Tidy Data

GAISE 2016

Teach statistical thinking.
- Teach statistics as an investigative process of problem-solving and decision-making.
- Give students experience with multivariable thinking.
Focus on conceptual understanding.
Integrate real data with a context and purpose.
Foster active learning.
Use technology to explore concepts and analyze data.
Use assessments to improve and evaluate student learning.

Real data

The StatPREP definition:

captured in the wild: not collected for teaching statistics
appropriate for investigation of a particular question or set of questions
- large enough (think: \(n \ge 1000\))
- rich enough (think: 5 or more variables)

For the moment, put aside the important matters of context and purpose.

Instead, focus on the organization of data in a way that will contribute to

good habits in the workplace
ability to use technology effectively
multivariable thinking

Discipline in presenting data encourages good work habits.

Data organization

Almost all StatPREP participants say they teach little or nothing about organizing data.
In textbooks, teaching about data focusses on variable types, not organization.
Organizing data properly is critically important for …
- thinking correctly with data
- using real data with technology

“Database Thinking”

Ironically, this is a very mathematical form of thinking, but few mathematicians know about it.
SQL is the algebra that gets you a job
Data base thinking \(\to\) explore and present both simple and complex data
Fundamental to data science

Unruly data

unruly: (adj.) disorderly and disruptive and not amenable to discipline or control

Unruly data

unruly: (adj.) disorderly and disruptive and not amenable to discipline or control

Example 1

Unruly data

unruly: (adj.) disorderly and disruptive and not amenable to discipline or control

Example 2

Rules tame unruliness

Always in a “spreadsheet” (ie, rows and columns, electronic) fail ✔︎✔︎✔︎
Rows: One row per unit of observation/analysis
- No other info (e.g. summaries, codebook info) in spreadsheet fail
- Any subset of rows is a valid data set.
Columns: One column per variable fail
- Always use column names fail
Always have a codebook fail
- Rows: how cases were selected
- Columns: units, specific questions/protocols/methods
Segregate data collection and data analysis
- Never change the original data as part of the analysis

Rules tame unruliness

Always in a “spreadsheet” (ie, rows and columns) fail ✔︎✔︎✔︎
Rows: One row per unit of observation/analysis
Columns: One column per variable fail
Always have a codebook fail
Segregate data collection and data analysis

Note: This sort of tidy/rectangular data is not the only kind of data, but it is an extremely important kind of data and sufficient for introductory courses.

Implications

You need to have a way to distribute such data.

Google spreadsheets, google forms
- Best practices
CSV files
More efficient forms

Operations on Tidy Data

Tidy Data Admits Standard Data Operations

One advantage of tidy data is that it is ammenable to standard, reusable data operations.

Data Operations: Fundamental

summarize / reduce: convert entire data set to 1-row of summary values
- in spreadsheet: often not tidy, summary formulas can be placed in untidy places
- in R: df_stats()
mutate / transform: add new variable
- in spreadsheet: add a new column
- in R: mutate()
filter: remove/ignore rows or columns
- in spreadsheet: “hide” data
- in R: filter()

Data Operations: Superficial

select: remove/ignore columns
- in spreadsheet: “hide” data
- in R: select()
arrange
- in spreadsheet, “sort”. But how do you do this without altering the data?
- in R: arrange()

Data Operations: Profound

group: handle summarize and mutate in a groupwise way
- in spreadsheet: pivot table
- in R: group_by(); formulas (eg, blood_pressure ~ age + sex)
join: bring together data from multiple sources
- in spreadsheet “fusion tables” but will be phased out in Dec. 2019 in terms of SQL operations
spread / gather: change unit of analysis
- in spreadsheet pivot/unpivot
- in R: spread(), gather() [soon to be pivot_wide(), pivot_long()]

Long vs Wide Data

country	continent	year	lifeExp
Afghanistan	Asia	1952	28.801
Afghanistan	Asia	1957	30.332
Afghanistan	Asia	1962	31.997
Afghanistan	Asia	1967	34.020

country	continent	pop1952	pop1957	pop1962	pop1967	pop1972	pop1977	pop1982	pop1987	pop1992	pop1997	pop2002	pop2007
Afghanistan	Asia	28.801	30.332	31.997	34.020	36.088	38.438	39.854	40.822	41.674	41.763	42.129	43.828
Albania	Europe	55.230	59.280	64.820	66.220	67.690	68.930	70.420	72.000	71.581	72.950	75.651	76.423
Algeria	Africa	43.077	45.685	48.303	51.407	54.518	58.014	61.368	65.799	67.744	69.152	70.994	72.301
Angola	Africa	30.015	31.999	34.000	35.985	37.928	39.483	39.942	39.906	40.647	40.963	41.003	42.731

Quiz

List the many ways in which this “table” violates the conventions for effective data organization.

Variable types

These are often introduced as definitions before the student has any use for them.

In stats book:

Variables Types

Key distinction / minor distinctions

numerical: interval, ratio, count, measure, …
categorical: unordered, ordered

Variable types in the wild

Numerical

quantities (with dimension and units) –
times and dates
counts (natural numbers)
~~codes: codes represent discrete levels, not really numerical~~

Variable types in the wild

Categorical

“Two-sample” isn’t (usually) two samples
- one sample, two variables
- categorical variable identifies the group
Summarizing with proportions doesn’t make categrical data numerical
Fixed vs “Random”
- Fixed: data contains all levels of interest
- Random: data represents a subset of/sample from levels of interest