Read the first worked example with the solution open so the structure is clear.
Try the practice problems before revealing each solution.
Use the related concepts and background knowledge badges if you feel stuck.
What to Focus On
Core idea:Data is the collection of observations or measurements you gather to answer a question about a group.
Common stuck point:The procedure for data (abstract) is the easy part; the trap is treating a summary like the mean as the data. Asking "Is this the raw set of recorded observations, before any summary or conclusion?" first is what keeps a correct-looking calculation from being attached to the wrong concept.
Sense of Study hint:Ask: Is this the raw set of recorded observations, before any summary or conclusion?
Worked Examples
Example 1
easy
A researcher records the following about 5 students: name, age, GPA, and favorite color. Classify each variable as quantitative or categorical, and explain the difference.
Name: categorical — labels, not numeric quantities
Full solution
2
Age: quantitative — numeric, can be averaged (e.g., mean age = 17.4)
3
GPA: quantitative — numeric, arithmetic operations are meaningful
4
Favorite color: categorical — labels with no inherent numeric order
5
Key distinction: quantitative variables measure amounts; categorical variables classify into groups
Data abstraction begins with identifying the type of each variable. Quantitative data supports arithmetic operations; categorical data supports counting and proportions. Applying the wrong analysis to the wrong type leads to meaningless results.
Example 2
medium
A survey asks: (1) What is your ZIP code? (2) How many hours do you sleep per night? (3) Rate your satisfaction 1–5. Classify each and identify potential misclassification pitfalls.
Example 3
easy
Look at three columns of a spreadsheet: 'age', 'eye color', 'shoe size'. Which are numerical and which are categorical?
Example 4
medium
A spreadsheet has 200 rows and 5 columns. How many data values does it hold, and what do rows and columns represent?
Example 5
hard
Pick the best summary for each variable: (a) eye color across 50 students, (b) heights of 50 students, (c) satisfaction (1-5) of 50 students.
Example 6
challenge
You want to study TV-watching habits in a city. Compare these two data-collection plans: (Plan A) survey 100 people leaving a movie theater; (Plan B) randomly call 100 phone numbers in the city. Which is better, and why?
Practice Problems
Try these problems on your own first, then open the solution to compare your method.
Example 1
easy
Classify each variable: (a) blood type (A, B, AB, O), (b) temperature in Celsius, (c) jersey number, (d) number of goals scored.
Example 2
medium
A data set contains 1000 rows and 8 columns. Explain what a row and a column represent, and define the terms 'observation', 'variable', and 'case' in statistical context.
Example 3
easy
A survey records each student's favorite color (red, blue, green). Is this data numerical or categorical?
Example 4
easy
Heights of students in centimeters are recorded. Is this numerical or categorical data?
Example 5
easy
Data is best described as what?
Example 6
easy
Yes/no answers to 'Do you own a pet?' are collected. What type of data is this?
Example 7
easy
A pollster collects 1,000,000 responses, but only from one biased website. Does the large size guarantee good conclusions?
Example 8
easy
Before analyzing a data set, what crucial thing should you understand about it?
Example 9
easy
Test scores of 85,90,78 are recorded. What type of data are these values?
Example 10
easy
Zip codes are stored as numbers like 90210. Should they be treated as numerical data for averaging?
Example 11
medium
A researcher wants to study average commute time. They survey only people leaving a train station at 8 a.m. Name the data problem and its effect.
Example 12
medium
Classify each: (a) eye color, (b) number of siblings, (c) shirt size labeled S/M/L. Which is numerical?
Example 13
medium
A dataset of customer reviews contains both star ratings (1-5) and written comments. Which part is numerical and which is text/categorical?
Example 14
medium
A study claims a strong link between ice cream sales and drowning. The data was observational. Can it conclude ice cream causes drowning?
Example 15
medium
A spreadsheet lists temperatures but some cells are blank. What should an analyst do before computing the mean?
Example 16
medium
Population vs sample: a researcher measures all 30 students in one class to learn about the class. Is this a population or a sample?
Example 17
medium
A dataset records survey responses on a 1-5 agreement scale. Why is reporting the median often safer than the mean here?
Example 18
medium
Two datasets answer 'How tall are students?': one lists exact heights, the other lists 'short/medium/tall'. Which supports computing a mean height?
Example 19
challenge
A company reports its 'average' customer age as 35 from sign-up forms, but 40% of customers left the age field blank. Explain why the reported average may be untrustworthy.
Example 20
challenge
Classify the variable 'temperature in Celsius' and explain why ratios like '20 degrees is twice as warm as 10 degrees' are invalid.
Example 21
challenge
A dataset combines heights measured in inches (US branch) and centimeters (EU branch) in one column without labels. Why is this dataset unusable as-is, and what fixes it?
Example 22
medium
A dataset stores 'number of children' (0,1,2,...) and 'marital status' (single/married). Which variable can have a meaningful mean?
Example 23
easy
A class records each student's favorite fruit (apple, banana, orange). Is this numerical or categorical data?
Example 24
easy
True or false: a list of telephone numbers is numerical data you can average.
Example 25
easy
A pollster surveys 20 students out of 600. The 20 are a ___ and the 600 are the ___ .
Example 26
easy
A survey asks 'Are you a vegetarian?' and records yes/no. What type of data is this?
Example 27
medium
A teacher claims 'students who eat breakfast score higher on tests'. This conclusion came from observing one classroom. Why is this conclusion weak?
Example 28
medium
A dataset lists 'shirt size' as S, M, L, XL. Is this categorical or numerical? Can you order the categories?
Example 29
medium
A column lists ice cream sales (numeric) per day. Another lists 'season' (winter/spring/summer/fall). Which can you compute an average of?
Example 30
medium
A class collects everyone's height in inches but writes some as feet+inches. Why is the dataset hard to analyze before cleaning?
Example 31
medium
A 'satisfaction' question on a 1-5 scale: which is safer to report, the mean or the median?
Example 32
medium
A study finds higher cell-phone use correlates with lower test scores in 7th graders. Does the data show that phones cause lower scores?
Example 33
medium
Sort these into numerical or categorical: (a) ZIP code, (b) number of pets, (c) brand of phone, (d) heart rate (bpm).
Example 34
hard
A dataset has these 'age' entries: 12, 13, 14, 11, 999. Why might 999 be problematic and what should you do?
Example 35
hard
A poll reports 'Average household has 2.4 cars' from a sample of 100 homes in a wealthy suburb. Why is generalizing this to the whole country a mistake?
Example 36
hard
A spreadsheet of student records mixes upper- and lowercase ('Apple', 'apple', 'APPLE') in the favorite-fruit column. What problem does this cause for analysis?
Example 37
challenge
A teacher's gradebook has columns 'student', 'attendance %', 'exam score', 'preferred name'. Classify each variable type.