- Home
- /
- Statistics
- /
- Statistics Core
Statistics Core
65 concepts in Statistics
Statistics is the science of learning from data. This core topic covers the entire statistical investigation cycle: formulating questions, collecting data through surveys and experiments, organizing and analyzing data with numerical summaries and visualizations, and drawing conclusions while acknowledging uncertainty. Students learn descriptive statistics โ measures of center, spread, and shape โ as well as the foundations of inferential statistics, including sampling distributions, confidence intervals, and hypothesis testing. Probability serves as the mathematical backbone, connecting data patterns to theoretical models. Particular attention is given to recognizing bias in data collection, understanding correlation versus causation, and evaluating statistical claims encountered in media and research. In an era of big data and data-driven decisions, statistical literacy is essential for informed citizenship, scientific inquiry, and professional competence across virtually every field.
Suggested learning path: Begin with exploratory data analysis and graphical displays, then study probability fundamentals, and progress to sampling distributions, confidence intervals, and basic hypothesis testing.
Data Collection
The systematic process of gathering information (data) to answer questions or learn about a topic.
Data Representation
Organizing and displaying data in ways that make patterns and information easier to see and understand.
Bar Graph
A graph that uses rectangular bars of different heights or lengths to compare quantities across categories.
Line Graph
A graph that uses points connected by lines to show how a quantity changes over time or across a continuous variable.
Mean as Fair Share
The mean (average) represents what each person would get if the total were divided equally among everyone.
Data Variability
How much the values in a data set are spread out or clustered together around the center.
Pictograph
A graph that uses pictures or symbols to represent data, where each symbol stands for a certain number of items.
Line Plot (Dot Plot)
A diagram showing data values as marks (usually X's or dots) above their values on a number line.
Mode
The value that appears most often in a data set. A set can have no mode, one mode, or multiple modes.
Range
The difference between the maximum and minimum values in a data set, measuring overall spread.
Tally Chart
A table that uses tally marks (lines) to count and organize data, with every fifth mark crossing the previous four.
Frequency Table
A table that records how often each value or category occurs in a data set, organizing raw data into a clear summary.
Categorical Data
Data that can be sorted into groups or categories, like colors, types, or names, rather than measured with numbers.
Statistical Question
A question that anticipates variability in answers - it can't be answered with a single number because different data points will give different responses.
Median
The middle value when data is arranged in order. Half the values are above it, half below.
Mean vs Median
Mean and median are both measures of center but respond differently to extreme values (outliers).
Spread vs Center
Center describes where the 'middle' of data lies; spread describes how far data extends from that center.
Correlation
A statistical relationship between two variables where changes in one are associated with changes in the other.
Correlation vs Causation
Correlation shows two variables move together; causation means one actually makes the other change. Correlation doesn't prove causation.
Sampling Bias
When a sample is collected in a way that makes some members of the population more likely to be included than others, leading to misleading conclusions.
Basic Probability
The chance or likelihood that an event will occur, expressed as a number between 0 (impossible) and 1 (certain).
Histogram
A graph that groups numerical data into ranges (bins) and shows the frequency of values in each range using bars that touch.
Box Plot
A visual display showing the five-number summary: minimum, Q1, median, Q3, and maximum, often with outliers marked separately.
Standard Deviation
A measure of how spread out data values are from the mean, calculated as the typical distance from the average.
Misleading Graphs
Graphs can distort data through tricks like truncated axes, inconsistent scales, or cherry-picked time ranges to create false impressions.
Experimental Design
The careful planning of experiments to establish cause-and-effect relationships by controlling variables and using comparison groups.
Two-Way Tables
A table that displays the frequency of data categorized by two different variables, allowing comparison across groups.
Random Sampling
Selecting individuals from a population where every member has an equal chance of being chosen.
Dot Plot
A statistical chart using dots to display the frequency of different values, similar to a line plot but often used for larger datasets.
Quartiles
Values that divide ordered data into four equal parts: $Q_1$ (25th percentile), $Q_2$ (median, 50th), and $Q_3$ (75th percentile).
Interquartile Range (IQR)
The range of the middle 50% of data, calculated as $Q_3 - Q_1$. It measures spread while ignoring extreme values.
Mean Absolute Deviation (MAD)
The average distance of data points from the mean, ignoring whether they're above or below.
Scatter Plot
A graph that uses dots to show the relationship between two numerical variables, with each dot representing one data point.
Distribution Shape
The overall pattern of how data values are spread, including whether the distribution is symmetric, skewed left, skewed right, uniform, or bimodal.
Population vs Sample
Population is the entire group you want to study; sample is the smaller subset you actually measure to learn about the population.
Theoretical Probability
The expected probability of an event based on mathematical reasoning about equally likely outcomes, without conducting experiments.
Experimental Probability
The probability of an event based on actual experimental data: the number of times the event occurred divided by total trials.
Sample Space
The complete set of all possible outcomes for a probability experiment, listed without repetition.
Compound Events
Events made up of two or more simple events, calculated using multiplication (for 'and') or addition (for 'or').
Relative Frequency
The fraction or percentage of times a value occurs out of the total number of observations.
Normal Distribution
A symmetric, bell-shaped probability distribution where most data clusters around the mean, with probabilities decreasing symmetrically toward the tails.
Z-Score (Standard Score)
The number of standard deviations a value is from the mean: $z = \frac{x - \mu}{\sigma}$.
Percentiles
Values that divide a distribution into 100 equal parts. The nth percentile is the value below which n% of data falls.
Sampling Distribution
The probability distribution of a statistic (like the mean) calculated from all possible samples of a given size from a population.
Central Limit Theorem
For large enough samples, the sampling distribution of the mean is approximately normal, regardless of the population distribution's shape.
Standard Error
The standard deviation of a sampling distribution, measuring how much a sample statistic typically varies from the true population parameter.
Confidence Interval
A range of values, calculated from sample data, that is likely to contain the true population parameter with a specified level of confidence.
Margin of Error
The maximum expected difference between a sample statistic and the population parameter, typically expressed as $\pm$ a value.
Linear Regression
A statistical method for modeling the relationship between variables by fitting a line that minimizes the sum of squared distances from data points to the line.
Line of Best Fit
The straight line that best represents the trend in a scatter plot, minimizing the overall distance between the line and all data points.
Residuals
The differences between observed data values and the values predicted by a model (actual - predicted).
R-Squared (Coefficient of Determination)
The proportion of variance in the dependent variable that is explained by the independent variable(s) in a regression model, ranging from 0 to 1.
Outlier Detection
Methods for identifying data points that are unusually far from the rest, using techniques like IQR rule, z-scores, or visual inspection.
Observational vs Experimental Studies
Observational studies observe subjects without manipulation; experiments deliberately assign treatments to establish causation.
Confounding Variables
A variable that influences both the independent and dependent variables, creating a spurious association that can be mistaken for causation.
Statistical Simulation
Using random number generation to model real-world processes and estimate probabilities or outcomes that are difficult to calculate theoretically.
Law of Large Numbers
As the number of trials increases, the experimental probability (sample average) converges to the theoretical probability (population mean).
Expected Value
The long-run average outcome of a random process, calculated as the sum of each outcome times its probability.
Hypothesis Testing
A formal procedure for using sample data to decide between two competing claims (hypotheses) about a population parameter.
P-Value
The probability of observing results at least as extreme as the actual data, assuming the null hypothesis is true.
Statistical Significance
A result is statistically significant when the p-value falls below a predetermined threshold ($\alpha$), typically 0.05, suggesting the observed effect is unlikely due to chance alone.
Correlation Coefficient
A number between โ1 and 1 that measures the strength and direction of the linear relationship between two variables.
Weighted Average
An average in which different values contribute unequally based on their assigned weights.
Empirical Rule
In a normal distribution: ~68% of data falls within 1ฯ, ~95% within 2ฯ, and ~99.7% within 3ฯ of the mean.
Skewness
A measure of the asymmetry of a distribution โ how much it leans to one side of the mean.