📊

Statistics Core

65 concepts in Statistics

Statistics is the science of learning from data. This core topic covers the entire statistical investigation cycle: formulating questions, collecting data through surveys and experiments, organizing and analyzing data with numerical summaries and visualizations, and drawing conclusions while acknowledging uncertainty. Students learn descriptive statistics — measures of center, spread, and shape — as well as the foundations of inferential statistics, including sampling distributions, confidence intervals, and hypothesis testing. Probability serves as the mathematical backbone, connecting data patterns to theoretical models. Particular attention is given to recognizing bias in data collection, understanding correlation versus causation, and evaluating statistical claims encountered in media and research. In an era of big data and data-driven decisions, statistical literacy is essential for informed citizenship, scientific inquiry, and professional competence across virtually every field.

Suggested learning path: Begin with exploratory data analysis and graphical displays, then study probability fundamentals, and progress to sampling distributions, confidence intervals, and basic hypothesis testing.

Data Collection

The systematic process of gathering information (data) to answer questions or learn about a topic.

Data Representation

Organizing and displaying data in ways that make patterns and information easier to see and understand.

data collection

Bar Graph

A graph that uses rectangular bars of different heights or lengths to compare quantities across categories.

data representation

Line Graph

A graph that uses points connected by lines to show how a quantity changes over time or across a continuous variable.

data representation

coordinate plane

Mean as Fair Share

The mean (average) represents what each person would get if the total were divided equally among everyone.

Data Variability

How much the values in a data set are spread out or clustered together around the center.

mean fair share

number comparison

Pictograph

A graph that uses pictures or symbols to represent data, where each symbol stands for a certain number of items.

Line Plot (Dot Plot)

A diagram showing data values as marks (usually X's or dots) above their values on a number line.

Mode

The value that appears most often in a data set. A set can have no mode, one mode, or multiple modes.

Range

The difference between the maximum and minimum values in a data set, measuring overall spread.

Tally Chart

A table that uses tally marks (lines) to count and organize data, with every fifth mark crossing the previous four.

Frequency Table

A table that records how often each value or category occurs in a data set, organizing raw data into a clear summary.

Categorical Data

Data that can be sorted into groups or categories, like colors, types, or names, rather than measured with numbers.

Statistical Question

A question that anticipates variability in answers - it can't be answered with a single number because different data points will give different responses.

data collection

Median

The middle value when data is arranged in order. Half the values are above it, half below.

ordering numbers

mean fair share

Mean vs Median

Mean and median are both measures of center but respond differently to extreme values (outliers).

Spread vs Center

Center describes where the 'middle' of data lies; spread describes how far data extends from that center.

variability intro

Correlation

A statistical relationship between two variables where changes in one are associated with changes in the other.

Correlation vs Causation

Correlation shows two variables move together; causation means one actually makes the other change. Correlation doesn't prove causation.

correlation intro

Sampling Bias

When a sample is collected in a way that makes some members of the population more likely to be included than others, leading to misleading conclusions.

data collection

population vs sample

Basic Probability

The chance or likelihood that an event will occur, expressed as a number between 0 (impossible) and 1 (certain).

Histogram

A graph that groups numerical data into ranges (bins) and shows the frequency of values in each range using bars that touch.

Box Plot

A visual display showing the five-number summary: minimum, Q1, median, Q3, and maximum, often with outliers marked separately.

Standard Deviation

A measure of how spread out data values are from the mean, calculated as the typical distance from the average.

variability intro

Misleading Graphs

Graphs can distort data through tricks like truncated axes, inconsistent scales, or cherry-picked time ranges to create false impressions.

Experimental Design

The careful planning of experiments to establish cause-and-effect relationships by controlling variables and using comparison groups.

correlation vs causation

Two-Way Tables

A table that displays the frequency of data categorized by two different variables, allowing comparison across groups.

data representation

categorical data

Random Sampling

Selecting individuals from a population where every member has an equal chance of being chosen.

population vs sample

Dot Plot

A statistical chart using dots to display the frequency of different values, similar to a line plot but often used for larger datasets.

Quartiles

Values that divide ordered data into four equal parts: $Q_1$ (25th percentile), $Q_2$ (median, 50th), and $Q_3$ (75th percentile).

ordering numbers

Interquartile Range (IQR)

The range of the middle 50% of data, calculated as $Q_3 - Q_1$. It measures spread while ignoring extreme values.

Mean Absolute Deviation (MAD)

The average distance of data points from the mean, ignoring whether they're above or below.

Scatter Plot

A graph that uses dots to show the relationship between two numerical variables, with each dot representing one data point.

coordinate plane

two variable data

Distribution Shape

The overall pattern of how data values are spread, including whether the distribution is symmetric, skewed left, skewed right, uniform, or bimodal.

Population vs Sample

Population is the entire group you want to study; sample is the smaller subset you actually measure to learn about the population.

data collection

Theoretical Probability

The expected probability of an event based on mathematical reasoning about equally likely outcomes, without conducting experiments.

probability basic

Experimental Probability

The probability of an event based on actual experimental data: the number of times the event occurred divided by total trials.

probability basic

data collection

Sample Space

The complete set of all possible outcomes for a probability experiment, listed without repetition.

Compound Events

Events made up of two or more simple events, calculated using multiplication (for 'and') or addition (for 'or').

probability basic

Relative Frequency

The fraction or percentage of times a value occurs out of the total number of observations.

Normal Distribution

A symmetric, bell-shaped probability distribution where most data clusters around the mean, with probabilities decreasing symmetrically toward the tails.

distribution shape

standard deviation intro

Z-Score (Standard Score)

The number of standard deviations a value is from the mean: $z = \frac{x - \mu}{\sigma}$.

standard deviation intro

Percentiles

Values that divide a distribution into 100 equal parts. The nth percentile is the value below which n% of data falls.

ordering numbers

Sampling Distribution

The probability distribution of a statistic (like the mean) calculated from all possible samples of a given size from a population.

population vs sample

standard deviation intro

Central Limit Theorem

For large enough samples, the sampling distribution of the mean is approximately normal, regardless of the population distribution's shape.

sampling distribution

normal distribution

Standard Error

The standard deviation of a sampling distribution, measuring how much a sample statistic typically varies from the true population parameter.

standard deviation intro

sampling distribution

Confidence Interval

A range of values, calculated from sample data, that is likely to contain the true population parameter with a specified level of confidence.

sampling distribution

Margin of Error

The maximum expected difference between a sample statistic and the population parameter, typically expressed as $\pm$ a value.

confidence interval

Linear Regression

A statistical method for modeling the relationship between variables by fitting a line that minimizes the sum of squared distances from data points to the line.

correlation intro

slope intercept

Line of Best Fit

The straight line that best represents the trend in a scatter plot, minimizing the overall distance between the line and all data points.

slope intercept

Residuals

The differences between observed data values and the values predicted by a model (actual - predicted).

linear regression

R-Squared (Coefficient of Determination)

The proportion of variance in the dependent variable that is explained by the independent variable(s) in a regression model, ranging from 0 to 1.

linear regression

Outlier Detection

Methods for identifying data points that are unusually far from the rest, using techniques like IQR rule, z-scores, or visual inspection.

interquartile range

Observational vs Experimental Studies

Observational studies observe subjects without manipulation; experiments deliberately assign treatments to establish causation.

experimental design

correlation vs causation

Confounding Variables

A variable that influences both the independent and dependent variables, creating a spurious association that can be mistaken for causation.

correlation vs causation

Statistical Simulation

Using random number generation to model real-world processes and estimate probabilities or outcomes that are difficult to calculate theoretically.

probability basic

random sampling

Law of Large Numbers

As the number of trials increases, the experimental probability (sample average) converges to the theoretical probability (population mean).

probability basic

Expected Value

The long-run average outcome of a random process, calculated as the sum of each outcome times its probability.

probability basic

weighted average

Hypothesis Testing

A formal procedure for using sample data to decide between two competing claims (hypotheses) about a population parameter.

sampling distribution

probability basic

P-Value

The probability of observing results at least as extreme as the actual data, assuming the null hypothesis is true.

hypothesis testing

probability basic

sampling distribution

Statistical Significance

A result is statistically significant when the p-value falls below a predetermined threshold ($\alpha$), typically 0.05, suggesting the observed effect is unlikely due to chance alone.

hypothesis testing

Correlation Coefficient

A number between −1 and 1 that measures the strength and direction of the linear relationship between two variables.

correlation intro

line of best fit

Weighted Average

An average in which different values contribute unequally based on their assigned weights.

mean fair share

stat expected value

Empirical Rule

In a normal distribution: ~68% of data falls within 1σ, ~95% within 2σ, and ~99.7% within 3σ of the mean.

stat normal distribution

Skewness

A measure of the asymmetry of a distribution — how much it leans to one side of the mean.

distribution shape