Least Squares Regression Line Formula

The Formula

\hat{y} = a + bx \quad\text{where}\quad b = r \cdot \frac{s_y}{s_x}, \quad a = \bar{y} - b\bar{x}

When to use: You have a scatter plot with points scattered around a general trend. The LSRL is the line that gets as close as possible to all the points simultaneously—it's the 'best' straight line through the cloud. 'Best' means it minimizes the total squared prediction error.

Quick Example

Study hours (x) and test scores (y) for 5 students. The LSRL might be: \hat{y} = 52 + 4.8x Interpretation: each additional hour of study is associated with a 4.8-point increase in the predicted test score. A student who studies 0 hours is predicted to score 52.

Notation

\hat{y} is the predicted value. b is the slope. a is the y-intercept. r is the correlation coefficient. s_x, s_y are the standard deviations of x and y.

What This Formula Means

The unique straight line \hat{y} = a + bx that minimizes the sum of squared vertical distances (residuals) between the observed data points and the line.

You have a scatter plot with points scattered around a general trend. The LSRL is the line that gets as close as possible to all the points simultaneously—it's the 'best' straight line through the cloud. 'Best' means it minimizes the total squared prediction error.

Formal View

\hat{y} = a + bx where b = r \cdot \frac{s_y}{s_x} and a = \bar{y} - b\bar{x}; equivalently, b = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2}

Worked Examples

Example 1

medium
Find the least-squares regression line for: (x,y): (1,2), (2,4), (3,5), (4,4), (5,5). Use b = r \frac{s_y}{s_x} and a = \bar{y} - b\bar{x}.

Solution

  1. 1
    \bar{x} = 3, \bar{y} = 4; s_x = \sqrt{2.5} \approx 1.58; s_y = \sqrt{1.5} \approx 1.22
  2. 2
    Calculate r: \sum(x_i-\bar{x})(y_i-\bar{y}) = (-2)(-2)+(-1)(0)+(0)(1)+(1)(0)+(2)(1) = 4+0+0+0+2=6; r = \frac{6}{4 \times s_x \times s_y} = \frac{6}{4(1.58)(1.22)} = \frac{6}{7.71} \approx 0.778
  3. 3
    Slope: b = r \frac{s_y}{s_x} = 0.778 \times \frac{1.22}{1.58} \approx 0.778 \times 0.772 \approx 0.60
  4. 4
    Intercept: a = \bar{y} - b\bar{x} = 4 - 0.60(3) = 4 - 1.8 = 2.2

Answer

\hat{y} = 2.2 + 0.60x
The least-squares regression line minimizes the sum of squared residuals. The slope b represents the expected change in y for a one-unit increase in x. The line always passes through (\bar{x}, \bar{y}).

Example 2

hard
The LSRL for predicting weight (y, kg) from height (x, cm) is \hat{y} = -100 + 0.8x. Interpret the slope and intercept, predict weight for height=175 cm, and explain why extrapolating to height=50 cm is problematic.

Common Mistakes

  • Using the regression line to predict outside the range of the data (extrapolation)—the linear pattern may not hold beyond observed values.
  • Confusing the roles of x and y: the regression of y on x is different from the regression of x on y.
  • Interpreting the y-intercept literally when x = 0 is outside the data range or doesn't make sense in context.

Why This Formula Matters

Regression is the workhorse of data analysis. It allows prediction, quantifies relationships, and is the foundation for more advanced modeling techniques used everywhere from economics to medicine.

Frequently Asked Questions

What is the Least Squares Regression Line formula?

The unique straight line \hat{y} = a + bx that minimizes the sum of squared vertical distances (residuals) between the observed data points and the line.

How do you use the Least Squares Regression Line formula?

You have a scatter plot with points scattered around a general trend. The LSRL is the line that gets as close as possible to all the points simultaneously—it's the 'best' straight line through the cloud. 'Best' means it minimizes the total squared prediction error.

What do the symbols mean in the Least Squares Regression Line formula?

\hat{y} is the predicted value. b is the slope. a is the y-intercept. r is the correlation coefficient. s_x, s_y are the standard deviations of x and y.

Why is the Least Squares Regression Line formula important in Math?

Regression is the workhorse of data analysis. It allows prediction, quantifies relationships, and is the foundation for more advanced modeling techniques used everywhere from economics to medicine.

What do students get wrong about Least Squares Regression Line?

The slope is NOT the correlation. The slope has units (\text{change in } y per unit x), while r is unitless and bounded between -1 and 1.

What should I learn before the Least Squares Regression Line formula?

Before studying the Least Squares Regression Line formula, you should understand: correlation, scatter plot, mean, standard deviation.