Statistics · Exploring Two-Variable Data · 14 min read · Updated 2026-05-11

Least Squares Regression — AP Statistics

AP Statistics · Exploring Two-Variable Data · 14 min read

1. What Is Least Squares Regression? ★★☆☆☆ ⏱ 3 min

Least squares regression (abbreviated LSR, with the resulting model called the least squares regression line, LSRL) is the standard objective method for fitting a straight line to bivariate quantitative data. Its core goal is to find the "best fitting" line for predicting values of a response variable $y$ from an explanatory variable $x$.

Unlike a line fit by eye, least squares uses a formal, replicable criterion to define "best": it minimizes the sum of the squared vertical distances (called residuals) between observed $y$-values and $y$-values predicted by the line. This method is the foundation for all further regression analysis in AP Statistics, and makes up 2-3% of total exam weight, appearing in both multiple-choice and free-response sections.

2. Least Squares Criterion and LSRL Coefficients ★★★☆☆ ⏱ 4 min

For any linear model predicting $y$ from $x$, we write the LSRL as:

\hat{y} = a + bx

where $\hat{y}$ is the predicted value of the response variable, $b$ is the slope, and $a$ is the y-intercept. A residual $e_i$ for the $i$-th data point is defined as $e_i = y_i - \hat{y}_i$, where $y_i$ is the observed response value. The sum of squared residuals (SSE) is:

SSE = \sum_{i=1}^n (y_i - a - bx_i)^2

Minimizing SSE gives closed-form formulas for LSRL coefficients. The slope $b$ is calculated as:

b = r \frac{s_y}{s_x}

where $r$ is the correlation between $x$ and $y$, $s_y$ is the standard deviation of $y$, and $s_x$ is the standard deviation of $x$. A key property of the LSRL is that it always passes through the point of the means $(\bar{x}, \bar{y})$, which we use to calculate the intercept $a$:

a = \bar{y} - b\bar{x}

Exam tip: Always confirm the sign of your slope matches the sign of the correlation. A negative correlation should always give a negative slope, and a positive correlation gives a positive slope; this is a quick check to catch calculation errors.

3. Residual Calculation and Interpretation ★★★☆☆ ⏱ 3 min

Residuals measure the prediction error of our LSRL: they tell us how far off the line's prediction is for each observed data point. A positive residual means the line underpredicted $y$ (observed $y$ is higher than predicted), while a negative residual means the line overpredicted $y$ (observed $y$ is lower than predicted).

For any LSRL, the sum of all residuals is always 0, because the line is centered on the point of the means. Lower SSE (sum of squared residuals) means a better-fitting linear model. Calculating and interpreting residuals is a very common AP exam question.

Exam tip: If you are asked to plot a residual, the x-coordinate is the x-value of the original observation, and the y-coordinate is the residual, not the original $y$-value.

4. Interpreting LSRL Slope and Intercept ★★★☆☆ ⏱ 3 min

AP exam questions almost always require contextually correct interpretations of slope and intercept, and this is a common area for point deductions. Strict phrasing is required to earn full credit.

📐 Worked Example

A LSRL for predicting the height of a pine seedling (in cm) from the amount of water it receives per week (in mL) is $\hat{y} = 2.1 + 0.12x$. Interpret the slope and intercept of this line in context, and state whether the intercept is meaningful.

1. Slope interpretation: For each additional 1 mL of water given per week, the predicted average height of a pine seedling after one month increases by 0.12 cm.
2. Intercept interpretation: The predicted average height of a pine seedling that receives 0 mL of water per week is 2.1 cm.
3. Since 0 mL of water per week is a plausible treatment, the intercept is meaningful in this context. If the study only included treatments from 10 mL to 50 mL of water, 0 mL would be outside the range of data, and the intercept would not be practically meaningful.

Exam tip: Always include units of measurement for both $x$ and $y$ in your interpretation, and always use the phrase "predicted average" to avoid incorrect claims about individual changes or causation.

5. AP-Style Worked Practice Problems ★★★★☆ ⏱ 4 min

📐 Worked Example

A sociologist studies the relationship between median household income (x, in thousands of dollars) and average life expectancy (y, in years) for 50 zip codes in a large US state. Summary statistics are: $\bar{x} = 65$, $s_x = 18$, $\bar{y} = 78$, $s_y = 4.5$, $r = 0.72$. (a) Calculate the equation of the least squares regression line for predicting life expectancy from median income. (b) Interpret the slope of your regression line in context. (c) One zip code has a median income of \$80,000 and an average life expectancy of 79.2 years. Calculate and interpret the residual for this zip code.

(a) Calculate slope first:
$b = r \frac{s_y}{s_x} = 0.72 \times \frac{4.5}{18} = 0.18$
Next calculate intercept:
$a = \bar{y} - b\bar{x} = 78 - (0.18)(65) = 66.3$
Final LSRL: $\hat{y} = 66.3 + 0.18x$, where $\hat{y}$ = predicted average life expectancy (years), $x$ = median household income (thousands of dollars).
(b) Slope interpretation: For each additional \$1000 in median household income in a zip code, the predicted average life expectancy increases by 0.18 years.
(c) Calculate predicted life expectancy for $x=80$:
$\hat{y} = 66.3 + 0.18(80) = 80.7 \text{ years}$
Calculate residual:
$e = y - \hat{y} = 79.2 - 80.7 = -1.5 \text{ years}$
Interpretation: The average life expectancy for this zip code is 1.5 years lower than the LSRL predicted based on its median household income.

Common Pitfalls

Why: Students mix up the order of terms because they write predicted $y$ first in the regression equation.

Why: Students swap the order of standard deviations because they forget which variable is which.

Why: Students forget the LSRL models the average trend, not individual outcomes.

Why: Students think all coefficients require a practical interpretation.

Why: Students confuse observed response values with predicted response values.

Why: Students confuse association (measured by regression) with causation, which only can be inferred from randomized experiments.

Quick Reference Cheatsheet

← Back to topic

Stuck on a specific question?
Snap a photo or paste your problem — Ollie (our AI tutor) walks through it step-by-step with diagrams.
Try Ollie free →