Statistics · Exploring Two-Variable Data · 14 min read · Updated 2026-05-11
Least Squares Regression — AP Statistics
AP Statistics · Exploring Two-Variable Data · 14 min read
1. What Is Least Squares Regression?★★☆☆☆⏱ 3 min
Least squares regression (abbreviated LSR, with the resulting model called the least squares regression line, LSRL) is the standard objective method for fitting a straight line to bivariate quantitative data. Its core goal is to find the "best fitting" line for predicting values of a response variable $y$ from an explanatory variable $x$.
Unlike a line fit by eye, least squares uses a formal, replicable criterion to define "best": it minimizes the sum of the squared vertical distances (called residuals) between observed $y$-values and $y$-values predicted by the line. This method is the foundation for all further regression analysis in AP Statistics, and makes up 2-3% of total exam weight, appearing in both multiple-choice and free-response sections.
2. Least Squares Criterion and LSRL Coefficients★★★☆☆⏱ 4 min
For any linear model predicting $y$ from $x$, we write the LSRL as:
\hat{y} = a + bx
where $\hat{y}$ is the predicted value of the response variable, $b$ is the slope, and $a$ is the y-intercept. A residual $e_i$ for the $i$-th data point is defined as $e_i = y_i - \hat{y}_i$, where $y_i$ is the observed response value. The sum of squared residuals (SSE) is:
SSE = \sum_{i=1}^n (y_i - a - bx_i)^2
Minimizing SSE gives closed-form formulas for LSRL coefficients. The slope $b$ is calculated as:
b = r \frac{s_y}{s_x}
where $r$ is the correlation between $x$ and $y$, $s_y$ is the standard deviation of $y$, and $s_x$ is the standard deviation of $x$. A key property of the LSRL is that it always passes through the point of the means $(\bar{x}, \bar{y})$, which we use to calculate the intercept $a$:
a = \bar{y} - b\bar{x}
Exam tip: Always confirm the sign of your slope matches the sign of the correlation. A negative correlation should always give a negative slope, and a positive correlation gives a positive slope; this is a quick check to catch calculation errors.
3. Residual Calculation and Interpretation★★★☆☆⏱ 3 min
Residuals measure the prediction error of our LSRL: they tell us how far off the line's prediction is for each observed data point. A positive residual means the line underpredicted $y$ (observed $y$ is higher than predicted), while a negative residual means the line overpredicted $y$ (observed $y$ is lower than predicted).
For any LSRL, the sum of all residuals is always 0, because the line is centered on the point of the means. Lower SSE (sum of squared residuals) means a better-fitting linear model. Calculating and interpreting residuals is a very common AP exam question.
Exam tip: If you are asked to plot a residual, the x-coordinate is the x-value of the original observation, and the y-coordinate is the residual, not the original $y$-value.
4. Interpreting LSRL Slope and Intercept★★★☆☆⏱ 3 min
AP exam questions almost always require contextually correct interpretations of slope and intercept, and this is a common area for point deductions. Strict phrasing is required to earn full credit.
Exam tip: Always include units of measurement for both $x$ and $y$ in your interpretation, and always use the phrase "predicted average" to avoid incorrect claims about individual changes or causation.
5. AP-Style Worked Practice Problems★★★★☆⏱ 4 min
Common Pitfalls
Why: Students mix up the order of terms because they write predicted $y$ first in the regression equation.
Why: Students swap the order of standard deviations because they forget which variable is which.
Why: Students forget the LSRL models the average trend, not individual outcomes.
Why: Students think all coefficients require a practical interpretation.
Why: Students confuse observed response values with predicted response values.
Why: Students confuse association (measured by regression) with causation, which only can be inferred from randomized experiments.