Statistics · Exploring Two-Variable Data · 14 min read · Updated 2026-05-11

AP Statistics Linear Regression Models — AP Statistics

AP Statistics · Exploring Two-Variable Data · 14 min read

1. Introduction to Linear Regression Models ★★☆☆☆ ⏱ 3 min

Linear regression models are statistical models that describe the linear relationship between an explanatory (independent) variable $x$ and a response (dependent) variable $y$. We distinguish between the population model, which describes the true underlying relationship with unknown parameters $\beta_0, \beta_1$ and a random error term $\varepsilon$, and the estimated sample model, which uses calculated statistics $b_0, b_1$ from sample data.

The most common method for fitting a linear model to sample data is least squares regression, which produces the line that minimizes the sum of squared vertical distances between observed data points and the line. This topic makes up approximately 5-7% of the total AP Statistics exam weight, and appears in both multiple-choice and free-response sections.

2. The Least Squares Regression Line (LSRL) ★★☆☆☆ ⏱ 4 min

A residual is the vertical difference between the observed response value and the predicted response value for any $x$: $e_i = y_i - \hat{y}_i$, where $\hat{y}_i$ is the predicted $y$ for observation $i$. The goal of least squares is to minimize $\sum e_i^2 = \sum (y_i - \hat{y}_i)^2$.

b_1 = r \cdot \frac{s_y}{s_x}

b_0 = \bar{y} - b_1\bar{x}

where $r$ is the correlation coefficient between $x$ and $y$, $s_y$ is the sample standard deviation of $y$, $s_x$ is the sample standard deviation of $x$, $\bar{y}$ is the sample mean of $y$, and $\bar{x}$ is the sample mean of $x$. A key property of the LSRL is that it always passes through the point $(\bar{x}, \bar{y})$, which can be used to check calculations. The LSRL is defined for predicting $y$ from $x$, so swapping $x$ and $y$ produces an entirely different line.

📐 Worked Example

A student researcher collects data on hours spent playing video games per week ($x$) and GPA ($y$, on a 4.0 scale) for 20 college students. Summary statistics are: $\bar{x} = 8.2$ hours, $s_x = 4.1$ hours, $\bar{y} = 3.1$, $s_y = 0.6$, $r = -0.58$. Calculate the equation of the LSRL for predicting GPA from weekly video game play.

Calculate the slope $b_1$ using the formula:
$b_1 = r \cdot \frac{s_y}{s_x} = -0.58 \cdot \frac{0.6}{4.1} \approx -0.58 \cdot 0.146 \approx -0.085$
Calculate the intercept $b_0$ using the fact that the LSRL passes through $(\bar{x}, \bar{y})$:
$b_0 = \bar{y} - b_1\bar{x} = 3.1 - (-0.085)(8.2) \approx 3.1 + 0.697 = 3.797$
Write the final equation with defined variables:
$\hat{y} = 3.80 - 0.085x, \text{ where } \hat{y} \text{ is predicted GPA and } x \text{ is weekly video game play in hours}$
Check that $(\bar{x}, \bar{y})$ satisfies the equation: $3.80 - 0.085(8.2) \approx 3.1 = \bar{y}$, so calculations are correct.

Exam tip: Always round your slope and intercept to 2-3 significant digits matching the input data; over-rounding early leads to calculation errors, and too many digits wastes time on the exam.

3. Interpreting Slope and Intercept in Context ★★★☆☆ ⏱ 3 min

One of the most frequently tested skills on the AP Statistics exam is correctly interpreting the slope and intercept of a linear regression model in context. Unlike pure math problems, AP requires interpretation tied directly to the scenario, not just a generic description.

The slope $b_1$ is the predicted *average* change in the response variable $y$ for a 1-unit increase in the explanatory variable $x$. It always has units of (units of y) per (unit of x). The intercept $b_0$ is the predicted average value of $y$ when $x = 0$. The intercept only has a practical, meaningful interpretation if $x = 0$ is a plausible possible value in the context of the problem. If $x = 0$ is impossible or far outside the range of observed data, the intercept is only a mathematical anchor for the line and has no practical meaning.

📐 Worked Example

Using the LSRL from the previous example: $\hat{y} = 3.80 - 0.085x$, where $x$ is weekly video game play in hours and $\hat{y}$ is predicted college GPA. Interpret the slope and intercept in context, and state if the intercept is practically meaningful.

Interpret the slope: The slope of -0.085 means that for each additional 1 hour of weekly video game play, the predicted average college GPA decreases by 0.085 points (on the 4.0 scale).
Interpret the intercept: The intercept of 3.80 means that for a student who plays 0 hours of video games per week, the predicted average GPA is 3.80.
Check for meaningfulness: 0 hours of weekly video game play is a plausible value for a college student, so the intercept has a practical interpretation in this context. If the explanatory variable was 'height of adult men' instead, $x=0$ would be impossible, and the intercept would have no practical meaning.

Exam tip: If you are asked to compare slopes of two models, a steeper slope (larger absolute value) always means a larger predicted change in $y$ per 1-unit change in $x$, regardless of the sign.

4. Residual Analysis and Coefficient of Determination ★★★☆☆ ⏱ 4 min

After fitting a linear regression model, we need to check if a linear model is actually appropriate for the data, and measure how much variation in $y$ the model explains. This is done with residual plots and the coefficient of determination ($R^2$).

A residual plot graphs residuals on the $y$-axis against the explanatory variable $x$ on the $x$-axis. For a linear model to be appropriate, residuals should be randomly scattered around the horizontal line at 0 with no clear pattern. A curved pattern means the true relationship between $x$ and $y$ is non-linear, so a linear model is a poor fit. A fan-shaped pattern (residuals getting wider or narrower as $x$ increases) means non-constant error variance, which violates regression assumptions.

The coefficient of determination $R^2$ (equal to $r^2$ for simple linear regression) measures the proportion of variation in the response variable $y$ that is explained by the linear relationship with $x$. It ranges from 0 (no linear explanation) to 1 (all variation explained), or 0% to 100% when expressed as a percentage. Higher $R^2$ means a stronger linear relationship.

📐 Worked Example

A botanist fits a linear regression model to data on tree age ($x$, years) and tree height ($y$, meters) for trees aged 1 to 50 years. Her residual plot shows residuals that are negative for young trees, positive for middle-aged trees, and negative again for old trees, forming a clear hump shape. $r = 0.82$ between age and height. What does the residual plot tell you about model fit? Calculate and interpret $R^2$.

The clear curved hump pattern in the residual plot indicates that a linear model is not appropriate for this relationship. This matches what we know about tree growth: trees grow quickly when young, level off when mature, so the relationship is curved, not linear.
Calculate $R^2$:
$R^2 = r^2 = (0.82)^2 = 0.6724 = 67.24\%$
Interpret $R^2$: Approximately 67% of the variation in tree height is explained by the linear relationship with tree age. Even though the linear relationship is strong, the curved pattern means a non-linear model would fit better.

Exam tip: Residual plots only check if a linear model is appropriate, not how strong the relationship is. A weak linear relationship can still have a random residual pattern, meaning a linear model is appropriate but not very predictive.

Common Pitfalls

Why: Correlation is symmetric, but regression is not, and students often mix up which variable is which

Why: Students forget regression predicts the average response, not an exact outcome for every person

Why: Students confuse association (what regression measures) with causation, which requires a randomized experiment

Why: Students think every intercept needs an interpretation by default

Why: Students assume the linear relationship holds everywhere, which is almost never true

Why: Students confuse model adequacy (linearity) with strength of relationship

Quick Reference Cheatsheet

← Back to topic

Stuck on a specific question?
Snap a photo or paste your problem — Ollie (our AI tutor) walks through it step-by-step with diagrams.
Try Ollie free →

AP Statistics Linear Regression Models — AP Statistics

1. Introduction to Linear Regression Models ★★☆☆☆ ⏱ 3 min

2. The Least Squares Regression Line (LSRL) ★★☆☆☆ ⏱ 4 min

3. Interpreting Slope and Intercept in Context ★★★☆☆ ⏱ 3 min

4. Residual Analysis and Coefficient of Determination ★★★☆☆ ⏱ 4 min

Common Pitfalls

Quick Reference Cheatsheet

More study guides