Statistics · CED Unit 2: Exploring Two-Variable Data · 14 min read · Updated 2026-05-11

Regression Outliers and Influential Points — AP Statistics

AP Statistics · CED Unit 2: Exploring Two-Variable Data · 14 min read

1. Regression Outliers ★★☆☆☆ ⏱ 4 min

The residual for any observation measures how far the observed $y$-value is from the predicted $y$-value from the least-squares regression line (LSRL).

e_i = y_i - \hat{y}_i

The standard rule of thumb for identifying a regression outlier (for roughly normally distributed residuals) is that the absolute residual is larger than twice the standard deviation of residuals: $|e_i| > 2s$. Most regression outliers within the range of $x$-values have very little impact on the LSRL, so they are rarely influential.

Exam tip: Never identify an outlier just by how far it is from the origin. Always check residual size to confirm a point is a regression outlier, regardless of its position on the scatterplot.

2. High-Leverage Points ★★★☆☆ ⏱ 4 min

Leverage quantifies how far an observation's $x$-value is from the mean of all $x$-values. For simple linear regression, leverage of observation $i$ is calculated as:

h_i = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\sum_{j=1}^n (x_j - \bar{x})^2}

The rule of thumb for high leverage in simple linear regression is $h_i > \frac{4}{n}$, derived from the general rule $h_i > \frac{2(p+1)}{n}$ for $p$ predictors. A high-leverage point is not automatically an outlier or influential: if it follows the pattern of the rest of the data, it will not change the LSRL much.

Exam tip: Always check the $x$-value range to identify high leverage. A point can have a very small residual (not an outlier) and still be high leverage.

3. Influential Points and Cook's Distance ★★★★☆ ⏱ 6 min

A point that is both a regression outlier and high leverage is almost always influential. We can formally measure influence with Cook's Distance, which combines residual size and leverage into a single metric that captures how much all predicted $y$-values would change if the point were removed. For simple linear regression ($p=2$ parameters: intercept + slope), Cook's Distance is:

D_i = \frac{e_i^2}{p \cdot s^2} \cdot \frac{h_i}{(1-h_i)^2}

The standard rule of thumb is that a point is influential if $D_i > 1$; a more conservative cutoff of $D_i > 0.5$ is often used for small datasets.

Exam tip: AP graders require you to link influence to a change in regression parameters. Always state how much the slope/intercept changes when you remove the point to justify your conclusion that it is influential.

Common Pitfalls

Why: Students confuse distance from the origin with influence, mixing up x and y extremes

Why: Students think any extreme x is automatically influential, but high-leverage points that follow the pattern of the other points do not change the slope

Why: Students mix up vertical outliers with influential points; an outlier in y that is in the middle of the x-range rarely has meaningful pull on the LSRL

Why: Students think unusual points are mistakes and must be removed, but influential points can be valid data that reveal important patterns

Why: What looks like an outlier visually can be within the 2s range for residuals, especially with large datasets

Why: Students assume any change in r means a matching change in slope, which is not always true

Quick Reference Cheatsheet

← Back to topic

Stuck on a specific question?
Snap a photo or paste your problem — Ollie (our AI tutor) walks through it step-by-step with diagrams.
Try Ollie free →