Statistics · 16 min read · Updated 2026-05-11

Exploring Two-Variable Data — AP Statistics

AP Statistics · College Board AP Statistics CED Unit 2 · 16 min read

1. Categorical Two-Variable Data: Two-Way Tables ★★☆☆☆ ⏱ 4 min

Two-way tables (also called contingency tables) organize frequency data for two categorical variables: one variable defines the table rows, the other defines columns, and each cell holds the count of observations that fall into both corresponding categories.

📐 Worked Example

A survey of 200 high school students tracks gender (male/female) and part-time job status (yes/no). The table below summarizes counts. Calculate the conditional distribution of part-time job status given gender, and determine if the variables are associated in this sample: | | Has Part-Time Job | No Part-Time Job | Total | |----------------|-------------------|------------------|-------| | Male | 42 | 58 | 100 | | Female | 51 | 49 | 100 | | Total | 93 | 107 | 200 |

Calculate the conditional distribution for female students, using the marginal total of 100 females:
$\frac{51}{100} = 51\% \text{ (have job)}, \frac{49}{100} = 49\% \text{ (no job)}$
Calculate the conditional distribution for male students, using the marginal total of 100 males:
$\frac{42}{100} = 42\% \text{ (have job)}, \frac{58}{100} = 58\% \text{ (no job)}$
Compare the two conditional distributions. Since they differ, the two variables are associated in this sample.

Exam tip: Always explicitly state the condition when calculating conditional percentages to avoid mixing up which variable you are conditioning on, a common source of lost points.

2. Quantitative Association: Scatterplots and Correlation ★★☆☆☆ ⏱ 4 min

Scatterplots are used to visualize relationships between two quantitative variables: the explanatory (independent) variable is plotted on the x-axis, and the response (dependent) variable is plotted on the y-axis. When describing a scatterplot, always address four key features:

**Direction**: Positive (as $x$ increases, $y$ increases), negative (as $x$ increases, $y$ decreases), or no association
**Form**: Linear, non-linear (curved), or no clear form
**Strength**: How closely points follow the observed form
**Unusual features**: Outliers, clusters, or gaps in the data

r = \frac{1}{n-1} \sum \left( \frac{x_i - \bar{x}}{s_x} \right) \left( \frac{y_i - \bar{y}}{s_y} \right)

Key properties: $r$ has no units, is unaffected by changes to the units of $x$ or $y$, and is not resistant to outliers (a single extreme point can drastically change its value).

3. Least-Squares Regression Lines ★★★☆☆ ⏱ 4 min

A regression line models the linear relationship between $x$ and $y$, and is used to predict values of the response variable for given values of the explanatory variable. The *least-squares regression line (LSRL)* is the line that minimizes the sum of the squared vertical distances between observed $y$-values and predicted $y$-values.

b_1 = r \frac{s_y}{s_x}

b_0 = \bar{y} - b_1 \bar{x}

AP exam questions almost always require context-specific interpretation of slope and intercept:

**Slope**: For each 1-unit increase in the explanatory variable $x$, the predicted value of $y$ increases/decreases by $b_1$ units, on average.
**Y-intercept**: The predicted value of $y$ when $x=0$. This is only practically meaningful if $x=0$ is a plausible value within or near the range of observed $x$-values.

4. Residuals, Influential Points and Extrapolation ★★★☆☆ ⏱ 4 min

A residual plot graphs residuals on the y-axis against the explanatory variable $x$ (or predicted $\hat{y}$) on the x-axis, and is used to check if a linear model is appropriate. Random, evenly-spread scatter around $e=0$ means the linear model is a good fit, while curved patterns or fanning (changing spread of residuals) means a linear model is not appropriate.

An influential point is an observation that drastically changes the slope, intercept, or correlation of the LSRL when removed. Points that are both outliers (large residual, far from the LSRL in the y-direction) and high-leverage (extreme $x$-value far from the mean of $x$) are almost always influential.

Extrapolation is the practice of using a LSRL to predict $y$-values for $x$-values that fall far outside the range of $x$-values used to build the line. Extrapolation is almost always unreliable, because we have no evidence that the linear relationship between $x$ and $y$ holds outside the observed range of $x$.

Exam tip: When asked to critique a regression prediction, always first check if the $x$-value falls inside the original data range. If not, identify it as extrapolation and state it is unreliable.

Common Pitfalls

Why: Students forget $r$ only measures linear association between two quantitative variables

Why: Failing to label variables clearly at the start of analysis

Why: Memorizing the interpretation without considering problem context

Why: Strong linear association feels like a cause-effect link

Why: Assuming the linear relationship between variables holds indefinitely outside the observed data range

Quick Reference Cheatsheet

← Back to topic

Stuck on a specific question?
Snap a photo or paste your problem — Ollie (our AI tutor) walks through it step-by-step with diagrams.
Try Ollie free →