Statistics · CED Exploring Two-Variable Data · 14 min read · Updated 2026-05-11

Introduction to Categorical Association — AP Statistics

AP Statistics · CED Exploring Two-Variable Data · 14 min read

1. What Is Categorical Association? ★★☆☆☆ ⏱ 3 min

This subtopic is part of AP Statistics Unit 2: Exploring Two-Variable Data, which makes up 10-15% of the total AP exam score, with this subtopic accounting for 5-7% of total exam weight. It appears in both multiple-choice and free-response sections, most commonly as a set of MCQs or early parts of a multi-part FRQ.

Unlike association between quantitative variables (which uses correlation and linear regression), categorical association relies on comparing conditional proportions rather than linear trends. Standard notation for two-way (contingency) tables uses $O_{ij}$ for the observed count in row $i$ column $j$, $R_i$ for row totals, $C_j$ for column totals, and $N$ for total sample size, with rows for the explanatory variable and columns for the response variable.

2. Two-Way Tables and Frequency Types ★★☆☆☆ ⏱ 4 min

A two-way (contingency) table organizes counts of observations for all combinations of levels of two categorical variables. Three core frequency types are used for association analysis, each with a corresponding relative frequency that adjusts for differing sample sizes:

**Marginal frequency**: Total count for a single level of one variable (found in table margins). Marginal relative frequency = marginal frequency / total sample size $N$, describing the proportion of the full sample in that level.
**Joint frequency**: Count of observations in a specific combination of levels (a single table cell). Joint relative frequency = joint frequency / $N$, describing the proportion of the full sample with that combination of outcomes.
**Conditional frequency**: Count of observations for a level of one variable, restricted to a specific level of the other variable. Conditional relative frequency = conditional frequency / *conditioning group total* (not overall $N$). Comparing conditional relative distributions is the core method for detecting association.

📐 Worked Example

A survey of 200 high school students asks: do you play a varsity sport (Yes/No), and do you have an after-school job (Yes/No)? Observed counts are below. Calculate: (a) the joint relative frequency of varsity athletes with no after-school job, (b) the marginal relative frequency of students with after-school jobs, (c) the conditional relative frequency of having an after-school job among varsity athletes. | | After-school job: Yes | After-school job: No | Row Total | |----------------|------------------------|-----------------------|-----------| | Varsity: Yes | 32 | 48 | 80 | | Varsity: No | 60 | 60 | 120 | | Column Total | 92 | 108 | 200 |

Recall the definition for each frequency type to confirm the correct denominator: joint relative frequency uses overall $N$, marginal uses overall $N$, and conditional uses the conditioning group total.
Solve (a): The cell for varsity athletes with no after-school job has a count of 48. Joint relative frequency =
$\frac{48}{200} = 0.24$
Solve (b): The marginal total for students with after-school jobs is 92. Marginal relative frequency =
$\frac{92}{200} = 0.46$
Solve (c): We condition on being a varsity athlete, so the denominator is the row total for varsity athletes (80), not overall $N$. The count of varsity athletes with after-school jobs is 32. Conditional relative frequency =
$\frac{32}{80} = 0.40$

Exam tip: Always circle the condition mentioned in the question before calculating. Phrases like 'given that', 'among', or 'conditional on' mean the denominator is the condition group total, not the overall sample size.

3. Detecting Categorical Association vs Independence ★★★☆☆ ⏱ 3 min

Two categorical variables are independent (no association) if the conditional distribution of the response variable is identical across all levels of the explanatory variable. In other words, knowing the value of the explanatory variable gives no additional information about the response variable. With real sample data, distributions are never perfectly identical, so we assess association by the magnitude of difference between conditional proportions.

📐 Worked Example

A researcher studies whether pet ownership (cat/dog/no pet) is associated with living arrangement (house/apartment). The conditional relative distribution of living arrangement, given pet type, is below. Is there evidence of an association between pet ownership and living arrangement in this sample? Justify your answer. | | House | Apartment | Total | |--------------|-------|-----------|-------| | Cat owner | 0.65 | 0.35 | 1.00 | | Dog owner | 0.72 | 0.28 | 1.00 | | No pet | 0.68 | 0.32 | 1.00 |

To assess association, compare the conditional distributions of living arrangement across the three pet ownership groups.
Calculate the range of conditional proportions for each level of living arrangement to measure variation: For House, proportions range from 0.65 (cat owners) to 0.72 (dog owners), a difference of 0.07 (7 percentage points). For Apartment, proportions range from 0.28 to 0.35, also a 7 percentage point difference.
A difference of only 7 percentage points across groups is small, meaning knowing a person's pet ownership gives almost no information about their living arrangement.
Conclusion: There is no meaningful evidence of an association between the two variables in this sample.

Exam tip: On FRQ questions asking about association, you must reference the magnitude of the difference in conditional proportions in context to earn full credit — never just state 'yes' or 'no' without numerical evidence.

4. Introduction to Simpson's Paradox ★★★★☆ ⏱ 4 min

Simpson's paradox highlights the importance of checking for lurking confounding variables when analyzing categorical association, as these variables can completely reverse the observed relationship between two variables of interest.

📐 Worked Example

Two college baseball pitchers record their hit rates over a season, split by batter handedness, as shown below. Explain how this is an example of Simpson's paradox, and identify the confounding variable. | Pitcher | Hits vs Right | Total Right Batters | Hits vs Left | Total Left Batters | |---------|----------|-------------|---------|------------| | A | 1 | 10 | 36 | 90 | | B | 18 | 90 | 5 | 10 |

Calculate the conditional hit rate for each pitcher within each subgroup:
Pitcher A: $1/10 = 0.10$ (10%) vs right-handed batters, $36/90 = 0.40$ (40%) vs left-handed batters. Pitcher B: $18/90 = 0.20$ (20%) vs right-handed batters, $5/10 = 0.50$ (50%) vs left-handed batters.
Compare within subgroups: Pitcher A has a lower (better) hit rate than Pitcher B against both right-handed and left-handed batters.
Calculate the overall (aggregated) hit rate for each pitcher:
Pitcher A overall: $(1+36)/(10+90) = 37/100 = 0.37$ (37%). Pitcher B overall: $(18+5)/(90+10) = 23/100 = 0.23$ (23%).
The direction of association is reversed: when aggregated, Pitcher B has a lower (better) overall hit rate than Pitcher A, even though A performs better against both subgroups of batters. This reversal matches the definition of Simpson's paradox.
The confounding variable is batter handedness: Pitcher A faces far more left-handed batters (who have higher overall hit rates than right-handed batters) than Pitcher B, who faces mostly right-handed batters.

Exam tip: When asked to explain Simpson's paradox on the exam, you must explicitly state the reversal of the association direction and explain the uneven distribution of the confounding variable to earn full credit.

Common Pitfalls

Why: Students confuse joint and conditional frequency, forgetting that 'given' or 'among' means we restrict the sample to the condition group

Why: Students think any difference means association, not accounting for random sample variation

Why: Students confuse marginal and conditional distributions; association is about conditional distributions, not marginal

Why: Students mix up the definitions of joint vs conditional relative frequency

Why: Students assume splitting by any variable gives the 'true' result, but the variable may not be a confounding lurking variable

Quick Reference Cheatsheet

← Back to topic

Stuck on a specific question?
Snap a photo or paste your problem — Ollie (our AI tutor) walks through it step-by-step with diagrams.
Try Ollie free →

Introduction to Categorical Association — AP Statistics

1. What Is Categorical Association? ★★☆☆☆ ⏱ 3 min

2. Two-Way Tables and Frequency Types ★★☆☆☆ ⏱ 4 min

3. Detecting Categorical Association vs Independence ★★★☆☆ ⏱ 3 min

4. Introduction to Simpson's Paradox ★★★★☆ ⏱ 4 min

Common Pitfalls

Quick Reference Cheatsheet

More study guides