AP Statistics · 2020+ AP Statistics CED Unit 3 · 18 min read
1. Introduction to Data Collection★☆☆☆☆⏱ 3 min
Collecting data is the process of gathering representative, reliable observations to answer statistical questions, forming the foundation of all inferential statistics in AP Statistics. Poor data collection leads to invalid conclusions even with perfect analysis. This topic makes up 15-23% of your total AP Stats exam score, appearing in both multiple choice and free response, often as part of the end-of-exam investigative task.
2. Probability Sampling Methods★★☆☆☆⏱ 5 min
Sampling is the process of selecting a subset (sample) of a larger population to measure, since a full census (measuring every population member) is often impractical, costly, or impossible. Below are the four probability sampling methods tested on the AP exam.
**Advantages**: Unbiased, simple for small populations with a complete sampling frame.
**Disadvantages**: Not efficient for large, geographically spread populations, may underrepresent small subgroups.
**Advantages**: Ensures representation of small subgroups, reduces sampling variability.
**Disadvantages**: Requires prior knowledge of stratum membership for all population members.
**Advantages**: Low cost, efficient for geographically spread populations, no need for a full population sampling frame.
**Disadvantages**: Higher sampling variability than SRS/stratified if clusters are not representative.
**Advantages**: Easy to implement for ordered populations, no need for a full sampling frame.
**Disadvantages**: Biased if population order has a repeating pattern matching $k$.
Exam tip: Examiners frequently ask you to distinguish stratified vs cluster sampling: if you take SRS from every group, it is stratified; if you select entire groups to measure, it is cluster.
3. Sampling Bias★★☆☆☆⏱ 3 min
Sampling bias occurs when some members of the population are systematically more likely to be selected in the sample than others, leading to unrepresentative results. This is distinct from random sampling error, which is natural variation between samples that cannot be eliminated, only reduced by increasing sample size.
**Nonresponse bias**: When selected individuals refuse to participate, and non-participants differ systematically from participants. Example: an email job satisfaction survey that gets more responses from very happy or very unhappy workers.
**Response bias**: When respondents give inaccurate answers, usually due to leading questions, social desirability bias, or confusing wording. Example: 'Do you support the harmful new tax increase that will raise grocery prices?' leads to more negative responses than neutral wording.
**Voluntary response bias**: When respondents self-select to participate (e.g. online polls, call-in surveys). People with strong opinions are far more likely to respond, so results are almost always biased.
Exam tip: To get full marks on bias questions, follow the 3-step rule: 1) Name the bias, 2) Explain how it applies to the scenario, 3) State the direction of the bias (overestimate/underestimate).
4. Core Principles of Experimental Design★★★☆☆⏱ 4 min
First, distinguish between the two core study types:
**Observational study**: You measure variables without interfering with subjects (you only observe). You can only conclude association, not causation.
**Experiment**: You deliberately impose a *treatment* on subjects to measure their response. Well-designed experiments allow you to establish causal relationships.
There are three core principles of experimental design, tested heavily on both multiple choice and free response questions:
5. Confounding, Lurking Variables, and Scope of Inference★★★★☆⏱ 3 min
Confounding and lurking variables both prevent drawing valid causal conclusions in observational studies or poorly designed experiments.
In well-designed experiments, randomisation eliminates confounding variables by evenly distributing them across treatment groups on average.
The scope of inference refers to the valid conclusions you can draw from a study, based entirely on how data was collected. There are two core rules tested every year on the AP exam:
If random sampling was used from a defined population, you can *generalise results to that population*. If not, you can only generalise to the studied group.
If random assignment to treatments was used, you can *draw causal conclusions* about the treatment effect. If not, you can only conclude association, not causation.
Exam tip: To get full marks when identifying a confounding variable, follow the 3-step rule: 1) Name the variable, 2) Explain how it differs between groups, 3) Explain how it affects the response variable.
Common Pitfalls
Why: Both involve grouping the population before sampling, so it is easy to confuse the two
Why: You think naming the bias is enough to get marks, but AP graders require full context
Why: It is intuitive to assume correlation = causation, or that the people you studied are representative of everyone
Why: Both use randomness, but for very different purposes
Why: You forget that people's beliefs about treatment affect their response