Collecting Data — AP Statistics

AP Statistics · 2020+ AP Statistics CED Unit 3 · 18 min read

1. Introduction to Data Collection ★☆☆☆☆ ⏱ 3 min

Collecting data is the process of gathering representative, reliable observations to answer statistical questions, forming the foundation of all inferential statistics in AP Statistics. Poor data collection leads to invalid conclusions even with perfect analysis. This topic makes up 15-23% of your total AP Stats exam score, appearing in both multiple choice and free response, often as part of the end-of-exam investigative task.

2. Probability Sampling Methods ★★☆☆☆ ⏱ 5 min

Sampling is the process of selecting a subset (sample) of a larger population to measure, since a full census (measuring every population member) is often impractical, costly, or impossible. Below are the four probability sampling methods tested on the AP exam.

**Advantages**: Unbiased, simple for small populations with a complete sampling frame.
**Disadvantages**: Not efficient for large, geographically spread populations, may underrepresent small subgroups.

**Advantages**: Ensures representation of small subgroups, reduces sampling variability.
**Disadvantages**: Requires prior knowledge of stratum membership for all population members.

**Advantages**: Low cost, efficient for geographically spread populations, no need for a full population sampling frame.
**Disadvantages**: Higher sampling variability than SRS/stratified if clusters are not representative.

**Advantages**: Easy to implement for ordered populations, no need for a full sampling frame.
**Disadvantages**: Biased if population order has a repeating pattern matching $k$.

📐 Worked Example

A local council wants to survey 500 residents of a city with 4 distinct neighborhoods: A (15,000 residents), B (10,000), C (12,000), D (13,000). (a) Describe how to select a stratified random sample of 500 residents. (b) Describe how to select a cluster sample of 500 residents using neighborhoods as clusters. (c) Which method is more appropriate for this study? Justify.

For part (a), first calculate total population and sampling fraction:
$15000 + 10000 + 12000 + 13000 = 50000 \\ \frac{500}{50000} = 0.01$
Neighborhoods are the strata we want to represent, so we select 1% of residents from each neighborhood. Assign each resident a unique ID, then use a random number generator to select 150 from A, 100 from B, 120 from C, 130 from D via SRS per stratum, then survey those selected.
For part (b), treat each neighborhood as a cluster. Assign each neighborhood a unique ID 1 to 4, randomly select one cluster, then randomly sample 500 residents from the selected cluster to survey.
For part (c), stratified random sampling is more appropriate. The study asks for city-wide support for a park, so we need representation from all neighborhoods, which stratified sampling guarantees. Cluster sampling would only sample one neighborhood, which may not represent the entire city.

Exam tip: Examiners frequently ask you to distinguish stratified vs cluster sampling: if you take SRS from every group, it is stratified; if you select entire groups to measure, it is cluster.

3. Sampling Bias ★★☆☆☆ ⏱ 3 min

Sampling bias occurs when some members of the population are systematically more likely to be selected in the sample than others, leading to unrepresentative results. This is distinct from random sampling error, which is natural variation between samples that cannot be eliminated, only reduced by increasing sample size.

**Nonresponse bias**: When selected individuals refuse to participate, and non-participants differ systematically from participants. Example: an email job satisfaction survey that gets more responses from very happy or very unhappy workers.
**Response bias**: When respondents give inaccurate answers, usually due to leading questions, social desirability bias, or confusing wording. Example: 'Do you support the harmful new tax increase that will raise grocery prices?' leads to more negative responses than neutral wording.
**Voluntary response bias**: When respondents self-select to participate (e.g. online polls, call-in surveys). People with strong opinions are far more likely to respond, so results are almost always biased.

Exam tip: To get full marks on bias questions, follow the 3-step rule: 1) Name the bias, 2) Explain how it applies to the scenario, 3) State the direction of the bias (overestimate/underestimate).

4. Core Principles of Experimental Design ★★★☆☆ ⏱ 4 min

First, distinguish between the two core study types:

**Observational study**: You measure variables without interfering with subjects (you only observe). You can only conclude association, not causation.
**Experiment**: You deliberately impose a *treatment* on subjects to measure their response. Well-designed experiments allow you to establish causal relationships.

There are three core principles of experimental design, tested heavily on both multiple choice and free response questions:

5. Confounding, Lurking Variables, and Scope of Inference ★★★★☆ ⏱ 3 min

Confounding and lurking variables both prevent drawing valid causal conclusions in observational studies or poorly designed experiments.

In well-designed experiments, randomisation eliminates confounding variables by evenly distributing them across treatment groups on average.

The scope of inference refers to the valid conclusions you can draw from a study, based entirely on how data was collected. There are two core rules tested every year on the AP exam:

If random sampling was used from a defined population, you can *generalise results to that population*. If not, you can only generalise to the studied group.
If random assignment to treatments was used, you can *draw causal conclusions* about the treatment effect. If not, you can only conclude association, not causation.

Exam tip: To get full marks when identifying a confounding variable, follow the 3-step rule: 1) Name the variable, 2) Explain how it differs between groups, 3) Explain how it affects the response variable.

Common Pitfalls

Why: Both involve grouping the population before sampling, so it is easy to confuse the two

Why: You think naming the bias is enough to get marks, but AP graders require full context

Why: It is intuitive to assume correlation = causation, or that the people you studied are representative of everyone

Why: Both use randomness, but for very different purposes

Why: You forget that people's beliefs about treatment affect their response