Stats without Tears / SWT [PDF]

There are several types of symmetric distributions, but here are the two you'll meet most often. A uniform distribution

84 downloads 157 Views 7MB Size

Recommend Stories


Handwriting Without Tears
The only limits you see are the ones you impose on yourself. Dr. Wayne Dyer

Handwriting Without Tears®
The greatest of richness is the richness of the soul. Prophet Muhammad (Peace be upon him)

Archives Without Tears
The beauty of a living thing is not the atoms that go into it, but the way those atoms are put together.

Bedside Medicine Without Tears
Silence is the language of God, all else is poor translation. Rumi

Topology Without Tears Solution Manual
Those who bring sunshine to the lives of others cannot keep it from themselves. J. M. Barrie

Learning Without Tears written by Helyn Connerr
Keep your face always toward the sunshine - and shadows will fall behind you. Walt Whitman

[PDF] Tears to Triumph
In the end only three things matter: how much you loved, how gently you lived, and how gracefully you

(SWT) Program
We must be willing to let go of the life we have planned, so as to have the life that is waiting for

Ch 5. Probability / SWT - BrownMath.com [PDF]
Jan 13, 2015 - Exercises for Chapter 5 ... If you're learning independently, you can skip the sections marked “Optional” and still understand the chapters that follow. ... because there are too many variables or because you don't know enough: the

[PDF] Mud, Sweat, and Tears
Keep your face always toward the sunshine - and shadows will fall behind you. Walt Whitman

Idea Transcript


archive.is webpage capture Webpage

Saved from

2 Dec 2016 13:36:17 UTC

search

http://brownmath.com/swt/pfswt.htm history

All snapshots from host brownmath.com share

Screenshot

download .zip

report error or abuse

0%

BrownMath.com Õ Stats w/o Tears Õ Printer-Friendly

Stats without Tears in One File Updated 1 Jan 2016 Copyright © 2001–2016 by Stan Brown

Contents Help » About This Book 1. Statistics! 1A. Statistics? What’s That? 1B. Good Samples, Bad Samples 1C. Data and Variables 1D. Statistical Errors 1E. Observation and Experiment 1F. Sharp Points What Have You Learned? Exercises for Chapter 1 What’s New 2. Graphing Your Data 2A. Graphing Non-Numeric Data 2B. Graphing Numeric Data 2C. Bad Graphs 2D. Really Good Graphs What Have You Learned? Exercises for Chapter 2 What’s New 3. Numbers about Numbers 3A. Measures of Center 3B. Summary Numbers on the TI-83 … 3C. Measures of Spread 3D. Measures of Position 3E. Five-Number Summary What Have You Learned? Exercises for Chapter 3 What’s New 4. Linked Variables 4A. Mathematical Models 4B. Scatterplot, Correlation, and Regression on TI-83/84 4C. Finding ŷ from a Regression on TI-83/84 4D. Decision Points for Correlation Coefficient 4E. Optional: Scatterplot, Correlation, and Regression in Excel What Have You Learned? Exercises for Chapter 4 What’s New 5. Probability 5A. Probability Basics 5B. Combining Probabilities 5C. Sequences instead of Formulas What Have You Learned? Exercises for Chapter 5 What’s New 6. Discrete Probability Models 6A. Random Variables 6B. Discrete Probability Distributions 6C. Bernoulli Trials 6D. The Geometric Model 6E. The Binomial Model What Have You Learned? Exercises for Chapter 6 What’s New 7. Normal Distributions 7A. Continuous Random Variables 7B. The Normal Model 7C. The Standard Normal Distribution 7D. Checking for Normality What Have You Learned? Exercises for Chapter 7 What’s New 8. How Samples Vary 8A. Numeric Data / Means of Samples 8B. Binomial Data / Proportions of Samples 8C. Summary of Sampling Distributions What Have You Learned? Exercises for Chapter 8 What’s New 9. Estimating Population Parameters 9A. Estimating Population Proportion p 9B. Estimating Population Mean µ When You Know 9C. Estimating Population Mean µ When You Don’t Know What Have You Learned? Exercises for Chapter 9 What’s New 10. Hypothesis Tests 10A. Testing a Proportion (Binomial Data) 10B. Sharp Points 10C. Testing a Mean (Numeric Data) 10D. Confidence Interval and Hypothesis Test 10E. Testing a Non-Random Sample What Have You Learned? Exercises for Chapter 10 What’s New 11. Inference from Two Samples 11A. Numeric Data — Paired or Unpaired? 11B. Inference with Paired Numeric Data (Case 3) 11C. Inference with Unpaired Numeric Data (Case 4) 11D. Inference on Two Proportions (Case 5) 11E. Confidence Interval and Hypothesis Test (Two Populations) 11F. More Confidence Intervals for Two Populations What Have You Learned? Exercises for Chapter 11 What’s New 12. Tests on Counted Data 12A. Testing Goodness of Fit to a Model 12B. Testing for Independence or Homogeneity 12C. But Wait, There’s More! What Have You Learned? Exercises for Chapter 12 What’s New Review What’s Important? Review Problems What’s New Solutions to All Exercises Solutions for Chapter 1 Solutions for Chapter 2 Solutions for Chapter 3 Solutions for Chapter 4 Solutions for Chapter 5 Solutions for Chapter 6 Solutions for Chapter 7 Solutions for Chapter 8 Solutions for Chapter 9 Solutions for Chapter 10 Solutions for Chapter 11 Solutions for Chapter 12 Solutions to Review Problems Reference Material Statistics Symbol Sheet Roman Letters Greek Letters What’s New Inferential Statistics: Basic Cases What’s New Seven Steps of Hypothesis Tests Big Names in Statistics What’s New Recommended Statistics Books Statistics for Citizens Textbooks What’s New TI-83/84 Cheat Sheet Sampling Statistics of a Sample or Parameters of a Population Correlation and Regression Discrete Probability Distributions Normal Distribution Confidence Intervals, Hypothesis Tests, Sample Size What’s New TI-83/84 Troubleshooting Error Messages List Troubles Graphing Troubles Other Troubles What’s New Sources Used What’s New

Help » About This Book This book is an alternative to the usual textbooks for a one-semester course in statistics. Whether you’re teaching in a classroom or learning on your own, you’ve come to the right place. DON’T PANIC!

Douglas Adams’ The Hitchhiker’s Guide to the Galaxy bore a “large, friendly label” with those words, and that’s also my message to you. I don’t see any reason for students to be afraid of statistics. It’s no more difficult than any other technical course, and it’s much more practical than other math courses. The mathematical details are here for those who want them, but I lean heavily on technology to relieve students of the “grunt work”.

Calculator:

You need a TI-83 or TI-84 family calculator to get the most out of this book. For $100 or less, this calculator has amazing capabilities for statistics, and it also supports other math courses up through calculus. I suggest you download my free MATH200A program, which adds some capabilities to the calculator, but this is optional. Some error conditions on your calculator can be scary when you see them the first time. Don’t panic! See TI-83/84 Troubleshooting.

View or Print:

These pages change automatically for your screen or printer. If you print, I suggest black-and-white, two-sided printing.

History of this book:

This textbook grew out of handouts I made for my students at TC3 (Tompkins Cortland Community College in Dryden, New York). The handouts filled gaps and corrected errors in our standard textbook. As time went on, I found myself replacing whole chapters. Student evaluations showed that they preferred these replacements to the textbook. In Spring 2013 I reached the tipping point: I had replaced more than half of the twelve textbook chapters. In good conscience I didn’t feel I could ask students to buy an expensive textbook that they would use less than half of, so I burned my bridges and announced the required textbook beginning in Summer 2013 as “none”. In Fall 2013, a second instructor at TC3 adopted this textbook for his class. Benjamin Kirk provided a lot of valuable suggestions and corrections, and I’m very grateful. They have improved the book considerably.

Feedback welcome!

Contact information is at BrownMath.com/about/#Contact. Please share your reactions, whether positive or negative! If I could explain something better, I’d like to know. If some section works particularly well for you, please tell me. If you find an error, I especially want to know about it. (My own students get extra credit for pointing out errors.)

Being on the Web, this book will get updated frequently, based on your feedback. You can see the revision dates in the chapter list above, and a revision history is shown at the end of every chapter. Students:

This eTextbook is a free resource for Because this textbook helps you, you. You can read it on line or print any please click to donate! or all chapters. If you print any chapters, you can keep your costs down by choosing black-and-white printing in duplex (two-sided) mode. Just a word of advice. I’ve tried to make statistics approachable to anyone with high-school math, but it’s still a technical subject. You can’t just read a chapter in one pass from start to end, the way you would a novel or a book of history. Please see How to Read a Math Book for some tips on getting the most out of your time with this book, or any math book. Some material is marked BTW. This is stuff I find interesting, including mathematical details that some students have asked for, but you can get through the course without it.

Instructors:

Although this is a free resource, it is copyrighted and I would appreciate your asking permission to copy and distribute any of it. My contact information is at BrownMath.com/about/#Contact . Though you don’t need to ask permission simply to link to this material, I would appreciate knowing about it.

1. Statistics! Updated 21 Feb 2016 (What’s New?) Contents:

1A. Statistics? What’s That? 1A1. What Should You Expect? 1A2. What Do You Get From the Course? 1A3. Sample and Population 1A4. Descriptive and Inferential Statistics 1A5. Statistic and Parameter 1B. Good Samples, Bad Samples 1B1. The Gold Standard: Random Samples · Seeding the Random-Number Generator · Selecting Members of the Sample 1B2. Almost as Good: Systematic Samples · Taking a Systematic Sample 1B3. Good but Hard: Cluster Samples 1B4. Stratified Samples 1B5. Census 1B6. Bogus Samples 1C. Data and Variables 1C1. What Are Data? What Are Variables? 1C2. Quantitative or Qualitative? 1C3. Summary Statements 1D. Statistical Errors 1D1. Sampling Error 1D2. Nonsampling Errors · Self-Selected Samples · Sampling Bias · Selection Bias · Non-Response Errors · Response Errors · Data Errors · Inappropriate Analysis 1E. Observation and Experiment 1E1. Observational Study Versus Designed Experiment · Confounding and Lurking Variables · Extended Examples 1E2. Experimental Techniques · Completely Randomized Design · Randomized Block Design · Matched Pairs · Control Group and Placebo · Double Blind 1F. Sharp Points 1F1. Rounding and Significant Digits · How Many Digits? · How to Round Numbers · When to Round Numbers 1F2. Powers of 10 from Your Calculator 1F3. Show Your Work! 1F4. Optional: ∑ Means Add ’em Up What Have You Learned? Exercises for Chapter 1 What’s New

1A. Statistics? What’s That? Summary:

We live in an uncertain world. You never have complete information when you make a decision, but you have to make decisions anyway. Statistics helps you make sense of confusing and incomplete information, to decide whether a pattern is meaningful or just coincidence. It’s a way of taming uncertainty, a tool to help you assess the risks and make the best possible decisions.

1A1. What Should You Expect? Statistics is different from any other math course. Yes, you’ll solve problems. But most will be real-world practical problems: Does aspirin make heart attacks less likely? Was there racial bias in the selection of that jury? What’s your best strategy in a casino? (Most examples will be from business, public policy, and medicine, but we’ll hit other fields too.) There will be very little use of formulas. Real statisticians don’t do things by hand. They use calculators or software, and so will you. Your TI-83 or TI-84 may seem intimidating at first, but you’ll quickly get to know it and be amazed at how it relieves you of drudgery. With little grunt work to do, you will focus on what your numbers mean. You’re not just a button-pushing calculator monkey; you have to think about what you’re doing and understand it well enough to explain it. Most of the time your answers will be non-technical English, not numbers or statistical jargon. That may seem scary and unfamiliar at first, but if you stick with it you’ll love stretching your brain instead of just following a book’s examples by rote.

1A2. What Do You Get From the Course? It may be a required course, so you get that much closer to graduation. But you can get more than that. If you do it right, statistics teaches you to think. You become skeptical, unwilling to take everything at face value. Instead, when somebody makes a statement you question how they know that and what they’re not telling you. You can’t be fooled so easily. You become a more thoughtful citizen, a more savvy consumer. Who knows? You might even have some fun along the way. So— Let’s get started!

1A3. Sample and Population Suppose you want to know about the health of athletes who use steroids versus those who don’t. Or you want to know whether people are likely to buy your new type of chips. Or you want to know whether a new type of glue makes boxes less likely to come apart in shipping. How do you answer questions like that? With most things you want to know, it’s impossible or impractical to examine every member of the group you want to know about, so you examine part of that group and then generalize to the whole group. Definitions:

A sample is the group you actually take data from. The population is the group you want to know something about. In Good Samples, Bad Samples, later in this chapter, you’ll see how samples are actually taken. The sample is usually a subgroup of the population, but in a census the whole population is the sample.

Example 1: You want to know what proportion of likely voters will vote for your candidate, so you poll 850 people. The people you actually ask are your sample, and the likely voters are the population. Caution!: Your sample is the 850 people you took data from, not just the subgroup that said they would vote for your candidate. The population is all likely voters, regardless of which candidate they prefer. Yes, you want to know who will vote for your candidate, but everybody’s vote counts, so the group you want to know something about — the population — is all likely voters. Definitions:

The number of members of your sample is called the sample size or size of the sample (symbol n), and the number of members of the population is called the population size or size of the population (symbol N). “Sometimes it is not possible to count the units contained in the population. Such a population is called infinite or uncountable.” (Finite and Infinite Population 2014) “Smokers” is an example. There is a definite number of smokers in the world at any moment, but if you try to count them the number changes while you’re counting. The sample size is always a definite number, since you always know how many individuals you took data from.

Example 2: You’re monitoring quality in a factory that turns out 2400 units an hour, so you test 30 units from each hour’s production. The units you tested are your sample, and your sample size is 30. All production in that hour is the population, and the population size is 2400. Isn’t the population the factory’s total production, since you want to know about the overall quality? No! Your sample was all drawn from one hour’s production. A sample from one production run can tell you about that production run, not about overall operations. This is why quality testing never ends. Example 3: You’re testing a new herpes vaccine. 800 people agree to participate in your study. You divide them randomly into two groups and administer the vaccine to one group and a placebo (something that looks and feels like a vaccine but is medically inactive) to another group. Over the course of the study, a few people drop out, and at the end you have 397 vaccinated individuals and 396 who received the placebo. You have two samples, individuals who were vaccinated (n1 = 397) and the control group (n2 = 396). The corresponding populations are all people who will take this vaccine in the future, and all people who won’t. Both of those populations are uncountable or infinite because more people are being born all the time.

1A4. Descriptive and Inferential Statistics Sometimes you want to summarize the data from your sample, and other times you want to use the sample to tell you something about the larger population. Those two situations are the two grand branches of statistics. Definition:

Descriptive statistics is summarizing and presenting data that were actually measured. Inferential statistics is making statements about a population based on measurements of a smaller sample.

Example 4: “52.9% of 1000 voters surveyed said they will vote for Candidate A.” That is descriptive statistics because someone actually measured (took responses from) those 1000 people. Compare: “I’m 95% confident that 49.8% to 56.0% of voters plan to vote for Candidate A.” That is inferential statistics because no one has asked all voters. Instead, a sample of voters was asked, and from that an estimate was made of the feelings of all voters.

1A5. Statistic and Parameter Definitions:

A statistic is a numerical summary of a sample. A parameter is a numerical summary of a population. Mnemonic: sample and statistic begin with s; population and parameter both begin with p.

Continuing with Example 4: “52.9% of 1000 voters surveyed plan to vote for Candidate A.” — 52.9% is a statistic because it summarizes the sample. “I’m 95% confident that 49.8% to 56.0% of voters plan to vote for Candidate A.” — 49.8% to 56.0% is an estimate of a parameter. (The actual parameter is the exact proportion presently planning to vote for A, which you don’t know exactly.) A statistic is always a statement of descriptive statistics and is always known exactly, because a statistic is a number that summarizes a sample of actual measured data. A parameter is usually estimated, not known exactly, and therefore is usually a matter of inferential statistics. The exception is a census, in which data are taken from the whole population. In that case, the parameter is known exactly because you have complete data for the population, so the parameter is then descriptive statistics. Describing …

The number is …

And the process is …

Any sample

A statistic

Descriptive statistics

A population (usually)

A parameter

Inferential statistics

A census (pop. w/ every member surveyed)

Both statistic and parameter

Descriptive statistics

1B. Good Samples, Bad Samples Summary:

A good sample is a smaller group that is representative of the population. A bad sample does a bad job of representing the population. You already know that a random sample is a good thing, but did you know that a random sample is actually carefully planned? What if you can’t take a true random sample? What are good and bad ways to gather samples? All valid samples share one characteristic: they are chosen through probability means, not selected by any decisions made by the person taking the sample. Every valid sample is gathered according to some rule that lets the impersonal operations of probability do the actual selection.

Definition:

A probability sample is a sample where the members are chosen by a predetermined process that uses the workings of chance and removes discretion from the investigators. Some of the types of probability samples are discussed below.

See also:

For lots of examples of good sampling and (usually) clear presentation of data about the American people, you might want to visit the Pew Research Center and its techoriented spinoff, Pew Internet. The venerable Gallup Poll also makes available its snapshots of the American public.

1B1. The Gold Standard: Random Samples Definition:

A random sample (also called a simple random sample) is a sample constructed through a process that gives every member of the population an equal chance of being chosen for the sample.

You always want a random sample, if you can get one. But to create a random sample you need a frame, and in many situations it’s impossible or unreasonably difficult to list all members of the population. The sections below explain alternative types of samples that can lead to statistically valid results. “Random” doesn’t mean haphazard. Humans think we’re good at constructing random sequences of letters and digits, but actually we’re very bad at it. Try typing 1300 “random” letters on your keyboard. If you do it really randomly, you should get about 1300÷26 = 50 of each letter. (Note: about 50 of each, not exactly 50. To determine whether a particular sample of text is unreasonably far from random letters, see Testing Goodness of Fit to a Model.) But if you’re like most people, the distribution will be very different from that: some letters will occur many more than 50 times, and others many less. So how do you construct a random sample? You need a frame, plus a random-number generator or random-number table. Definition:

A sampling frame, or simply a frame, is a list of all members of the population in a way that lets you assign a unique number to each one. The frame need not be a physical list; it can be a computer file — these days it usually is. But it has to be a complete list.

If you have a table of random numbers, the table will come with instructions for use. I’ll show you how do it with the TI-83/84, but you could also do it with Excel’s RANDBETWEEN( ) function, or with any other software that generates pseudo-random numbers. (The Web site random.org provides true random numbers based on atmospheric noise.) Seeding the Random-Number Generator Random numbers from software or a calculator aren’t really random, but what we call pseudorandom numbers. That means that they are generated by deterministic calculations designed to mimic randomness pretty well but not perfectly. To help them do a better job, you need to “seed” the random number sequence, meaning that you give it a unique starting point so that your sequence of random numbers is different from other people’s. You seed the random numbers only once. To do this: 1. Turn on the calculator and press [CLEAR]. 2. Come up with a number through some means other than choosing it. For instance, select the first number you see in the newspaper or in a book that you let fall open where it will. Type this number into the calculator. (Eyes closed, I tapped the financial page with a pen and used the number that the pen touched.) 3. Press [STOÕ], which shows on your screen as Õ. 4. Press [MATH] [] [1] to paste rand to the screen. Press [ENTER]. Again, you need to seed random numbers only once on your calculator. Selecting Members of the Sample For this you need to know the size of the population, which is the number of individuals in your frame. You will generate a series of random numbers between 1 and the population size, as follows: 1. Press [MATH] [] [5] to paste randInt( to your screen. 2. Press [1] [,], enter the population size, and press [)] [ENTER] to generate the first random number. In my case the population size was 20,147 and my first random number was 4413, so the first member of my sample will be the 4413th individual, in order, from the sampling frame. 3. Press [ENTER] to generate the next random number. (The randInt function may or may not be displayed again, depending on your calculator model and settings.) In my case, the next random number is 4949, so the 4949th individual in my frame becomes the second member of my sample. 4. Continue pressing [ENTER] until you have your desired sample size. If you get a duplicate random number, simply ignore it and take the next one. (If your calculator has [8] randIntNoRep, use it instead of plain randInt to prevent duplicates from appearing in the first place.)

1B2. Almost as Good: Systematic Samples Definition:

A systematic sample with k = some number is one where you take every kth individual from a representative subset of the population to make up your sample.

Example 5: Standing outside the grocery store all day, you survey every 40th person. That is a systematic sample with k=40. If properly taken, a systematic sample can be treated like a random sample. Then why do I call it almost as good? Because you have to make one big assumption: that the variable you’re surveying is independent of the order in which individuals appear. In the grocery-store example, you have to assume that shoppers in the time period when you take your survey are representative of all shoppers. That may or may not be true. For example, a high proportion of Wegmans shoppers at lunch time are buying prepared foods to eat there or take back to work. At other times, the mix of groceries purchased is likely to be different. Taking a Systematic Sample 1. Estimate the number of individuals you will be sampling from, and call this N. (Here your sampling frame is smaller than the population.) In the grocery-store example, estimate how many shoppers will pass the point where you will stand during the time you’re standing there. If you estimate 1200 shoppers during the six hours when you’ll take your survey, then N=1200. If you’re pretty unsure of N, you may need to observe that spot without taking the survey, just to get a preliminary count. 2. Decide how large a sample you want. Divide N by your desired sample size, rounding down, and call the result k. If you want 95 grocery shoppers in your sample, then k = N/95 = 1200/95 = 12.63 Õ k=12. If your estimate of N is uncertain, you’ll want to reduce k a bit. This will increase your sample size, but a sample that’s too large (within reason) is better than one that’s too small. 3. If you have never seeded the random-number generator, do it now. See Seeding the RandomNumber Generator, above. 4. Take a random number from 1 to k to determine which person will be first in your sample. To do this, press [MATH] [] [5] to paste randInt(, then [1] [,]. Enter the value of k and press [)] [ENTER]. Caution: It’s 1 to k, not 1 to N. If you need to survey every 12th person, then you use randInt(1,12). For determining where to start in the first 12 people, randInt(1,95) and randInt(1,1200) are both wrong. At right you see an illustration with k=12. The calculator has determined that I will start with the 2nd person and take every 12th person after that: 2, 14, 26, 38, 50, and so on. 5. If you reach your desired sample size sooner than expected, keep going for the originally planned time. Why? Because you don’t know whether the individuals that appear early are different from those that appear late. The good news is that the larger sample will give you more accurate results, always a good thing.

1B3. Good but Hard: Cluster Samples Sometimes a true random sample is possible but unreasonably difficult. For example, you could use census records to take a random sample of 1000 adults in the US, but that would mean doing a lot of travel. So instead you take a cluster sample. Definition:

In a cluster sample, you first subdivide the population into a large number of subunits, called clusters, and then you construct a random sample from the clusters. “In single-stage cluster sampling, all members of the selected clusters are interviewed. In multi-stage cluster sampling, further subdivisions take place.” (Upton and Cook 2008, 76)

Example 6: You want to have 600 representative Americans try your new neck pillow to gauge your potential market. Travel to 600 separate locations across the country would be ridiculously expensive, so you randomly select 30 census tracts and then randomly select 20 individuals within each selected census tract. A cluster sample makes one big assumption: that the individuals in each cluster are representative of the whole population. You can get away with a slightly weaker assumption, that the individuals in all the selected clusters are representative of the whole population. But it’s still an assumption. For this and other technical reasons, a cluster sample cannot be analyzed in all the same ways as a random sample or systematic sample. Analysis of cluster samples is outside the scope of this course.

1B4. Stratified Samples Sometimes you can identify subgroups of your population and you expect individuals within a subgroup to be more alike than individuals of different subgroups. In such a case, you want to take a stratified sample. Definition:

If you can identify subgroups, called strata (singular: stratum), that have something in common relative to the trait that you’re studying, you want to ensure that your sample has the same mix of those groups as the population. Such a sample is called a stratified sample.

Example 7: You’re studying people’s attitudes toward a proposed change in the immigration laws for a Presidential candidate. You believe that some races are more likely to favor loosening the law and others are more likely to oppose it. If the population is 66% non-Hispanic white, 14% Hispanic, 12% black, 4% Asian, and so on, your sample should have that same composition. A stratified sample is really a set of mini-samples grouped together. Example 8: You want to survey attitudes towards sports at a college that is 45% male and 55% female, and you want 400 in your sample. You would take a sample of 45%×400 = 180 male students and 55%×400 = 220 female students to make up your sample of 400. Each mini-sample would be taken by a valid technique like a random sample or systematic sample.

1B5. Census Definition:

A census is a sample that contains every member of the population.

In many situations, it’s impossible or highly inconvenient to take a census. But with the nearuniversal computerization of records, a census is practical in many situations where it never used to be. Example 9: At the push of a button, a librarian can get the exact average number of times that all library books have been checked out, with no need for sampling and estimation. An apartment manager can tell the exact average number of complaints per tenant. And so forth. A census is the only sample that perfectly represents the population, because it is the whole population. If you can take a census, you’ve reduced a problem of inferential statistics to one of descriptive statistics. But even today, only a minority of situations are subject to a census. For instance, there’s no way to test a drug on every person with the condition that the drug is meant to treat. It’s totally impractical to interview every potential voter and determine his or her preferences. And so forth.

1B6. Bogus Samples Any sample where people select the individual Because this textbook helps you, members is a bogus sample. That means every sample please click to donate! where people select themselves, and every sample where the interviewer decides whether to include or exclude individual members. Why is that bad? Remember, a proper sample is a smaller group that is representative of the population. No sample will represent the population perfectly, but you do the best you possibly can. The good samples listed above can go bad if you make various kinds of mistakes, but a sample that doesn’t depend on the workings of chance is always wrong and cannot be made right. The textbooks will give you names for the types of bad samples — convenience sample, opportunity sample, snowball sample, and so on — but why learn the names when they’re all bogus anyway? Good Samples

Bad Samples

Chosen through probability methods

Chosen by individual decisions about which persons or things to include

Represent the population as well as possible

Do not accurately represent the population

Uncertainty can be estimated, and can be reduced by increasing sample size

Uncertainty cannot be estimated, and bigger samples don’t help

So goodbye to Internet polls and petitions, letter-writing campaigns, “the first 500 volunteers”, and every other form of self-selected sample. If people select themselves for a sample, then by definition they are not representative because they feel more strongly than the people who didn’t select themselves. You can make statements about the people who selected themselves, but that tells you nothing about the much larger number who didn’t select themselves. (More about this in Simon 2001, Web Polls.) Goodbye also to any kind of poll where the pollster selects the individual people. If you set up a rule that depends on the workings of chance and then follow it, that’s okay. But if you decide on the spur of the moment who gets included, that’s bogus. Why is it bad to just approach people as you see them? Because studies show that you are more likely to approach people that you perceive to be like you, even if you’re not aware of that. Ask yourself if you are truly equally likely to select someone of a different race or sex from yourself, someone who is dressed much richer or poorer than you, someone who seems much more or much less attractive, and so forth. Unless you’re Gandhi, the honest answer is “not equally likely”. It doesn’t make you a bad person, just a bad pollster like everyone else. If you tend to pick people who are more like you, your sample is not representative of the population. The same principle applies to studies of non-humans. Here the investigator’s intrinsic biases may be less clear, but unless you choose your sample based on chance you can never be sure that those biases didn’t “skew it up”.

1C. Data and Variables Summary:

Statistics is all about data and variables, but what exactly do those terms mean? What are the types of data and variables? This will be an important topic throughout the course, because different variable types are presented differently in descriptive statistics, and again are analyzed differently in inferential statistics. So before you do anything, you need to think what type of data you’re dealing with.

1C1. What Are Data? What Are Variables? Definitions:

Variables are the characteristics you’re studying. Data are the values of those characteristics that you record, and the value recorded from any given member of the sample is called a data point or datum. You can think of the variable as kind of like a question, and the data points as the answers to that question. If you record one piece of information from each member of the sample, you have univariate data; if you record two pieces of information from each member, you have bivariate data.

Example 10: You record the birth weights of babies born in a certain hospital during the year. The variable is “birth weight”. Example 11: In April, you ask all the members of your sample whether they had the flu vaccine that year and how many days of work or school they lost because of colds or flu. (Can you see at least two problems with that second question? If not, you will after you read about Nonsampling Errors, later in this chapter.) This is bivariate data. One variable is “flu shot?” and the data points are all yes or no; the other variable is “days lost to colds and flu” and the data points are whole numbers.

1C2. Quantitative or Qualitative? Definitions:

Quantitative data are data that are numbers. Quantitative data are also called numeric data. Numeric data are subdivided into discrete and continuous data. Discrete data are whole numbers and typically answer the question “how many?” Continuous data can take on any value (or any value within a certain range) and typically answer the question “how much?” Qualitative data are data that are not numbers. Qualitative data are also called nonnumeric data, attribute data or categorical data.

Common mistakes:

Just seeing numbers in a problem does not mean you have numeric data. Consider this statement: “45% of viewers polled said they thought Candidate X performed well in the debate.” There’s a number there, all right, but you have non-numeric data because each person answered “yes” or “no”, which means the individual data points are non-numeric. Some data look like numbers but aren’t: ZIP codes, for instance. When in doubt, ask yourself, “Would it make sense to average the data?” If the answer is no, you have non-numeric data.

Sometimes we talk about data types, and sometimes about variable types. They’re the same thing. For instance, “weight of a machine part” is a continuous variable, and 61.1 g, 61.4 g, 60.4 g, 61.0 g, and 60.7 g are continuous data. Quantitative (numeric)

Qualitative (categorical or non-numeric)

You get a number from each member of the sample.

You get a yes/no or a category from each member of the sample.

The data have units (inches, pounds, dollars, IQ points, whatever) and can be sorted from low to high.

The data may or may not have units and do not have a definite sort order.

It makes sense to average the data.

Your summary is counts or percentages in each category.

Examples (discrete): number of children in a family, number of cigarettes smoked per day, age at last birthday Examples (continuous): height, salary, exact age

Examples: hair color, marital status, gender, country of birth, and opinion for or against a particular issue

Continuous or discrete data? Sometimes when you have numeric data it’s hard to say whether you have discrete or continuous data. But since you’ll graph them differently, it’s important to be clear on the distinction. Here are two examples of doubtful cases: salary and age. It’s true that your salary can be only a whole number of pennies. But there are a great many possible values, and the distance between the possible values is quite small, so you call salary a continuous variable. Besides, you don’t ask “how many pennies do you make?” but rather “how much do you make?” What about age? Well, age at last birthday is clearly discrete since it can be only a whole number: “how many years old were you at your last birthday?” But age now, including years and months and days and fractions of days, would be continuous, again because you can subdivide it as finely as desired.

1C3. Summary Statements When you see a summary statement, you have to do a little mental detective work to figure out the data type. Always ask yourself, what was the original measurement taken or question asked? Example 12: “The average salary at our corporation is $22,471.” The original measurement was the salary of each individual, so this is continuous data. Example 13: “The average American family has 1.7 children.” Don’t let “1.7” fool you into identifying this as a continuous variable! What was the original question or measurement? “How many children are there in your family?” That’s discrete data. Example 14: “Four out of five dentists surveyed recommend Trident sugarless gum for their patients who chew gum.” Yes, there are numbers in the summary statement, but the original question asked of each dentist was “Do you recommend Trident?” That is a yes/no question, so the data type is categorical.

1D. Statistical Errors Summary:

In statistics, an error is not necessarily a mistake. This section explores the types of statistical errors and where they come from.

Definition:

An error is a discrepancy between your findings and reality. Some errors arise from mistakes, but some are an inevitable part of the sampling process.

1D1. Sampling Error Definition:

Even if you make no mistakes, inevitably samples will vary from each other and any given sample is almost sure to vary from the population. This variability is called sampling error. (It would probably be more helpful to call it sample variability, but we’re stuck with “sampling error”.) Sampling error “refers to the difference between an estimate for a population based on data from a sample and the ‘true’ value for that population which would result if a census were taken.” (Australian Bureau of Statistics 2013)

Except for a census, no sample is a perfect representation of the population. So the sample mean (average), for example, will usually be a bit different from the population mean. Sampling errors are unavoidable, even if you do everything right when you take a random sample. They’re not mistakes, they’re just part of the nature of things. Although sampling error cannot be eliminated, the size of the error can be estimated, and it can be reduced. For a given population, a larger sample size gives a smaller sampling error. You'll learn more about that when you study sampling distributions.

1D2. Nonsampling Errors Definition:

Nonsampling errors are discrepancies between your sample and reality that are caused by mistakes in planning, in collecting data, or in analyzing data.

Nonsampling errors make your sample unrepresentative of the population and your results questionable if not useless. Unlike sampling errors, nonsampling errors cannot be reduced by taking larger samples, and you can’t even estimate the size of most nonsampling errors. Instead, the mistakes must be corrected, and probably a new sample must be taken. There are many types of nonsampling errors. Different authors give them different names, but it’s much more important for you to recognize the bad practice than to worry about what to name it. In taking your own samples, and in evaluating what other people are telling you about their samples, always ask yourself: what could go wrong here? has anything been done that can make this sample unrepresentative of the population? Here are some of the more common types of nonsampling errors. After you read through them, see how many others you can think of. Self-Selected Samples This is almost always bogus. People who select themselves are by definition different from people who don’t, which means they are not representative. It can be very hard to know whether that difference matters in the context of a particular study. Since you can never be sure, it is safest to avoid the problem and not let people select themselves. But medical studies all use volunteers. (They have to, ethically.) Why doesn’t that make the sample bogus? They’re volunteers, but usually they’re not self-selected volunteers. For example, researchers may ask doctors and hospitals to suggest patients who meet a particular profile; they use probability techniques to select a sample from that pool. But things are not always simple. For example, some companies or researchers may advertise and pay volunteers to undergo testing. In this case you have to ask very serious questions about whether the volunteers are representative of the general population. Statistical thinking isn’t a matter of black and white, but some pretty sophisticated judgment can be involved. Your take-away is: don’t accept anything at face value, but always ask: What important facts are being left out? What does that do to the credibility of the results? Sampling Bias Definition:

Sampling bias results from taking a sample in a way that tends to over- or underrepresent some subgroup of the population.

Example 15: If you’re doing a survey on student attitudes toward the cafeteria, and you conduct the survey in the cafeteria, you are systematically under-representing students who don’t use the cafeteria. It seems logical that attitudes are more negative among students who don’t use the cafeteria than among students generally, so by excluding them you will report overall attitude as more favorable than it really is. “Bias” is a good example of the words in statistics that don’t have their ordinary English meaning. You’re not prejudiced against students who dislike the cafeteria. “Bias” in statistics just means that something tends to distort your results in a particular direction. Example 16: The classic example of sampling bias is the Literary Digest fiasco in predicting that Landon would beat Roosevelt in the 1936 election. The magazine sent questionnaires to all its subscribers, it phoned randomly selected people in telephone books, and it left stacks of questionnaires at car dealerships with instructions to give one to every person who test drove a car. The sample size was in the millions. This procedure systematically over-represented people who were well off and systematically under-represented poorer people. In 1936 the Great Depression still held sway, and most people did not have the disposable income to subscribe to a fancy magazine, let alone a home telephone; the very thought of buying a car would have struck them as ridiculous or insulting. In that era, the Republicans appealed more to the rich and the Democrats more to the working class. So the net effect of the Literary Digest’s procedure was that it made the country look a lot more Republican than it actually was. Since Landon was a Republican and FDR a Democrat, FDR’s actual support was much greater than shown by the poll, and Landon’s was much less. Notice that a sample size of millions did not overcome sampling bias. A larger sample size is not an answer to nonsampling errors. The Digest’s original article can be found in Landon in a Landslide: The Poll That Changed Polling (American Social History Project). While we’re on the subject of presidential elections, different nonsampling errors also led to wrong predictions of a Dewey victory over Truman in 1948. For analyses of both the 1936 and the 1948 statistical mistakes, see Classic Polling Surprises (2004) and Introduction to Polling (n.d.). Selection Bias Beyond sampling bias, there are many other bad practices in selecting your sample can bias the results. Wikipedia’s Selection Bias has a good rundown of quite a few. Non-Response Errors If you’re taking a mail survey, a significant number of people (probably a majority) won’t respond. Are the responders representative of the non-responders, or has a bias been introduced by the nonresponse? That’s a tough question, and the answer may not always be clear. For this reason, mail surveys are often coded so that the investigators can tell who did respond, and follow up with those who didn’t. That follow-up can be more mail, a phone call, or a visit. Even with in-person polls, non-response is a problem: many people will simply refuse to participate in your survey. Depending on what you’re surveying, that could be unimportant or it could be a fatal flaw. Response Errors Definition:

Response errors occur when respondents give answers that don’t match objective truth or their own actual opinions.

Poorly worded survey questions are a major source of response errors, and lead to biased results or completely meaningless results. There may not be a perfect survey question, but having several people review the questions against a list of possible problems will greatly reduce the level of response errors. But response errors can never be completely eliminated. For instance, people tend to shade their answers to make themselves look good in their own eyes or in the interviewer’s eyes. Most people rate themselves as better-than-average drivers, for example, which obviously can’t be true. And selfreporting of objective facts is always suspect because memory is unreliable. Example 17: “How often do you read to your child?” (People will tend to award themselves points for good intentions, and they don’t want to look like bad parents.) “Do you think immigrants should be allowed to take jobs from honest Americans?” (That’s a leading question. Compare with “Should immigrants be allowed to apply for jobs and pay taxes in the US?” You can see that a given person might give different answers to those two questions.) “How much do you spend on food, including groceries and restaurants, in a typical week?” (Most people don’t carry an accurate accounting system in their heads.) “Do you agree with the X Party platform on gun control, education, abortion, and taxes?” (Asking too much in one question. If you agree with some but not all, how do you answer?) “Do you favor prison reform?” (Too vague — nobody’s against prison reform in principle, but the specifics of a particular policy would make all the difference.) “The President has proposed a Federal 30-day waiting period for the purchase of any automatic or semiautomatic weapon. Do you favor this proposal?” (Superficially this looks good: it’s specific, and it’s asking only one thing. But in fact it’s biased by the use of “favor” alone. You should word questions neutrally: “Do you favor or oppose. …” Better yet, “What is your opinion of this proposal?” with options of strongly agree, agree, neutral, disagree, and strongly disagree. “In the race for mayor, do you favor candidate A, B, or C?” (Some people are more likely to choose the first alternative because it’s first, and they don’t like to say they have no strong opinion. It is better to vary the order of candidates randomly to avoid a response error in favor of A. Of course, you should also offer “other” and ’not sure” as alternatives.) Data Errors These include mistakes by interviewers in recording respondents’ answers, mistakes by investigators in measuring and recording data, and mistakes in entering the recorded data. Inappropriate Analysis In the second half of the course you’ll learn a number of inferential statistics procedures. Each one is appropriate in some circumstances and inappropriate in others. If you use the wrong form of analysis in a given situation, or you apply it wrongly, your results will be about as good as the results from using a hammer to drive a screw.

1E. Observation and Experiment Summary:

There are two main methods of gathering data, the observational study and the experiment. Learn the differences, and what each one can tell you.

1E1. Observational Study Versus Designed Experiment Many, many statistical investigations try to find out whether A causes B. To do this, you have two groups, one with A and one without A, or you have multiple groups with different levels of A. You then ask whether the difference in B among the groups is significant. The two main ways to investigate a possible connection are the observational study and the experiment. The concepts aren’t hard, but there’s a boatload of vocabulary. Let’s get through the definitions first, and then have some concrete examples to show how the terms are used. Please read the definitions first, then read the first example and refer back to the definitions; repeat for the other examples. Definition:

In an observational study, the investigator simply records what happens (a prospective study) or what has already happened (a retrospective study). In an experiment, the investigator takes an more active role, randomly assigning members of the sample to different groups that get treated differently.

Which is better? Well, in an observational study, you always have the problem of wondering whether the groups are different in some way other than what you are studying. This means that an observational study can never establish cause. The best you can do after an observational study is to say that you found an association between two variables. BTW: How do we establish cause, when for ethical or practical reasons we can’t do an experiment? The nine criteria are listed in Causation (Simon 2000b) and were first laid down by Sir Austin Bradford Hill in 1965.

Definitions:

In an observational study or an experiment, there are two or more variables. You want to show that changes in one or more of them, called the explanatory variables, go with changes in one or more response variables. Explanatory variables are the suspected causes, and response variables are the suspected effects or results.

Example 18: Over the course of a year, you have parents record the number of minutes they spend every day reading to their child, and at the end of the year you record each child’s performance on standard tests. The explanatory variable is parental time spent reading to the child, and the response variable(s) are performance on the standardized test(s). Definitions:

In an experiment, the experimenter manipulates the suspected cause(s), called explanatory variable(s) or factor(s). A specific level of each factor is administered to each group. The level(s) of the explanatory variable(s) in a given group are known as its treatment.

Example 19: To test productivity of factory workers, you randomly assign them to three groups. One group gets an extra hour at lunch, one group gets half-hour breaks in morning and afternoon, and one group gets six 10-minute breaks spaced throughout the day. The explanatory variable or factor is structuring of break time, and the three levels or treatments are as described. Definitions:

In an experiment, each member of a sample is called a unit or an experimental unit. However, when the experiment is performed on people they are called subjects or participants.

Definition:

In any study or experiment, results will vary for individuals within each group, and results will also vary between the groups as a whole. Some of that variation is due to chance: it is expected statistical variability or sampling error. If the differences between groups are bigger than the variation within groups — and enough bigger, according to some calculations you’ll learn later — then the investigator has a significant result. A significant result is a difference that is too big to be merely the result of normal statistical variability. I’ll have a lot more to say about significance when you study Hypothesis Tests.

Confounding and Lurking Variables In Example 18, about reading to children, you find generally that the more time parents spend reading to first graders, the better the children tend to do on standard tests of reading level. Is the reading time responsible for the improved test scores? You can easily think of other possible explanations. Parents who spend more time reading to their children probably spend more time with them in general. They tend to be better off financially — if you’re working two jobs to make ends meet, you probably have little time available for reading to your children. Economic status and time spent with children in general are examples of lurking variables in this study. Definition:

A hidden variable that isn’t measured and isn’t part of your design but affects the outcome is called a lurking variable.

Example 20: In a large elementary school, you schedule half the second grade to do art for an hour, two mornings a week, with the district’s art teacher. The other half does art for an hour, two afternoons a week, with the same teacher, but they are told at the beginning that all their projects will be displayed and prizes given for the best ones. Can you learn anything from this about whether the chance to win prizes prompts children to do a better job on art projects? The problem is that there’s not just one difference in treatment here, the promised prizes. There’s also the fact that everyone’s project will be on display. And maybe mornings are better (or worse) for doing art than afternoons. Maybe the teacher is a morning person and fades in the afternoon, or is not a morning person and really shines in the afternoon. Even if there’s a difference in quality of the projects, you can’t make any kind of simple statement about the cause, because of these confounding variables. Definition:

A confounding variable is “associated in a non-causal way with a factor and affects the response.” (DeVeaux, Velleman, Bock 2009, 346) “Confounding occurs in an experiment when you [can’t] distinguish the effects of different factors.” (Triola 2011, 32)

In the art example, you wanted to find out whether promising prizes makes children do better art work. But the promise of prizes wasn’t the only difference between the two groups. Time of day and public display are confounding variables built into the design of this experiment. You know what they are, but you can’t untangle their effect from the effect of what you actually wanted to study. What’s the difference between lurking variables and confounding variables? Both confuse the issue of whether A causes B. A lurking variable L is associated with or causes both A and B, so any relationship you see between A and B is just a side effect of the L/A and L/B relationships. For example, counties with more library books tend to have more murders per year. Does reading make people homicidal? Of course not! The lurking variable is population size. High-density urban counties have more books in the library and more murders than low-density rural counties. A confounding variable C is typically associated with A but doesn’t cause it, so when you look at B you don’t know whether any effect comes from A, from C, or from both. For example, after a year with a lot of motorcycle deaths, a state passes a strict helmet law, and the next year there are significantly fewer deaths. Was the helmet law responsible? Maybe, but time is a confounding variable here. Were motorcyclists shocked at the high death toll, so that they started driving more carefully or switched to other modes of transit? Don’t obsess over the difference between lurking and confounding variables. Some authors don’t even make a distinction. You should recognize variables that make results questionable; that’s a lot more important than what you call them. BTW: That said, if you want to see two more takes on the difference, have a look at Confounding and Lurking Variables (Virmani 2012) and Confounding Variables (Velleman 2005).

Lurking and confounding variables are the boogeyman of any statistical work. Lurking variables are the reason that an observational study can show only association, not causation. In experiments, you have the potential to exclude lurking variables, or at least to minimize them, but it takes planning and extra work, and you need to be careful not to create a design with built-in confounding.. Whenever any experiment claims that A causes B, ask yourself what lurking variables there might be, and whether the design of the study has ruled them out. You can’t take this for granted, because even professional researchers sometimes cut corners, knowingly or unknowingly. Extended Examples Example 21: Does smoking cause lung cancer? Initial studies in the mid-20th century had three or four groups: non-smokers, light smokers, moderate smokers, and heavy smokers. They looked at the number and severity of lung tumors in the groups to see whether there was a significant difference, and in fact they found one. This was an observational study. Ethically it had to be: if you suspect smoking is harmful you can’t assign people to smoke. Explanatory variable: smoking level (none, light, moderate, heavy). Levels or treatments don’t apply to an observational study. Response variable: tumor production Because this was an observational study, there was no control for lurking variables, and even with a significant result you can’t say from this study that smoking causes lung cancer. What lurking variables could there be? Well, maybe some genetic factor both makes some people more likely to smoke and makes them more susceptible to lung cancer. This is a problem with every observational study that finds an effect: you can’t rule out lurking variables, and therefore you can’t infer causation, no matter how strong an association you find. Since you can’t do an experiment on humans that involves possibly harming them, how do you know that smoking causes lung cancer? A good explanation is in Causation (Simon 2000b). Example 22: Does aspirin make heart attacks less likely? Here you can do an experiment, because aspirin is generally recognized as safe. Investigators randomly assigned people to two groups, gave aspirin to one group but not the other, and then monitored the proportion who had heart attacks. They found a significantly lower risk of heart attack in the aspirin group. This was a designed experiment. Explanatory variable: aspirin. There were two levels or treatments: yes and no. Response variable: heart attack (yes/no) From this experiment, you can say that aspirin reduces the risk of heart attack. How can you be sure there were no lurking variables? By randomly assigning people to the two groups, investigators made each group representative of the whole population. For example, overweight is a risk factor for heart attacks. The random assignment ensures that overweight people form about the same proportion in each group as in the population. And the same is true for any other potential lurking variable. (It helps to have larger samples, and in this study each sample was about 10,000 people.) Example 23: Does prayer help surgical patients? Here again, no one thinks prayer is harmful, so ethically the experimenters were in the clear to assign cardiac-bypass patients randomly to three groups: people who knew they were prayed for, people who were prayed for and didn’t know it, and people who were not prayed for. Investigators found no significant difference in frequency of complications between the patients who were prayed for and those who were not prayed for. This was a designed experiment. Explanatory variables: receipt of prayer (two levels, yes and no) and knowledge of being prayed for (also two levels, yes and no). There were three treatments: (a) receipt=yes and knowledge=yes, (b) receipt=yes and knowledge=no, (c) receipt=no. Response variable: occurrence of post-surgical complications (yes/no). (You can read an abstract of the experiment and its results in Study of the Therapeutic Effects of Intercessory Prayer [Benson 2006]. The full report of the experiment is in Benson 2005.) Because lurking variables can’t be ruled out in an observational study, investigators always prefer an experiment if possible. If ethical or other considerations prevent doing an experiment, an observational study is the only choice. But then the best you can hope for is to show an association between the two variables. Only with an experiment do you have a hope of showing causation.

1E2. Experimental Techniques Okay, so you always have to do an experiment if you want to show that A causes B. Let’s look in more detail at how experiments are conducted, and learn best practices for an experiment. Caution: Design of Experiments is a specialized field in statistics, and you could take a whole course on just that. This chapter can only give you enough to make you dangerous. While you’re planning your first experiment in real life, it’s a good idea to get help from someone senior or a professional statistician. BTW: R. A. Fisher “virtually invented the subject of experimental design” (Upton and Cook 2008, 144), and pioneered many of the techniques that we use today. He was a great champion of planning: Upton & Cook quote him as saying “To call in the statistician after the experiment is done may be no more than asking him to perform a postmortem examination: he may be able to say what the experiment died of.”

Completely Randomized Design Definitions:

Experimenters randomly assign members to the various treatment groups. We say that they have randomized the groups, and this process is called randomization.

Why randomize? Why not just put the first half of the sample in group A and the second half in group B? Because randomization is how you control for lurking variables. Think about the study with aspirin and heart attacks. You know that different individuals are more or less susceptible to heart attacks. Risk factors include smoking, obesity, lack of exercise, and family history. You want your aspirin group and your non-aspirin group to have the same mix of smokers and non-smokers as the general population, the same mix of obese and non-obese individuals, and so on. Actually it’s harder than that. There aren’t just “smokers” and “non-smokers”; people smoke various amounts. There aren’t just “obese” and “fit” people, but people have all levels of fitness. It would be very laborious to do stratified samples and get the right proportions for a lot of variables. You’d have to have a huge number of strata. And even if you did do those matchups, taking enormous trouble and expense, what about the variables you didn’t think of? You can never be sure that the samples have the same composition as the population. It really must be random assignments — you can’t just assign test subjects to groups alternately. Steve Simon (2000a) explains why, with examples, in Alternating Treatments. Randomization is the indispensable way out. Instead of trying to match everything up yourself — and inevitably failing — you let impersonal random chance do your work for you. Are you guaranteed that the sample will perfectly represent the population? No, you’re not. Remember sampling error, earlier in this chapter. Samples vary from the population; that’s just the nature of things. But when you randomize, in the long run most of your samples will be representative enough, even though they’re not perfectly representative. Randomized Block Design Notice I said that randomization works in the long run. But in the short run it may not. Suppose you are testing a weight-loss drug on a group of 100 volunteers, 50 men and 50 women. If you completely randomize them, you might end up with 20 men and 30 women in one group, and 30 men and 20 women in the other. (There’s about a 20% chance of this, or a more extreme split.) Why is this bad? Because you don’t know whether men and women respond differently to the drug. If you see a difference between your 20/30 placebo group and your 30/20 treatment group, you don’t know how much of that is the drug and how much is the difference between men and women. Gender is a confounding variable. What to do? Create blocks, in this case a block of the 50 men and a block of the 50 women. Then within each block you randomly assign individuals to receive medication or a placebo. Now you can find how the drug affects women and how it affects men. This is called a randomized block design. When you can identify a potentially confounding variable before you perform your experiment, first divide your subjects into blocks according to that variable, and then randomize within each block. Do this, and you have tamed that confounding variable. BTW: R. A. Fisher coined the term “randomized block” in 1926.

In this example, gender would be called a blocking variable because you divide your subjects into blocks according to gender. Now there’s no problem separating the effects of the drug from the effects of gender in your experimental results. When I talked about complete randomization, I said it would be laborious to take strata of a lot of variables, and that complete randomization was the answer. But here I’m suggesting exactly that for men and women in the weight-loss study. Right about now, you might be telling me, “Make up your mind!” This is where some judgment is needed in making tradeoffs. Men and women typically have different percentages of body fat, and they are known to respond differently to some drugs. It makes sense that a weight-loss drug could have different response from men and women, and therefore you block on gender. But no other factor stands out as both important and measurable. If you tried to block on motivation, for instance, how would you measure it? “Block what you can, randomize what you cannot” is a good rule, sometimes attributed to George Box. A variable is a candidate for blocking when it seems like it could make a difference, and you can identify and measure it. For other variables, we depend on randomization, either complete randomization or randomization within blocks. Matched Pairs There’s one circumstance where you can be sure that the subgroups are perfectly matched with respect to lurking variables: when you use a matched-pairs design. This is kind of like a randomized block design where each block contains two identical subjects. Example 24: You want to know whether one form of foreign-language instruction is more effective than another. So you take fifty pairs of identical twins, and assign one twin from each pair to group A and the other twin to group B. Then you know that genetic factors are perfectly balanced between the two groups. And if you restrict yourself to twins raised together, you’ve also controlled for environmental factors. A special type of matched-pairs design matches each experimental unit to itself. Example 25: You want to know the effect of caffeine on heart rate. You don’t assemble a sample, give coffee to half of them, and measure the difference in heart rate between the groups. People’s heart rates vary quite a bit, so you would have large variation within each group, and that might swamp the effect you’re looking for. Instead, you measure each individual’s resting heart rate, then give him or her a cup of coffee to drink, and after a specified time measure the heart rate again. By comparing each individual to himself or herself, you determine what effect caffeine has on each person’s heart rate, and people’s different resting heart rates aren’t an issue. See also:

Experimental Design in Statistics shows how the same experiment would work out with a completely randomized design, randomized blocks, and matched pairs. Control Group and Placebo

Definition:

In an experiment, usually one of the treatments will be no treatment at all. The group that gets no treatment is called the control group.

But “no treatment at all” doesn’t mean just leaving the control group alone. They should be treated the same way as the other groups, except that they get zero of whatever the other groups are getting. If the treatment groups are getting injections, the control group must get injections too. Otherwise you’ve introduced a lurking variable: effects of just getting a needle stick, and in humans the knowledge that they’re not actually getting medicine. Definition:

A placebo is a substance that has no medical activity but that the subjects of the experiment can’t tell from the real thing.

The placebo effect is well known. Sick people tend to get better if they feel like someone is looking after them. So if you gave your treatment group an injection but your control group no injection, you’d be putting them in a different psychological state. Instead, you inject your control group with salt water. BTW: TheProfessorFunk has a fun three-minute YouTube video in the placebo effect (Keogh 2011). Thanks to Benjamin Kirk for drawing this to my attention.

You might think placebos would be unnecessary when experimenting on animals. But if you’ve ever had a pet, you know that some animals are stressed by getting an injection. If the control group didn’t get an injection, you’d have those differing stress levels as a lurking variable. So you administer a placebo. Example 26: Sometimes, for practical or ethical reasons, you have to get a little bit creative with a control group. Here’s an excellent example from Wheelan (2013, 238): Suppose a school district requires summer school for struggling students. The district would like to know whether the summer program has any long-term academic value. As usual, a simple comparison between students who attend summer school and those who do not would be worse than useless. The students who attend summer school are there because they are struggling. Even if the summer school program is highly effective, the participating students will probably still do worse in the long run than the students who were not required to take summer school. What we want to know is how the struggling students perform after taking summer school compared with how they would have done if they had not taken summer school. Yes, we could do some kind of controlled experiment in which struggling students are randomly selected to attend summer school or not, but that would involve denying the control group access to a program that we think would be helpful. Instead, the treatment and control groups are created by comparing those students who just barely fell below the threshold for summer school with those who just barely escaped it. Think about it: the [group of all] students who fail a midterm are appreciably different from [the group of all] students who do not fail a midterm. But students who get a 59 percent (a failing grade) are not appreciably different from those students who get a 60 percent (a passing grade).

Double Blind Some people do better if they think they’re getting medicine, even if they’re not. To avoid this placebo effect, the standard technique is the double blind. Definitions:

In a double-blind experiment, neither the test subjects nor those who administer the treatments know what treatment each subject is getting. In a single-blind experiment, the test subjects don’t know which treatment they’re getting, but the personnel who administer the treatments do know.

Okay, given that people’s thoughts influence whether they improve, a single blind makes sense. If you let someone know they’re not getting medicine in a trial, they’re less likely to improve. But why isn’t that enough? Why is a double blind necessary? For one thing, there’s always the risk that a doctor or nurse might tell the subject, accidentally or on purpose. But beyond that, if you’re treating someone who has a terrible disease, you might treat them differently if they’re getting a placebo that if they’re getting real medicine, even if you don’t realize you’re doing it. Why take the risk of introducing another lurking variable? Better to use a double blind and just rule out the possibility. You might wonder how it’s done in practice. In a drug trial, for instance, each test subject is assigned a code number, and the drug company then packages pills or vaccines with a subject’s code number on each. The doctors and nurses who administer the treatments just match the code number on the pill or vaccine to each subject’s code number. Of course all the pills or vaccines look alike, so the workers who have contact with the subjects don’t know who’s getting medicine and who’s getting a placebo. And what they don’t know, they can’t reveal.

1F. Sharp Points

1F1. Rounding and Significant Digits You’ll be dealing with numbers through most of this course. Handle them right, and you won’t get burned! There are three issues here: how many digits to round to, how to round to that number of digits, and when to do your rounding. How Many Digits? There are a lot of rules for how many digits you should round to, but we’re not going to be that rigorous in this course. Instead, you’ll use common sense supplemented by a few rules of thumb. What’s common sense? Avoid false precision, and avoid overly rough numbers. BTW: The rules are important, but we have only so much time, and you’ve probably learned them in your science courses. If you want to, look up “significant figures” or “significant digits” in the index of pretty much any science textbook, or look at Significant Figures/Digits.

Example 27: When you fill your car’s gas tank, the pump shows the number of gallons to three decimal places. You can also describe that as the nearest thousandth of a gallon. How much gas is that? Convert it to teaspoons (Brown 2009): (0.001 gal) × (4 qt/gal) × (4 cup/qt) × (16 Tbsp/cup) × (3 tsp/Tbsp) » 0.8 tsp. You can bet there’s several times that much in the hose when the pump shuts off. Three decimal places at the gas pump is false precision a/k/a spurious accuracy. That third decimal place is just noise, statistical fluctuations without real significance. On the other hand, suppose the pump showed only whole gallons. This is too rough. You can go along pumping gas for no extra charge (bad for the merchant), and then abruptly the cost jumps by several dollars (bad for you). Here are some rules of thumb to supplement your common sense. These are not matters of right and wrong, but conventions to save thinking time: Round averages and other statistics to one more decimal place than the original data. If you’ve surveyed families for the number of children in each household, you have a bunch of whole numbers, which have zero decimal places. Your average should have one decimal place. Round probabilities to four decimal places unless you show them as exact fractions. Round z-scores (Chapter 3 and later) and other test statistics to two decimal places. How to Round Numbers Round in one step. Say you have a number 1.24789, and you want to round it to one decimal place. Draw a line — mentally or with your pencil — at the spot where you want to round: 1.2|4789. If the first digit to the right of that line is a 0, 1, 2, 3, or 4, throw away everything to the right of the line. It doesn’t matter what digits come after that first digit. Here, the first digit to the right of the line is a 4, so you throw away everything to the right of the line: 1.24789 rounded to one decimal place is 1.2. Rounding in multiple steps, 1.24789 Õ 1.2479 Õ 1.248 Õ 1.25 Õ 1.3, is wrong. (Why? Because 1.24789 is 0.05211 units away from 1.3, but only 0.04789 units away from 1.2.) You must round in one step only. As you know, if the first digit to the right of the line is a 5, 6, 7, 8, or 9, you raise the digit to the left of the line by one and throw away everything to the right of the line. To one decimal place, 1.27489 is 1.2|7489 Õ 1.3. You may need to “carry one”. What is 1.97842 to one decimal place? 1.9|7842 needs you to increase that 9 by one. That means it becomes a zero and you have to increase the next digit over: 1.9+0.1 = 2.0. Therefore, 1.97842 rounded to one decimal place is 2.0. When to Round Numbers Here’s the Big No-No: Never do further calculations with rounded numbers. What’s the right way? Round only after the last step in calculation. Example 28: True story: In Europe, average body temperature for healthy people was determined to be 36.8°C, as repeated in A Critical Appraisal of 98.6°F (Mackowiak, Wasserman, Levine 1992). Rounding to the nearest degree, the average human body temperature is 37°C. So far so good. But in the US, thermometers for home use are marked in degrees Fahrenheit. Some nimrod converted 37°C using the good old formula 1.8C+32 and got 98.6°F, and that’s what’s marked on millions of US thermometers as “normal” temperature. If you’ve got one of those, ask for your money back, because it’s wrong. Why is it wrong? The person who did the conversion committed the Big No-No and did further calculations with a rounded number. For a correct calculation, use the unrounded number, 36.8. (Okay, 36.8 was probably rounded from 36.77 or 36.82 or something. But the point is that it’s the least rounded number available.) 1.8×36.8+32 = 98.24 Õ 98.2, and that is the average body temperature for healthy humans.

1F2. Powers of 10 from Your Calculator When a calculation results in a number lower than about 0.0005, your calculator will usually present it in the dreaded scientific notation, like this example. Be alert for this! Your answer is not 1.99 (or however you want to round it). Your answer is 1.99×10-4 . How do you convert this to a decimal for reading by ordinary humans? (And yes, you should usually do that — definitely, if your work will be read by non-technical people.) The exponent (the number after the E minus) tells you how many zeroes the decimal starts with, including the zero before the decimal point. 1.99×10-4 is 0.000 199 or 0.0002. When a decimal starts with a bunch of zeroes, especially if the decimal is long, many people use spaces to separate groups of three digits. This makes the decimal easier to read.

1F3. Show Your Work! Don’t just write down answers; show your work. This is in your own best interest: It helps you organize your thoughts, so that you’re less likely to make a mistake. If your work is substantially correct, you may get partial credit even if your final answer was wrong. Your instructor probably won’t give full credit for a bare answer with nothing to back it up. (My own practice is to write WTCF for “where’d this come from?”) “But,” I hear you object, “in the real world, all that matters is getting the right answer.” True enough, but there’s a difference between being in the real world and preparing for the real world. Part of your study is to develop thought and work habits that ensure you will get the right answer when there’s nobody around to check you. You expose your process now, so that problems can be corrected. How do you show your work? The general idea is to show enough that someone familiar with the course content can follow what you did. When evaluating a formula, write down the formula, then on a line below show it with the numbers replacing the letters. Your calculator can handle very complicated formulas in one step, so your next line will be your last line, containing the final answer and any rounding you do. Example: SEM = /√n SEM = 160/√37 SEM = 26.30383797 Õ SEM = 26.3 You’ll be using a lot of the menus and commands on the TI-83 or TI-84. Here are some tips: Show all command arguments. If you’re using randInt to get five random integers from 1 to 100, write down randInt(1,100,5). That’s the only way your instructor will know that you know how to use that function. If you think the command is randInt(5,100), now is the time to correct that misunderstanding. Abbreviate repetitive information. When you put a column of numbers into list 1, you don’t have to write down all the numbers and say “L1”. Instead, just write “x’s in L1” (use the actual column description if it’s not “x’s”). Focus on commands, not keystrokes. If you’re doing 1-VarStats L1,L2, write that. For pity’s sake, don’t write all the keystrokes, [STAT] [] [1] [2nd] [L1] [,] [2nd] [L2] [ENTER]. I put them in this book because you’re just learning them. But someone familiar with the course knows how to get the command, and I hate to think of all the time and paper you could waste. Show inputs first, then outputs. Many students show their answer, then as an afterthought they write down the command. Write down the command before you enter it in the calculator. When writing down the outputs, you can omit any that are the same as the inputs.

1F4. Optional: ∑ Means Add ’em Up You’ll find that your calculator does the complicated stuff for you, but here and there I’ve scattered formulas in BTW paragraphs in case you want to peek behind the curtain. Stats formulas usually need to do the same thing to every member of a data set and then add up the results. The Greek letter ∑, a capital sigma, indicates this. This summation notation makes formulas easier to write — easier to read, too, once you get used to it. Some examples: ∑x = sum of all data points. (x means a data point. If you had to write this out the long way, it would be x1 + x2 + x3 + … + xn , where n is the size of your data set.) ∑x² = square each data point and add up the squares. (∑ is an addition operator, so powers and multiplication happen before the summation.) ∑xf = multiply each unique data point by the number of times it occurs, and add up the results. (f means frequency or repetition count.) ∑x²f = square each unique data point and multiply by the number of times it occurs, then add up the results. ∑(x − x)² = take each data point and subtract the average of the whole sample, square the result, and add up all the squares. (x is the average of a sample. The parentheses tell you that you don’t square the average, you square the differences.)

What Have You Learned? Key ideas:

Descriptive versus inferential statistics. Sample versus population, and statistic versus parameter. Variable type (same as data type) — numeric versus non-numeric, and the two types of numeric data. Simple random sample (“random” doesn’t mean what you think.) Systematic sample, k and randInt. Mistakes in designing or taking samples. (Your sample is suspect if some factor other than chance determines the members of the sample.) Sampling error. (A more descriptive term is sample variability. It can’t be eliminated, but it can be managed.) Sampling bias among other nonsampling errors. (“Bias” doesn’t mean what you think it does.) Observation versus experiment; only an experiment lets you infer cause and effect. Lurking variables — always be on the lookout for the possibility. Randomization and matched pairs. Mechanics: How and when to round numbers, reading scientific notation from your calculator, and showing your work.

Study aids:

TI-83/84 Cheat Sheet Because this textbook helps you, please click to donate!

Statistics Symbol Sheet How to Read a Math Book How to Work a Math Problem How to Take a Math Test or Quiz



Chapter 2 WHYL Õ

Exercises for Chapter 1 Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand. Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.

1

Briefly distinguish sampling error from nonsampling error. Which one represents avoidable mistakes? The other type can’t be eliminated, but what can you do to reduce it?

2

A gynecologist wants to study pregnant women’s use of prenatal vitamins. One month, she randomly selects one of her first five patients. For the rest of that month, she records data on every fifth pregnant patient that she sees. (a) What type of sample is this? (b) Is it a good sample or a bad sample? Why? (c) Is the gynecologist performing an observational study or an experiment?

3

To test Gro-Mor plant food, investigators randomly divide 150 bulbs into three groups. They are planted in a greenhouse under identical conditions, except that one group gets no plant food, one group gets Gro-Mor, and one group gets Magi-Grow, a competitor’s product. The height of each plant is measured at the end of each week for 13 weeks. Identify the following: (a) Type of experimental design. (b) Factor(s). (c) Treatments or levels. (d) Response variable(s). (e) Experimental units. (f) Explanatory variable(s). (g) Which is the control group?

4

The National Census of Borgovia released the statement, “The average number of children in Borgovian families is 2.1.” (a) Identify the variable. (b) State the specific variable type. (c) Is the number 2.1 a statistic or a parameter?

5

You’re taste-testing your new formula for Whoopsie Cola against your old formula. You assemble a focus group of 80 people and give them each a small cup of each drink. (Half the group gets old cola, water, new cola; the other half gets new, water, old. Of course you don’t tell them what they’re getting.) 55 of the people in the focus group like the new formula better. (a) Describe the sample. (b) What is the sample size? (c) Describe the population. (d) What is the population size?

6

No sample can perfectly represent the population, so no two samples will be the same, even if your sampling technique is perfect. (a) What is the name for this variation? (b) What can be done to reduce this variation?

7

“Have you ever left an infant alone in the house while you went to the store?” Explain how response bias might operate with this question.

8

You want to survey attitudes of resident students toward the cafeteria food. (There are 2000 resident students, and about 1500 of them eat in the cafeteria on a given day. The dorms have two students per room.) How would you construct a random sample of size 50? a systematic sample? a cluster sample? Which of these is the best balance between statistical purity and practicality?

9

Two studies — Misinformation and the 2010 Election (Ramsay 2010) at the University of Maryland, and Some News Leaves People Knowing Less (Fairleigh Dickinson University 2011) have shown that Fox News viewers know less about the world than people who watch no news at all. Can you conclude that this is because they watch Fox News? Why or why not?

10

You’re conducting a survey to determine Tompkins County voters’ willingness to pay for expanded bus routes. You randomly select twenty bus trips on each day one week, and on each selected bus you or your associate hand a questionnaire to each person who gets on the

bus. (a) What is the most serious problem with this survey technique? (b) What is the technical term for this type of mistake?

11

Sandy said, “I took a random sample by walking around the halls at lunch time and just asking random people to take my survey.” What is wrong with this statement? What type of sample did Sandy actually take?

12

“42% of my sample said that they have at least one device in the house that can stream video.” (a) What is the data type? (b) Is this an example of descriptive or inferential statistics? (c) Is the number 42% a statistic or a parameter?

13

You want to test the effectiveness of a new medication for a condition that was previously untreatable. You randomly select thirty doctors from state lists of licensed doctors, and all of them agree to help. Each doctor will put up notices in the waiting rooms, and will select the first 30 adult volunteers, assigning the first 15 to the experimental group and the second 15 to the control group. Patients will not be told which group they are in; you supply placebo pills that are identical in appearance to the active medication. Doctors will administer the placebo and medication to the selected groups and report results back to you. Identify three serious errors in this technique. Are these examples of sampling or nonsampling error?

14 15

Which is larger, 0.0004 or 2.145E-4? Explain. You survey 87 randomly selected households and find a total of 163 children. Dividing, you announce that the average number of children is 163/87 = 1.87356. What’s wrong with that, and how do you fix it?

16

Identify the type of each variable as discrete, continuous, or non-numeric: (a) Telephone area code. (b) Volume of a soap bubble. (c) Number of times a comment gets retweeted. (d) Ownership of a dog. (e) Level of pain, from “none” to “unbearable”. (f) Level of pain, from 0 to 10.

17

Here are some statements summarizing data. (I made all of them up.) State the original question asked or measurement taken from each member of the sample, and identify each data type as discrete, continuous, or non-numeric. The first one is done for you as an example. (a) The average weight loss in rats sent to space was 3.4 g. Answer — Measurement: weight loss of each rat. Continuous. Now you answer these: (b) The average dinner check at my restaurant last Friday was $38.23. (c) 45% of patients taking Effluvium complained of bloating and stomach pain. (d) The average size of a party at my restaurant last Friday was 2.9 people.



Solutions Õ

What’s New 21 Feb 2016: Add randIntNoRep. 23 May 2015: Add a reference to Steve Simon’s Web Polls. Mention that alternating assignments are not as good as a randomized design. 14 Mar 2015: Tie the example of statistic and parameter more closely to the example of descriptive and inferential statistics, and add a summary chart. 1 Feb 2015: Add text to clarify the exercise about Borgovian families. 11 Jan 2015: Add links and study aids in What Have You Learned? (intervening changes suppressed) 16 Jan 2013: New document. (The section on Data and Variables was previously published separately.)

2. Graphing Your Data Updated 16 Dec 2015 (What’s New?) Summary:

To make sense out of a mass of raw data, make a graph. Non-numeric data want a bar graph or pie chart; numeric data want a histogram or stemplot. Histograms and bar graphs can show frequency or relative frequency.

Contents:

2A. Graphing Non-Numeric Data 2A1. Bar Graph · Optional: Bar Graph in Excel · Bar Graph with Relative Frequencies · Optional: Relative Frequencies in Excel · Side-by-Side Bar Graph · Stacked Bar Graph 2A2. Making a Table from Scratch 2A3. Pie Chart · Optional: Pie Chart in Excel 2B. Graphing Numeric Data 2B1. Histogram for Numeric Data · Histogram Versus Bar Graph · Relative-Frequency Histogram · Optional: Histogram in Excel 2B2. Ungrouped Discrete Data · Optional: Ungrouped Discrete Histogram in Excel 2B3. Shapes of Data Sets 2B4. Stem Plot 2C. Bad Graphs 2D. Really Good Graphs What Have You Learned? Exercises for Chapter 2 What’s New

2A. Graphing Non-Numeric Data Any graph of non-numeric data needs to show two things: the categories and the size of each. Probably you’re already familiar with the two most common types, which are the bar graph and pie chart. The sizes of categories can be shown as raw counts, called frequencies, or percentages, called relative frequencies. (Relative frequencies can also be shown as decimals, but I think most people respond better to “20%” than “.20”.) How do you decide whether to show frequencies or relative frequencies? This is a stylistic choice, not a matter of right and wrong. Your choice depends on what’s important, what point you’re trying to make. If your main concern is just with the individuals in your sample, go with frequencies. But if you want to show the relationship of the parts to the whole, show relative frequencies.

2A1. Bar Graph Example 1: In fall 2012, the How Often Parents Read to Children under Age 12 (n=434) Pew Research Center (2013a) surveyed American adults on How Often Number of Parents their habits of reading to their Every day 217 children. The survey included 434 adults who had at least one A few times a week 113 child under age 12, and the About once a week 39 results are shown in the table. (Remember, you can’t call A few times a month 26 the data numeric just because Less often 30 you see numbers in a summary statement. You have to go back Never 9 to the individual data points, which are categorical: “every day”, “a few times a week”, and so on. If the Pew Center had asked “how many days a week do you read to your child?” and got answers 0, 1, 2, 3, 4, 5, 6, and 7, that would be a set of numeric data.) Your bar chart or bar graph must follow these rules: The bars have equal width and equal spacing; they do not touch. Each bar is labeled with its category below the axis. Typically for non-numeric data, there’s no One True Order for the categories. Try to find an order that feels natural. If you prefer, you can order the categories from the tallest to the shortest bar; that is called a Pareto chart. The frequency or relative-frequency axis (usually the vertical axis) starts at 0, and you need to show the 0 label. That axis is a number line, so tick marks are equally spaced and represent consistent numbering. (The frequency 0 goes next to the horizontal axis. Don’t offset it downward.) The height or length of each bar is proportional to the number or percentage of individuals in that category. You can write frequencies or percentages at the top of every bar, but this is optional because you’re labeling your frequency axis. The frequency axis always needs a title. The category axis may or may not need a title, depending on whether the graph title and category names make the chart easy enough to understand. Usually the category axis is horizontal, so the frequency axis and the bars are vertical. But you can also make a horizontal bar chart, where the category axis is vertical and the frequency axis and bars are horizontal. You can make a bar graph by hand, or use software such as Microsoft Excel. If you make a bar graph by hand, use graph paper and draw the axes and bars with a straightedge — wobbly bars make you look like you had a liquid lunch. Here’s my bar graph for parents reading to children:

A couple of comments on best practices: Notice that I made one square on the vertical axis equal 10 people, or five squares equal 50. That way when I have numbers like 113 or 39 I know how high to draw my bars. If you pick three or four squares per 50 people, you have a much harder job to draw the bars at the correct heights because you have to figure things like “if 50 is 3 squares, then 113 must be 113/(50/3) = about 6.8 squares.” Always pick “nice” numbers for your numeric scales. Notice also that I drew horizontal lines at the major milestones. These “gridlines” help the reader assess the heights of the bars more accurately. Optional: Bar Graph in Excel Getting some kind of bar graph out of Excel is easy. But then there’s a lot of fiddling around to reverse some of Excel’s rather strange format choices. Here are instructions for Excel 2010. If you have Excel 2007, 2013, or 2016, you’ll find that they’re pretty similar. 1. Get your categories into one column and your frequencies into the next column. The first row of each column should be the column headings from the table. Don’t enter a total row. 2. With your mouse, highlight all rows and columns of your chart. (It doesn’t matter whether you include the column heads.) Click the Insert tab and then Charts » Column, and select the first 2-D column chart.

3. Right-click the useless legend at the right, “Series1” or “Number of Parents”, and select Delete. 4. When you right-clicked the legend, three Chart Tools tabs appeared. On the Layout tab of the ribbon click Chart Title » Above Chart. Click into the chart title and type a better one. (Maybe Excel already gave your chart a title, but “Number of Parents” is the proper title of the frequency axis, not the whole chart.) 5. Click Axis Titles » Primary Vertical Axis » Rotated Title. Click on the words “Axis Title” that appear in the chart, and type the new title “Number of Parents” for your frequency axis. 6. If your category axis needs a title, click Axis Titles » Primary Horizontal Axis » Title Below Axis and enter the axis title. 7. For some reason, the chart has tick marks between the categories. Right-click one of them, select Format Axis, and change Major tick mark type to None. That gives the chart you see here.

8. You may have to tweak the formatting of the graph further; here are some suggestions. (If you try something and don’t like the result, press Ctrl-Z to undo the change.) If the category names are long, try shifting them to vertical alignment: right-click on any of them and select Format Axis » Alignment. In Text direction, select Rotate 90°. You may need to resize the whole graph to improve spacing or to make the bars’ heights show better contrast. Look carefully at the frame and you’ll see handles in each corner and the middle of each side. To resize the graph, drag any of the handles. To change fonts of the axis labels or titles, click the element, then click the Home tab in the ribbon. Change font or font size as desired. If you prefer a horizontal bar chart, it’s easy to make the change. Click into the chart area, then on the Design tab on the ribbon click Change Chart Type » Bar and select the first one. Okay, well, nothing is that easy! Excel puts the categories in backwards order, so right-click the category axis and select Format Axis » Axis Options » Categories in reverse order. Still on the Axis Options dialog, click Horizontal axis crosses at maximum category.

Bar Graph with Relative Frequencies The frequency bar graph tells us about the 434 individuals in the Pew Research Center’s sample. But why collect that sample except for what it can tell us about how often parents in general read to their children? You know from Sampling Error in Chapter 1 that the proportions in the population are probably not the same as the sample, but probably not very far off either. So you compute those proportions and then redraw your graph to show percentages instead of raw counts. First, total all the frequencies to How Often Parents Read to Children under Age 12 (n=434) get the sample size n = 434. (In How Often Number of Parents Rel. Freq. this case n is given already, but often it isn’t.) Then convert each Every day 217 50% frequency into a relative A few times a week 113 26% frequency. The formula, if you need one, is f/n. For example, 9 About once a week 39 9% parents never read to their underA few times a month 26 6% 12 children. The relative frequency is f/n = 9/434 = 0.021 Less often 30 7% or 2%: 2% of parents never read Never 9 2% to their children. Enter that and the other relative frequencies in the table, as shown at right. You may see some bar graphs with relative frequencies as decimals. There’s nothing wrong with that for technical audiences, but general audiences usually respond better to percentages. Your relative frequencies may not add up to exactly 100% (or 1.0000), because of rounding. Don’t change any of the numbers to force a total. Once you have your relative frequencies, you can make your bar graph. Choose round numbers for the tick marks on your relative frequency axis, for example every 5% or every 10%. I won’t inflict another of my sketches on you, but you can see a finished relative-frequency bar graph below. Optional: Relative Frequencies in Excel To my surprise, I found that Excel doesn’t include relative-frequency bar graphs in its repertoire. You have to enter some formulas to compute the relative frequencies, and then create the graph from them. (Of course you could compute the relative frequencies yourself and enter them in Excel as numbers, but whenever possible I like to be lazy and make the computer do the work.) 1. Enter the categories in a column, leave a blank column, and enter the frequencies. If you already have the categories and frequencies in adjacent columns, right-click on the letter at the top of the frequency column and select Insert. 2. Click into the cell below the last frequency, and type “=sum(” (without the quotes). Then with your mouse select the frequencies. Finally, type a closing parenthesis and hit the Enter key. 3. In the address box just above the first column of the spreadsheet, type a unique name such as TOTPARENTS and press the Enter key. This makes it easier to refer to this total cell.

4. Click into the empty relative-frequency cell for the first category. Type an = sign, then click on the first frequency cell. (In the illustration, the relative-frequency cells are in column B and the frequency cells are in column C.) Type /TOTPARENTS (including / mark for division) and press the Enter key. 5. Grab the “handle” at the lower right of the cell you just typed into, and drag it down to fill the Relative Frequency column. 6. Click the % sign in the ribbon to change the decimals to percentages. (The % sign is near the middle of the ribbon, on the Home tab.) Now highlight the category and relativefrequency columns, click the Insert tab and the first 2-D column chart, and tweak the graph as you did before. Your result should be something like the one you see here. On this chart, neither axis really needs a label. The percent signs reinforce the message in the chart title that the bars show relative frequencies. And the category names together with the chart title tell the reader exactly what is being represented. It’s a judgment call where to place tick marks on the relative-frequency axis, and you really need to look at the data to make a decision. Four categories are under 10%, so it makes sense to show the 5% line and help the reader get a sense of the relative sizes. Of course, if you show 5% then you have to show every 5% increment up to the top of the graph. Side-by-Side Bar Graph You may want to compare two populations: men and women, for instance, or one year versus another year. To do this, a side-by-side bar graph is ideal. A side-by-side bar graph has two bars for each category, and a legend shows the meaning of the bars. The two populations you’re comparing are almost never the same size. Therefore side-by-side graphs almost always show relative frequencies rather than frequencies. Example 2: In Educational Attainment, the Census Bureau (2014) showed the educational attainment of the population in selected years 1940 to 2012. I chose the years 1992 and 2012 and prepared this graph to show the change over that 20-year period.

What do you see? Comparing 2012 to 1992, the proportion of the population with no college (the first four categories) declined, and the proportion with some college or a college degree increased. You should be able to see why this has to be a relative-frequency chart: in a frequency chart, the larger population in 2012 would make all the bars taller than the 1992 bars, and you’d be hard put to see any kind of trend. 10%

Stacked Bar Graph Example 3: Another way to compare two populations is the stacked bar graph. In the side-by-side bar graph, above, each group of bars was one category, and each bar within a group was a population. With the stacked bar graph, you have one bar for each population, and one piece of that bar for each category. (A stacked bar graph is kind of like an unrolled pie chart.) Here’s a stacked bar graph for the same data set:

What do you see? Look first at the legend that lists the Because this textbook helps you, categories, then at the two bars. The top two segments please click to donate! represent some college. In 1992, about 56% of adults had no education beyond high school. But in 2012, only about 42% had a high-school diploma or less, meaning that 58% had at least some college. The proportions of college and no college were reversed in those 20 years. You can also see that, though the group with four years of high school shrank, it didn’t shrink as much as the group with college grew. In other words, it’s not just more high-school graduates going on to college, it’s a higher proportion of the population entering high school. All the categories without a high-school diploma shrank. In 1992, 20% of adults had less than a high-school diploma and 80% were high-school graduates; in 2012, only about 12% had less than a high-school diploma and 88% had graduated from high school. What’s the best way to compare two populations? The answer depends on what you’re trying to show. The side-by-side graph seems to be better at showing how each category changed, and the stacked graph is usually better at showing the mix, especially if you want to group the categories mentally. In the side-by-side graph, you can easily see the decline in adults with a fourth-grade education or less, but the shift to a college-educated population is much harder to see. It’s just the opposite with the stacked graph. As always, get clear in your own mind what you’re trying to show, and then select the type of graph that shows that most clearly. Did you notice that this stacked bar graph shows relative frequencies? (Maybe you didn’t notice, because it seems like the natural way to go.) A stacked bar graph could show frequencies instead of relative frequencies, if you want to emphasize the different sizes of the populations, but then it becomes harder to compare the mix in the populations. BTW: When you make a stacked bar graph in Excel, there’s no need to pre-compute the percentages. Just select the third type of 2-D column chart, 100% Stacked Column.

2A2. Making a Table from Scratch Example 4: In the first example, you were given a table of categories and counts. But more likely you’ll just have a mass of data points, like this: Children’s Favorite Beach Toys shovel dump truck shovel bucket shovel ball bucket sifter ball shovel dump truck ball shovel shovel bucket sifter shovel bucket dump truck bucket ball shovel ball bucket net

ball shovel net shovel ball

Before you can make any kind of graph, you need a table to summarize the data. You’re probably tempted to count the number of shovels, the number of balls, and so on, but it’s way too easy to make mistakes that way. Why? Because you have to go over the data set multiple times, and you may count something twice or miss something. The better procedure is to tally the categories in a table. It’s a win-win: the procedure is faster, and you’re less likely to make a mistake. Simply go through the data, one item at a time. If you’re seeing a given Toy Tallies category for the first time, add it to your list with a tally mark; if that category shovel ||| is already in your table, just add a tally mark. Here’s my table of tallies after ball ||| going through the first two columns of data: dump truck || Please complete your tallies on your own before you look at mine. sifter | After you’ve tallied all the data, count the tallies in each category and bucket | total the counts. Of course the total should equal your sample size n. Here’s my complete table: Toy shovel ball dump truck sifter bucket net Total

Tallies Frequency |||| |||| 10 |||| || 7 ||| 3 || 2 |||| | 6 || 2 30

Always check the total of your frequencies. If it matches the sample size, that’s no guarantee everything is correct; but if it doesn’t match, you know something is wrong. Once you’ve got your table, you can make a graph by following the procedures above. If you’re publishing the table itself, give just the category names and sizes and the total, but leave out the tallies.

2A3. Pie Chart Where a bar graph tends to emphasize the sizes of categories in relation to each other, a pie chart tends to emphasize the categories as divisions of the whole. This distinction is not hard and fast; it’s just a matter of emphasis. To make a pie chart, you need a compass, or something else that can draw a circle, and you need a protractor. The angle of each segment of the pie will be 360°×f/n, where f is the frequency of the category and n is the sample size — in other words, it’s 360° times the relative frequency, whether you’re showing frequencies or relative frequencies on the pie chart. But in practice, if you’re going to make a pie chart you’ll use Excel or some other software. Optional: Pie Chart in Excel Excel can draw a pie chart for you, but you have to make a bunch of tweaks before it’s usable. There’s one bit of good news: with a pie chart, unlike a bar graph, Excel can compute relative frequencies automatically. I’ll show you how to do that for the data about parents reading to children, for which we made a bar graph earlier. 1. Highlight the categories and frequencies, but not the total. Click the Insert tab and then Pie, and choose the first 2-D pie. You see the result at right. Many people stop there, but this is an absolutely horrible design. Readers have to keep looking back and forth to match up the colors, and often there are similar colors. Color-blind people are really screwed, and if you print the chart on a black-and-white printer it’s hopeless. Fortunately you can fix this! 2. You’re going to put the category names with the pie segments, so right-click the legend (the list of categories at the right) and select Delete. 3. Click on the “Number of Parents” title and type in a better one, such as “How Often Parents Read to Children”. (Don’t type the quotes in the title, of course.) 4. In the ribbon, on the Layout tab, click Data Labels » More Data Label Options. Under Label Contains, select Category Name, select either Value or Percentage, and select Show Leader Lines. Under Label Position, select Best Fit. Click Close. 5. You may want to resize the graph to make the labels less crowded, depending on the sizes of the segments. Drag a handle with your mouse, as you did before.

2B. Graphing Numeric Data Summary:

For numeric data, you want to show four things: the shape, center, and spread of the distribution plus any outliers. The histogram is the standard way to do this, and it can show frequencies or relative frequencies. Usually you’ll group the data into classes, but when you have discrete data without too many different values you can make an ungrouped histogram. For a discrete data set with a moderate number of values and a moderate range, a stemplot is an alternative. With a stemplot, it doesn’t matter how many different data values there are, but the number of data points matters.

2B1. Histogram for Numeric Data How can you draw a picture of numeric data? The answer is a histogram. BTW: The term “histogram” was coined by Karl Pearson in lectures some time before 1895.

Example 5: Let’s use the lengths of some randomly selected iTunes songs: Lengths of iTunes Songs (seconds) 113282179594213 319245323334526 395440477240296 428407230294152 242837246135412 223275409114604 170239138505316 369298168269398 433212367255218 283179374204227 How do you make sense of this? As you might expect, the first step is to make a table. But you don’t want to treat each number as its own category, because that would produce a really uninteresting graph. Instead you create categories, except for numeric data you call them classes. The rules for classes are very simple: The classes must cover all the data points. They must all be the same width. There must be no gaps between classes. Notice that the rules don’t tell you how many classes there must be, or what width a class must have. That’s where your discretion comes in. You want to pick class boundaries that are “nice” numbers, and you don’t want too many classes or too few. In practice, five to nine classes is usually about the right number. How does that apply to the iTunes songs? Take a look at the data. The lowest number seems to be 113, and the highest is 837. That gives a range in “nice” numbers of about 100–850. If you set class width to 100 you have eight classes, so that seems about right. Now go ahead and make your tally marks to create the table. Instead of category names, you use class boundaries. You already know how to make tally marks, so I’ll just give you the results: Lengths of iTunes Songs (seconds) Class Tallies Frequency Boundaries 100–199 |||| |||| 9 200–299 |||| |||| |||| |||| 20 300–399 |||| |||| 9 400–499 |||| || 7 500–599 ||| 3 600–699 | 1 700–799 0 800–899 | 1 Even though the 700–799 class has no data points, it’s still a class and it will occupy the same width in the histogram as any other class. A bar with zero height shows in the histogram as a gap, and that’s good because it emphasizes that there’s something unusual about the point in the 800–899 class (which was 837 seconds). If the class width is 100, how come the class bounds are 100–199 and not 100–200? In fact, some authors do write these class bounds as 100–200, 200–300, and so on, with the understanding that if a number is right on the boundary it goes in the upper class. All authors agree that the class width is the difference between the lower bounds of two consecutive classes, not the difference between lower and upper bounds of one class. So whether you write 100–199 for the first class or 100–200, the class width is 200 minus 100, which is 100. Once you have the table, the histogram is straightforward. You can draw the histogram by hand or use Excel. I’ll show Excel later, but here’s my hand-made histogram for the iTunes data. Notice that you label the data bars on their edges: 100, 200, …, 900, not 100–199, 200–299, …. Label the left edge of each bar, and also the right edge of the last bar. The right edge of the last bar is always one class width more than the left edge, so even if you’ve got 800–899 in your table the last bar’s edges are 800 and 900. Like all histograms, this one is good at showing the shape of the data (skewed right; see below), the center (somewhere in the upper 200s to 300s), and the spread (from 100ish to 800ish seconds, or about two minutes to 13 minutes). In Chapter 3 you’ll learn how to measure center and spread numerically, but there’s always a place for a picture to help people grasp a data set as a whole. This data set also shows an outlier, located somewhere in the 800–899 class. Not every data set will have an outlier, of course, and a rare sample might have more than one. When an outlier occurs, your first move is to go back to your original data sheets and make sure that it’s not simply a mistake in entering your data. If it’s a real data point, then you can ask what it means. In this case, the message is pretty simple: tunes generally run up to about 11 or 12 minutes (700 seconds), but the occasional one can be several minutes longer. Histogram Versus Bar Graph A histogram is similar to a bar graph, but with the following differences: Histogram

Bar Graph

Data type

Numeric (grouped)

Discrete ungroupedH

Non-numeric

Order of categories

Numeric order, left to right

Numeric order, left to right

Any order you choose

Do the bars touch?

Yes

No, they’re spaced

Where are they labeled?

Below the edges

Below the centers

HSome authors treat ungrouped discrete data as numeric and make a histogram. Others, including this book, treat ungrouped discrete data as categories and make a bar graph. For both histogram and bar graph, the frequencies must start at 0. However, in a histogram the data axis typically doesn’t start at zero. You just leave some space between the frequency axis and the first bar, and the scale of the data axis is considered to start at the first bar. Relative-Frequency Histogram Though I don’t show it here, you could make a relative-frequency histogram, the same way you made a relative-frequency bar chart. The relative frequencies range from 0 for the 700–799 class to 20/50 = 40% for the 200–299 class. Optional: Histogram in Excel Believe it or not, out of all the chart types in Excel, the standard histogram is not included. To make one, you have to combine a column chart and a scatterplot (Middleton), or download additional software. You can follow the detailed instructions in that document, or you can download the free Better Histogram add-in from TreePlan Software to do the job. (It works in Excel 2007 through 2016.) If you’re using Better Histogram: 1. Enter all the original numbers in a column in Excel. 2. Double-click the downloaded ZIP file, and within it double-click Better-Histogram-2007. You will have to enable macros. 3. Click the Add-Ins tab in the ribbon, and then Better Histogram. Data Range: Click the “_” button at right and highlight your numbers. Start Value: The lower bound of your first class, not the lowest number in the data. Step Value: Your class width. Stop Value: The right-hand edge of the last class, which in this example is 900 (not 899). 4. Better Histogram will create a new sheet in your workbook with a frequency table and histogram. Click on the chart title and enter a new title. Click on the horizontal axis title and either delete it or change it to more appropriate text. The result is shown at right.

5. Optional: You might wish to jazz up the chart visually. If so, click on the Design tab of Excel’s ribbon and choose a design. Color is fine, but don’t choose different colors for different bars because that can make bars look larger or smaller than they actually are. Here’s what I got from clicking the blue theme.

2B2. Ungrouped Discrete Data To make sense of most data sets, you need to group the data into classes. But sometimes your data have only a few different values. In such cases, you probably want to skip the grouping and just have one histogram bar for each different response. The height of the bar tells you how often that response occurred, as usual. Example 6: A state park collected data on the number of adults in each vehicle that entered the park in a given time interval: 3 1 1 3 3 3 0 7 3 1 3 6 4 5 3 2 3 4 2 3 0 2 2 4 8 3 3 1 3 3 3 4 1 5 2 2 6 3 4 2 There are only nine different values, so it seems a little silly to group them. Instead, just tally the occurrences, as shown at right.

Number of Adults in Vehicles Entering Park Adults Tallies Frequency 0 || 2 1 |||| 5 2 |||| || 7 3 |||| |||| |||| 15 4 |||| 5 5 || 2 6 || 2 7 | 1 8 | 1 Total 40

Label ungrouped data under the centers of the bars, just like categorical data, not under the edges. Some authors still make the bars touch because the data are numeric, and others keep the bars separated because the data are ungrouped. I prefer the second approach, but I’ll accept the other. Here’s my histogram:

Caution: This particular data set has at least one occurrence of every value between min and max. But suppose it didn’t; suppose there were no vehicles with 7 adults? In that case, you would draw the histogram exactly the same, except that the bar above “7” would have zero height. The horizontal axis for numeric data must always have a consistent scale for its whole length, so you never close up any gaps. Optional: Ungrouped Discrete Histogram in Excel You can graph ungrouped discrete data in Excel, if you wish. The key is to fool Excel into treating the data like categorical data: 1. Type the unique values in one column. But as you type each number, type an apostrophe (') first. Don’t put 0, 1, 2 and so on in the cells, but '0, '1, '2. The apostrophe won’t appear, but it tells Excel to treat the numbers like text. (You may notice that Excel left justifies those numbers.) 2. Type the frequencies in a second column. 3. Highlight the numbers in both columns, and on the Insert tab click Column. Select the first 2-D column.

4. Make all the same adjustments you made for the bar graph, above. By the way, you might notice that the tick marks on the vertical axis are every two cars on this graph, but they were every five cars on my hand-drawn histogram. One is not better than the other; it’s a stylistic choice.

5. Optional: If you want to make the bars touch, right-click on the graph, select Format Data Series, and under Series Options change Gap Width to 0%. Then click Border Color and select Solid Line with a color of white.

2B3. Shapes of Data Sets You should know the names of the most common shapes of numeric data. Why? It’s easier to talk about data that way, and — as you’ll see in the next chapter — you treat different-shaped distributions a little differently. The first question is whether the data set is symmetric or skewed. The histogram of a symmetric data set would look pretty much the same in a mirror; a skewed data set’s histogram would look quite different in a mirror. If a distribution is skewed, you say whether it’s skewed left or skewed right. A distribution that is skewed left, like the first one below, has mostly high scores, and a distribution that is skewed right, like the second one below, has mostly low scores. The direction of skew is away from the bulk of the data, toward the long skinny tail, where there are few data points.

Skewed left or negatively skewed

Skewed right or positively skewed

Example 7: Scores on a really easy test would be skewed left: most people get high scores, but a few get low or very low scores. Lifespan in developed countries is skewed left: there are relatively few infant and child deaths, and most people live into their 60s, 70s, or 80s. (The first graph in Calculus Applied to Probability and Statistics [Waner and Costenoble 1996] illustrates this.) People’s own evaluation of their driving skills and safety are left skewed: few people rate themselves below average and most rate themselves above average. Illusory Superiority cites a study by Svenson showing this “Lake Wobegon effect”. Example 8: People’s departure times after a concert would be skewed right: most people leave shortly before or after the performers finish, but a few straggle out for some time afterward. Skewedright distributions are more common than skewed-left distributions. Salaries at almost any corporation are another good example of a distribution that is skewed right: most people make a modest wage, but a few top people make much more. There are several types of symmetric distributions, but here are the two you’ll meet most often. A uniform distribution is one where all possible values are equally likely to occur. The normal distribution has a precise definition, which you’ll meet in Chapter 7, but for now it’s enough to say that it’s the famous bell curve, with the middle values occurring the most often and the extreme values occurring much less often. You’ll notice that both of the examples below are “bumpy”. That’s usual. In real life you pretty much never meet an exact match for any distribution, because there are always lurking variables, measurement errors, and so on. And even if a population does perfectly follow a given distribution, like the probability distributions you’ll meet in Chapter 6, still a sample doesn’t perfectly reflect the population it came from: sampling error is always with us. When we say that a data set follows suchand-such a distribution, we mean it’s a close match, not a perfect match.

Uniform

Normal (“bell curve”)

Example 9: Winning lottery numbers are uniformly distributed. (In the short term some numbers occur more often than others, but over the long run they tend to even out.) The results of rolling one die many times are uniformly distributed. (But the results of rolling two dice are not uniformly distributed: 7 is the most likely, 2 and 12 are tied for least likely, and the other numbers are intermediate.) The normal distribution or bell curve occurs very often, and in fact many natural and industrial processes produce normal distributions. This happens so often that we often just say or write ND for “normal distribution” or “normally distributed”. Example 10: Men’s and women’s heights follow separate normal distributions. People’s arrival times at an event are ND. IQ scores, and scores on most tests, are ND. The amount of soda in two-liter bottles is ND. Your commute times on a given route are ND.

2B4. Stem Plot Suppose you have a discrete data set with few repetitions. An ungrouped histogram would have most bars at the same low height; a grouped histogram might show a pattern but you’d lose the individual data points. If your discrete data set isn’t too large (n < 100, give or take), and the range isn’t too great, you can eat your cake and have it too. The stemplot, also known as a stem-and-leaf diagram, is a mutant hybrid between a histogram and a simple list of data. The idea is that you take all the digits of each data point except the last digit and call that the stem; the last digit is the leaf. For example, consider scores of 113 and 117. They are two leaves 3 and 7 on a common stem 11 (meaning 110). To construct a stemplot, you look over your data set for the minimum and maximum, then write the stems in a column, from lowest to highest. Just like with a histogram, there are no gaps, so if you have data in the 50s and the 70s but not in the 60s you still need a stem of 6. However, your stems probably won’t start at 0. Start them with the lowest data point that actually occurred, and end them with the highest data point that actually occurred. BTW: The stemplot was invented by John Tukey in 1970.

Example 11: Here is a set of IQ scores from 50 randomly-selected tenth graders: 99 77 83 111 141 89 98 84 93 124 110 73 96 60 102 87 123 120 100 95 100 90 104 85 129 81 119 112 103 76 108 91 94 114 108 92 96 94 88 101 117 106 103 105 113 97 106 109 80 116

To make your stemplot, eyeball the data for the minimum and maximum, which are 60 and 141. Write the stems, 6 to 14, in a column at the left of your paper, starting several lines below the top. Then draw a vertical line just to the right of them. Now go through the data points, one by one, and add each leaf to the proper stem. During this process, you might find a value outside what you thought were the min and max. That’s no problem. Just add the stem and then the leaf. (Again, the stems can’t have gaps, so if your first stem is 6 and you come across a data point 47, you have to add stems 4 and 5, not just 4.) Finally, add a title and a legend or key to your stemplot. Here is the result: IQ Scores 6 | 0 7 | 7 3 6 8 | 3 9 4 7 5 1 8 0 9 | 9 8 3 6 5 0 1 4 2 6 4 7 10 | 2 0 0 4 3 8 8 1 6 3 5 6 9 11 | 1 0 9 2 4 7 3 6 12 | 4 3 0 9 13 | 14 | 1 key: 11 | 7 = 117

If you lie down and look at this sideways, it looks like a histogram. But the bonus is that you can still see all the actual data points within the groupings of 60–69, 70–79, etc. A stemplot is great at showing shape, center and spread of distributions plus outliers, but most data sets don’t lend themselves to a stemplot. If your data set is too large, your leaves will run off the edge of the page. If your data set is too sparse — if the range is large for the number of data points — most of your stems won’t have leaves and the plot won’t really show any patterns in the data. But when you have a moderate-sized data set and the data range is moderate, a stemplot is probably better than a histogram because the stemplot gives more information. One last touch is sorting the leaves. I don’t think that’s important enough to take the extra effort in a homework problem or on a quiz, but if you’re going to be presenting your stemplot to other people then you probably want to sort the leaves. Here’s the same stemplot with sorted leaves: IQ Scores 6 | 0 7 | 3 6 7 8 | 0 1 3 4 5 7 8 9 9 | 0 1 2 3 4 4 5 6 6 7 8 9 10 | 0 0 1 2 3 3 4 5 6 6 8 8 9 11 | 0 1 2 3 4 6 7 9 12 | 0 3 4 9 13 | 14 | 1 key: 11 | 7 = 117

A glance at this stemplot shows you quite a lot. The data set is normally distributed, the center is around 100 points, the spread is 60–141, and there’s an outlier at 141.

2C. Bad Graphs You now know how to make good graphs, so be on the lookout for bad graphs. Sometimes they’re bad just because whoever drew them didn’t know any better, or didn’t think. But some people may deliberately try to deceive you with a graph. Example 12: File this one under “what were they thinking?” The left-hand graph doesn’t have a title, so you don’t know what “Yes” and “No” mean. You have to look back and forth between the graph and the legend, and anyone with red-green color blindness probably won’t be able to see which segment is which. Oh yes — what percentages of the sample answered “Yes” and “No”? You can guess that it’s around a third versus two thirds, but that’s not very precise. The right-hand graph cures those problems. It’s now crystal clear which segment is Yes and which is No, and what proportion of the sample gave each answer. This actually lets you show more information in less space, a win-win. (Of course you wouldn’t use a vague term like “Opinions” — that’s just there to remind you to give your graph a title.)

Example 13: There’s no telling whether this one is deliberate deception or just incompetent graphing. An oatmeal company, which shall remain nameless, wanted to show that eating oatmeal for four weeks reduces cholesterol. The first graph makes a strong case — until you look at the scale on the vertical axis. (Don’t even think about wasting your time on a graph with no vertical scale.) The scale doesn’t start at zero, so it makes differences look much bigger than they are. Your frequency or relative frequency scale must always start at zero (and you must show the zero). The second graph is properly drawn, and now you can see that the drop in cholesterol is only a slight one.

Example 14: It’s all very well to create visual interest, but not if it makes the reader misinterpret the graph. In the left-hand graph, you can tell from the scale that B is supposed to be three times as large as A, but since it’s three times as high and three times as wide it’s actually nine times as large, giving the reader a distorted impression of the amount of difference. Even if your “bars” are pictures, they still have to be the same width. The corrected version is shown at right. (It’s still not quite correct, though, because 0 is not shown on the vertical axis.)

source: Misleading Graph

2D. Really Good Graphs If you follow the rules in this chapter, you’ll make good, professional graphs. But there are plenty of other ways to make good graphs, depending on the data you’re trying to show. There’s a classic picture book that can give you lots of good ideas. Edward Tufte’s The Visual Display of Quantitative Information has been around since 1983, and no one has yet done it any better. (Tufte has produced newer editions.) Example 15: One famous graph in Tufte’s book is particularly stunning. Charles Minard wanted to present a lot of time-series data about Napoleon’s disastrous campaign in Russia in the winter of 1812–1813: where battles took place, numbers of casualties, temperature, and so forth. He elected to make a kind of stylized map showing just the rivers and the cities where events happened. (Niemen at the left is the Niemen River, Russia’s western border at the time. Moscow, “Moscou” in French, is as far east as Napoleon got.) Across that, Minard showed the army strength as a broad swath at the start that shrank to almost nothing by the end of the retreat westward. Below are dates of events, temperatures, and precipitation. It’s a huge amount of information on one piece of paper.

This tiny rendition doesn’t do it justice, but if you click on it you’ll see it at a better size. (Your browser may still reduce it to fit on your screen. Try clicking into the picture and you should see it at original size, though you’ll have to scroll around to see the details. It sounds like a lot of effort, but I promise you it’s worth it. Or just get the book, because it has plenty more!) Example 16: Here’s one I ran across in my reading. It’s not the graph of the century like Minard’s, but it’s a cut above the usual. In Bear Attacks: Their Causes and Avoidance (2002), Stephen Herrero had the problem of contrasting bears’ diet in spring, summer, and fall. (Of course in winter they’re not eating.) He could have drawn three pie charts, or a stacked bar graph, but instead he came up with a great alternative. (You can click on the picture to enlarge it.) Each component of diet is clearly labeled right in the graph, not in some legend off to the side, and the contrasting backgrounds make it a little more interesting visually. A stacked bar graph would convey the same information, but I like this presentation because it suggests that “spring”, “summer”, and “fall” are not completely separate but rather transition one into the next. The vertical axis is clearly labeled, too. There’s no doubt what the numbers are (as opposed to some units of weight, for instance, or something more esoteric like pounds of feed per hundreds of pounds of bear). He probably could have left off the title off the category axis — after all, we know that the seasons are seasons, and the graph title also conveys that information. But that’s a minor point. My only real quibble with this graph is that the overall graph title at the bottom is too small.

What Have You Learned? Overview:

With numeric data, the goal of descriptive stats is to show shape, center, spread, and outliers.

Key ideas:

When you have a mass of data and need frequencies, don’t pass through the data repeatedly, counting a different category each time. Instead, use the tally system. Relative frequency for any class or category is the number of data points in that class, divided by total sample size. For non-numeric data, make a bar graph or pie chart. Place categories in any order that seems reasonable to you. Side-by-side bar graphs and stacked bar graphs can be useful for comparing populations. Numeric data: Group continuous data in classes, tally them, and make a grouped histogram. Bars must touch, and you label them under the edges, not the middles. Do the same with discrete data that have a lot of different values. Present discrete data without too many different values in one bar for each different value. Label them under their middles. It’s a matter of taste whether the bars touch (ungrouped histogram) or not (bar graph). For bar graphs and histograms, show scale on the frequency or relative-frequency axis, and show scale or category name on the data axis. Usually, each axis has a title, with a separate chart title at the top. But you can omit an axis title when it would be redundant information. In every bar graph or histogram, the frequency or relative-frequency axis must start at 0 and have consistent scale for its whole length. Be on the lookout for violations of this rule and other signs of bad graphs. Know the most common shapes of numeric distributions: uniform, bell curve, skewed left, and skewed right. The stemplot (stem-and-leaf diagram) is also an option for discrete data with moderate range and ≤ about 100 data points.

Study aids:

Histogram Versus Bar Graph Because this textbook helps you, please click to donate!

Statistics Symbol Sheet How to Read a Math Book How to Work a Math Problem How to Take a Math Test or Quiz

← Chapter 1 WHYL

Chapter 3 WHYL Õ

Exercises for Chapter 2 Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand. Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.

1

The Pew Research Center (2013c) conducted a poll of 1000 adults in Mexico, asking whether they would move to the US if they had the means and opportunity to move. Draw a relative-frequency bar graph for their responses.

Would You Move to the US? Yes, with authorization

154

Yes, without authorization

204

No

612

Don’t know

2

30

What’s wrong with this graph? (You should be able to see at least two problems, maybe more.)

(source: Misleading Graph in Wikipedia)

3

Professor Marvel had a statistics class of fifteen students, and on one 15-point quiz their scores were 10.5 13.5 8 12 11.3 9 9.5 5 15 2.5 10.5 7 11.5 10 10.5 Construct a frequency table and bar graph for their letter grades on the quiz, where 90% is the minimum for an A, 80% for a B, 70% for a C, and 60% for a D.

4

Bulmer (1979, 92) quotes an 1898 study of deaths by horse kick in the Prussian army. Von Bortkiewicz compiled the number of deaths in 14 Prussian Army corps over the 20-year period 1875–1894, as shown at right. (14 corps over 20 years gives 14×20 = 280 observations.) For example, there were 32 observations in which two officers died of horse kicks. (a) What is the type of the variable? (b) Construct an appropriate graph.

5

Deaths by Horse Kick in 14 Prussian Army Corps, 1875–1894 Number of Deaths

Frequency

0

144

1

91

2

32

3

11

4

2

Total

280

In a GM factory in Brazil, 25 workers were asked their Commuting Distances in km commuting distance in kilometers. Construct a stem-and-leaf 5 15 23 12 9 plot. 12 22 26 31 21 —Adapted from Dabes and Janik (1999, 8) 11 19 16 45 12 8 26 18 17 1 16 24 15 20 17 Abigail asked a number of students their major. She found 35 in liberal arts, 10 in criminal justice, 25 in nursing, 45 in business, and 20 in other majors. What was the relative frequency of the nursing group, rounded to the nearest whole percent?

6 7

(a) Name three types of graph used for ungrouped discrete data. Which type do you use when? (b) Name the type of graph used for grouped numeric data. (c) Name two types of graph used for qualitative data.

8

Bert asked his fellow students how many books they read for pleasure in a year. He found that most of them read 0, 1, or 2 books, but some read 3 or more and a very few read as many as 10. (He plotted the histogram shown at right.) Identify the shape of this distribution.

9

(a) In making a histogram, how do you decide whether to group data? (b) What are the two rules for classes when you group data?

10

At right is a grouped frequency distribution. (a) Create a frequency histogram. (For a real quiz, you’d use graph paper, but you can freehand this one.) (b) Find the class width. (c) What’s the shape of this distribution?

Test scores, x

Frequencies, f

470.0–479.9

15

480.0–489.9

22

490.0–499.9

29

500.0–509.9

50

510.0–519.9

38



Solutions Õ

What’s New Add Excel 2016, here and here. 11 Feb 2015: In the first histogram, get a little more vague about the location of the center. 11 Jan 2015: Deprecate the common error of offsetting the 0 frequency. Note a remaining problem in a corrected graph. Add an overview, links, and study aids in What Have You Learned? (intervening changes suppressed) 3 June 2013: New document.

3. Numbers about Numbers Updated 1 Feb 2015 (What’s New?) Summary:

For numeric data, the goal of descriptive stats is to show the shape, center, spread, and outliers of a data set. In this chapter, you learn how to find and interpret numbers that do that. Measures of center: mean, median, mode. Median is resistant and therefore better for describing skewed data. Measures of spread: range, interquartile range, variance, standard deviation. Standard deviation is best. Measures of position: percentiles, z-scores, quartiles. The quartiles help determine which data points, if any, are outliers. The min, max, and quartiles appear in the five-number summary and are shown on a boxplot. BTW: Measures of shape called skewness and kurtosis do exist, but they’re not part of this course. Roughly, skewness tells how this data set differs from a symmetric distribution, and kurtosis tells how it differs from a normal distribution. If you’re interested, you can learn about them in Measures of Shape: Skewness and Kurtosis. The MATH200B Program part 1 can compute those measures of shape for you.

Contents:

3A. Measures of Center 3A1. The Three M’s: Mean, Median, Mode 3A2. Mean, Median, Mode, and the Shape of a Data Set 3B. Summary Numbers on the TI-83 … 3B1. … from a List of Numbers 3B2. … from an Ungrouped Distribution · Weighted Average 3B3. … from a Grouped Distribution 3C. Measures of Spread 3C1. Range and IQR (Interquartile Range) 3C2. Standard Deviation · What Good Is the Standard Deviation, Anyway? · The Empirical Rule for Normal Distributions · Optional: Chebyshev’s Inequality 3D. Measures of Position 3D1. Percentiles 3D2. Quartiles 3D3. z-Scores 3E. Five-Number Summary 3E1. Outliers 3E2. Box-Whisker Diagrams · Box-Whisker Plot, and Shape of a Data Set · Box-Whisker Plot on TI-83/84/89 · Finding Outliers with the TI-83/84/89 or Excel · Five-Number Summary from TI-83/84/89 Boxplot What Have You Learned? Exercises for Chapter 3 What’s New

3A. Measures of Center

3A1. The Three M’s: Mean, Median, Mode There are three common measures of the center of a data set: mean, median, and mode. Definition:

The mean is nothing more than the average that you’ve been computing since elementary school. The symbol for the mean of a sample is x, pronounced “x bar”. The symbol for the mean of a population is the Greek letter µ, pronounced “mew” and spelled “mu” in English. (Don’t write µ as “u”; the letter has a tail at the left.) You can think of the mean as the center of gravity of the distribution. If you made a wooden cutout of the histogram, you could balance it on a pencil or your finger placed exactly under the mean.

BTW: The formula for the mean is x = ∑x/n or µ = ∑x/N, meaning that you add up all the numbers in the data set and then divide by sample size or population size.

Definition:

The median is the middle number of a sample or population. It is the number that is above and below equal numbers of data points. (Examples are below.) There’s no one agreed symbol for the median. Different books use M or Med or just “median”. To find the median by hand, you must put the numbers in order. If the data set has an odd number of data points, counting duplicates, then the median is then the middle number. If the data set has an even number of data points, the median is half way between the two middle numbers. (In the next section, you’ll get the median from your TI calculator, with no need to sort the numbers.)

Definition:

The mode is the number that occurs most frequently in a data set. If two or more numbers are tied for most frequent, some textbooks say that the data set has no mode, and others say that those numbers are the modes (plural). We’ll follow the second convention. Most distributions have only one mode, and we call them unimodal. If a distribution has two modes, or even if it has two “frequency peaks” like the one at right, we call it bimodal. (This was students’ final grades in a math course: a lot of low or high grades, and few in the middle.) There’s no symbol for the mode.

Example 1: You’re interviewing at a company. You ask about the average salary, and the interviewer tells you that it’s $100,000. That sounds pretty good to you. But when you start work, you find that everybody you work with is making $10,000. What went wrong here? The interviewer told the truth, but left out a key fact: Everybody but the president makes below the average. Eight employees make $10,000 each, the vice president makes $50,000, and the president makes $870,000. Yes, the mean is (8×10,000 + 50,000 + 870,000)/10 = $100,000, but that’s not representative because the president’s salary is an outlier. It pulls the mean away from the rest of the data, and skews the salary distribution toward the right. This graph tells the sad tale:

There was your mistake. Salaries at most companies are strongly skewed right, so most employees make less than the average. When a data set is skewed, the mean is pulled toward the extreme values. (A data set can be skewed without outliers, but when there are outliers the data set is almost certain to be skewed.) You should have asked for the median salary, not the average (mean) salary. There are 10 employees, and 50% of 10 is 5, so the median is less than or equal to five data points and greater than or equal to five data points. The fifth-highest and sixth-highest salaries are both $10,000, so the median is $10,000. The median is more representative than the mean when a data set is skewed. The mean is pulled toward an extreme value, but the median is unaffected by extreme values in the data set. We say that the median is resistant. Example 2: What is the median of the data set 8, 15, 4, 1, 2? Put the numbers in order: 1, 2, 4, 8, 15. There are five numbers, and 50% of 5 is 2.5. You need the number that is above 2 data points and below 2 data points; the median is 4. Example 3: What is the median of the data set 7, 24, 15, 1, 7, 45? There are six data points, and in order they are 1, 7, 7, 15, 24, 45. 50% of 6 is 3; you need the number that is above 3 data points and below 3 data points. It’s clear that the median is between 7 and 15, but where exactly? When the sample size is an even number, the median is the average of the two middle numbers. Therefore the median for this data set is the average of 7 and 15, (7+15)/2 = 11.

3A2. Mean, Median, Mode, and the Shape of a Data Set When a distribution is symmetric, the mean and median are close together. If it’s unimodal, the mode is close to the mean and median as well. But have you ever taken a course that was graded on a curve, and one or two “curve wreckers” ruined things for everyone else? What happened? Their high scores raised the class average (mean), so everybody else’s scores looked worse. The class scores were skewed right: low scores occurred frequently, and high scores were rare. (You can see shapes of skewed distributions in Chapter 2.) When a distribution is skewed, the mean is pulled toward the extreme values. The median is resistant, unaffected by extreme values. And you can reverse that logic too: if the mean is greater than the median, it must be because the distribution is skewed right. From the median to the mean is the direction of skew.

Skewed left, mean < median (usually)



Skewed right, mean > median (usually)

For heaven’s sake, don’t memorize that! Instead, just draw a skewed distribution and ask yourself approximately where the mean and median fall on it. BTW: Karl Pearson gives the rule median = (2×mean + mode)/3 for moderately skewed distributions. For more about this, see Empirical Relation between Mean, Median and Mode.

Caution! All the statements in this section are a rule of thumb, true for most distributions. The logic holds for almost every unimodal continuous distribution, and for discrete distributions with a lot of different values. But it tends to break down on discrete distributions that have only a few different values. For more about this, see von Hippel 2005.

3B. Summary Numbers on the TI-83 … Summary:

The 1-VarStats command gives you mean, median, and much more for any data set. If you have just a plain list of numbers, enter the name of that list on the command line. If you have a frequency distribution, enter the name of the data list and the name of the frequency list on the command line.

Excel:

Excel can do these computations. This isn’t an Excel course, but if you’re an Excel head you can figure out how to get this information. One way is with the Data Analysis tool, part of the Analysis Toolpak add-in that comes with Excel (though you may have to enable it). Another way is to click in a blank cell, click Formulas » More Functions » Statistical and select the appropriate worksheet function.

3B1. … from a List of Numbers Example 4: Professor Marvel had a statistics class of fifteen students, and on one quiz their scores were 10.5 13.5 8 12 11.3 9 9.5 5 15 2.5 10.5 7 11.5 10 10.5 Your TI-83 or TI-84 can give you the mean, median, and other numbers that summarize this data set. 1. If you have any partial commands visible, press [CLEAR]. 2. Press [STAT] [ENTER] to get into the edit screen for statistics lists. You can use any list, but let’s use L1 this time. (If you don’t see L1, and pressing the left arrow doesn’t bring it into view, press [STAT] [5] [ENTER] [STAT] [ENTER].) 3. Cursor to the L1 label at the top — not the top number, the column heading — and press [CLEAR] [ENTER] to clear the list. 4. Enter your numbers, pressing [ENTER] after each one. 5. After entering the last number, check all the numbers carefully and make any needed corrections. If you duplicated a number, press [DEL] to remove it; if you left out a number, press [2nd DEL makes INS] to open a space for it. 6. Press [STAT] [] [1] to select 1-VarStats. If you have a newer TI-84 with the “wizard” interface selected, a little menu will appear. Identify the list that contains your data: [2nd 1 makes L1]. For a simple list of numbers like this one, there is no frequency list, so press [DEL]. If you have an older calculator or you’ve turned off the “wizard’ interface, the calculator will paste 1-VarStats to the home screen. On the same line, identify the list that contains your data: [2nd 1 makes L1]. 7. After writing down the complete command on your paper — 1-VarStats L1 — press [ENTER] to execute it. The results screen is shown below. A down arrow on the screen says that there is more information if you press [t], and an up arrow says that there is more information if you press [s].

Look first at the bottom of the screen. Always check n first — if it doesn’t match your sample or population size, the other numbers are big sacks of bogosity. In this case a quick count of the original data set shows 15 numbers, which is the right quantity. (Of course, this check can’t determine if you miskeyed any numbers. Only double and triple checking can protect you from that kind of mistake.) What are you seeing on this screen? x is the mean. The calculator doesn’t know whether a given data set is a sample or a population, so it can’t guess whether to display x or µ. This data set is quiz scores from the complete class, so it’s a population and you write down µ = 9.7 . Always use the right symbols, even if the calculator doesn’t. A word about rounding: The rules for significant digits and rounding are beyond the scope of this course, but beware of being ridiculously precise. (For example, most gasoline pumps are calibrated in 0.001 gallon units. But 0.001 gallon is two tablespoons, and there’s considerably more gas than that in the hose, so that precision is just silly.) A good rule of thumb is to report sample statistics and population parameters to one more decimal place than the original data. Then why did I say µ = 9.7 instead of 9.72, since the original data have one decimal place? That’s a valid question, and my answer is that 9.72 would not be wrong but it feels overly precise when there are only fifteen data points, most are whole numbers, and the rest are a whole number plus ½. ∑x and ∑x² are the sum of the original data and the sum of the squares of the original data. You won’t use them in this course, and there’s no reason to write them down. Sx and x are the standard deviation, computed by two methods. Choose Sx with a sample and x with a population. More when we get to Measures of Spread! Since this data set is a population, select x = 3.057929583 and write down = 3.1 . BTW: The name standard deviation was created in 1893 by Karl Pearson. (We might wish that he had chosen something with fewer than six syllables.) He assigned the symbol to the standard deviation of a population in 1894.

n is the sample size or population size. Since you have a population, you write down N = 15 , not n=15. minX and maxX are the smallest and largest data points. Q1 and Q3 are the first and third quartiles, which we’ll get to under Measures of Position. Med is the median, which you met earlier. Med = 10.5 . The minimum, Q1, median, Q3, and maximum are together called the five-number summary. I’ll have more to say about the five-number summary later in this chapter. Showing your work and your results, you write down: 1-VarStats L1 µ = 9.7 = 3.1 N = 15 min = 2.5 Q1 = 8 Med = 10.5 Q3 = 11.5 max = 15

3B2. … from an Ungrouped Distribution Example 5: Your TI-83 or TI-84 can also compute statistics of a frequency distribution. Let’s try it with the data from Chapter 2 for number of adults in vehicles entering the park.

Number of Adults in Vehicles Entering Park Adults in Number of Vehicle Vehicles 0 2 1 5 2 7 3 15 4 5 5 2 6 2 7 1 8 1 Total 40

Enter the data values in one statistics list, such as L1. Enter the frequencies in a second list, such as L2. Press [STAT] [] [1] to select 1-VarStats. In the “wizard” interface, enter [2nd 1 makes L1] for List and [2nd 2 makes L2] for FreqList. In the non-“wizard” interface, press [2nd 1 makes L1], then [,] (comma), then [2nd 2 makes L2] and [ENTER]. The data list must come first and the frequency list second. Caution — rookie mistake: Students often leave off the frequency list. Your calculator is pretty good, but it can’t read your mind. The only way it knows that you have a frequency distribution is if you give it both the frequency list and the data list. Either way, write down the complete command on your paper: 1-VarStats L1,L2. Here are the results:

Again, look at n first. That protects you from the rookie mistake of leaving off the frequency list. If n is wrong, redo your 1-VarStats command and this time do it right. These forty vehicles are obviously not all the vehicles that enter the park, so they are a sample, not a population. You therefore write down the statistics as follows: 1-VarStats L1,L2 x = 3.0 s = 1.7 (from Sx = 1.73186575) n = 40 min = 0 Q1 = 2 Med = 3 Q3 = 4 max = 8 Weighted Average Sometimes you take an average where some data points are more important than others. We say that they are weighted more heavily, and the mean that you compute in this way is called a weighted average or weighted mean. You’re intimately familiar with one example of a weighted average: your GPA or grade point average. Example 6: The NHTSA’s Corporate Average Fuel Economy or CAFE Rule (NHTSA 2008) specifies a corporate average of 34.8 mpg (miles per gallon) for passenger cars. Let’s keep things simple and suppose that ZaZa Motors makes three models of passenger car: the Behemoth gets 22 mpg, the Ferret gets 35 mpg, and the Mosquito gets 50 mpg. Does ZaZa meet the standard? To answer that, you can’t just average the three models: (22+36+50)/3 = 36 mpg. Suppose the company sells one Mosquito and the rest are Behemoths and a sprinkling of Ferrets? You have to take into account the number of cars of each model sold. In effect, you have a frequency distribution with mpg figures and repetition counts. Let’s suppose these are the sales figures: Auto Sales by ZaZa Motors Model

Miles per Gallon

Number Sold

Behemoth

22

100,000

Ferret

35

250,000

Mosquito

50

20,000



370,000

Total

Put the miles per gallon in L1 and the frequencies in L2. (How do you know it’s not the other way around? You’re trying to find an average mpg, so the mpg numbers are your data.) You should find: 1-VarStats L1,L2 µ = 32.3 mpg N = 370,000 passenger cars Even though two of the three models meet the standard, the mix of sales is such that ZaZa Motors’ CAFE is 32.3 mpg, and it’s not in compliance. BTW: The formula for the mean of a grouped distribution and the formula for a weighted average are the same formula: µ = ∑xf/N for a population or x = ∑xf/n for a sample. Either way, take each data value times its frequency. Add up all those products, and divide by the population size or sample size. For the notation, see ∑ Means Add ’em Up in Chapter 1.

3B3. … from a Grouped Distribution In a grouped frequency distribution, one number called the class midpoint stands for all the numbers in the class. Definition:

The class midpoint for a given class equals the lower boundary plus half the class width. This is half way between the lower class boundary of this class and the lower class boundary of the next class.

Example 7: Let’s revisit the lengths of Lengths of iTunes Songs (seconds) iTunes songs from the ungrouped Class Boundaries Class Midpoint Frequency histogram in Chapter 2. What is the 100–199 150 9 midpoint of the 300 to 399 class? 200–299 250 20 The class width equals the difference 300–399 350 9 between lower boundaries: 400−300 = 100. 400–499 450 7 Half the class width is 50, so the midpoint 500–599 550 3 is 300+50 = 350. You could also compute 600–699 650 1 the class midpoint as (300+400)/2 = 350. However, it is wrong to take 700–799 750 0 (300+399)/2 = 349.5 as class midpoint or 800–899 850 1 399−300 = 99 as class width. Don’t use the upper boundary in finding the class midpoint. Of course you don’t have to compute every class midpoint the long way. Once you have the midpoint of the first class, (100+200)/2 = 150, just add the class width repeatedly to get the rest: 250, 350, … 850. The grouped frequency distribution, with the class midpoints, is shown at right. What good is the class midpoint? It’s a stand-in for all Because this textbook helps you, the numbers in its class. Instead of being concerned with please click to donate! the nine different numbers in the 100 to 199 class, twenty different numbers in the 200 to 299 class, and so on, we pretend that the entire data set is nine 150s, twenty 250s, and so on. This means you get approximate statistics, but you get them with a lot less work. Is this legitimate? How good is the approximation? Usually, quite good. In most data sets, a given class holds about equally many data points below the class midpoint and above the class midpoint, so the errors from making the approximation tend to balance each other out. And the bigger the data set, the more points you have in each class, so the approximation is usually better for a larger data set. Procedure: Enter the class midpoints in one statistics list, such as L1. Enter the frequencies in another list, such as L2. Enter the command 1-VarStats L1,L2 and write down the complete command on your paper. Again, avoid the rookie mistake: include the class-midpoint list and the frequency list in your command. The results screens are below. As usual, before you look at anything else, check that n matches the size of the data set. 50 is correct, so that’s one less worry.

There’s a problem with the second screen, though. Your calculator knows you have a frequency distribution, because you gave two lists to the 1-VarStats command. But it doesn’t have the original data, so it doesn’t know the true minimum (lowest data point). When you read minX=150, you interpret that to mean that the lowest data point occurs in the class whose midpoint is 150; in other words, the minimum is somewhere between 100 and 199. Your knowledge of the rest of the five-number summary has the same limitation. For instance, the median isn’t 250; all you know is that it occurs somewhere between 200 and 299. Because of these limitations, you don’t do anything with the second results screen from a grouped distribution. The mean and standard deviation don’t have this problem: they’re approximate, but the approximation is good enough. (n is exact, not an approximation.) These 50 iTunes songs are obviously not all the songs there are, not even all the songs in any particular person’s iTunes library. They are a sample, not a population. Therefore you write down your work and results like this: 1-VarStats L1,L2 x = 316 (or you could write 316.0) s = 145.1 n = 50

3C. Measures of Spread There are four common measures of the spread of a data set: range, interquartile range or IQR, variance, and standard deviation. (You may also see spread referred to as dispersion, scatter, variation, and similar words.)

3C1. Range and IQR (Interquartile Range) Definition:

The range of a data set is the distance between the largest and smallest members.

Example 8: If the largest number in a data set is 100 and the smallest is 20, the range is 100−20 = 80, regardless of what numbers lie between them and what shape the distribution might have. Caution: The range is one number: 80, not “20 to 100”. Obviously the range has a problem as a measure of spread: It uses only two of the numbers. Since only the two most extreme numbers in the data set get used to compute the range, the range is about as far from resistant as anything can be. In favor of the range is that it’s easy to compute, and it can be a good rough descriptor for data sets that aren’t too weird. The interquartile range has something of the same idea, but it is resistant. Definition:

The interquartile range (IQR) is the distance between the largest and smallest members of the middle 50% of the data points, taking repetitions into account. Alternative definition: The IQR is the third quartile minus the first quartile, or the 75th percentile minus the 25th percentile.

You’ll learn about percentiles and quartiles in the next section, Measures of Position, but for now let’s just take a quick non-technical example. Example 9: Consider the data set 1, 2, 3, 3, 3, 4, 5, 8, 11, 11, 15, 23. There are twelve numbers, and the middle 50% (six numbers) are 3, 3, 4, 5, 8, 11. The interquartile range is 11−3 = 8.

3C2. Standard Deviation The IQR is a better measure of spread than the range, because it’s resistant to the extreme values. but it still has the problem that it uses only two numbers in the data set. Isn’t there some measure of spread that uses all the numbers in the data set, as the mean does? The answer is yes: the variance and the standard deviation use all the numbers. Your calculator gives you the standard deviation, as you saw above. The variance is important in a theoretical stats course, but not so much in this practical course. We’ll measure spread with the standard deviation almost exclusively. (To save wear and tear on my keyboard and your printer, I’ll often use the abbreviation SD.) If you’d like to know how the variance and SD are computed, read the “BTW” section that follows. Otherwise, skip down to “What Good Is the Standard Deviation, Anyway?” x x−µ (x−µ)² BTW: To see how the variance is computed, let’s go back to Professor Marvel’s quiz scores. We computed the mean as 9.7, or to use the 10.5 0.78 0.6084 unrounded value, µ = 9.72. (Never round numbers if you’re going to use 13.5 3.78 14.2884 them in further calculation; that’s the Big No-no.) 8 -1.72 2.9584 If you want to devise a measure of spread, it seems reasonable to 12 2.28 5.1984 consider spread from the mean, so try subtracting the mean from each quiz 11.3 1.58 2.4964 score and then adding up all those deviations. You get zero, so obviously “sum of deviations” isn’t a useful measure of spread. 9 -0.72 0.5184 But with the next column you strike gold. Squaring all the deviations 9.5 -0.22 0.0484 changes the negatives to positives, and also weights the larger deviations 5 -4.72 22.2784 more heavily. This is progress! Now divide the total of squared deviations 15 5.28 27.8784 by the population size and you have the variance: ² = 140.2640/15 = 9.3509. ( is the Greek letter sigma.) 2.5 -7.22 52.1284 (When computing the variance of a sample, you divide by n−1 rather 10.5 0.78 0.6084 than n. The reasons are technical and are explained in Steve Simon’s 7 -2.72 7.3984 articles Degrees of Freedom (1999a) and Degrees of Freedom, Part 2 11.5 1.78 3.1684 (2004). 10 0.28 0.0784 The variance is quite a good measure of spread because it uses all the numbers and combines their differences from the mean in one overall 10.5 0.78 0.6084 measure. But it’s got one problem. If the data are dollars, the squared Total 0.00 140.2640 deviations will be in square dollars, and therefore the variance will be in square dollars. What’s a square dollar? (No, I don’t know either.) You want a measure of spread that is in the same units as the original data, just like the mean and median are. The simplest solution is to take the square root of the variance, and when you do that you have the standard deviation (SD), = √(140.2640/15) = 3.05793, which rounds to 3.1. And because the standard deviation is in the same units as the original data, it can be used as a yardstick, as you’ll see below. For lovers of formulas, here they are. The standard deviation of a population, , has population size N on the bottom of the fraction; the standard deviation of a sample, s, has sample size n minus 1 on the bottom of the fraction. If you’re not familiar with the ∑ notation (sigma or summation), ∑x² means square every data value and add the squares; ∑x²f means square every data value, multiply by the frequency, and add those products. For the notation, see ∑ Means Add ’em Up in Chapter 1. Formulas for Standard Deviation of a Frequency Distribution

of a List of Numbers When the data set is the whole population

When the data set is just a sample

Why are there two formulas on each row under “list of numbers”? The first formula is the definition, and the second is a shortcut for faster computations. Of course they’re mathematically equivalent; you could prove that if you wanted to. BTW: Sir Ronald Fisher coined the term variance in 1918. He used the symbol ² for the variance of a population, since Pearson had already assigned to the standard deviation, and the variance is the square of the SD.

What Good Is the Standard Deviation, Anyway? The standard deviation will be the key to inferential statistics, starting in Chapter 8, but even within the realm of descriptive statistics there are some applications. In addition to this section, you’ll see an application in z-Scores, below. Working with the quiz scores on your TI-83 or TI-84, you found that the population mean was µ = 9.7 and the population SD was = 3.1. What does this mean? Just as a concept, the standard deviation gives you an idea of the expected variation from one member of the sample or population to the next. The SD in this example is about a third of the mean, so you expect some variation but not a lot. But can you do better than this? Yes, you can! The Empirical Rule for Normal Distributions You can predict what percentage of the data will be within a certain number of standard deviations above or below the mean. In a normal distribution, 68% of the data are between one SD below the mean and one SD above the mean (µ±), 95% are within two SD of the mean (µ±2), and 99.7% are within three SD of the mean (µ±3). This is the Empirical Rule or 68–95–99.7 Rule. Caution! It’s good for normal distributions only.





BTW: You’ll notice that the 68%, 95%, and 99.7% of data occur within approximately one, two, and three SD of the mean. More accurate figures are shown in the pictures, but for now we’ll just use the simple rule of thumb. You’ll learn how to make precise computations in Chapter 7. BTW: It’s not a traditional part of the Empirical Rule, but another useful rule of thumb is that, in a normal distribution, about 50% of the data are within 2/3 of a SD above and below the mean.

Example 10: Adult women’s heights are normally distributed with µ = 65.5² and = 2.5². (By the way, different sources give different values for human heights, so don’t be surprised to see different figures elsewhere in this book.) How tall are the middle 95% of women? Solution: The middle 95% of the normal distribution lies between two SD below and two SD above the mean. 2 = 2×2.5 = 5², and 65.5±5 = 60.5² to 70.5², so 95% of women are 60.5² to 70.5² tall. Actually there are two interpretations. You can say that 95% of women are 60.5² to 70.5² tall, or you can say that if you randomly select one woman the probability that she’s 60.5–70.5² tall is 95%. Any probability statement can be turned into a proportion statement, and vice versa. You’ll learn about this in Interpreting Probability Statements in Chapter 5. Example 11: What fraction of women are 65.5² to 68² tall? Solution: 68−65.5 = 2.5, so 68² is one standard deviation above the mean. You know that 68% of a normal distribution is within µ±. You also know that the normal distribution is symmetric, so 68%/2 = 34% of women are within one SD below the mean, and 34% are within one SD above the mean. Therefore 34% of women are 65.5² to 68² tall. You can combine the three diagrams above and show data in regions bounded by each whole number of standard deviations, like this:

BTW: Where do these figures come from? For example, how do we know that about 13.5% of the population is between one and two standard deviations below the mean in a normal distribution? Well, 95% is between two SD below and two SD above the mean. Half of 95% is 47.5%, so 47.5% of the population is between the mean and two SD below the mean. Similarly, about 68% is between one SD below and one SD below, so 68/2 = 34% is between the mean and one SD below. But if 47.5% is between µ−2 and µ — call it Region A — and 34% is between µ− and µ, then the part of Region A that is not in the 34% is the part between µ−2 and µ−, and that must be 47.5−34 = 13.5%. If you had an afternoon to kill, you could work out the other seven percentages.

With this diagram, you can work Example 11 more easily, directly reading off the 34% figure for women between mean height and one SD above the mean. You can also work more complicated examples, like this one. Example 12: If you randomly select a woman, how likely is it that she’s taller than 70.5²? Solution: 70.5−65.5 = 5.0, so 70.5² is two SD above the mean. From the diagram, you see that 2.35+0.15 = 2.5% of the population is more than two SD above the mean. Answer: a randomly selected woman has a 2.5% of being more than 70.5² tall. Optional: Chebyshev’s Inequality If you have a normal distribution, the Empirical Rule tells you how much of the population is in each region. What if you don’t have a normal distribution? As you might expect, the portions of the population in the various regions depends on the shape of the distribution, but Chebyshev’s Inequality (or Chebyshev’s Rule) gives you a “worst case scenario” — no matter how skewed the distribution, at least 75% of the data are within 2 SD of the mean, and at least 89% are within 3 SD of the mean. More generally, within k SD above and below the mean, you will find at least (1−1/k²)·100% of the data. (If you plug in k = 1, you’ll find that at least 0% of the data lie within one SD of the mean. Distributions where all the data are more than one SD away from the mean are unusual, but they do exist.) Example 13: For the quiz scores, two standard deviations is 2×3.0579 = 6.1, so you expect at least 1−1/2² = 1−¼ = 75% of the quiz scores to be within the range 9.7±6.1 = 3.6 to 15.8. Remember that this is a worst case. In fact, 14 of the 15 numbers (93%) are within those limits.

3D. Measures of Position Summary:

The measures of center and spread that you’ve studied are properties of the data set as a whole. Now we look at measures of position, which consider how a given data point stands in relation to the whole sample or population that it’s part of.

3D1. Percentiles Definition:

The percentile rank of a data point is the percentage of the data set that is equal to or less than the data point. We say that the data point is at the __th percentile or %ile for short. The symbol is P followed by a number. For example, P35 or P35 denotes the 35th percentile, the member of the data set that is greater than or equal to 35% of the data.

Percentiles are most often used in measures of human development, like your child’s performance on standardized tests, or an infant’s length or weight. Example 14: Your daughter takes a standardized reading test, and the report says that she is in the 85th percentile for her grade. Does this make you happy or sad? Solution: 85% of her grade read as well as she does, or less well; only 15% read better than she does. Presumably this makes you happy. Example 15: Consider the data set 1, 4, 7, 8, 10, 13, 13, 22, 25, 28. (To find percentiles, you have to put the data set in order.) (a) What is the percentile rank of the number 13? Solution: There are ten numbers in the data set, and seven of those are ≤13. Seven out of ten is 70%, so the percentile rank of 13 is 70, or “13 is at the 70th percentile”, or P70 = 13. (b) Find P60 for this data set. Solution: What number is greater than or equal to 60% of the numbers in the data set? Counting up six numbers from the beginning, you find … 13 again. So 13 is both P60 and P70. (Anomalies like this are usual when you have small data sets. It really doesn’t make sense to talk about percentiles unless you have a fairly large data set, typically a population like all third graders or all six-week-old infants.) BTW: Everybody agrees on the idea of a percentile, but different authors have different ways to compute it. For example, some authors say a percentile rank is the percent of data less than the data point, instead of less than or equal to as I did. By their definition there is a 0th percentile but no 100th percentile; by my definition there is no 0th percentile but there is a 100th percentile. And some define percentiles in such a way that the percentile (like the mean) need not be a member of the data set. The different definitions can give very different answers for small data sets. Nobody worries too much about this, because in practice you seldom compute percentiles against small data sets. (What does “18th percentile” mean in a set of only 12 numbers?) All the definitions give pretty much the same answer for larger data sets. David Lane’s Percentiles (2010) gives three definitions of percentile and shows what difference they make. His Definition 2 is the one I use in this book.

3D2. Quartiles Definitions:

The first quartile (Q1) is the member of the data set that is greater than or equal to a quarter of the data points. The third quartile (Q3) is the member of the data set that is greater than or equal to three quarters of the data points. To find quartiles by hand, put the data set in order and find the median. If you have an odd number of data points, strike out the median. Q1 is the median of the lower half, and Q3 is the median of the upper half.

One fourth is 25% and three fourths is 75%, so Q1 = P25 and Q3 is P75. (I chose a definition of percentiles that makes this happen. Some authors use different definitions, which may give slightly different results.) What, no Q2? There is a Q2, but two quarters is one half, or 50%, so the second quartile is better known as the median: 50% of the data are less than or equal to the 50th %ile, alias Q2, alias the median. The quartiles and the median divide the data set into four equal parts. We sometimes use the word quartile in a way that reflects this: the “bottom quartile” means the part of the data set that is below Q1, and the “`upper quartile” or “top quartile” means the part of the data set that is above Q3. Q1 and Q3 are part of the five-number summary (later in this chapter). From Measures of Spread, you already know that they’re used to find the interquartile range, and later in this chapter you’ll use the IQR to make a box-whisker plot. BTW: Just like percentiles, quartiles are defined slightly differently by different authors. Dr. Math gives a nice, clear rundown of different ways of computing quartiles in Defining Quartiles in The Math Forum (2002). I follow Moore and McCabe’s method, which is also used by your TI-83 or TI-84.

3D3. z-Scores You’ll use z-scores more than any other measure of position. (Remember that every measure of position measures the position of one data point within the sample or population that it is part of.) Definition:

The z-score of a data point is how many standard deviations it lies above or below the mean. (A z-score is sometimes called a standard score.)

How do you find out how many SD a number is above or below the mean of its data set? You subtract the mean, and then divide the result by the SD. z-score within a sample:

z-score within a population:



Either way, it’s When you compute a z-score, the top and bottom of the fraction are both in the same units as the original data, and therefore the z-score itself has no units. z-scores are pure numbers. What good are z-scores? You’ll use them in inferential statistics, starting in Chapter 9, but you can also use them in descriptive statistics. For one thing, a z-score gives you economy in language. Instead of saying “at least 75% of the data in any distribution must lie between two standard deviations below the mean and two standard deviations above the mean”, you can say “at least 75% of the data lie between z = ±2.” A z-score helps you determine whether a measurement is unusual. For instance, how good is an SAT verbal score of 300? Scores on the SAT verbal are ND with mean of 500 and SD of 100, so z = −2. The Empirical Rule tells you only 2½% of students score that low or lower. And z-scores are also good for comparing apples and oranges, as the next example shows. Example 16: You have two candidates for an entry-level position in your restaurant kitchen. Both have been to chef school, but different schools, and neither one has any experience. Chris presents you with a final exam score of 86, and Fran with a final exam score of 67. Which one do you hire? At first glance, you’d go with the one who has the higher score. But wait! Maybe Fran with the 67 is actually better, and just went to a tougher school. So you ask about the average scores at the two schools. Chris’s school had a mean score of 76, and Fran’s school had a mean score of 59. Assuming that the students at the two schools had equal innate ability, Fran went to a tougher school than Chris. Chris scored 10 points above the school average, while Fran scored only 8 points above the school average. Now do you hire Chris? Not yet! Maybe there was more variability in Chris’s class, so 10 points above the average is no big deal, but there was less variability in Fran’s, so 8 points above the mean is a big deal. So you dig further and find that the standard deviations of the two classes were 8 and 4. At this point, you make a table: Chris

Fran

Candidate’s score

86

67

School mean

76

59

8

4

(86−76)/8 = 1.25

(67−59)/4 = 2.00

School SD z-score

The z-scores tell you that Fran stands higher in Fran’s class than Chris stands in Chris’s class. Assuming that the two classes as a whole were of equal ability, Fran is the stronger candidate.

3E. Five-Number Summary Definition:

The five-number summary of a data set is the minimum value, Q1, median, Q3, and maximum value (in order).

The five-number summary combines measures of center (the median) and spread (the interquartile range and the range). A plot of the five-number summary, called a box-whisker diagram (below), shows you shape of the data set. On the TI-83 or TI-84, the five-number summary is the second output screen from 1-VarStats. Caution! Remember that the second screen is meaningful only for a simple list of numbers or an ungrouped distribution, not for a grouped distribution. To produce a five-number summary, you need all the original data points. Example 17: Here is the second output screen from the quiz scores earlier in this chapter. The five-number summary is 2.5, 8, 10.5, 11.5, 15 . The median is 10.5, meaning that half the students scored 10.5 or below and half scored 10.5 or above. The interquartile range is Q3−Q1 = 11.5−8 = 3.5. Half of the students scored between 8 and 11.5.

3E1. Outliers Definition:

An outlier is a data value that is well separated from most of the data. Conventionally, the values Q1−1.5×IQR and Q3+1.5×IQR (first quartile minus 1½ times interquartile range, and third quartile plus 1½ times interquartile range) are called fences, and any data points outside the fences are considered outliers.

Example 18: Here again are the quiz scores from earlier in this chapter: 10.5 13.5 8 12 11.3 9 9.5 5 15 2.5 10.5 7 11.5 10 10.5 Find the outliers, if any. The five-number summary, above, gave you the quartiles: Q1 = 8 and Q3 = 11.5. The interquartile range is 11.5−8 = 3.5, and 1.5 times that is 5.25. The fences are 8−5.25 = 2.75 and 11.5+5.25 = 16.75. All the data points but one lie within the fences; only 2.5 is outside. Therefore 2.5 is the only outlier in this data set. You can find outliers more easily by using your TI-83 or TI-84; see below. Why do you care about outliers? First off, an outlier might be a mistake. You should always check all your data carefully, but check your outliers extra carefully. But if it’s not a mistake, an outlier may be the most interesting part of your data set. Always ask yourself what an outlier may be trying to tell you. For example, does this quiz score represent a student who is trying but needs some extra help, or one who simply didn’t prepare for the quiz? What do you do with outliers? One thing you definitely don’t do: Don’t just throw outliers away. That can really give a false picture of the situation. But suppose you have to make some policy decision based on your analysis, or run a hypothesis test (Chapters 10 and 11) and announce whether some claim is true or false? One way is to do your analysis twice, once with the outliers and once without, and present your results in a two-column table. Anyone who looks at it can judge how much difference the outliers make. If you’re lucky, the two columns are not very different, and whatever decision must be made can be made with confidence. But maybe the two columns are so different that including or excluding the outliers leads to different decisions or actions. In that case, you may need to start over with a larger sample, change your data collection protocol, or call in a professional statistician. For more on handling outliers, see Outliers (Simon 2000d).

3E2. Box-Whisker Diagrams The five-number summary packs a lot of information, but it’s usually easier to grasp a summary through a picture if possible. A graph of the five-number summary is called a boxplot or boxwhisker diagram. BTW: The box-whisker diagram was invented by John Tukey in 1970.

A box-whisker diagram has a horizontal axis, which is the number line of the data, and the number line need not start at zero. Either the axis or the chart as a whole needs a title, but there’s usually no need for a title on both. There is no vertical axis. For the graph itself, first identify any outliers and mark them as squares or crosses. Then draw a box with vertical lines at Q1, the median, and Q3. Lastly, draw whiskers from Q3 to the greatest value in the data set that isn’t an outlier, and from Q1 to the smallest value in the data set that isn’t an outlier. Example 19: Let’s look at a box-whisker plot of those same quiz scores, which were 10.5 13.5 8 12 11.3 9 9.5 5 15 2.5 10.5 7 11.5 10 10.5 The five-number summary is reproduced at right. You recall from the previous section that there is one outlier, 2.5, so the smallest number in the data set that isn’t an outlier is 5. Here’s a plot that I made with StatTools from Palisade Corporation:

Box-Whisker Plot, and Shape of a Data Set The box-whisker plot is almost as good as a histogram for showing you the shape of a distribution. If one whisker is longer than the other, and especially if there are outliers on the same side as the long whisker, the distribution is skewed in that direction. If the whiskers are about the same length and there are no outliers, but one side of the box is longer than the other, that usually indicates skew in that direction as well. Example 20: In the boxplot of quiz scores, just above, you see an outlier on the left side, and the left side of the box is longer than the right. That indicates that the distribution is left skewed. Box-Whisker Plot on TI-83/84/89 You can use your TI-83 or TI-84 to make a box-whisker plot. The calculator comes with that ability — see Box-Whisker Plots on TI-83/84 — but it’s easier to use MATH200A Program part 2. See Getting the Program for instructions on getting the program into your calculator. (If you have a TI-89, see Box-Whisker Plots on TI-89.) To make a box-whisker plot with the program, begin by entering the numbers into a statistics list, such as L1. (If you have an ungrouped frequency distribution, put the numbers in one list and the frequencies in a second list. You need the original data for a boxplot, so you can’t make a boxplot of a grouped frequency distribution.) Now press [PRGM]. If you can see MATH200A in the list, press its menu number; otherwise, use the [t] or [s] key to get to MATH200A, and press [ENTER]. With the program name on your home screen, press [ENTER] (again) to run the program, and yet again to dismiss the title screen. You’ll then see a menu. Press [2] for box-whisker plot.

The program asks whether you have one, two, or three samples. Select 1, since that’s what you have. The program wants to know whether you have a plain list of numbers or a grouped frequency distribution. Since you have a plain list, choose 1. The program needs to know which list holds the numbers to be plotted. Finally, the program presents the box-whisker plot. Finding Outliers with the TI-83/84/89 or Excel When you have a box-whisker plot on your screen, whether you used MATH200A part 2 or the calculator’s native commands, if you see any outliers press [TRACE] and then [] or [] to find which data points are outliers. (For the TI-89, see Box-Whisker Plots on TI-89. If you prefer to use Excel to find outliers, see Normality Check and Finding Outliers in Excel.) Five-Number Summary from TI-83/84/89 Boxplot After pressing the [TRACE] key, you can get the five-number summary by pressing [] or [] repeatedly. If there are outliers at the left, use the lowest one for the minimum (first number in the five-number summary); if there are outliers at the right, use the highest one for the maximum (last number in the five-number summary).

What Have You Learned? Overview:

With numeric data, the goal of descriptive stats is to show shape, center, spread, and outliers.

Key ideas:

Measures of center: mean, median, mode; “resistant”. When is mean better, and when is median more representative? Measures of spread: range, variance, standard deviation (SD). Know the advantages and disadvantages of each. Interpreting standard deviation in a normal distribution with the Empirical Rule (68–95–99.7 Rule). Most important measure of position: z-score. Use formulas to find z-score from raw score or vice versa. Be able to interpret z-scores. Other measures of position: percentiles and quartiles. Interquartile range (IQR). Finding mean, SD, and five-number summary with the TI-83 or TI-84. You can do this for a simple list of numbers, for an ungrouped distribution, or for a grouped distribution. The five-number summary of a grouped distribution is not meaningful. Meaning of outliers and what to do when they occur. Boxplot shows outliers if any, plus the five-number summary. (A boxplot isn’t meaningful for a grouped distribution.) Weighted mean. Seeing shape of a distribution in its boxplot, or from the relationship between mean and median.

Study aids:

TI-83/84 Cheat Sheet Because this textbook helps you, please click to donate!

Statistics Symbol Sheet How to Read a Math Book How to Work a Math Problem How to Take a Math Test or Quiz

← Chapter 2 WHYL

Chapter 4 WHYL Õ

Exercises for Chapter 3 Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand. Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.

1 2

When is the mean not the best choice for a measure of center? What would you use instead? Your doctor tells you that you’re in the 15th percentile for cholesterol. Should you be concerned, or should you go out and celebrate with bacon-wrapped shrimp? (Give your reason, not just an answer.)

3

Consider these questions about measures of spread: (a) What’s the biggest problem with the range? (b) What makes the interquartile range a better measure of spread? (c) Why is the variance better than both? (d) What makes the standard deviation (SD) better than the variance? (Give two reasons.)

4 5

Contrast (a) s and , (b) µ and x, (c) N and n.

6

Weights of apples (of a particular type) are normally distributed. In a large shipment, you find that nearly all the apples weigh between 4.50 and 8.50 ounces. Estimate the SD of the weight of apples in that shipment.

Your smart-alecky statistics prof distributes quiz results as z-scores rather than raw scores. It’s a large class, and quiz scores were normally distributed. Your z-score was +1.87. How did you do relative to the class?

7

The grouped frequency distribution at right is the ages reported by a sample of Roman Catholic nuns, from Johnson and Kuby (2004, 67). (a) Approximate the mean and SD of the ages of these nuns, to two decimal places, and find the sample size. (b) Explain why a boxplot of this distribution is a bad idea.

8

You took the courses shown at right. On the usual scale of A = 4.0, A− = 3.7, B+ = 3.3, and so forth, compute your GPA. (Your GPA, grade point average, is the average of your course grades, weighted by number of credits in each course.)

9

Ages

Frequency

20 – 29

34

30 – 39

58

40 – 49

76

50 – 59

187

60 – 69

254

70 – 79

241

80 – 89

147

Course

Credits

Grade

Statistics

3

A

Calculus

4

B+

Microsoft Word

1

C−

Microbiology

3

B−

English Comp

3

C

Your prof has a policy that you can skip the final exam if your quiz average is 87% or better. After ten quizzes, your average is 86%. One quiz remains. Is it still possible for you to skip the final, and if so what percentage score do you need on that last quiz?

10

In a GM factory in Brazil, 25 workers were asked their Commuting Distances in km commuting distance in kilometers. The data, from Dabes and 5 15 23 12 9 Janik (1999, 8), are shown at right. 12 22 26 31 21 (a) Construct a grouped frequency distribution for 0–9 km, 10– 11 19 16 45 12 19 km, and so on. (You made a stemplot for a homework problem in 8 26 18 17 1 Chapter 2, so use that answer to save yourself some work.) 16 24 15 20 17 (b) What is the class width? What are the class midpoints? (c) Use your grouped distribution to approximate the mean and SD of the commuting distances. (d) Now compute the mean, median, and SD of the original data set. (e) Construct a box-whisker plot from the original data set. (Never make a boxplot from grouped data.). Suggestion: do it on your calculator, then transfer it to paper where you have already drawn and labeled the number line. (f) Which is the most appropriate measure of center for this sample? Why? (g) Give the five-number summary and identify any outliers.

11

SAT verbal scores are normally distributed, with a mean of 500 and SD of 100. You randomly select a test taker. What’s the probability that s/he scored between 500 and 700?

12

Mensa, the largest high-IQ society, accepts SAT scores as indicating intelligence. Assume that the mean combined SAT score is 1500, with SD 300. Jacinto scored a combined 2070. Maria took a traditional IQ test and scored 129. On that test, the mean is 100 and the SD

is 15. From the test scores, who is more intelligent? Explain.

13

At right is a sample shown as a grouped frequency distribution. Compute the following quantities and label each with its proper symbol: (a) sample size, (b) mean, (c) standard deviation. Round to two decimal places. Use any valid method, but show your work. (Begin by filling in the third column including column heading.)

14

Test Scores

Frequencies, f



470.0–479.9

15



480.0–489.9

22



490.0–499.9

29



500.0–509.9

50



510.0–519.9

38



In a particular data set (continuous data), the mean is around 8700 and the median is around 5000. What if anything can you say about the shape of the distribution?



Solutions Õ

What’s New 1 Feb 2015: Add references to TI-89 boxplots and using Excel to find outliers. 11 Jan 2015: New section: Five-Number Summary from TI-83/84 Boxplot; add links and study aids in What Have You Learned? (intervening changes suppressed) 3 June 2013: New document.

4. Linked Variables Updated 11 Jan 2015 (What’s New?) Intro:

When you get two numbers from each member of the sample (bivariate numeric data), you make a plot to look for a relationship between them. If a straight line seems like a good fit for the plotted points, we say that they follow a linear model. In this chapter, you’ll learn when to use a linear model, and how to find the best one.

Contents:

4A. Mathematical Models 4B. Scatterplot, Correlation, and Regression on TI-83/84 Step 0. Setup Step 1. Make the Scatterplot Step 2. Perform the Regression · Correlation Coefficient, r · Regression Line, ŷ = ax+b · Coefficient of Determination, R² Step 3. Display the Regression Line Optional: Display the Residuals · Residual Plot Showing Problems · Optional Advanced: Residuals and R² 4C. Finding ŷ from a Regression on TI-83/84 Method 1: Trace on the Regression Line Graph (preferred) · Extrapolation: Just Say No (Usually) Method 2: Use Calculated Regression Equation (if necessary) Finding Residuals 4D. Decision Points for Correlation Coefficient Procedure Examples Interpretation 4E. Optional: Scatterplot, Correlation, and Regression in Excel Plot the Points Show the Regression Line Show the Correlation Coefficient Predict the Average y What Have You Learned? Exercises for Chapter 4 What’s New

4A. Mathematical Models The chapter intro talks about points following a “linear model”. But what is a linear model, and what does it mean to follow one? Well, since a linear model is one kind of mathematical model, let’s talk a little bit about mathematical models. You know what a model is in general, right? A copy of the original, usually smaller and with unimportant details left out. Think of model airplanes, or architect’s models of buildings. A mathematical model is like that. Real Life is Complicated,™ and mathematical models help us manage those complications. Definition:

A mathematical model is a mathematical description of something in the real world. An object or process or data set follows a model if the calculations you do with the model match reality closely enough to be useful.

You’ve already met one model in Chapter 3: the grouped frequency distribution. Instead of dealing with all the data points, you do calculations using the midpoint of each class. That gives you approximate mean and SD, but the approximation is close enough to be useful. The MathIsFun site has a nice example of modeling the space inside a cardboard box, going beyond the h×w×l formula; see Mathematical Models. You’ll meet plenty more models in this book: probability models in Chapter 5, several discrete models in Chapter 6, and the normal model in Chapter 7. But in this chapter we’re concerned with the linear model. Definition:

The linear model uses the linear equation y = ax+b to model the relationship between two numeric variables x and y. In any particular model, a and b are constants. Because the graph of y = ax+b is a straight line, we can also call it a straightline model, and we say that x and y have a straight-line relationship in the model.

The linear model is a good one if it describes the data well enough to let you make useful calculations.

4B. Scatterplot, Correlation, and Regression on TI-83/84 Summary:

When you have a set of (x,y) data points and want to find the best equation to describe them, you are performing a regression. You will learn how to find the strength of the association between your two variables (correlation coefficient), and how to find the line of best fit (least squares regression line). Usually you have some idea that your x variable can help predict your y variable, so you call x the explanatory variable and y the response variable. (Other names are independent variable and dependent variable.)

See also:

A separate version of these instructions for the TI-89 Scatterplot, Correlation, and Regression in Excel

4B1. Step 0. Setup Set floating point mode, if you haven’t already.

[MODE] [t] [ENTER]

Go to the home screen

[2nd MODE makes QUIT] [CLEAR]

Turn on diagnostics with the [DiagnosticOn] command.

[2nd 0 makes CATALOG] [x-1] Don’t press the [ALPHA] key, because the CATALOG command has already put the calculator in alpha mode. Scroll down to DiagnosticOn and press [ENTER] twice.

The calculator will remember these settings when you turn it off: next time you can start with Step 1.

4B2. Step 1. Make the Scatterplot Before you even run a regression, you should first plot the points and see whether they seem to lie along a straight line. If the distribution is obviously not a straight line, don’t do a linear regression. (Some other form of regression might still be appropriate, but that is outside the scope of this course.) 20%

Let’s use this example from Sullivan (2011, 179): the distance a golf ball travels versus the speed with which the club head hit it. Club-head speed, mph (x)

100

102

103

101

105

100

99

105

Distance, yards (y)

257

264

274

266

277

263

258

275

Turn off other plots.

[Y=] Cursor to each highlighted = sign or Plot number and press [ENTER] to deactivate.

Set the format screen.

Press [2nd ZOOM makes FORMAT]. Just select everything in the left column.

Enter the numbers in two statistics lists.

[STAT] [1] selects the list-edit screen. Cursor onto the label L1 at top of first column, then [CLEAR] [ENTER] erases the list. Enter the x values. Cursor onto the label L2 at top of second column, then [CLEAR] [ENTER] erases the list. Enter the y values.

Set up the scatterplot.

[2nd Y= makes STAT PLOT] [1] [ENTER] turns Plot 1 on. [t] [ENTER] selects scatterplot. [t] [2nd 1 makes L1] ties list 1 to the x axis. [t] [2nd 2 makes L2] ties list 2 to the y axis. (Leave the square as the selected mark for plotting.)

Plot the points.

[ZOOM] [9] automatically adjusts the window frame to fit the data.

BTW: I have the grid turned on in some of these pictures, but earlier I told you to turn it off. That’s simplest. If you want the grid, you can turn it on, but then you’ll have to adjust the grid spacing for almost every plot. To adjust grid spacing, press [WINDOW ], set Xscl and Yscl to appropriate values for your data, and press [GRAPH ] to see the result.

Check your data entry by tracing the points.

[TRACE] shows you the first (x,y) pair, and then [] shows you the others. They’re shown in the order you entered them, not necessarily from left to right.

A scatterplot on paper needs labels (numbers) and titles on both axes; the x and y axes typically won’t start at 0. Here’s the plot for this data set. (The horizontal lines aren’t needed when you plot on graph paper.)

When the same (x,y) pair occurs multiple times, plot the extra ones slightly offset. This is called jitter. In the example at the right, the point (6,6) occurs twice.

If the data points don’t seem to follow a straight line reasonably well, STOP! Your calculator will obey you if you tell it to perform a linear regression, but if the points don’t actually fit a straight line then it’s a case of “garbage in, garbage out.” For instance, consider this example from DeVeaux, Velleman, Bock (2009, 179). This is a table of recommended f/stops for various shutter speeds for a digital camera: Shutter speed (x)

1/1000

1/500

1/250

1/125

1/60

1/30

1/15

1/8

2.8

4

5.6

8

11

16

22

32

f/stop (y)

If you try plotting these numbers yourself, enter the shutter speeds as fractions for accuracy: don’t convert them to decimals yourself. The calculator will show you only a few decimal places, but it maintains much greater precision internally. You can see from the plot at right that these data don’t fit a straight line. There is a distinct bend near the left. When you have anything with a curve or bend, linear regression is wrong. You can try other forms of regression in your calculator’s menu, or you can transform the data as described in DeVeaux, Velleman, Bock (2009, ch 10) and other textbooks.

4B3. Step 2. Perform the Regression Set up to calculate statistics.

[STAT] [] [4] pastes LinReg(ax+b) to the home screen.



[2nd 1 makes L1] [,] [2nd 2 makes L2] defines L1 as x values and L2 as y values. If you have the “wizard’ interface, leave FreqList blank, or press [DEL] if something is already filled in.

Set up to store regression equation.

[,] [VARS] [] [1] [1] pastes Y1 into the LinReg command.

Show your work! Write down the whole command — LinReg(ax+b) L1,L2,Y1 in this case, not just LinReg or LinReg(ax+b).

Press [ENTER]. The calculator shows correlation and regression statistics and pastes the regression equation into Y1.

Your input screen should look like this, for the “wizard” and non-wizard interfaces:

Write down the slope a, the y intercept b, the coefficient of determination R², and the correlation coefficient r. (A decent rule of thumb is four decimal places for slope and intercept, and two for r and R².) a = 3.1661, b = −55.7966 R² = 0.88, r = 0.94 Now let’s take a look in depth at each of those. Correlation Coefficient, r Look first at r, the coefficient of linear correlation. r can range from −1 to +1 and measures the strength of the association between x and y. A positive correlation or positive association means that y tends to increase as x increases, and a negative correlation or negative association means that y tends to decrease as x increases. The closer r is to 1 or −1, the stronger the association. We usually round r to two decimal places.

“Several sets of (x,y) [pairs], with the correlation coefficient for each set. Note that correlation reflects the noisiness and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom).” source: Correlation and Dependence

BTW: Karl Pearson developed the formula for the linear correlation coefficient in 1896. The symbol r is due to Sir Francis Galton in 1888.

For real-world data, 0.94 is a pretty strong correlation. But you might wonder whether there’s actually a general association between club-head speed and distance traveled, as opposed to just the correlation that you see in this sample. Decision Points for Correlation Coefficient, later in this chapter, shows you how to answer that question. BTW: Though nobody ever computes r by hand any more, the formula explains the properties of r. Here are two equivalent forms. In the first form, you compute the z score of each x within just the x’s and the z score of each y within just the y’s. The second formula is easier if you already have the means and SD of the x’s and y’s. For the meaning of ∑, see ∑ Means Add ’em Up in Chapter 1.

z-scores are pure numbers without units, and therefore r also has no units. You can interchange the x’s and y’s in the formula without changing the result, and therefore r is the same regardless of which variable is x and which is y. Why is r positive when data points trend up to the right and negative when they trend down to the right? The product (x−x)(y−y) explains this. When points trend up to the right, most are in the lower left and upper right quadrants of the plot. In the lower left, x and y are both below average, x−x and y−y are both negative, and the product is positive. In the upper right, x and y are both above average, x−x and y −y are both positive, and again the product is positive. The product is positive for most points, and therefore r is positive when the trend is up to the right. On the other hand, if the data trend down to the right, most points are in the upper left (where x is below average and y is above average, x−x is negative, y−y is positive, and the product is negative) and the lower right (where x−x is positive, y−y is negative, and the product is also negative.) Since the product is negative for most points, r is negative when data trend down to the right.

Be careful in your interpretation! No matter how strong your r might be, say that changes in the y variable are associated with changes in the x variable, not “caused by” it. Correlation is not causation is your mantra. It’s easy to think of associations where there is no cause. For example, if you make a scatterplot of US cities with x as number of books in the public library and y as number of murders, you’ll see a positive association: number of murders tends to be higher in cities with more library books. Does that mean that reading causes people to commit murder, or that murderers read more than other people? Of course not! There is a lurking variable here: population of the city. When you have a positive or negative association, there are four possibilities: x might cause changes in y, y might cause changes in x, lurking variables might cause changes in both, or it could just be coincidence, a random sample that happens to show a strong association even though the population does not.

used by permission; source: http://xkcd.com/552/ (accessed 2014-09-15)

BTW: If correlation is not causation, then how can we establish causation? For example, how do we know that smoking causes lung cancer in humans? Obviously we can’t perform an experiment, for ethical reasons. Sir Austin Bradford Hill laid down nine criteria for establishing causation in a 1965 paper, The Environment and Disease: Association or Causation? Short summaries of the “Bradford Hill criteria” are many places on the Web, including Steve Simon’s (2000b) Causation.

Regression Line, ŷ = ax+b Write the equation of the line using ŷ (“y-hat”), not y, to indicate that this is a prediction. b is the y intercept, and a is the slope. Round both of them to four decimal places, and write the equation of the line as ŷ = 3.1661x − 55.7966 (Don’t write 3.1661x + −55.7966.) These numbers can be interpreted pretty easily. Business majors will recognize them as intercept = fixed cost and slope = variable cost, but you can interpret them in non-business contexts just as well. The slope, a or b1 or m, tells how much ŷ increases or decreases for a one-unit increase in x. In this case, your interpretation is “the ball travels about an extra 3.17 yards when the club speed is 1 mph greater.” The slope and the correlation coefficient always have the same sign. (A negative slope would mean that y decreases that many units for every one unit increase in x.) The intercept, b or b0 , says where the regression line crosses the y axis: it’s the value of ŷ when x is 0. Be careful! The y intercept may or may not be meaningful. In this case, a club-head speed of zero is not meaningful. In general, when the measured x values don’t include 0 or don’t at least come pretty close to it, you can’t assign a real-world interpretation to the intercept. In this case you’d say something like “the intercept of −55.7966 has no physical interpretation because you can’t hit a golf ball at 0 mph. Here’s an example where the y intercept does have a physical meaning. Suppose you measure the gross weight of a UPS truck (y) with various numbers of packages (x) in it, and you get the regression equation ŷ = 2.17x+2463. The slope, 2.17, is the average weight per package, and the y intercept, 2463, is the weight of the empty truck. BTW: The slope (a or m or b1) and y intercept (b or b0) of the regression line can be calculated from formulas, if you have a lot of time on your hands:

For the meaning of ∑, see ∑ Means Add ’em Up in Chapter 1. Traditionally, calculus is used to come up with those equations, but all that’s really necessary is some algebra. See Least Squares — the Gory Details if you’d like to know more. The second formula for the slope is kind of neat because it connects the slope, the correlation coefficient, and the SD of the two variables.

Coefficient of Determination, R² The last number to look at (third on the screen) is R², the coefficient of determination. (The calculator displays r², Because this textbook helps you, but the capital letter is standard notation.) R² measures the please click to donate! quality of the regression line as a means of predicting ŷ from x: the closer R² is to 1, the better the line. Another way to look at it is that R² measures how much of the total variation in y is predicted by the line. In this case R² is about 0.88, so your interpretation is “about 88% of the variation in distance traveled is associated with variation in club-head speed.” Statisticians say that R² tells you how much of the variation in y is “explained” by variation in x, but if you use that word remember that it means a numerical association, not necessarily a cause-and-effect explanation. It’s best to stick with “associated” unless you have done an experiment to show that there is cause and effect. There’s a subtle difference between r and R², so keep your interpretations straight. r talks about the strength of the association between the variables; R² talks about what part of the variation in the y variable is associated with variation in the x variable, and how well the line predicts y from x. Don’t use any form of the word “correlated” when interpreting R². Only linear regression will have a correlation coefficient r, but any type of regression — fitting any line or curve to a set of data points — will have a coefficient of determination R² that tells you how well the regression equation predicts y from the independent variable(s). Steve Simon (1999b) gives an example for non-linear regression in R-squared. BTW: In straight-line regression, R² is the square of r, so if you want a formula just compute r and square the result.

4B4. Step 3. Display the Regression Line Show line with original data points.

[GRAPH]

What is this line, exactly? It’s the one unique line that fits the plotted points best. But what does “best” mean? For each plotted point, there is a residual equal to y−ŷ, the difference between the actual measured y for that x and the value predicted by the line. Residuals are positive if the data point is above the line, or negative if the data point is below the line. You can think of the residuals as measures of how bad The same four points on left and right. the line is at prediction, so you want them small. For any The vertical distance from each possible line, there’s a “total badness” equal to taking all the measured data point to the line, y−ŷ, is residuals, squaring them, and adding them up. The least called the residual for that x value. squares regression line means the line that is best because it The line on the right is better because has less of this “total badness” than any other possible line. the residuals are smaller. Obviously you’re not going to try different lines and make source: Dabes and Janik (1999, 179) those calculations, because the formulas built into your calculator guarantee that there’s one best line and this is it. BTW: Carl Friedrich Gauss developed the method of least squares in a paper published in 1809.

4B5. Optional: Display the Residuals I would like you to know the material in this section, but it's not part of the MATH200 syllabus so I don’t require it. No homework or quiz problems will draw from this section. You will, however, need to calculate individual residuals; see Finding Residuals, below. “No regression analysis is complete without a display of the residuals to check that the linear model is reasonable.” DeVeaux, Velleman Bock (1999, 227) The residuals are automatically calculated during the regression. All you have to do is plot them on the y axis against your existing x data. This is an important final check on your model of the straight-line relationship. Turn off other plots.

Press [Y=]. Cursor to the highlighted = sign next to Y1 and press [ENTER]. Cursor to PLOT1 and press [ENTER].

Set up the plot of residuals against the x data.

Set up Plot 2 for the residuals. Press [2nd Y= makes STAT PLOT] [t] [ENTER] [ENTER] to turn on Plot 2. Press [t] [ENTER] to select a scatterplot. The x’s are still in L1, so press [2nd 1 makes L1] [ENTER]. In this plot, the y’s will be the residuals: press [2nd STAT makes LIST], cursor up to RESID, and press [ENTER] [ENTER].

Display the plot.

[ZOOM] [9] displays the plot.

You want the plot of residuals versus x to be “the most boring scatterplot you’ve ever seen.” (DeVeaux, Velleman, Bock 2009, 203) “It shouldn’t have any interesting features, like a direction or shape. It should stretch horizontally, with about the same amount of scatter throughout. It should show no bends, and it should have no outliers. If you see any of these features, find out what the regression model missed.” Don’t worry about the size of the residuals, because [ZOOM] [9] adjusts the vertical scale so that they take up the full screen. If the residuals are more or less evenly distributed above and below the axis and show no particular trend, you were probably right to choose linear regression. But if there is a trend, you have probably forced a linear regression on non-linear data. If your data points looked like they fit a straight line but the residuals show a trend, it probably means that you took data along a small part of a curve. Here there is no bend and there are no outliers. The scatter is pretty consistent from left to right, so you conclude that distance traveled versus club-head speed really does fit the straight-line model. Residual Plot Showing Problems Refer back to the scatterplot of f/stop against shutter speed. I said then that it was not a straight line, so you could not do a linear regression. If you missed the bend in the scatterplot and did a regression anyway, you’d get a correlation coefficient of r = 0.98, which would encourage you to rely on the bad regression. But plotting the residuals (at right) makes it crystal clear that linear regression is the wrong type for this data set. This is a textbook case (which is why it was in a textbook): there’s a clear curve with a bend, variation on both sides of the x axis is not consistent, and there’s even a likely outlier. Optional Advanced: Residuals and R² I said in Step 2 that the coefficient of determination measures the variation in measured y that’s associated with the variation in measured x. Now that you understand the residuals, I can make that statement more precise and perhaps a little easier to understand. The set of measured y values has a spread, which can be measured by the standard deviation or the variance. It turns out to be useful to consider the variation in y’s as their variance. (You remember that the variance is the square of the standard deviation.) The total variance of the measured y’s has two components: the so-called “explained” variation, which is the variation along the regression line, and the “unexplained” variation, which is the variation away from the regression line. The “explained” variation is simply the variance of the ŷ’s, computing ŷ for every x, and the “unexplained” variation is the variance of the residuals. Those two must add up to the total variance of the measured y’s, which means that as percentages of the variation in y they must add to 100%. So R² is the percent of “explained” variation in the regression, and 100%−R² is the percent of “unexplained” variation. and Now I can restate what you learned in Step 2. R² is 88% because 88% of the variance in y is associated with the regression line, and the other 12% must therefore be the variance in the residuals. This isn’t hard to verify: do a 1-VarStats on the list of measured y’s and square the standard deviation to get the total variance in y, s²y = 59.93. Then do 1-VarStats on the residuals list and square the standard deviation to get the “unexplained” variance, s²e = 7.12. The ratio of those is 7.12/59.93 = 0.12, which is 1−R². Expressing it as a percentage gives 100%−R² = 12%, so 12% of the variation in measured y’s is “unexplained” (due to lurking variables, measurement error, etc.).

4C. Finding ŷ from a Regression on TI-83/84 Summary:

The regression line represents the model that best fits the data. One important reason for doing the regression in the first place is to answer the question, what average y value does the model predict for a given x? This page shows you two methods of answering that question.

See also:

A separate version of these instructions for the TI-89 A separate version of these instructions for Excel

4C1. Method 1: Trace on the Regression Line Graph (preferred) You can make predictions while examining the graph of the regression line on the TI-83/84 or TI-89. Advantages to this method: aside from being pretty cool, it avoids rounding errors, and it’s very fast for multiple predictions. Activate tracing on the regression line.

[TRACE]

Look in the upper left corner to make sure that the regression equation is displayed.

If you see P:L1,L2, press [s] to display the regression equation.

Enter the x value.

Press the black-on-white numeric keys including [(−)] and decimal point if needed. As soon as you press the first number, you’ll see a large X= appear at the bottom left of the screen. Enter any additional digits and press [ENTER]. The TI-83/84 displays the predicted average y value (ŷ) at the bottom right and puts a blinking cursor at that point on the regression line.

Caution:

ŷ = 267.1 yards is the predicted or expected average distance for a club-head speed of 102 mph. But that does not mean any particular golf ball hit at that speed will travel that exact distance. You can think of ŷ as the average travel distance that you’d would expect for a whole lot of golf balls hit at that speed. Extrapolation: Just Say No (Usually)

Caution: A regression equation is valid only within the range of actual measured x values, and a little way left and right of that range. If you try to go too far outside the valid range, the calculator will display ERR:INVALID. It’s not just being cranky. The line describes the points you measured, so it’s usable between your minimum and maximum x values and maybe a little way outside those limits. But unless you have very solid reasons why the same straight-line model is good beyond that range, you can’t extrapolate. Take a look at this graph of men’s and women’s winning times in the Olympic 100-meter dash from 1928 to 2012, which I made from data compiled by Mike Rosenbaum. (The women’s 100 m dash became an Olympic event in 1928.)

From this you can reasonably guess that if women had run in the 1924 Olympics, the winner would have finished in around 12.2 or 12.3 seconds. And the 2016 winner will probably finish in around 11.5 seconds. But the further you go outside your measured data, the more riskier your predictions. Will men’s and women’s times generally continue to decrease? Probably: training will get better, nutrition will improve, global communications will make it less likely that a stellar runner goes undiscovered. But will the decrease follow a straight line? Certainly not! Think about it for a minute. If times keep decreasing on a straight line, eventually they’ll cross the x axis and go negative. Runners will finish the race before they start it! So obviously the straight-line model breaks down — the only question is where. You don’t know, and you can’t know. All you know is that it’s not safe to extrapolate. Bogus extrapolations give statistics a bad name and make people say “you can prove anything with statistics.” Here’s an example. I’ve just extended the two trend lines to “prove” that after the 2160 Olympics women will run the 100 meters faster than men. Pretty clearly, the linear model breaks down before then.

It’s not safe to extrapolate to earlier times, either. The intercepts tell you that in the year zero, the fastest man in the world took 31.6 seconds to run 100 m, and the fastest woman took 44.7 seconds. Does that seem believable?

4C2. Method 2: Use Calculated Regression Equation (if necessary) But what if you don’t still have the regression line on your calculator, for instance if you’ve done a different regression? In that case, you can go back to your written-down regression equation and plug in the desired x value. Advantage of this method: You already know how to substitute into equations. Disadvantages: depending on the specific numbers involved, you may introduce rounding errors. Also, since you’re entering more numbers there’s an increased chance of entering a number wrong. Example: To find the predicted average y value for x = 102, go back to the regression equation that you wrote down, and substitute 102 for x: ŷ = 3.1661x − 55.7966 ŷ = 3.1661*102 − 55.7966 ŷ = 267.1456 Õ 267.1 In this example, the rounding error was very small, and it disappeared when you rounded ŷ to one decimal place. But there will be problems where the rounding error is large enough to affect the final answer, so always use the trace method if you can. Again, please observe the Cautions above. With this method, the calculator won’t tell you when your x value is outside a reasonable range, so you need to be aware of that issue yourself.

4C3. Finding Residuals Each measured data point has an associated residual, defined as y−ŷ, the distance of the point above or below the line. To find a residual, the actual y comes from the original data, and the predicted average ŷ comes from one of the methods above. Example: Find the residual for x = 102. Solution: From the original data, y = 264. From either of the methods above, ŷ = 267.1. Therefore the residual is y−ŷ = 264−267.1 = −3.1 yards. If a given x value occurs in more than one data point, you have multiple residuals for that x value.

4D. Decision Points for Correlation Coefficient Summary:

After you compute the linear correlation coefficient r of your sample, you may wonder whether this reflects any linear correlation in the population. By comparing r to a critical number or decision point, you either conclude that there is linear correlation in the population, or reach no conclusion. You can never conclude that there’s no correlation in the population.

BTW: This page gives a simple mechanical test, but a proper statistical test exists. The optional advanced handout Inferences about Linear Correlation explains how decision points are computed and the theory behind the test. You need to learn about t tests before you can understand all of it, but right now you can use the Excel spreadsheet that you’ll find there. Or you can use MATH200B Program part 6 to do the computations.

4D1. Procedure The decision points are used to answer the question “From the linear correlation r of my sample, can I rule out chance as an explanation for the correlation I see? Can I infer that there is some correlation in the population?” To answer that question, temporarily disregard the sign of r. This is the absolute value of r, written | r |. Then compare | r | to the decision point, and obtain one of the only three possible results: If | r | ≤ d.p.

If | r | > d.p. ... and r is negative

... and r is positive

... then you cannot say whether ... then there is some there is any linear correlation in the negative linear correlation population. in the population.

... then there is some positive linear correlation in the population.

Here’s a table of decision points (also known as critical values of r) for various sample sizes. Decision Points or Critical Numbers for r (two-tailed test for ≠0 at significance level 0.05) n

d.p.

n

d.p.

n

d.p.

5

.878

10

.632

15

6

.811

11

.602

7

.754

12

.576

8

.707

13

9

.666

14





n

d.p.

.514

20

16

.497

17

.482

.553

18

.532

19





n

d.p.

.444

30

.361

22

.423

40

.312

24

.404

50

.279

.468

26

.388

60

.254

.456

28

.374

80

.220

100

.196











(If your sample size is not shown, either refer to the Excel workbook or use the next lower number that is shown in the table. Example: n = 35 is not shown, and therefore you will use the decision point for n = 30.)

4D2. Examples You survey 50 randomly selected college students about the number of hours they spend playing video games each week and their GPA, and you find r = −0.35. You look up n = 50 in the table and find 0.279 as the decision point. |r|>d.p. (0.35 > 0.279). You conclude that for college students in general, video game play time is negatively associated with GPA, or that GPA tends to decrease as video-game playing increases. You randomly select 21 college students. For the amount they spend on textbooks and their GPA, you find r = +0.20. n=21 isn’t in the table of decision points, so you select 0.444, the decision point for n=20. |r|≤d.p. (0.20 ≤ 0.444). Therefore, you are unable to make any statement about an association between textbook spending and GPA for college students in general.

4D3. Interpretation Be very careful with your interpretation, and don’t say more than the statistics will allow. The question was simply whether there is some correlation in the population, not how much. The population might have stronger or weaker correlation than your sample; all you know is that it has some. (Though you won’t learn how to do it in this course, it is possible to estimate the correlation coefficient of the population.) If you conclude there is some correlation in the population, it’s probable, not certain. From a completely uncorrelated population, there’s still one chance in 20 of drawing a sample with | r | greater than the decision point. Because 1/20 is .05, we say that .05 is the significance level. Even if you conclude that there is some correlation in the population, that’s the start of your investigation, not the end. If there’s a correlation in the population, you can’t just assume that one variable drives the other: correlation is not causation. Steve Simon’s (2000b) Causation gives some hints for investigating causation, using smoking and lung cancer as an example.) Finally, note that there’s no way to reach the conclusion “there’s no correlation in the population." Either there (probably) is, or you can’t reach any conclusion. This will be a general pattern in inferential statistics: either you reach a conclusion of significance, or you don’t reach any conclusion at all. (As you’ll see in Chapter 10, you can conclude “something is going on”, you can fail to reach a conclusion, but you can never conclude “nothing is going on”. Lack of evidence for is not evidence against.)

4E. Optional: Scatterplot, Correlation, and Regression in Excel Summary:

In “Scatterplot, Correlation, and Regression on TI-83/84”, earlier in this chapter, you learned the concepts of correlation and regression, and you used a TI-83 or TI-84 calculator to plot the points and do the computations. The calculator is handy, but calculator screens aren’t great for formal reports. This section tells you how to do the same operations in Microsoft Excel, without repeating the concepts. I’m using Excel 2010, but Excel 2007 or 2013 should be almost identical.

4E1. Plot the Points Here again are the data: Club-head speed, mph (x)

100

102

103

101

105

100

99

105

Distance, yards (y)

257

264

274

266

277

263

258

275

1. Enter the x-y pairs in rows or columns; row or column heads are optional. 2. With your mouse, highlight the data but not the headers. Click Insert. In the Charts section, click Scatter and choose the first scatterplot type. 3. Right-click the useless “Series1” legend and click Delete. 4. This time I got lucky, but sometimes Excel puts too much white space at the left or bottom of the chart. If this happens to you, right-click the axis numbers and select Format Axis. Change Minimum to Fixed and type in a sensible value. 5. In the Excel ribbon, click Layout » Axis Titles » Primary Horizontal Axis Title » Title Below Axis and type the axis title. Include units if any. In this case, you have club-head speed in miles per hour. 6. Click Axis Title » Primary Vertical Axis Title » Rotated Title and type the axis title, including units if any. In this case, you have distance traveled in yards. 7. Click Chart Title » Above Chart and type your chart title. 8. For a neater appearance, you can right-click the horizontal axis, select Format Axis, and change Major tick mark type to None. Repeat for the vertical axis. Your chart should look like this:

4E2. Show the Regression Line 1. In the Excel ribbon, click Layout. In the Analysis group, click Trendline » More Trendline Options. 2. In the dialog box that appears, click Trendline Options at the left. At the top right, select Linear. At the bottom right, select Display Equation on chart and Display Rsquared value on chart. 3. Click and drag the regression equation and R² value so that they’re not covering any data points or any part of the line. Then right-click them and select Format Trendline Label. Click Fill at the left, then at the right click Solid Fill and change the color to white. (This keeps the gridlines from running through the text.) If you wish, click Border Color » Solid Line. Here’s the result:

4E3. Show the Correlation Coefficient Excel won’t put r on the chart, but you can compute it in a worksheet cell: 1. Click into an empty worksheet cell. Type =CORREL( including the = sign and opening parenthesis. 2. Highlight your y list with your mouse — numbers only, not the header — and type a comma. 3. Highlight your x list with your mouse — again, just the numbers. Type a closing parenthesis and press [ENTER]. (You can get the slope, y intercept, or R² into the worksheet by following the above procedure but substituting SLOPE, INTERCEPT, or RSQ for CORREL.)

4E4. Predict the Average y Like your calculator, Excel can find the ŷ value (predicted average y) for any x. Caution:

A regression equation is valid only within the range of actual measured x values, and a little way left and right of that range. If you go outside that range, Excel will happily serve up garbage numbers to you.

On average, how far do you expect a golf ball to travel when hit at 102 mph? 1. Type your x value, 102, in an empty cell. 2. Click into an empty worksheet cell. Type =FORECAST( including the = sign and opening parenthesis. 3. Click into the cell that holds your x value, and type a comma. 4. Highlight your y list with your mouse — numbers only, not the header — and type a comma. 5. Highlight your x list with your mouse — again, just the numbers. Type a closing parenthesis and hit [ENTER]. You’ll see the predicted average distance, 267.1 yards. The prediction formula, like all Excel formulas, is “live”: if you type in a new x Excel will display the corresponding ŷ. If this doesn’t happen, in the Excel ribbon click Formulas » Calculation Options » Automatic.

What Have You Learned? Key ideas:

Make scatterplot on calculator to decide whether to perform regression. Compute linear correlation coefficient r and best-fitting line ŷ = ax+b on calculator. Interpret r, R², slope, and intercept. Caution! r is about correlation, but R² is not. Use the regression line to make and interpret predictions about ŷ. Remember, you are predicting an average. Caution! Don’t extrapolate. Compute residuals. If you have several data points with the same x and different y’s, you have several residuals for that x. Determine whether there’s a linear relation in the population. Don’t make handwaving arguments; use decision points.

Study aids:

TI-83/84 Cheat Sheet Because this textbook helps you, please click to donate!

Statistics Symbol Sheet

← Chapter 3 WHYL

Chapter 5 WHYL Õ

Exercises for Chapter 4 Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand. Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.

1

A researcher performed a regression on x = age and y = salary for all employees at MegaGrandeEnormoCorp (doing business as “Gramma’s Kitchen”). She found R² = 0.64. How would you explain this to a friend who doesn’t understand any math more complicated than percentages?

2

Manatees or “sea cows” are large, slow-moving mammals that live in coastal waters. They’re an endangered species. Sharyn O’Halloran (n.d., slide 4) quotes yearly figures from the US Fish & Wildlife Service for the number of powerboat registrations and number of manatees killed by power boats in Florida coastal waters. (a) The two variables are power-boat registrations and manatee deaths. Which should be the explanatory variable, and which should be the response variable? (b) On paper or on your calculator, make a scatterplot. Do the data seem to follow a straight line, more or less? (c) Give the symbol and numerical value of the correlation coefficient. (d) Write down the regression equation for manatee deaths as a function of power-boat registrations. (e) State and interpret the slope. (f) State and interpret the y intercept. (g) Give the coefficient of determination with its symbol, and interpret it. (h) How many deaths does the regression predict if 559,000 power boats are registered? Use the proper symbol. (i) Find the residual for x = 559. (j) How many manatee deaths would you expect for a million power-boat registrations?

Year Power Boat Manatees Reg. (1000s) Killed 1977 447 13 1978 460 21 1979 481 24 1980 498 16 1981 513 24 1982 512 20 1983 526 15 1984 559 34 1985 585 33 1986 614 33 1987 645 39 1988 675 43 1989 711 50 1990 719 47 1991 716 53 1992 716 38 1993 716 35 1994 735 49

3

Sascha randomly selected 10 TC3 students and asked how many hours of TV they watched on an average day and what was their GPA. The correlation was −0.57. What if anything can you say about TV watching and GPA for all TC3 students?

4

Dial Temp, °F 0 6 2 −1 3 −3 5 −10 6 −16

Your deep freezer has a dial to regulate temperature, but it’s just numbered 0 to 8 with no indication of temperature. So you try various dial settings, allowing 24 hours for temperature to stabilize after each change. The results are shown at right. (a) Make a scatterplot. Does a straight-line model seem reasonable here? (b) What linear equation best describes the relation between dial setting x and temperature y? (c) State and interpret the slope. (d) State and interpret the y intercept. (e) Give the correlation coefficient with its symbol. (f) Give the coefficient of determination with its symbol, and interpret it. (g) Predict the temperature for a dial setting of 1.

5

A statistics professor asked students to write on their final exam the number of hours they had spent studying. After scoring the exams, she randomly selected 12 of them and plotted exam score against hours of study, with the result r = 0.85. What if anything can you say about the relation between study time and exam score for statistics students in general, assuming that this class is representative of all classes?

6

A public-school administrator with too much time on his hands studied shoe size and reading ability and found a correlation coefficient of 0.81. Are big feet a sign of intelligence?

7

A scatterplot is shown at right. Would the value of r be strongly positive, near zero, or strongly negative? Briefly explain your answer.

8

“In a large study of twins, the Minnesota Twin study found a correlation of +.71 between the IQ scores of identical twins. Another study found that family income is correlated +.30 with the IQ of children.” (Source: Pearson’s 2001 in the McGraw-Hill Statistical Primer.) How much of the variation in children’s IQ is associated with variation in family income?



Solutions Õ

What’s New 11 Jan 2015: Add links and study aids in What Have You Learned? 3 Jan 2015: Add a screen shot for setting up the scatterplot, and mention using the square as the plot symbol. (intervening changes suppressed) 10 Mar 2012: New document formed out of separate documents on regression, prediction, and decision points.

5. Probability Updated 13 Jan 2015 (What’s New?) Intro:

By now you know: There’s no certainty in statistics. When you draw a sample from a population, what you get is a matter of probability. When you use a sample to draw some conclusion about a population, you’re only probably right. It’s time to learn just how probability works.

Contents:

5A. Probability Basics 5A1. What Is Probability? 5A2. Where Do You Get Probabilities? 5A3. Interpreting Probability Statements 5A4. Law of Large Numbers 5A5. Sample Space 5A6. Probability Models 5B. Combining Probabilities 5B1. Probability “or” for Disjoint Events 5B2. Probability “or” for All Events 5B3. Probability “not” — Complements 5B4. Probability “and” for Independent Events 5B5. Probability “at least” for Independent Events 5B6. Conditional Probability · Optional: Conditional Probability Formula 5B7. Optional: Checking Independence 5B8. Optional: Probability “and” for All Events 5C. Sequences instead of Formulas What Have You Learned? Exercises for Chapter 5 Problem Set 1 Problem Set 2 What’s New

If you’re learning independently, you can skip the sections marked “Optional” and still understand the chapters that follow. If you’re taking this course with an instructor, s/he may require some or all of those sections. Ask if you’re not sure!

5A. Probability Basics

5A1. What Is Probability? Definitions:

Probability can be defined two ways: the long-term relative frequency of an event, or the likelihood that an event will occur. A trial is any procedure or observation whose result depends at least partly on chance. The result of a trial is called the outcome. We call a group of one or more repeated trials a probability experiment.

Example 1: Ten thousand doctors took aspirin every night for six years, and 104 of them had heart attacks. The relative frequency is 104/10000 = 1.04%, so the probability of heart attack is 1.04% for doctors taking aspirin nightly. Each doctor represents a trial, and the outcome of each trial is either “heart attack” or “no heart attack”. The group of 10,000 trials is a probability experiment. Definition:

An event is a group of one or more possible outcomes of a trial. Usually those outcomes are related in some way, and the event is named to reflect that.

Example 2: If you draw a card from a deck without looking, there are 52 possible outcomes (assuming the jokers have been removed). “Ace” is an event, representing a group of four outcomes, and the probability of that event is 4/52 or 1/13. “Spade” is an event, representing a group of 13 outcomes, so its probability is 13/52 or 1/4. “Ace of spades” is both an outcome and an event, with a probability of 1/52. Write probabilities as fractions, decimals, or percentages, like this: P(event) = number Example 3: On a coin flip, P(heads) = 0.5, read as the probability of heads is 0.5. “P(0.5)” is wrong. Don’t write P(number); always write P(event) = number. All probabilities are between 0 and 1 inclusive. A probability of 0 means the event is impossible or cannot happen; a probability of 1 means the event is a certainty or will definitely happen. Probabilities between 0 and 1 are assigned to events that may or may not happen; the more likely the event, the higher its probability. Definition:

When an event is unlikely — when it has a low probability of occurring — you call it an unusual event. Unless otherwise stated, “unlikely” means that the probability is below 0.05. This will be an important idea in inferential statistics.

5A2. Where Do You Get Probabilities? Pure thought is enough to give many probabilities: the probability of drawing a spade from a deck of cards, the probability of rolling doubles three times in a row at Monopoly, the probability of getting an all-white jury pool in a county with 26% black population. Any such probability is called a theoretical probability or classical probability. Theoretical probabilities come ultimately from a sample space, usually with help from some of the laws for combining events. Example 4: A standard die (used in Monopoly or Yahtzee) has six faces, all equally likely to come up. Therefore you know that the probability of rolling a two is 1/6. On the other hand, some probabilities are impossible to compute that way, because there are too many variables or because you don’t know enough: the probability that weather conditions today will give rise to rain tomorrow, the probability that a given radium nucleus will decay within the next second, the probability that a given candidate will win the next election, the probability that a driver will have a car crash in the next year. To find the probability of an event like that, you do an experiment or rely on past experience, and so it is called an experimental probability or empirical probability. Example 5: The CDC says that the incidence of TB in the US is 5 cases per 100,000 population. 5/100,000 = 0.005%. Therefore you can say that the probability a randomly selected person has TB is 0.005%. These two terms describe where a probability came from, but there’s no other difference between experimental and theoretical probabilities. They both obey the same laws and have the same interpretations. You probably don’t need formulas, but if you want them here they are: Theoretical or classical: P(success) = N(success) / N(possible outcomes) Empirical or experimental: P(success) = N(success) / N(trials)

5A3. Interpreting Probability Statements Every probability statement has two interpretations, probability of one and proportion of all. You use the interpretation that seems most useful in a given situation. Example 6: For doctors taking aspirin nightly, P(heart attack in six years) = 1.04%. The “probability of one” interpretation is that there’s a 1.04% chance any given doctor taking aspirin will have a heart attack. The “proportion of all” interpretation is that 1.04% of all doctors taking aspirin can be expected to have heart attacks. Which interpretation is right? They’re both right, but in a given situation you should use the one that feels more natural.

5A4. Law of Large Numbers You know that P(boy) is about 50% for live births, but you’re not surprised to see families with two or three girls in a row. Probability is long-term relative frequency; it can’t predict what will happen in any particular case. This is expressed in the law of large numbers: as you take more and more trials, the relative frequency tends toward the true probability. BTW: The law of large numbers was stated in 1689 by Jacob Bernoulli.

Example 7: For just a few babies, say the four children in one family, it’s quite common to find a proportion of boys very different from 50%, say one in four (25%) or even zero in four. But consider a class of thirty statistics students. The proportion may still be different from 50%, but a very different proportion (more than 70%, say, or less than 30%) would be unusual. And when you look at all babies born in a large hospital in a year, experience tells you that the proportion will be very close to 50%. The more trials you take, the closer the relative frequency is to the true probability — usually. But the Law of Large Numbers says that the relative frequency tends to the true probability. Probability can’t predict what will happen in any given case. The idea that a particular outcome is “due” is just wrong, and it’s such a classic mistake that it has a name. The Gambler’s Fallacy is the idea that somehow events try to match probabilities.

Result

Heads so far

rel. freq.

1

T

0

0.0000

2

H

1

0.5000

3

H

2

0.6667

Trial

4 H 3 0.7500 Example 8: I’ve just flipped a coin a few times, and the results are shown at the right. The first flip was a tail, and 5 H 4 0.8000 after that flip the relative frequency (rf) of heads is 0. The 6 T 4 0.6667 next flip is a head, and after two flips I’ve had one head out of two trials, so the rf is 0.5. The third flip is also a head, so now the rf is 2/3 or about 0.6667. At this point someone might say, “you’re due for a tail, to move the rf back toward the true probability of 0.5.” That’s the Gambler’s Fallacy. The coin doesn’t know what it did before, and it doesn’t try to make things “right”. In my trials, the fourth flip moves the rf of heads further from 0.5, and the fifth flip moves it further still. True, the sixth flip moves the rf of heads closer to 0.5, but it could just as well have moved it further away, even if the coin is perfectly fair. I stopped after six trials. I know that if I went on to do ten trials, or a hundred, or a thousand, over time the proportion of heads would almost always move closer to 0.5 — not necessarily on any particular flip, but in the long run. Subconsciously you expect random events not to show a pattern, but you may see patterns along the way. For example, if you flip a fair coin repeatedly, inevitably you will see a run of ten heads or ten tails — about twice in every thousand sequences of ten. If you flip the coin once every two seconds, you can expect to see a run of ten flips the same about once every 17 minutes, on average. Here are two more examples of patterns cropping up in processes that are actually random: Clustering Illusion at Wikipedia. The “hot hand” illusion in basketball: see Gilovich, Vallone, Tversky (1985). Example 9: You have flipped a coin 999 times, and there were 499 heads and 500 tails. What’s the probability of a head on the next flip? Solution: It is 50%, the same as on any other flip. The Law of Large Numbers tells you that over time you tend to get closer and closer to 50% heads, but it doesn’t tell you anything at all about any particular flip. If you think that the coin is somehow “due for a head”, you’ve fallen into the Gambler’s Fallacy.

5A5. Sample Space At bottom, probability is about counting. Empirical probability is the number of times something did happen, divided by the number of trials. Classical probability is similar, but it makes use of a list or table of all possible outcomes, called a sample space. Technically a sample space is just a list of all possible outcomes, but it’s only useful if you make it a list of all possible equally likely outcomes. For repeated independent trials — flipping multiple coins, rolling multiple dice, making successive bets at roulette, and so on — the size of the sample space will be the number of outcomes in each trial, raised to the power of the number of trials. For example, if you want to compute probabilities for the number of girls in a family of four children, your sample space will have 24 = 16 entries. Example 10: If you roll two dice, what’s the probability you’ll roll a seven? You could list the sample space as S = { 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 } but the outcomes are not equally likely. There’s only one way to get a twelve, for instance (double sixes), but there are several ways to get a seven (1–6, 2–5, and so on). So it’s much more useful to list your sample space with equally likely outcomes. When constructing a sample space, be systematic so that you don’t leave any out or list any twice. Here, you’re rolling two dice, and each die has six equally likely results, so you have 6×6 = 36 equally likely outcomes in your sample space. How can you be systematic? List the outcomes in some regular order, like the picture below. Each row lists all the possibilities with the same outcome for the first die; each column lists all the possibilities with the same outcome for the second die.

image courtesy of Bob Yavits, Tompkins Cortland Community College Once you have a sample space of equally likely outcomes, finding the probability is simple. There are six ways to roll a seven: 6-1, 5-2, 4-3, 3-4, 2-5, 1-6. There are 36 possible outcomes, all equally likely. Therefore the probability of rolling a seven is 6/36 or 1/6 or about 0.1667. In symbols, P(7) = 6/36 or P(7) = 1/6. Presenting numbers:

There’s no need to reduce fractions to lowest terms. If a decimal is not exactly equal to a fraction, it’s probably better to keep the fraction. But if the fraction is complex or you’re comparing fractions, round to four decimal places and use the “approximately equal” sign, like this: P(7) » 0.1667 Caution: Round your final answer only. Never use a rounded number in further calculations; that’s the Big No-no. Fortunately, your calculator makes it easy to chain calculations so that you can see rounded numbers but it still uses the unrounded numbers for further calculations.

Example 11: Find the probability of rolling craps (two, three, or twelve). Solution: There’s one way to roll a two, two ways to roll a three, and one way to roll a twelve. P(craps) = (1+2+1)/36 = 4/36 or 1/9.

5A6. Probability Models Often, it’s not practical to construct a sample space and compute probabilities from it. Instead, you construct a probability model. Probability models are yet another kind of mathematical model as introduced in Chapter 4. Definition:

A probability model is a table showing all possible outcomes and their probabilities. Every probability must be 0 to 1 inclusive, and the total of the probabilities must be 1 or 100%. A probability model can be theoretical or empirical. Number of Heads Example 12: Construct a probability model for the number of heads that on Two Coin Flips appear when you flip two coins. Solution: Start by constructing the sample space. Remember that you x P(x) need equally likely events if you are going to find probabilities from the 1/4 sample space. The first coin can be heads or tails, and whatever the first coin 0 is, the second coin can also be heads or tails. So the sample space has 2×2 = 1 2/4 4 outcomes: 2 1/4 S = { HH, HT, TH, TT } There are four equally likely outcomes, so the denominator (bottom number) ∑ 4/4 = 1 on all the probabilities will be 4. The possible outcomes are no heads (one way), one head (two ways), and two heads (one way). The probability model is shown at right. Often a total row is included, as I did, to show that the probabilities add up to 1. That was an easy example, so easy that you could just as well work from the sample space. But think about more complex situations, especially with empirical (experimental) probabilities. Constructing a sample space may be impractical, but a probability model is relatively easy to create. Example 13: (adapted from Sullivan 2011, page 235 problem 40): The CDC asked college students how often they wore a seat belt when driving. 118 answered never, 249 rarely, 345 sometimes, 716 most of the time, 3093 always. Construct a probability model for seat-belt use by college students when driving. Solution: Probability of one is proportion of all, so to get the Seat-Belt Use by probabilities you simply calculate the proportions. Sample size was (118+249+345+716+3093) = 4521. The proportions or probabilities are College Students Driving (sample size: 4521) then simply 118/4521, 249/4521, and so on. The probability model is shown at the right. Never 2.61 % Comments: Don’t push this model too far. In this sample, 68.4% Rarely 5.51 % of college students reported that they always use a seat belt when Sometimes 7.63 % driving. There’s no uncertainty about that statement; it’s a completely Most of 15.84 % accurate statistic (summary number for a sample). But can you go the time further and say that 68.4% of college students always wear a seat belt Always 68.41 % when driving? No, for two reasons. 100.00 % First, this is a sample. Even if it’s a perfect random sample, it’s still Total not the population. There’s always sample variability. A different sample of college students would most likely give different answers — probably not very different, since this was a large sample, but almost certainly not identical. Second, and more serious, this survey depended on self reporting: students weren’t observed, they were just asked. When people report their behavior they tend to shade their responses in the direction of what’s socially approved or what they would like to think about themselves (response bias). How many of those “always” responses should have been “most of the time” or “sometimes”? You have no way to know.

5B. Combining Probabilities You can find probabilities of simple events by making sample spaces and counting. But life isn’t usually that simple. To find probabilities of more interesting (and complex) events, you need to use rules for combining probabilities. The rules are the same whether your original probabilities are theoretical or experimental.

5B1. Probability “or” for Disjoint Events Definition:

When two events can’t both happen on the same trial, they are called mutually exclusive events or disjoint events.

Example 14: You select a student and ask where she was born. “Born in Cortland” and “born in Ithaca” are mutually exclusive events because they can’t both be true for the same person. Comment: Obviously it’s possible that neither is true. Disjoint events could both be false, or one might be true, but they can’t both be true in the same trial. Example 15: You select a student and ask his major. “Major in physics” and “major in music” are non-disjoint events because they could be true of the same student. (It doesn’t matter whether they are both true of the student you asked. They are non-disjoint because they could both be true of the same student — think about double majors.) Rule:

For disjoint events, P(A or B) = P(A)+P(B)

Example 16: You draw a card from a standard 52-card deck. What’s P(ace or face card)? (A face card is a king, queen, or jack.) Solution: Are the events “ace” and “face card” disjoint? Yes, because a given card can’t be both an ace and a face card. Therefore you can use the rule: P(ace or face card) = P(ace) + P(face card) But what are P(ace) and P(face card)? A picture may help.

used by permission; source: http://www.jfitz.com/cards/ accessed 2012-09-26 Now you can see that the deck of 52 cards has four aces and twelve face cards. Therefore P(ace) = 4/52 and P(face card) = 12/52 Since the events are disjoint, P(ace or face card) = P(ace) + P(face card) P(ace or face card) = 4/52 + 12/52 = 16/52 Reminder: When you need to compute probability of A or B, always ask yourself first, are the events disjoint? Use the simple addition rule only if the events are disjoint. If events are nondisjoint — if it’s possible for both to happen on the same trial — you have to use the general rule, below. Take a look at this table of marital status in 2006, from US Marital Status in 2006 (in Millions) the US Census Bureau. It’s known as a contingency Men Women Totals table or two-way table, because it classifies each Married 63.6 64.1 127.7 member of the sample or population by two Widowed 2.6 11.3 13.9 variables — in this case, sex and marital status. Divorced 9.7 13.1 22.8 Never married 30.3 25.0 55.3 Example 17: What’s the probability that a randomly selected person is widowed or divorced? Totals 106.2 113.5 219.7 Solution: Are those events disjoint? Yes, because a given person can’t be listed in both rows of the table. (You might argue that a given person can be both widowed and divorced in his or her lifetime, and that’s true. But the table shows marital status at the time the survey was made, not over each person’s lifetime. The “Widowed” row counts those whose most recent marriage ended with the death of their spouse.) Therefore P(widowed or divorced) = P(widowed) + P(divorced) How do you find those probabilities? Remember that probability of one = proportion of all. Find the proportions, and you have the probabilities. P(widowed or divorced) = 13.9/219.7 + 22.8/219.7 P(widowed or divorced) = 36.7/219.7 » 0.1670 Example 18: Find the probability that a randomly selected man is widowed or divorced. Solution: Disjoint events? Yes, a given man can’t be in both rows of the table. Again, the probabilities are the proportions, but now you’re looking only at the men: P(widowed or divorced) = P(widowed) + P(divorced) P(widowed or divorced) = 2.6/106.2 + 9.7/106.2 P(widowed or divorced) = 12.3/106.2 » 0.1158 Now let’s look at a couple of examples of probability “or” for non-disjoint events. Example 19: Find P(seven or club). Solution: Are the events “seven” and “club” disjoint? No, because a given card can be both a seven and a club. You can’t use the simple addition rule. The next section shows you a formula, but in math there’s usually more than one way to approach a problem. Here you can look back at the picture and count from the sample space. There are thirteen clubs, plus the sevens of spades, hearts, and diamonds, for a total of 16. (You don’t count the seven of clubs when counting sevens, because you already counted it when counting clubs.) And therefore P(seven or club) = 16/52. Example 20: Find P(woman or divorced). Solution: Disjoint events? No, a given person can be both. So what do you do? The same thing as in the preceding example: you count up all the women, and all the divorced people who aren’t women, and divide by the number of people: P(woman or divorced) = 113.5/219.7 + 9.7/219.7 = 123.2/219.7 » 0.5608

5B2. Probability “or” for All Events Look back at P(seven or club). Those are not disjoint Because this textbook helps you, events, so you can’t just add P(seven) and P(club). But please click to donate! what did you do, when counting? You counted the clubs, then you counted the sevens that aren’t clubs. In other words, just adding P(seven) and P(club) would be wrong because that would double count the overlap. With 52 cards, it’s easy enough just to count. But that’s not practical in every problem, so there’s a rule: go ahead and double count by adding the probabilities, then fix it by subtracting the part you double counted. Rule:

P(A or B) = P(A) + P(B) − P(A and B)

This general addition rule works for all events, disjoint or non-disjoint. (If two events are disjoint, they can’t happen at the same time, P(A and B) is 0, and the general rule becomes the same as the simple rule.) Let’s redo the last two examples with this new general rule, to see that it gives the same answers. Example 19 again: Find P(seven or club). P(seven or club) = P(seven) + P(club) − P(seven and club) Caution: P(seven and club) doesn’t mean “all the sevens and all the clubs”. It means the probability that one card will be both a seven and a club — in other words, it means the seven of clubs. P(seven or club) = 4/52 + 13/52 − 1/52 P(seven or club) = 16/52 Example 20 again: Using the table, find P(woman or divorced). Solution: P(woman or divorced) = P(woman) + P(divorced) − P(woman and divorced) P(woman or divorced) = 113.5/219.7 + 22.8/219.7 − 13.1/219.7 P(woman or divorced) = 123.2/219.7 » 0.5608

US Marital Status in 2006 (in Millions) Men Women Totals Married 63.6 64.1 127.7 Widowed 2.6 11.3 13.9 Divorced 9.7 13.1 22.8 Never married 30.3 25.0 55.3 Totals 106.2 113.5 219.7

5B3. Probability “not” — Complements About two thirds of students who register for a math class complete it successfully. What’s the probability that a randomly selected student who registers for a math class will not complete it successfully? Of course you already know it’s 1−(2/3) = 1/3. Let’s formalize this. Definitions:

Two events are complementary if they can’t both occur but one of them must occur. If A is an event from a given sample space, then the complement of A, written A C or not A, is the rest of that sample space.

Describing a complement usually involves using the word “not”. Complementary events (can’t both happen, but one must happen) are a subcategory of disjoint events (can’t both happen). Example 21: The complement of the event “the student completes the course successfully” is the event “the student does not complete the course successfully.” Obviously the complement need not be a simple event. The complement of “the student completes the course successfully” is “the student never shows up, or attends initially but stops attending, or withdraws, or earns an F, or takes an incomplete but never finishes”, or probably other outcomes I haven’t thought of.

P(AC) = 1 − P(A)

Rule:

This comes directly from the definition, and the rule for “or”. A and A C can’t both happen, so they’re disjoint and P(A or A C) = P(A)+P(A C). But one or the other must happen, so P(A or A C) = 1. Therefore P(A)+P(A C) = 1, and P(A C) = 1−P(A). Example 22: In rolling two dice, “doubles” and “not doubles” are complementary events because they can’t both happen on the same roll, but one of them must happen. “Boxcars” (double sixes) and “snake eyes” (double ones) can’t both happen, so they’re disjoint; but they are not complementary because other outcomes are possible. The complement rule is useful on its own, but it really shines as a labor-saving device. Very often when a probability problem looks like a lot of tedious computation, the complement is your friend. This really sticks out with “at least” problems, but here are a few simpler examples. Example 23: The color distribution for plain M&Ms is shown at right. What’s the probability that a randomly selected plain M&M is any color but yellow? Solution: You could add the probabilities of the five other colors, but of course it’s easier to say P(Yellow C) = 1 − P(Yellow) P(Yellow C) = 100% − 14% = 86% Example 24: Referring again to the table of marital status, what’s the probability that a randomly selected person is not currently married? Solution: Since the four marital statuses are disjoint, you could add the probabilities for widowed, divorced, and never married. But it’s easier to take the complement of “married”: P(not currently married) = P(marriedC) P(not currently married) = 1 − P(married) P(not currently married) = 1 − 127.7/219.7 P(not currently married) = 0.4188

Colors of Plain M&Ms Blue 24 % Orange 20 % Green 16 % Yellow 14 % Brown 13 % Red 13 %

US Marital Status in 2006 (in Millions) Men Women Totals Married 63.6 64.1 127.7 Widowed 2.6 11.3 13.9 Divorced 9.7 13.1 22.8 Never married 30.3 25.0 55.3 Totals 106.2 113.5 219.7

5B4. Probability “and” for Independent Events Definition:

Two events are called independent events if the occurrence of one doesn’t change the probability of the other.

Example 25: When you play poker, being dealt a pair in this hand and a pair in the next are independent events because the deck is shuffled between hands. But in casino blackjack, according to Scarne on Cards (Scarne 1965, 144), four decks are used and they aren’t necessarily shuffled between hands. Therefore, getting a natural (ace plus a ten or face card) in this hand and a natural in the next are not independent events, because the cards already dealt change the mix of remaining cards and therefore change the probabilities. That’s also an example of sampling with replacement (poker) and sampling without replacement (casino blackjack). Samples drawn with replacement are independent because the sample space is reset to its initial condition between draws. Samples drawn without replacement are usually dependent because what you draw out changes the mix of what is left. However, if you’re drawing from a very large group, the change to the proportions in the mix is very small, so you can treat small samples from a very large group as independent. Independent events are not disjoint, and disjoint events are not independent. If two events A and B are disjoint, then if A happens B can’t happen, so its probability is zero. One of two disjoint events happening changes the probability of the other, so they can’t be independent. Rule:

For independent events, P(A and B) = P(A) × P(B)

Example 26: In Monopoly, you get an extra roll if you roll doubles, but if you roll doubles three times in a row you have to go to jail. What’s the probability you’ll have to go to jail on any given turn? Solution: Refer to the picture of the dice. There are six ways out of 36 to get doubles, so P(doubles) = 6/36 or 1/6. Each roll is independent, so the probability of doubles three times in a row is (1/6)×(1/6)×(1/6) or (1/6)^3 = 1/216, about 0.0046. If you play a lot of Monopoly, you’ll go to jail, because of doubles, between four and five times per thousand turns. Example 27: The first traffic light on your morning commute is red 40% of the time, yellow 5%, and green 55%. What’s the probability you’ll hit a green all five mornings in any given week? Solution: Are the five days independent? Yes, because where you hit that light in its cycle on one morning doesn’t influence where you hit it on the next day. The probability of green is 55% each day regardless of what happens on any other day. Therefore, the probability of five greens on five successive mornings is 55%×55%×55%×55%×55% or (0.55)5 » 0.0503. About one week in twenty, that light should be green for you all five mornings. Example 28: Refer again to the table of marital status. What’s the probability that a randomly selected person is female and widowed? Solution: In a two-way table, for probability “and”, you don’t worry about formulas or independence because everything is already laid out for you. 11.3 million persons are female and widowed, out of 219.7 million. Therefore: P(female and widowed) = 11.3/219.7 » 0.0514.

US Marital Status in 2006 (in Millions) Men Women Totals Married 63.6 64.1 127.7 Widowed 2.6 11.3 13.9 Divorced 9.7 13.1 22.8 Never married 30.3 25.0 55.3 Totals 106.2 113.5 219.7

Example 29: Earlier in this section, I said that samples drawn without replacement are usually dependent, but you can treat them as independent when drawing a small sample from a very large group. Here’s an example. If you randomly select three women, what’s the probability that all three are widowed? Solution: From the preceding example, the probability that any one woman is widowed was 11.3/219.7. Because three women is a small sample against the millions of women in the census, and the sample is random, you can treat them as independent. If you randomly select one woman out of millions, the mix of marital status in the remaining women is so nearly unchanged that you can ignore the difference. Therefore, the probability that all three women are widowed is (11.3/219.7) × (11.3/219.7) × (11.3/219.7) = (11.3/219.7)³ » 0.0001.

5B5. Probability “at least” for Independent Events There’s no special rule for “at least”, but textbook writers (and quiz writers) love this type of problem, so it’s worth looking at. “At least” problems usually want you to combine several of the probability rules. Example 30: Think back to that traffic light that’s green 55% of the time, yellow 5%, red 40%. What’s the probability that you’ll catch it red at least one morning in a five-day week? Solution: You could find the probability of catching it red one morning (five separate probabilities for five separate mornings), or two mornings (ten different ways to hit two mornings out of five), or three, four, or five mornings. This would be incredibly laborious. Remember that the complement is your friend. What’s the complement of “at least one morning”? It’s “no mornings”. So you can find the probability of getting a red on no mornings, subtract that from 1, and have the desired probability of hitting red on at least one morning. P(at least one red in five) = 1 − P(no red in five) But the status of the light on each morning is independent of all the others, so P(no red in five) = P(no red on one)5 What’s the probability of no red on any one morning? It’s 1 minus the probability of red on any one morning: P(no red on one) = 1 − P(red on one) = 1−0.4 Now put all the pieces together: P(no red on one) = 1 − P(red on one) = 1−0.4 P(no red in five) = [ P(no red on one) ]5 = (1−0.4)5 P(at least one red in five) = 1 − P(no red in five) = 1 − (1−0.4)5 » 0.9222 About 92% of weeks, you hit red at least one morning in the week. Be careful with your logic! You really do need to work things through step by step, and write down your steps. Some students just seem to subtract things from 1, and multiply other things, and hope for the best. That’s not a very productive approach. One thing that can help you with these “at least’ and “at most” problems is to write down all the possibilities and then cross out the ones that don’t apply, or underline the ones that do apply. For “at least one red in five”, you have 0 1 2 3 4 5 or 0 | 1 2 3 4 5. Either way, with this enumeration technique, taught to me by Benjamin Kirk, you can see that the complement of “at least one” is “none”. A common mistake is computing 1−0.45 for P(none), instead of the correct (1−0.4)5 . “None are red” means “all are not-red”, every one of the five is something other than red. Remember that all are not is different from not all are. In ordinary English, people often say “All my friends can’t go to the concert” when they really mean “Some of my friends can go, but not all of them can go.” In math you have to be careful about the distinction. Here’s an example. Example 31: For the same situation, what’s the probability that you’ll hit a red light no more than four mornings in a five-day week? (This could also be asked as “at most four mornings” or “four mornings at most”.) Solution: Try enumerating. “At most four out of five” looks like this: 0 1 2 3 4 5 or 0 1 2 3 4 | 5. The previous example was a “none are” or “all are not”, but this one is a “not all are”. P(≤ 4 out of 5) = 1 − P(5 out of 5) P(5 out of 5) = 0.45 P(≤ 4 out of 5) = 1 − 0.45 » 0.9898 About 99% of weeks, you hit the red light no more than four mornings of the week. Example 32: You’re throwing a barbecue, and you want to start the grill at 2 PM. Fred and Joe live on opposite sides of town, and they’ve both agreed to bring the charcoal. The problem is that they’re both slackers. Fred is late 40% of the time, and Joe is late 30% of the time. What’s the probability you’ll start the grill by 2 PM? Solution: This is another “at least” problem for independent events, though this time the independent events don’t have the same probability. To have charcoal by 2 PM, at least one of them has to show up by then. What’s the probability that at least one will be on time? Again, you could compute the probability that they’re both on time, that Fred’s on time but Joe’s late, and that Fred’s late and Joe’s on time — all of those together will be the probability of charcoal on time. But again, the complement is your friend. The complement of “charcoal on time” is “charcoal late”, which happens only if they’re both late. P(charcoal on time) = 1 − P(charcoal late) P(charcoal on time) = 1 − P(Fred late and Joe late) (Fred and Joe live on opposite sides of town, so whether one is late has no connection with whether the other one is late. The events are independent.) P(charcoal on time) = 1 − P(Fred late) × P(Joe late) P(charcoal on time) = 1 − 0.4×0.3 = 0.88 You’ve got an 88% chance of starting the grill on schedule. Example 33: The space shuttle Challenger exploded shortly after launch in the 1980s, when one of six gaskets failed. After the fact, engineers realized that they should have known the design was too risky, but they didn’t think past “each gasket is 97% reliable.” The trouble was that if any gasket failed, the shuttle would explode. If you were asked to evaluate the design while the plans were still on the drawing board, what would you conclude? (Note: The design makes the six gaskets independent.) Solution: The shuttle will explode if one or more gaskets fail. Here’s another “at least” problem, so enumerate the case you’re interested in: 0 | 1 2 3 4 5 6. P(explosion) = P(at least one gasket fails) The complement of “at least one gasket fails” (hard to compute) is “no gaskets fail” (much easier). What does it mean for no gaskets to fail? All gaskets must hold. Since the gaskets are independent, that’s easy to compute: P(all six gaskets hold) = 0.976 The answer you want is the complement of the all-hold or zero-fail case: P(at least one gasket fails) = 1 − P(all six hold) = 1 − 0.976 P(explosion) = P(at least one gasket fails) = 1 −0.976 » 0.1670 Conclusion: There’s about a 17% chance that the shuttle will explode, just considering the gaskets and ignoring all other possible causes of trouble. This is about the same as the odds of shooting yourself in Russian roulette.

5B6. Conditional Probability In 2012, the Honda Accord was the most frequently stolen vehicle in the US (Siu 2013). Does that mean that your Honda Accord is more likely to be stolen than another model? You’re tested for a rare strain of flu, and the result is positive. Your doctor tells you the test is 99% accurate. Does that mean that there’s a 99% chance you have that strain of flu? In New York City, a rape victim identifies physical characteristics that match only 0.0001% of people. Police find someone with those characteristics and arrest him. Is there only a 0.0001% chance that he’s innocent? These are examples of conditional probability — the probability of one event under the condition that another event happened. It’s probably the most misunderstood probability topic, but I’m going to demystify it for you. The definition may seem hard at first. But after you work through the examples you’ll find it makes sense. Definition:

The conditional probability of B given A, written P(B | A), is the probability of B under the condition that A occurs. Read B | A as “B given A” or “if A then B”. That’s the “probability of one” interpretation. You might find the “proportion of all” interpretation easier: P(B | A) is the proportion of A’s that are also B. Either way, the order matters — P(B | A) and P(A | B) mean different things and they’re different numbers.

Example 34: P(truck | Ford) is the probability that a vehicle is a truck if it’s a Ford, or the probability that a Ford is a truck, or the proportion of trucks among Fords. P(Ford | truck) is the probability that a vehicle is a Ford if it’s a truck, or the probability that a truck is a Ford, or the proportion of Fords among trucks. Example 35: Let’s look first at the suspected rapist. The prosecutor presents evidence that these physical characteristics are found in only 0.0001% of people. The prosecutor therefore claims that there’s only a 0.0001% chance the suspect is innocent. But the defense points out that there are over 8 million people in New York City. 0.0001% × 8,000,000 = 8, so the suspect is not a unique individual at all, but one of about eight people who match the eyewitness accounts. Seven of them are innocent. If there’s no evidence beyond the physical match to tie him to the crime, the probability that this defendant is innocent isn’t 0.0001%, it’s 7/8 or 87.5%. (And that’s just in the city. If you consider the metro area, or the US, or the world, there are even more people who match, so any one of them is even more likely to be innocent.) The prosecutor’s fallacy is the false idea that the probability of a random match equals the probability of innocence. You can also describe this fallacy as “consider[ing] the unlikelihood of an event, while neglecting to consider the number of opportunities for that event to occur”, in the words of “The Prosecutor’s Fallacy” on the Poker Sleuth site (Stutzbach 2011). It’s an easy mistake to make if you just think about low probabilities. To not make this error, think in whole numbers, as the defense did. 0.0001% is hard to think about; 8 is much easier. The key to solving conditional-probability problems is your old friend, probability of one equals proportion of all. The probability that this particular matching person is innocent is the same as the proportion of all matching people that are innocent, or the proportion of innocent people among those who match. Probability problems usually get easier when you turn them into problems about numbers of people or numbers of things. What does this look like in symbols? (Don’t be afraid of symbols! They are your friend, I promise. Words are slippery and confusing, but when you reduce a problem to symbols you make the situation clear and you are half way to solving it.) In this example, there’s a 0.0001% chance that a random person would match the physical type of the criminal: P(matching) = 0.0001% The prosecution wants you to believe that the probability of a matching individual being innocent is the same: P(innocent | matching) = 0.0001% (WRONG) This is a conditional probability, the probability that one thing is true if another thing is true. Formally, the whole expression is “the probability of innocent given matching”. But it’s easier to think of as “the probability that a person who matches is innocent” or “the proportion of matching people who are innocent”. The symbols help you clarify your thinking. “The probability of a match” and “the probability of innocence among those who match” are different symbols, and they’re different concepts. You’d expect them to be different probabilities. The defense showed the right way to figure the probability of innocence given a match. 0.0001%×8,000,000 = 8 people match, and 7 of them are innocent. The probability that a matching person is innocent — the probability that a person is innocent given that he matches — is 87.5%. P(innocent | matching) = 87.5% (CORRECT) Notice what happens with if-then probabilities. You’re considering one group within a subgroup of the population, not one group within the whole population. You’ve reduced your sample space — not all people, but all matching people. The bottom number of your fraction comes from the “given that” part of the conditional probability, because P(innocent | matching) is the proportion of matching people that are also innocent. To explode the prosecutor’s fallacy, you distinguish between a probability in the whole population and a probability in a subgroup. You also have to ask yourself, “which group?” The issue of medical test results is a good example. Example 36: There’s a rare skin disease, Texter’s Peril (TP), where you become hypersensitive to the buttons on your phone. (Yes, I am making this up.) It affects 0.03% of adults aged 18–30, three in ten thousand. The only cure is to lay off texting for 30 days, no exceptions. Naturally this is about the worst thing that can happen to anyone. Your doctor has tested you and the test comes up positive. She tells you that the test is 99% accurate. Does that mean you are 99% likely to have TP? You might think so, and sadly many doctors make the same mistake. You have a positive test result, and you want to know how likely it is that you have Texter’s Peril. In symbols, P(disease | positive) = ? Your doctor told you that the test is 99% accurate, meaning that 99% of people who actually have TP get a positive result: P(positive | disease) = 99% These are obviously not the same symbol, so the probability you care about, the probability you have the disease, may well be different from 99%. How can you compute it? Change those probabilities to whole numbers, and make a table. (I got this technique from the book Calculated Risks [Gigerenzer 2002]. The book cites a study showing that doctors routinely confused probabilities when counseling patients about test results.) You’ve already played with a two-way table; now you’re going to make one. It’s a little bit like filling in a puzzle. I hope you like puzzles. You don’t know the population size, but that’s okay. Just use a large round number, like a million. Start with what you know. P(disease) = 0.03% Out of 1,000,000 people, 0.03% = 300 will have TP, and the other 999,700 won’t. That’s the bottom row of the table, the totals row. P(positive | disease) = 99% Of the 300 who have actually have TP, 99% = 297 will get a correct positive result, and 3 will get a false negative. That’s the first column of the table. P(negative | diseaseC) = 99% (In the real world, a given test may not be equally accurate for positives and negatives, but we’ll overlook that to keep things simple.) Out of 999,700 who don’t have TP, 99% = 989,703 will get a correct negative result, and 9,997 will get a false positive. This is the second column of the table, and now you can fill in the column of totals. Have TP

Don’t Have TP

297

9,997

10,294

3

989,703

989,706

300

999,700

1,000,000

Positive Test Negative Test Total

Total

Take a look at that table, specifically the “Positive Test” row. Do you see the problem? Most of the people with positive test results actually don’t have Texter’s Peril, even though the test is 99% accurate! It took a while to get here, but it’s better to be correct slowly than to be wrong quickly. You can now compute the probability of having TP given that you have a positive test result. Once again, probability of one equals proportion of all, so this is really the same as the proportion of people with positive test results who actually have TP: P(disease | positive) = 297 / 10,294 = 2.89% The test is 99% accurate, but because TP is rare, most of the positive results are false positives, and there’s under a 3% chance that a positive result means you actually have Texter’s Peril. There’s a 1 − 297/10,294 = 97.11% chance that a positive result is a false positive. Notice again: With conditional probability, you’re not concerned with the whole population. Rather, you focus on a subgroup within a subgroup. P(disease | positive) is the proportion of people who actually have the disease, within the subgroup that received a positive test result. Example 37: What’s the chance that a negative is a false negative, that given a negative test result you actually have TP? In symbols, P(disease | negative) = ? You’ve already got the table, so this is a piece of cake. Out of a million people, 989,706 test negative and 3 of them have the disease. The probability that a negative is a false negative is P(disease | negative) = 3/989,706 » 0.000 003 which is essentially nil. Example 38: A lot of Web sites in 2013 trumpeted the news that the Honda Accord was the most frequently stolen model in the US the year before. And that’s true. Out of 721,053 stolen cars and light trucks in 2012, Hot Wheels 2012 tells us that 58,596 were Honda Accords (NICB 2013). But many Web sites warned Honda owners that they were most at risk. For instance, Honda Accord, Civic Remain Top Targets for Thieves at cars.com (Schmitz 2013) leads with “If you own a Honda Accord or Civic, or a full-size Ford pickup truck, you might want to take a moment to make sure your auto-insurance payments are up to date. You drive one of the top three most-stolen vehicles in the US.” Do you see what’s wrong here? Think about it for a minute before reading on. Yes, a lot of Honda Accords were stolen, because there are a lot of them on the road. Too many news organizations are sloppy and think that the likelihood a stolen car is an Accord is the same as the likelihood that an Accord will be stolen. This is the doctor’s mistake from the previous example, all over again. Let’s clarify. You have 58,596 Accords out of 721,053 thefts, so the probability that a stolen car was an Accord — the probability that a car was an Accord given that it was stolen — the probability of “if stolen then Accord” — is P(Accord | stolen) = 58,596/721,053 = 8.13% But that doesn’t tell you doodley-squat about your chance of having your Accord stolen. That would be the probability of a car being stolen given that it is an Accord, “if Accord then stolen”. The top number of that fraction is still 58,596, but the bottom number is the total number of Accords on the road: P(stolen | Accord) = 58,596/(total Accords on the road in 2012) Do you see the difference? They’re both conditional probabilities, but they’re different conditions. “If stolen then Accord” is different from “if Accord then stolen”. The first one is about Accord thefts as a proportion of all thefts, and the second one is about Accord thefts as a proportion of all Accords. Those are different numbers. To find the chance that an Accord will be stolen, you need the number of Accords on the road in 2012. A press release from Experian (2012) says there were “more than 245 million vehicles on US roads” in 2012, and 2.6% of them were Accords. P(stolen | Accord) = (stolen Accords)/(total Accords on the road in 2012) P(stolen | Accord) = 58,596/(2.6% of 245 million) P(stolen | Accord) = 58,596/6,370,000 P(stolen | Accord) = 0.92% Yes, over 8% of cars stolen in 2012 were Accords, but the chance of a given Accord being stolen was under 1%. P(Accord | stolen) = 8.13%, but P(stolen | Accord) = 0.92%. Optional: Conditional Probability Formula Rule:

P(B | A) = P(A and B) / P(A) or N(A and B) / N(A)

The “N” alternatives remind you that often it’s easier just to count than to find probabilities and then divide. Either way, when you consider P(B | A), remember that you’re interested in the likelihood of B given that A occurs. It’s the B cases within the A group, not all the B cases. P(A | B) is not the same as P(B | A). You’ll get the probability right if you remember that the second event, the “given that” event, supplies the bottom number of the fraction. Example 39: Find P(stolen | Accord), the chance that any one Accord will be stolen. Using the numbers from Example 38, P(stolen | Accord) = N(Accord and stolen) / N(Accord) P(stolen | Accord) = 58,596/6,370,000 = 0.92% Example 40: I draw a card from the deck, and I tell you it’s red. What’s the probability that it’s a heart? If you didn’t know anything about the card, you’d write P(heart) = ¼ because a quarter of the cards in the deck are hearts. But what is the probability given that it’s red? P(heart | red) = P(heart and red) / P(red) P(heart and red) is the probability of a red heart. A quarter of the cards in the deck are red hearts, so this is just ¼. P(red) is of course ½ because half the cards in the deck are red. P(heart | red) = (¼) / (½) = (¼) × 2 = ½ This one is probably easier to do by just counting: P(heart | red) = N(heart and red) / N(red) P(heart | red) = 26/52 = ½ Either way, you’re concerned with the sub-subgroup of hearts within the subgroup of red cards. P(heart | red) = ½ — half of the red cards are hearts. Example 41: You know P(heart | red) = ½: given that a card is red, there’s a ½ probability that it’s a heart. But what is P(red | heart), the probability that a card is red given that it’s a heart? You probably already know the answer, but let’s run the formula: P(red | heart) = N(red and heart) / N(heart) P(red | heart) = 13/13 = 1 (or 100%) Conditional probabilities often come up in two-way tables. Example 42: Again using the table of marital status, US Marital Status in 2006 (in Millions) what’s the probability that a randomly selected woman Men Women Totals is divorced? In other words, given that the person is a Married 63.6 64.1 127.7 woman, what’s the probability that she’s divorced? Widowed 2.6 11.3 13.9 Solution: The problem wants 9.7 13.1 22.8 P(divorced | woman), the probability that the person is Divorced Never married 30.3 25.0 55.3 divorced given that she’s a woman. Totals 106.2 113.5 219.7 P(divorced | woman) = N(divorced and woman) / N(woman) P(divorced | woman) = 13.1/113.5 » 0.1154 Because we have “given woman” or “if woman”, the bottom number is the number of women, 113.5 million.

5B7. Optional: Checking Independence Remember the definition of independent events? A and B are independent if the occurrence of one doesn’t change the probability of the other. Now that you know about conditional probability, you can define independent events in terms of conditional probability: Definition:

Two events A and B are independent if and only if P(A|B) = P(A).

This makes sense. P(A) is the probability of A without considering whether B happened or not, and P(A|B) is the probability of A given that B happened. If B’s occurrence doesn’t change the probability of A, then those two numbers will be equal. Example 43: Referring again to the table of marital status, show that “woman” and “widowed” are dependent (not independent).

US Marital Status in 2006 (in Millions) Men Women Totals Married 63.6 64.1 127.7 Widowed 2.6 11.3 13.9 Divorced 9.7 13.1 22.8 Never married 30.3 25.0 55.3 Totals 106.2 113.5 219.7

Solution:

P(widowed) = 13.9 / 219.7 » 0.0633 P(widowed | woman) = 11.3 / 113.5 » 0.0996 These numbers are different — the probability of “widowed” changes when “woman” is given, or in English the proportion of widowed women is different from the proportion of widowed people. Therefore the events “woman” and “widowed” are not independent. By the way, if A and B are independent then B and A are independent. So you could just as well compare P(woman) = 113.5/219.7 » 0.5166 to P(woman|widowed) = 11.3/13.9 » 0.8129. Since those are different, you conclude that “woman” and “widowed” are dependent.

5B8. Optional: Probability “and” for All Events When events are not independent, to find probability “and” you need to use a conditional probability. Remember the formula for conditional probability: P(B | A) = P(A and B) / P(A). Multiply both sides by P(A) and you have P(A) × P(B | A) = P(A and B), or: Rule:

For all events, P(A and B) = P(A) × P(B | A)

Example 44: You draw two cards from the deck without looking. What’s the probability that they’re both diamonds? Solution: Are these independent events? No! P(diamond1 ), the probability that the first card is a diamond, is 13/52 because there are 13 diamonds out of 52. But if the first card is a diamond, the probability that the second card is a diamond is different. Now there are only 12 diamonds left in the deck, out of a total of 51 cards. So P(diamond2 | diamond1 ) = 12/51, which is a bit less than 13/52. P(diamond1 and diamond2 ) = P(diamond1 ) × P(diamond2 | diamond1 ) P(diamond1 and diamond2 ) = (13/52) × (12/51) » 0.0588

5C. Sequences instead of Formulas A lot of probability problems can be solved without using formulas, through the technique of sequences. Here’s the procedure: 1. Write down the “winning sequences”, the sequences that lead to the desired outcome. 2. Assign probabilities to each event in each sequence, from start to end. 3. Multiply the probabilities within each sequence, and then add up the probabilities of all the sequences. Example 45: Suppose a bag contains 6 oatmeal cookies, 4 raisin cookies, and 5 chocolate chip. You are to draw two cookies from the bag without looking (and without replacement, which would be yucky). What is the probability that you will get two chocolate chip cookies? Solution: To start with, notice that there are 6+4+5 = 15 cookies. There’s only one winning sequence, but this one illustrates an important point: you have to assign each probability in its situation at that point in its sequence. 1. Sequence: CC1 and CC2 2. Probabilities: 5/15 and 4/14. You compute the probability CC2 at this point in the sequence: it’s the probability of a second CC if the first cookie was CC. You don’t care about the probabilities if the first cookie was anything else, because the sequence starts with a CC cookie. That means that, when you are looking for the probability of a second CC cookie, the bag now contains only 14 cookies, and only 4 of them are CC. 3. Arithmetic: (5/15)×(4/14) » 0.0952 Example 46: In the same situation, what’s the probability you’ll get one oatmeal and one raisin? Solution: Even though you don’t care which order they come in, you have to list both orders among your willing sequences. Remember the example of flipping two coins, or the examples with dice: to make probabilities come out right, consider possible orderings. 1. Sequences: (A) O 1 and R2 ; (B) R1 and O 2 2. Probabilities: (A) 5/15 and 4/14; (B) 4/15 and 5/14 3. Arithmetic: (5/15)×(4/14) + (4/15)×(5/14) » 0.1905 Example 47: Consider the same bag of 15 cookies, but now what’s the probability you get two cookies the same? Solution: 1. Sequences: (A) O 1 and O 2 ; (B) R1 and R2 ; (C) CC1 and CC2 2. Probabilities — again, the probability for the second cookie takes into account the first cookie that was drawn. (A) 6/15 and 5/14; (B) 4/15 and 3/14; (C) 5/15 and 4/14 3. Arithmetic: (6/15)×(5/14) + (4/14)×(3/14) + (5/15)×(4/14) » 0.2952 Example 48: Your teacher’s policy is to roll a six-sided die and give a quiz if a 2 or less turns up. Otherwise, she rolls again and collects homework if a 3 or less turns up. You haven’t done the homework for today and you’re not ready for a quiz. What is the probability you’ll get caught? Solution: Though you could do this with formulas, you’ll get the same answer with less pain by following the method of sequences. The “winning sequences” in this case are the sequences that lead to either a quiz or homework. 1. There are two sequences: (A) quiz (and stop, without deciding about homework); (B) no quiz, but homework Notice that you start each sequence from the same starting point. Notice also that you don’t consider the possible sequence “no quiz and no homework” because in that sequence you don’t get caught. 2. P(quiz) = 2/6 = 1/3. P(no quiz) = 1−1/3 = 2/3. P(homework if die roll) = 3/6 = 1/2. (A) 1/3 (B) 2/3 and 1/2 3. (1/3) + (2/3)×(1/2) = (1/3)+(1/3)= 2/3 There’s a 2/3 probability of a quiz or homework. Sequences let you think through a situation without getting confused about which formula may apply. Sometimes no formula applies. Here’s a famous example. Example 49: You’re a contestant on Let’s Make a Deal. You have to pick one of three doors, knowing that there’s a new car behind one of them and a “zonk” (something funny but worthless) behind the other two. Let’s say you pick Door #1. The host, who of course knows where the car is, opens Door #2 and shows you a zonk. He then asks whether you want to stick with your choice of Door #1, or instead take what’s behind Door #3. What should you do, and why? (I gave specific door numbers to help make this problem less abstract, but the specifics don’t matter. What does matter is that you pick a door at random, and the host reveals that a door you didn’t pick is the wrong one.) Solution: There’s really no formula for this one, because the host’s actions aren’t governed by probability. Once you realize that, it’s easy. 1. In the long run, 1/3 of contestants will choose the correct door, whichever one it is, and 2/3 will choose one of the two wrong doors. Why? The show’s producers have to make sure that prizes are equally distributed among the three doors over the long haul. If they favored one door over the others, people would notice and would start picking that door. Therefore, P(right door) = 1/3 and P(wrong door) = 2/3. 2. If you chose the right door, the host opens one of the two wrong doors, but obviously you would not benefit by switching. 3. If you chose the wrong door, the host opens the other wrong door and offers you the chance to switch doors. The host has eliminated the other wrong door, and the third door must be the winning door. You should switch. If you chose the wrong door and switch doors, you will always win because the host has eliminated the other wrong door. 4. The probability that you chose the right door initially, and will lose if you switch, is 1/3. The probability that you chose the wrong door initially, and will win if you switch, is 2/3. In the long run, keeping your original choice is the winning strategy 1/3 of the time, and switching is the winning strategy 2/3 of the time. 5. Switching doors doubles your chance of winning. BTW: This is the famous Monty Hall Problem. Monty Hall developed Let’s Make a Deal and hosted the show for many years. There was a lot of controversy (Tierney 1991) about the answer. Many people who should have known better thought that Door #1 and Door #3 were equally likely after Door #2 was opened. But they forgot that this is not a pure probability problem. The host knows where the car is and picks a door to open based on that knowledge, and that makes all the difference.

What Have You Learned? Key ideas:

Theoretical/classical and empirical/experimental probability. Two interpretations: probability of one = proportion of all. Law of Large Numbers and Gambler’s Fallacy. Sample space, and the importance of equally likely outcomes. Constructing and interpreting probability models. Know meanings of disjoint events a/k/a mutually exclusive events, complementary events, independent events. Probability “or”: For disjoint events, P(A or B) = P(A) + P(B) For all events, P(A or B) = P(A) + P(B) − P(A and B) Probability “not” for complementary events: P(not A) or P(A C) = 1 − P(A) Probability “and”: For independent events, P(A and B) = P(A) × P(B) Optional: For all events, P(A and B) = P(A) × P(B | A) Conditional probability: P(B | A) means probability of B given A, or probability of if-A-then-B, or probability of B if A is known to have occurred, or proportion of B’s within the A’s. Optional: P(B | A) = P(A and B) / P(A) Solve problems with “at most” and “at least” conditions by using the complement and the other rules. Solve probability problems involving two-way tables (also here). Solve probability problems with sequences instead of formulas.

Study aids:

Statistics Symbol Sheet Because this textbook helps you, please click to donate!

← Chapter 4 WHYL

Chapter 6 WHYL Õ

Exercises for Chapter 5 Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand. Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.

Problem Set 1

1

You toss three coins. (a) How many entries do you expect in the sample space of equally likely events?

(b) Construct that sample space. (c) Find P(2H), the probability of getting exactly two heads.

2

In 2003 a federal government survey estimated that 58.2% of US households had both a cell phone and a landline, 2.8% had only cell service, and 1.6% had no phone service at all. (a) Construct a probability model for type of phone service to US households. (Hint: You’re going to have to add a fourth case.) (b) Supposedly, polling agencies try not to call cell phones, because consumers object to paying for the calls. What proportion of US households could be reached by a landline in 2003?

3

According to DiscovertheOdds.com (2014), the probability of being struck by lightning in a given year is about 1 in 1,000,000. A blog post by Tara Parker-Pope (2007) says that the probability of suffering a shark attack in 2003 was about 1 in 4,691,000. Can you add these two numbers to find the probability of being struck by lightning or attacked by a shark in 2003 as 1/1,000,000 + 1/4,691,000? Briefly, why or why not?

4

P(A), the probability of event A, is 0.7. A and B are complementary events. Find (a) P(not A); (b) P(B); (c) P(A and B). If any of them cannot be determined from the information given, say so.

5

A blog post by Tara Parker-Pope (2007) reported that your lifetime risk of dying of heart disease is 1/5, and your lifetime risk of dying of cancer is 1/7. Can you add these two numbers to find the probability of dying of heart disease or cancer? Briefly, why or why not?

6 7

Explain the difference between P(divorced | man) and P(man | divorced).

8

You shuffle a standard 52-card deck well and deal five cards. What is the probability that the fifth card is a spade?

A company analyzed all 412 customer complaints that were received in January 2013. None of them were for unresolved billing disputes. Therefore the probability that a randomly selected complaint from January 2013 was for an unresolved billing dispute is zero. We’re used to interpreting a probability of zero as impossible, but obviously it is possible for a complaint to be about an unresolved billing dispute. How do you resolve this paradox? Need a hint? Think about the two kinds of probability from the beginning of the chapter.

9

Write out the sample space for flipping two coins, and use it to answer these questions. (a) If you are told that at least one of the flips came up heads, what is the probability that both are heads? (b) If you are told that the first coin came up heads, what is the probability that both are heads?

10

The chance of being a victim of violent crime in a given year varies by age and sex, according to What are my chances of being a victim of violent crime? Take 17.1 per thousand, or 1.71%, as the average.

(a) You’re waiting for a flight at the airport. You fall into conversation with a stranger, and you’re surprised to learn that both of you have been victims of violent crime in the past year. Assuming random selection, what are the chances of that happening? (b) Explain why you cannot use the same technique to find the probability that both members of a married couple have been victims of violent crime in the past year.

11

For this problem, please use the table of marital status at right.

US Marital Status in 2006 (in Millions) Men Women Totals (a) Find P(divorced). Married 63.6 64.1 127.7 Widowed 2.6 11.3 13.9 (b) Give two interpretations of that probability. Divorced 9.7 13.1 22.8 (c) What type of probability is this: classical, empirical, Never married 30.3 25.0 55.3 experimental, theoretical? Totals 106.2 113.5 219.7 (d) Find P(divorcedC) and give one interpretation. (e) Find P(man and married). (f) Find P(man or married). (Work this with and without the formula.) (g) Find the probability that a randomly selected male was never married: P(never married | male) = ? (h) Find P(man | married), and interpret as “____% of ____ were ____.” (i) Find P(married | man), and interpret as “____% of ____ were ____.”

12

In five-card draw poker, you are dealt five cards and then during the betting you can discard some in hopes that the replacements will improve your hand. You have a pat hand if the first five cards are good enough that you don’t need to discard. What’s the probability you’ll be dealt a diamond flush (five diamonds) as a pat hand?

13

There are 20 M&Ms left in the dish: 5 blue, 4 orange, 3 green, 3 yellow, 3 brown, and 2 red. The yellows are your favorites. Your friend takes three M&Ms without looking.

(a) What’s the chance that she leaves your favorites behind? (b) What’s the chance that all three of her picks are red?

14

Tom Turkey invested in two risky startup companies, A and W. There is a 0.90 probability that company A will go bankrupt, and a 0.80 probability that company W will go bankrupt. Assuming the two companies have no connection, find the probabilities that (a) both will go bankrupt; (b) one of them, but not both, will go bankrupt; (c) neither will go bankrupt.

15

Without looking, you take three M&Ms from a new three-pound bag. (The bag contains over a thousand M&Ms.) Use the probability model of plain M&M colors at right to answer these questions.

Colors of Plain M&Ms Blue 24 % Orange 20 % Green 16 % Yellow 14 % Brown 13 % Red 13 %

(a) Find the probability that all three are red. 30%

(b) Find the probability that none are red. (c) Find the probability that at least one is green. (d) Find the probability that exactly one is green.

16

A poll found that 45% of baseball fans had attended a game in person within the past year. Of five randomly selected baseball fans, find the probability that at least one fan had not attended a game within the past year.

17

Without looking, Grace Underfire takes two sourballs from a bowl that contains 11 cherry and 9 orange flavor. What is the probability that she will get one of each flavor?

18

An annual church raffle offers one chance in 500 of winning something. Find the chance that you win at least once if you play five years in a row.

19

Butch will miss an important TV program while taking his statistics exam, so he sets both his DVRs to record it. The first one records 70% of the time, and the second one records 60% of the time. (Their performance is independent.) What is the probability that he gets home after the exam and finds (a) No copies of his program? (b) One copy of his program? (c) Two copies of his program?

Problem Set 2

20

Police plan to enforce speed limits during the morning rush hour on four different routes into the city. The traps on routes A, B, C, and D are operated 40%, 30%, 20%, and 30% of the time, respectively. Biff always speeds to work, and he has probability 0.2, 0.1, 0.5, and 0.2 of using those routes. (a) What’s the probability that he’ll get a ticket on any one morning? (b) What’s the probability he’ll go five mornings without a ticket? (Hint: His choice of a route, and whether there’s a speed trap on that route, are independent.)

21

For this problem, please use the table of marital status at right. Show that the events “man” and ”divorced” are not independent.

22

I remarked that if you flip a fair coin repeatedly, you’ll see a run of ten heads or ten tails. Show why this should happen twice in about every thousand flips.

US Marital Status in 2006 (in Millions) Men Women Totals Married 63.6 64.1 127.7 Widowed 2.6 11.3 13.9 Divorced 9.7 13.1 22.8 Never married 30.3 25.0 55.3 Totals 106.2 113.5 219.7

23

(adapted from Dabes and Janik 1999, page 24) The probability that a certain door is locked is 0.5. The key to the door is one of five unidentified keys hanging on a rack. You select two keys before going to the door. Find the probability that you can open the door without returning for another key.



Solutions Õ

What’s New 11–13 Jan 2015: Add links and study aids in What Have You Learned? 27 Sept 2014: Replace the inline citations with links to the new Sources Used section. 20 Sept 2014: Relate probability models to the talk of mathematical models in Chapter 4. Add an example showing the assumption of independence for a small sample from a large population. Add the enumeration technique to the Challenger gasket example, along with some explanatory text. Add the “proportion of all” interpretation to the definition of conditional probability, to the analysis of the prosecutor’s fallacy, and to What Have You Learned? (intervening changes suppressed) 22 Aug 2012: New document.

6. Discrete Probability Models Updated 7 Oct 2015 (What’s New?) Intro:

In Chapter 5, you looked at the probabilities of specific events. In this chapter, you’ll take a more global view and look at the probabilities of all possible outcomes of a given trial.

Contents:

6A. Random Variables 6B. Discrete Probability Distributions 6B1. Mean and Standard Deviation of a DPD 6B2. Comparing DPDs: Parking Choices 6B3. Fair Price of a Game 6C. Bernoulli Trials 6D. The Geometric Model 6D1. Computing Probabilities 6D2. Mean and Standard Deviation of a Geometric Distribution 6D3. Making a Decision 6D4. Baseball 6E. The Binomial Model 6E1. Computing Probabilities 6E2. Baseball Again! 6E3. Mean and Standard Deviation of a Binomial Distribution 6E4. Surprised? 6E5. A Life-or-Death Example What Have You Learned? Exercises for Chapter 6 What’s New

6A. Random Variables The random variable is one of the main concepts of statistics, and we’ll be dealing with random variables from now till the end of the course. Definitions:

A variable is “the characteristic measured or observed when an experiment is carried out or an observation is made.” —Upton and Cook (2008, 401) If the results of that procedure depend on chance, completely or partly, you have a random variable. Each outcome of the procedure is a value of the variable. We use a capital letter like X for a variable, and a lower-case letter like x for each value of the variable. As you learned in Chapter 1, numeric variables can be discrete or continuous. A discrete random variable can have only specific values, typically whole numbers. A continuous random variable can have infinitely many values, either across all the real numbers or within some interval.

In this chapter, you’ll be concerned with discrete random variables. In the next chapter, you’ll look at one particular type of continuous random variable, the normal distribution. Example 1:

You roll three dice. The number of sixes that appear is a random variable, and the total number of spots on the upper faces is another random variable. These are both discrete.

Example 2:

You randomly select a household and ask the family income for last year. This is a continuous random variable.

Example 3:

You randomly select twelve TC3 students, measure their heights, and take the average. “Height of a student” is a continuous random variable, and “average height in a 12-student sample” is another continuous random variable.

Example 4:

You randomly select 40 families and ask the number of children in each. “Number of children in family” is a discrete random variable, and “average number of children in a sample of 40 families” is a continuous random variable.

6B. Discrete Probability Distributions Definition:

A discrete probability distribution or DPD (also known as a discrete probability model) lists all possible values of a discrete random variable and gives their probabilities. The distribution can be shown in a table, a histogram, or a formula. Like any probabilities, the probabilities in a DPD can be determined theoretically or experimentally.

Example 5: In March 2013, Royal Auto sent me Declared Chance of Prize one of those “Win big!” flyers with a fake car key Value, x Winning, P(x) taped to it. The various prizes, and chances of Two Camaros $100,000 1 in 5,000,000 winning, are shown at right. This is a discrete probability distribution. The Cash 10,000 1 in 1,000,000 discrete variable X is “prize value”, and the five Apple iPad 1,000 1 in 500,000 possible values of X are $100,000 down to $5. Remember the two interpretations of Various 500 1 in 250,000 probability: probability of one = proportion of all. Gift card 5 0.9999928 From the table, you can equally well say that any person’s chance of winning a $500 prize is 1/250,000 = 0.000 004 = 0.0004%, or that in the long run 0.0004% of all the people who participate in the promotion will win a $500 prize. A discrete probability distribution must list all possible outcomes. The total probability for all possible outcomes in any situation is 1. Therefore, for any discrete probability distribution, the probabilities must add up to 1 or 100%.

6B1. Mean and Standard Deviation of a DPD Definitions:

Suppose you do a probability experiment a lot of times. (For the Royal Auto example, suppose bazillions of people show up to claim prizes.) Each outcome will be a discrete value. The mean of the discrete probability distribution, µ, is the mean of the outcomes from an indefinitely large number of trials, and the standard deviation of the discrete probability distribution, , is the standard deviation of the outcomes from an indefinitely large number of trials. The mean of any probability distribution is also called the expected value, because it’s the expected average outcome in the long run.

How do you find the mean and SD of a discrete probability distribution? Well, one interpretation of probability is long-term relative frequency, so you can treat a discrete probability distribution as a relative frequency distribution. (You can also think of the probabilities as weights, with the mean as the weighted average.) On the TI-83/84, that means good old 1-Var Stats, just like in Chapter 3. BTW: Textbooks all list the formulas, so if you want to know them here they are. But in fact everybody uses software except in the simplest cases. µ = ∑ x·P(x) = √[ ∑ ( x²·P(x)) − µ²] For ∑, see ∑ Means Add ’em Up in Chapter 1.

Example 6: To find the mean and SD of the distribution of winnings in the Royal Auto sweepstakes, put the x’s in one list and the P(x)’s in another list. Caution: When the probability is a fraction, enter the fraction, not an approximate decimal. The calculator will display an approximate decimal, but it will do its calculations on a much more precise value. After entering the x’s and p’s, press [STAT] [] [1] and specify your two lists, such as 1-Var Stats L1,L2. (Yes, the order matters: the x list must be first and the P(x) list second.) When you get your results, check n first. In a discrete probability distribution, n represents the total of the probabilities, so it must be exactly 1. If it’s just approximately 1, you made a mistake in entering your probabilities.

The mean of the distribution is µ = $5.03 , and the standard deviation is = $45.85 . Interpretation: In the long run, the dealership will have to pay out $5.03 per person in prizes. The SD is a little harder to get a grasp on, but notice that it’s more than nine times the mean. This tells you that there is a lot of variability in outcome from one person to the next. In general, the mean tells you the long-term average outcome, and the SD tells you the unpredictability of any particular trial. You can look at the SD as a measure of risk. A couple of notes about the calculator output: The calculator knows that a DPD is a population, so it gives you and not s for the SD. It should give you µ for the mean, but instead it displays x, so you need to make the change. I’ve already mentioned that the sum of the probabilities (n) must be exactly 1, not just approximately 1.

6B2. Comparing DPDs: Parking Choices Example 7: When visiting the city, should you park in a lot or on the street? On a quarter of your visits (25%), you park for an hour or less, which costs $10 in a lot; for parking more than an hour they charge a flat $14. If you park on the street, you might receive a simple $30 parking ticket (p = 20%), or a $100 citation for obstruction of traffic (p = 5%), but of course you might get neither. Which should you do? (Adapted from Paulos 2004.) You have two probability models here, one for the outcomes of parking in a lot, and one for street parking. Begin by putting the two models into tables: Parking in lot x

P(x)

x

Parking on street

P(x)

≤ 1 hour

$10 0.25

Parking ticket

> 1 hour

$14

Obstruction ticket $100 0.05

$30 0.20

No ticket The problem leaves out some things that you can figure for yourself. Remember that every probability model includes all outcomes, and the probabilities add up to 1. If there’s a 25% chance of parking up to an hour, there must be a 100−25 = 75% chance of parking more than an hour. And on the street, if you have a 20+5 = 25% chance of getting some kind of ticket, you have a 100−25 = 75% chance of getting neither. The cost of getting neither ticket is zero. Now you can fill in the empty cells in the tables. Parking in lot x

P(x)

x

Parking on street

P(x)

≤ 1 hour

$10 0.25

Parking ticket

> 1 hour

$14 0.75

Obstruction ticket $100 0.05

Total

$30 0.20

No ticket

1.00

$0 0.75

Total

1.00

BTW: I showed the total probability to emphasize that it’s 1. Never compute the total of the outcomes (x’s), because that wouldn’t mean anything.

How do these tables help you make up your mind where to park? By themselves, they don’t. But they let you compute µ and , and that will help you decide. I placed the x’s and P(x)’s for the parking lot in L1 and L2, and did 1-Var Stats L1,L2. I placed the x’s and P(x)’s for street parking in L3 and L4 and did 1-Var Stats L3,L4. Here are the results: Lot:

Street:

As always, look first at n. If it’s not exactly 1, find your mistake in entering the probabilities. Now you can interpret these results. Parking in the lot is a bit more expensive in the long run (µ = $13.00 per day versus µ = $11.00 per day). But there are no nasty surprises ( = $1.73, little variation from day to day). Parking on the street is much riskier ( = $23.64), meaning that what happens today can be wildly different from what happened yesterday. So what should you do? Statistics can give you information, but part of your decision is always your own temperament. If you like stability and predictability — if you are risk averse — you’ll opt for the parking lot. If it’s more important to you to save $2 a day on average, and you can accept occasionally getting hit with a nasty fine, you’ll choose to park on the street.

6B3. Fair Price of a Game Definitions:

The fair price of a game is the price that would make all parties come out even in the long run. (We’re not just talking traditional games here. A game is any activity where the participants stand to gain or lose money or something else of value. Usually chance contributes to the outcome, but not necessarily.) The fair price of a game is the price that would make the expected value or mean value of the probability distribution equal to zero, the break-even point.

(“Fair price” is one of those math words that look like English but mean something different. You should expect to pay more than the fair price because the operator of the game — the insurance company or casino or stockbroker — also has to cover selling and administrative expenses.) There are two ways to compute the fair price: Method 1: Ignore the actual price of the game, multiply each prize by its probability, and add up the products. Method 2: If you already know the mean of the probability distribution from the player’s point of view, then fair price = actual price + µ. Example 8: Take a really simple bar game: a stranger Die shows x P(x) offers to pay you $60 if you roll a 6 with a standard sixsided die, but you have to pay him $12 per roll. Find the 1,2,3,4,5 −$12 5/6 fair price of this game. 6 $60−12 = $48 1/6 Method 1: The only prize is $60, and you have a 1/6 chance of winning it. $60×(1/6) = $10 . 6/6 = 1 Method 2: Amounts in L1, probabilities in L2; 1VarStats L1,L2. Verify that n=1, and read off the mean of −$2. The actual price is $12, so the fair price is $12 + (−2) = $10 . Naturally, the two methods always give the same answer. Method 2 is easier if you already know the mean of the probability distribution; otherwise Method 1 is easier. Example 9: A lottery has a $6,000,000 grand prize with probability of winning 1 in 3,000,000. It also has a $10 consolation prize with probability of winning 1 in 1000. What is the fair price of your $5 lottery ticket? Solution: You don’t need µ, so Method 1 is easier: multiply each prize by its probability and add up the products. $6,000,000×(1/3,000,000) + $10×(1/1000) Õ fair price is $2.01 . Why does a lottery ticket that is worth $2.01 actually cost $5.00? In effect, the lottery is paying out about 2.01/5.00 » 40% of ticket sales in prizes. Some of the 60% that the lottery commission keeps will cover the lottery’s own expenses, and the rest is paid to the state treasury. This is actually fairly typical: most lotteries pay out in prizes less than half of what they take in. By contrast, the illegal “numbers game” pays out about 70%, or at least it did in the 1980s in Cleveland. (Don’t ask me how I know that!)

6C. Bernoulli Trials In the examples so far of probability models, I’ve had to give you a table of probabilities. But there are many subtypes of discrete probability distribution where the probabilities can be calculated by a formula. The rest of this chapter will look at part of one family, discrete probability distributions that come from Bernoulli trials. Definition:

Repeated trials of a process or an event are called Bernoulli trials if they have both of these characteristics: 1. Each trial has only two possible outcomes. We call those “success” and “failure”. However, “success” is not necessarily a desirable outcome. Success simply means the outcome you’re interested in, and failure is the other outcome. 2. The probability of success, denoted p, is the same for every trial. This is another way of saying that the trials are independent. (Even if they’re not independent, you can usually treat the trials as independent if the sample is a small part of the population, not more than about 10%.) If the probability of success on each trial is p, then the probability of failure on each trial is 1−p, or q for short.

BTW: Bernoulli trials are named after Jacob Bernoulli, a Swiss mathematician. He developed the binomial distribution, which you’ll meet later in this chapter.

Example 10:

You randomly interview 30 people to find out which party they will vote for in the next election. These are not Bernoulli trials, because there are more than two possible outcomes. (New York State ballots often have six or more parties listed, though some parties just endorse the Republican or Democratic candidate.)

Example 11:

On reflection, you realize that you don’t care which party a given voter will choose. All you care about is whether they are voting for your candidate or not, so you randomly select 30 registered voters and ask, “Will you be voting for Abe Snake for President?” These are Bernoulli trials, because there are only two answers, and the probability of voting for Abe Snake is the same for each randomly selected person. (p equals the proportion of Abe Snake voters in the population. Remember, proportion of all = probability of one.) BTW: Actually, this overlooks the undecided or “swing” voters. These become fewer as the election gets closer, but in real life they can’t be overlooked because they may be a larger proportion than the leading candidate’s lead.

Example 12:

You draw cards from a deck until you get a heart. These are not Bernoulli trials. Although there are only two outcomes, heart and other suit, the probability changes with each draw because you have removed a card from the deck. Variation: You replace each card and reshuffle the deck before drawing the next card. Then these become Bernoulli trials because the probability of drawing a heart is 25% on every trial. Variation: You have five decks shuffled together, instead of one 52-card pack. You don’t replace cards after drawing them. You can treat these as Bernoulli trials even without replacement, because you won’t be drawing enough cards to alter the probabilities significantly. How do I know? Five packs is 260 cards, and 10% of 260 is 26. On the first card, P(heart) = 25%. It’s quite unlikely that you’d have no hearts by the 26th card (0.04% chance), but if you did, the probability of a heart on the 27th card would be: 5×13/(5×52−26) » 27.8%. That’s not much different from the original 25%. (You don’t have to take my word for these probabilities. Use the sequences method from Chapter 5 to compute them.) Although this sample without replacement violates independence, it doesn’t violate it by very much, not enough to worry about. This bears out what I said earlier: Trials without replacement can still be treated as independent when the sample is small relative to the population.

6D. The Geometric Model Example 13: According to the AVMA (2014) 30.4% of US households own one or more cats. Suppose you randomly select some households. (a) How likely is it that the first time you find cat owners is in the fifth household? (b) How likely is it that your first cat-owning household will be somewhere in the first five you survey? Although you could compute these individual probabilities using techniques from Chapter 5, there’s a specific model called the geometric model that makes it a lot easier to compute. Also, using the geometric model you can get an overview of the probabilities for various outcomes, which you’d miss by computing probabilities of specific events using the previous chapter’s techniques. If trials are independent, and you want the probability of a string of failures before your first success, you’re using a geometric model. Definition:

The geometric model, also known as the geometric probability distribution, is a kind of discrete probability distribution that applies to Bernoulli trials when you try, and try, and try again until you get a success. P(x) is the probability that your first success will come on your xth attempt, after x−1 failures. Expanding on the definition of Bernoulli trials, you can say that a geometric model is one where Each trial has only two possible outcomes, called success and failure. There’s no fixed number of trials. You keep on till you have a success. The probability of success, p, is the same on every trial (the trials are independent). The random variable X is the number of trials, including the successful final trial, so x ≥ 1, with no upper limit.

The probability of success on any given trial, p, completely describes a geometric model. Here’s a picture of part of the geometric model for cat-owning households, with p = 0.304. How do you read this? The horizontal axis is x, the number of the trial that gives your first success, and the vertical axis is P(x), the probability of that outcome. For example, there’s a hair over a 30% chance that you’ll find cat owners in your first household, P(1) = 30.4%. There’s about a 21% chance that the first household won’t own cats but the second household will, P(2) » 21%. Skipping a bit to x = 6, there’s just about a 5% chance that the first five households won’t have cats but the sixth will, P(6) » 5%. And so forth. x = 1 is always the most likely outcome, and larger x values are successively less and less likely. This is true for every geometric distribution, not just this particular one with p = 0.304. The geometric model never actually ends. The probabilities eventually get too small to show in the picture, but no matter what x you pick, the probability is still greater than 0.

6D1. Computing Probabilities Your TI-83/84 calculator has two menu selections for the geometric model: geometpdf(p,x) answers the question “what’s the probability that my first success will come at trial number x?” geometcdf(p,x) answers the question “what’s the probability that my first success will come at or before trial number x?” (The “c” stands for cumulative, because the cdf functions accumulate the probabilities for a range of outcomes.) They’re both in the [2nd VARS makes DISTR] menu. (If you have a calculator in the TI-89 family, use the [F5] Distr menu. Select Geometric Pdf and Geometric Cdf.) Let’s use the calculator to find the answers for Example 13. Here p, the probability of success in any given household, is 30.4% or 0.304. Part (a) wants the probability of four failures followed by a success on the fifth try. For that you use geometpdf. Press [2nd VARS makes DISTR] [s ] [s ] to get to geometpdf, and press [ENTER]. With the “wizard” interface:

With the classic interface:

Enter p and x.

After entering p and x, press [)] [ENTER] to get the answer.

Press [ENTER] twice, and your screen will look like the one at right.

geometpdf(.304,5) = .0713362938 Õ 0.0713

There’s about a 7% chance you won’t find any cat owners in the first four households but you will in the fifth household. (You could calculate this the long way. The probability of four failures followed by a success is (1−.304)4 ×.304. But the geometric model is easier. That’s the point of a model: one general rule works well enough for all cases, so you don’t have to treat each situation as a special case with its own unique methods.) Part (b) wants the probability of a success occurring anywhere in the first five trials. This is a geometcdf problem. Press [2nd VARS makes DISTR] [s ] to get to geometcdf, and press [ENTER]. With the “wizard” interface:

With the classic interface:

Enter p and x.

After entering p and x, press [)] [ENTER] to get the answer.

Press [ENTER] twice, and your screen will look like the one at right.

geometcdf(.304,5) = .8366774327 Õ 0.8367

There’s almost an 84% chance you will find at least one cat-owning household among the first five. (Doing this the long way, you would use the complement. The complement of “at least one catowning household in the first five” is “no cat-owning households in the first five”. The probability that a given household doesn’t own a cat is q = 1−.304 = 0.696, and the probability that five in a row don’t own cats is 0.6965 . Therefore the original probability you wanted is 1−(.6965 ) = 0.8367.) BTW: You don’t actually need formulas for the geometric model, but if you’re curious about what your calculator is doing, here they are: geometpdf(p,x) = qx−1p geometcdf(p,x) = 1−qx where q = 1−p as usual. You can see that the two “long way” paragraphs above actually used those formulas.

6D2. Mean and Standard Deviation of a Geometric Distribution The geometric distribution is completely specified by p, so you can compute the mean and standard deviation quite easily: µ = 1/p = µ √q or (1/p) √(1−p) Example 14: 30.4% of US households own cats. How many households do you expect you’ll need to visit to find a cat-owning household? Solution: The expected value of a distribution is the mean. µ = 1/p = 1/.304 = 3.289473684. µ = 3.3. Interpretation: On average, you expect to have to visit between 3 and 4 households to find the first cat owners. Caution! The expected value (mean) is not the most likely value (mode). Take a look back at the histogram, and you’ll see that the most likely value is 1: you’re more likely to get lucky on the first trial than on any other specific trial. But the distribution is highly skewed right, so the average gets pulled toward the higher numbers. To compute the SD, just multiply the mean by √q. A handy technique is called chaining calculations. After first calculating the mean, press the [×] key, and the calculator knows you are multiplying the previous answer by something. Here you see that = 2.7. Interpreting is a bit harder. The geometric distribution is a type of discrete probability distribution, so you interpret its standard deviation the same way as for any other DPD. In this particular example, is almost as large as µ, so you expect a lot of variability. If you and a lot of coworkers go out independently looking for households with cats, the group average number of visits will be 3.3 households, but there will be a lot of variability between different workers’ experience. You can’t use the Empirical Rule here because the geometric model is not a bell curve, but you can at least say you won’t be surprised to find workers who get lucky on the first house (µ− » 0.5), and workers who have to visit six houses or more (µ+ » 6.0).

6D3. Making a Decision Some people find it very hard to make choices because they feel they must consider all the pros and cons of every possibility. Others look at possibilities one at a time and take the first one that’s acceptable. Studies such as The Tyranny of Choice (Roets, Schwartz, Guan 2012) show that the first group may make better choices objectively, but the second group is happier with the items they choose. Example 15: You have to buy a new sofa. You’d be content with 55% of the sofas out there. Let’s assume that your Web search presents sofas in an order that has nothing to do with your preferences. There are hundreds to choose from, so you decide to adopt the “first one that’s acceptable” strategy. How likely is it that you’d order the third sofa you’d see? Solution: This is a geometric model, with two failures followed by one success. p = 55%. geometpdf(.55,3) = .111375. There’s about an 11% chance you’d order the third sofa.

6D4. Baseball Example 16: Larry’s batting average is .260. During which time at bat would he expect to get his first hit of the game? How likely is he to get his first hit within his first four times at bat? Solution: This is a geometric model with p = 0.260. The mean or expected value is 1/p = 1/.26 = 3.85, about 4. On average, his first hit each game will come on his fourth time at bat . For the second question, geometcdf(.26,4) = .70013424; there’s about a 70% chance he’ll get his first hit within his first four times at bat.

6E. The Binomial Model In the previous section, we looked at the geometric model, where you just keep trying until you get a success. In this section, we’ll look at the binomial model, where you have a fixed number of trials and a varying number of successes. Definition:

The binomial model, also known as the binomial probability distribution or BPD, is a kind of discrete probability distribution that applies to Bernoulli trials when you have a fixed number of trials, n. Expanding on the definition of Bernoulli trials, you can say that a binomial model is one where Each trial has only two possible outcomes, called success and failure. You have a fixed number of trials, n. The probability of success, p, is the same on every trial (the trials are independent). The probability of failure on any trial is q = 1−p. The random variable X is the number of successes, so 0 ≤ x ≤ n, and P(x) is the probability of x successes.

Example 17: Cats again! 30.4% of US households own one or more cats. You visit five households, selected randomly. (a) What’s the chance that no more than two have cats? (b) What’s the chance that exactly two have cats? (c) What’s the chance that at least two have cats? (d) What’s the chance that two to four have cats? This problem fits the binomial model: n = 5 trials, each household does or does not have cats, and the probability p = 30.4% is the same for each household. A picture of this binomial distribution is shown at right, and you can see some differences from the picture of the geometric distribution: The geometric distribution extends from one trial to ∞, but the binomial distribution can have only 0 to n successes in n trials. The binomial distribution is less strongly skewed than the geometric distribution, and 1 is not necessarily the most likely value of X (though it happens that way in this particular distribution). How do you read the picture? There’s about a 17% probability that none of the five households will have cats, about 36% that one of the five will have cats, and so on. (Why 36% and not 30.4%? Because there’s a greater chance of “winning” one out of five than one out of one.) BTW: In this book we’re more concerned with computing probabilities, but it can be nice to get an overall picture of a distribution. I made this particular graph by using @RISK from Palisade Corporation, but you can also make histograms of binomial distributions by using MATH200A Program part 1(5).

6E1. Computing Probabilities Here you have a choice. Your TI-83/84 calculator comes with two menu selections for the binomial model, but the MATH200A program gives you a simpler interface. Here’s a quick overview of both, before we start on computations: With the MATH200A program (recommended):

If you’re not using the program:

MATH200A Program part 3 gives you one interface for all binomial probability calculations. The program might already be on your calculator from Chapter 3 boxplots, but if it’s not, see Getting the Program for instructions.

These are both in the [2nd VARS makes DISTR] menu: binomcdf(n,p,x) answers the question

“what’s the probability of no more than x successes in n trials (0 to x successes)?” (The “cdf” stands for cumulative distribution function, because the cdf functions accumulate the probabilities for a range of outcomes.)

To find binomial probability with the program, press [PRGM]. If you see MATH200A in the list, press its menu number; otherwise, press [t] or [s] to get to MATH200A, and press [ENTER]. That puts the program name on your home screen. Press [ENTER] again to run the program, and yet again to dismiss the title screen. You’ll then see a menu. Press [3] for binomial probability.

binompdf(n,p,x) answers the question

“what’s the probability of exactly x successes in n trials?” (The “pdf” stands for probability distribution function, because the probability for any particular number of successes is a function of [determined by] that number.)

Got a TI-89 family calculator? Use the [F5] Distr menu. Select Binomial Pdf or Binomial Cdf. The Cdf function can handle any range of successes, not just 0 to x. See Binomial Probability Distribution on TI-89 for full instructions.)

Because this textbook helps you, please click to donate!

Now let’s use your TI-83/84 to answer the questions in Example 17. You have five trials, so n = 5. The probability of success on any given household is 30.4%, so p = 0.304. (a) What’s the probability that no more than two of the five randomly selected households have cats? With the MATH200A program (recommended):

If you’re not using the program:

Press [PRGM], select MATH200A, and press [3] in the MATH200A menu.

The probability that no more than two of your five households have cats (in other words, the probability that 0 to 2 have cats) is binomcdf(5,.304,2). Press [2nd VARS makes DISTR] and scroll up to binomcdf.

Enter n and p. “No more than two cats” is from 0 to 2 cats, so enter those values when prompted. The program echoes back your inputs and shows the computed probability. To show your work, write down the screen name, the inputs, and the result.

If you don’t have the “wizard” interface, or you have it turned off, binomcdf( will appear on your screen, Enter n, p, and the desired maximum number of successes, in that order, then the closing paren and [ENTER].

Conclusion: P(x ≤ 2) or P(0 ≤ x ≤ 2) = 0.8316 .

If you have the “wizard” interface, you get a menu screen, but you enter the same information. Press [ENTER] once on Paste and then again when the command is pasted to your home screen. Either way, write down the binomcdf command and the argument numbers to show your work. Conclusion: P(x ≤ 2) or P(0 ≤ x ≤ 2) = 0.8316 .

(b) What’s the probability that exactly two of five randomly selected households are cat owners? With the MATH200A program (recommended):

If you’re not using the program:

You need a specific number of successes, instead of a range. It’s almost exactly the same deal: you just enter the same number for from and to. In this example, to get the probability of exactly two successes, enter number of successes from 2 to 2.

(a) The probability of exactly two cat-owner households in five is binompdf(5,.304,2). Press [2nd VARS makes DISTR] and then press [s] several times to get to binompdf. (Caution! pdf, not cdf.) Press [ENTER], type in the numbers, and press [)] [ENTER].

Conclusion: P(x = 2) or P(2) = 0.3116 .

Conclusion: P(x = 2) or P(2) = 0.3116 .

(The “wizard” interface screen is the same as it was for binomcdf.)

(c) What’s the probability that at least two of the five randomly selected households have cats? With the MATH200A program (recommended):

If you’re not using the program:

“At least two”, in a sample of five, means from two to five successes. Enter those values in MATH200A part 3. Here’s the results screen:

This one is a little trickier. You could find P(2), P(3), P(4), and P(5) and add them up by hand, but that’s tedious and error prone, and it can introduce rounding errors. Instead, you’ll make the calculator add them up for you. First, get all the probabilities for 0 through n successes into a statistics list. To do this, use binompdf (not cdf) but with only the n and p arguments. (If you have the “wizard” interface, leave x value blank.) After the closing paren, don’t press [ENTER] just yet. Instead, press the [STOÕ] key and select a statistics list, such as [2nd 6 makes L6]. Then press [ENTER]. This puts the probabilities for 0 successes, 1 success, and so on to 5 successes into L6. (If you want, you could examine them with [], or on the [STAT] edit screen.)

Conclusion: P(x ≥ 2) or P(2 ≤ x ≤ 5) = 0.4800 .

Now you need to sum the desired range of cells. You want 2 ≤ x ≤ 5. But the lowest possible x is 0, and the cells in statistics lists are numbered starting at 1. So to get x from 2 through 5, you need cells 3 through 6. When summing part of a list, add 1 to your desired x values. Press [2nd STAT makes LIST] [] [5] to paste sum(, then [2nd 6 makes L6] [,] 3 [,] 6 [)] [ENTER]. Your answer: P(x ≥ 2) or P(2 ≤ x ≤ 5) = 0.4800 . Beware of off-by-one errors when you solve problems with phrases like at least and no more than. Always test the “edge conditions”. “Okay, I need at least 2, and that’s 2 through 5, not 3 through 5. Oh yeah, add 1 for the statistics list in the TI-83, so I’m summing cells 3 through 6, not 2 through 5.” Alternative solution: Do you remember solving “at least” problems in Chapter 5? What was the lesson there? With laborious probability problems, the complement is your friend. What’s the complement of “at least two”? It’s “fewer than two”, which is the same as “no more than one”. Shaky on the logic of complements? Use the enumeration method from Chapter 5: 0 1 2 3 4 5 or 0 1 | 2 3 4 5. Find the probability of ≤1 household with cats, and subtract from 1: P(x ≥ 2) = 1 − P(x ≤ 1) P(x ≥ 2) = 1 − binomcdf(5, .304, 1) P(x ≥ 2) = .4799959639 Õ 0.4800 (d) What’s the chance for two to four cat-owning households in your random sample of five households? With the MATH200A program (recommended):

If you’re not using the program:

Nothing new here: just use good old MATH200A part 3. Here’s the results screen:

You need x from 2 through 4, but remember you always add 1 when summing binomial probabilities from a statistics list, so you put 3 to 5 in your sum command. (You’re still using the same distribution, so there’s no need to repeat binompdf.) P(2 ≤ x ≤ 4) = 0.4774 . Alternative solution: You can also do it without summing. If you think about it, the probability for x from 2 to 4 is the probability for x from 0 to 4, with x below 2 (x no more than 1) removed: 0 1 2 3 4. In symbols, P(0 ≤ x ≤ 1) + P(2 ≤ x ≤ 4) = P(0 ≤ x ≤ 4) and by subtracting that first term you get P(2 ≤ x ≤ 4) = P(0 ≤ x ≤ 4) − P(0 ≤ x ≤ 1)

P(2 ≤ x ≤ 4) = 0.4774 .

Your probability is the result of subtracting two cumulative probabilities, the cdf from 0 to 4 minus the cdf from 0 to 1. It’s shown at right. This is tricky, I admit. You have to set that x value correctly in the second binomcdf, so this method is not much better than the other one. About all it has going for it is that it avoids storing values in a list and then using sum. BTW: You don’t actually need a formula for the binomial model, but if you’re curious about what your calculator is doing, here it is: binompdf(n,p,x) = nCx · px qn−x Why? px is the probability of getting successes on all of the first x trials. q is the probability of failure on one trial, and therefore qn−x is the probability of failure on the remaining trials, after the x out of n successes. But in a binomial probability model, you care how many successes and failures there are, not in what order they occur. To account for the fact that order doesn’t matter, the formula has to multiply by nCx, “the number of ways to choose x objects out of n”. (If you want to know more about nCx, search “combinations” at your favorite math site.) BTW: Unlike the geometric case, there’s no simple formula for binomcdf. Your calculator just has to compute probabilities for x = 0, 1, and so on and add them up.

6E2. Baseball Again! Example 18: Larry’s batting average is .260. How likely is it that he’ll get more than one hit in four times at bat? Solution: This is a binomial model with n = 4, p = 0.26, x = 2 to 4. You can use MATH200A part 3 or the binompdf-sum technique to get .27870128. P(x > 1) = 0.2787 or about 28%. (The program is completely straightforward, so I’m showing only the tricky binompdf-sum sequence here.) Alternative solution: If you don’t have the program, can you see how to use the complement to solve this problem more easily? Check your answer against mine to be sure that your method is correct.

6E3. Mean and Standard Deviation of a Binomial Distribution The binomial distribution depends on the proportion in the population (p) and your sample size (n). You can compute the mean and SD quite easily: µ = np = √[npq] What are the mean and SD of the number of cat-owning households in a random sample of five households? µ = np = 5 × 0.304 = 1.52 = √[npq] = √[5 × .304 × (1−.304)] = 1.028552381 Conclusion: µ = 1.5 and = 1.0 . Interpretation: in a sample of five households, the expected number of cat-owning households is 1.5. Or, if you take a whole lot of samples of five households, on average you will find that 1.5 households per sample own cats. The SD is 1.0. You can’t use the Empirical Rule, but you can say that you expect most of the samples of five to contain µ±2 = 1.5±2×1.0 = 0 to 3 cat-owning households.

6E4. Surprised? Example 19: 30.4% of US households own one or more cats. You visit ten random households and seven of them own cats. Are you surprised at this result? Definition:

A result is surprising or unusual or unexpected if it has low probability, given what you think you know about the population in question. The threshold for “low probability” can vary in different problems, but a typical choice is 5%. When we ask whether a result is surprising (unusual, unexpected), we are really talking about that result or one even further from the expected value.

You think you know that 30.4% of US households own cats. A sample of ten doesn’t seem very large; how do you decide whether seven successes seems reasonable or unreasonable? First, what’s the expected value? That’s µ = np = 10×.304 = 3.04. Next, what does “that result or one further from the expected value” mean? The expected value is 3.04, seven is greater than 3.04, so we’re talking about seven or more successes, x = 7 to 10. Find the probability of that result or one even further from the expected value. That’s easiest with MATH200A part 3: set n=10, p=.304, x=7 to 10. You can also do it with binomcdf: seven or more successes is the complement of zero to six successes (0 1 2 3 4 5 6 7 8 9 10). Either way, the probability is 0.0115 or just over 1%. Draw your conclusion. If 30.4% of US households own cats, finding seven or more cat houses in a random sample of ten households is unusual (surprising, unexpected). That was a trivial example. But in real life, when a result is unexpected it can cast doubt on what you’ve been told. Here’s an example.

6E5. A Life-or-Death Example Example 20: In Talladega County, Alabama, in 1962, an African American man named Robert Swain was accused of rape. 26% of eligible jurors in the county were African American, but the 100man jury panel for Swain’s trial included only 8 African Americans. (Through exemptions and peremptory challenges, all were excluded from the final 12-man jury.) Swain was convicted and sentenced to death. Swain’s lawyer appealed, on grounds of racial bias in jury selection. The Supreme Court ruled in 1965 that “The overall percentage disparity has been small and reflects no studied attempt to include or exclude a specified number of blacks.” —Adapted from Michailides What do you think of that ruling? If 100 men in the county were randomly selected, is eight out of 100 in the jury pool unexpected (unusual, surprising)? Solution: This is a binomial model: every man in the county either is or is not African American, the sample size is a fixed 100, and in a random sample there’s the same 26% chance that any given man is African American. To determine whether eight in 100 is unexpected, ask what is expected. For binomial data, µ = np = 100×.26 = 26; in a sample of 100, you expect 26 African Americans. Okay, 26 is expected, 8 is less than 26, “further away from expected” is less than 8, so you compute the probability for x = 0 to 8. Use binomcdf(100,.26,8) or MATH200A part 3. Either way you get a probability of 4.734795002E-6, or about 0.000 005 , five chances in a million. That is unexpected. It’s so unlikely that we have to question the county’s claim that the selection was random. Unfortunately, Mr. Swain’s lawyer didn’t consult a statistician.

What Have You Learned? Key ideas:

Concept of a random variable. (You don’t actually work problems about this.) Discrete probability distribution (DPD or discrete PD), a/k/a probability model: list of possible outcomes with their probabilities. µ and for a DPD : computing and interpreting. Concept of Bernoulli trials: only two possible outcomes; p = probability of success and q = probability of failure. Geometric model: definition, computing probabilities, computing and interpreting µ and . (This is a less important topic.) Binomial probability distribution (BPD or binomial PD): definition. How do you know when you have a binomial model? Computing probabilities in the binomial model. Use either binomcdf/binompdf or MATH200A, but MATH200A is less work. µ and for a binomial PD : computing and interpreting. (You must use formulas to compute them.) Determining whether an outcome is unusual or surprising.

Study aids:

TI-83/84 Cheat Sheet Because this textbook helps you, please click to donate!

Statistics Symbol Sheet

← Chapter 5 WHYL

Chapter 7 WHYL Õ

Exercises for Chapter 6 Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand. Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.

1

You roll five dice and count the number of twos that appear. (a) List the possible values of the discrete random variable, X = “number of twos in five dice”. (b) What type of probability model is appropriate? Why?

2

A lottery has a 1 in 10 million chance of paying $10,000,000, a 1 in 125 chance of paying $100, and a 1 in 20 chance of paying $10. A ticket costs $5, and you do not get that money back if you win a prize. (a) Construct a discrete probability distribution. (b) Is this a good deal or a bad deal for you? Explain.

3

Blood Types at the Stanford School of Medicine’s Web site lists the relative frequencies of blood types in the US. (There’s also a nice chart of what blood types you can safely receive, based on your own blood type.) Only 6.6% of the US have O negative blood. Velma the Vampire will drink anything, but she prefers O negative. She doesn’t know a victim’s blood type until she tastes it. (a) How many does she expect to drain before she gets some O negative? (b) How likely is it that she’ll find her first O negative within her first ten victims? (c) How likely is it that exactly two of her first ten victims will be O negative?

4

In January 2013, a CBS News story by Sarah Dutton and others reported poll results: 92% of American adults favored universal background checks for gun buyers. (a) If TC3 students are representative of American adults when the poll was taken, what’s the chance that you’ll have to ask three TC3 students before the third one opposes universal background checks? (b) How likely is it that you’d find a student opposing universal background checks somewhere in the first three you ask, not necessarily in third position?

5

Suppose 80% of students who register for Elizabethan Sonnets complete the course successfully. (a) Imagine taking many, many samples of seven people, with replacement. What would be the expected number and standard deviation of the number of people that would finish successfully, per sample of seven? (b) At the end of the semester, imagine a random group of seven students who originally registered for the course. Find the probability that four to six of them completed it successfully. (c) What’s the chance that, when you ask each person in turn, the third person you ask is the first one who successfully completed the course? (d) What’s the chance that the first person that you find who successfully completed the course is one of the first two you ask?

6

In a June 2013 poll, the Pew Research Center (2013b) found that 49% of American adults approved of President Obama’s job performance. In a random sample of 40 American adults, taken at the same time, would you be surprised if 13 approved his performance? Why or why

not?

7

According to the Social Security Administration (2010), 0.1304% of 22-year-old males are expected to die in the next year. (a) What is the fair price of a $100,000 one-year term life insurance policy on a 22-year-old male? (To keep things simple, assume that the company will charge the same price to every 22-yearold male, without regard to lifestyle or health factors.) (b) The company actually charges $180.00 for this policy, more than the fair price. Is this unfair? Explain.

8

A coin is weighted — the chance of heads is not 50%. On five flips of that coin, the probability of various numbers of heads is shown by this model: x P(x)

0

1

2

3

4

5

0.0778

0.2591

0.3456

0.2305

0.0768

0.0102

(a) Find and interpret the mean and standard deviation of this probability model. (b) For an extra challenge, can you use your answer from part (a) to construct a simpler probability model for five flips of this coin?

9

Long experience shows that a particular drug will help 70% of the people who take it. (a) If you take a random sample of five people, what is the probability that the drug helps at least three? (b) If you take many samples of 10 people, what’s the average number of people per sample that the drug will help? (c)In a random sample of 10 people, would you be surprised if the drug helps only five? Why or why not?

10

In April 2013, the Pew Research Center released poll results for the question “Which of the following best describes how you feel about doing your taxes?” Surprisingly (to me, anyway), 34% said they like or love doing their taxes. (a) How many Americans would you expect to have to ask to find one who likes or loves doing her taxes? (b) If you ask five random Americans, what’s the probability that none of them will say they like or love doing their taxes?

11

In a sentence or two, write down the difference between the geometric and binomial models. (Write it, don’t just think it. It’s easy to tell yourself you understand something, but the rubber meets the road when you have to put your understanding into words on paper.)

12

In a sentence or two, write down the difference between pdf and cdf.



Solutions Õ

What’s New 7 Oct 2015: Simplify the definition of a random variable. 13 Mar 2015: Rewrite the section on computing fair price, removing the use of 1-VarStats and adding the relation among mean, fair price, and actual price. Add an example showing both methods. Simplify the computation in Example 9 (formerly Example 8). In one example, correct geometcdf to geometpdf, thanks to David Messmer. 11 Jan 2015: Add links and study aids in What Have You Learned? (intervening changes suppressed) 16 Apr 2013: New document.

7. Normal Distributions Updated 30 Mar 2016 (What’s New?) Summary:

The normal distribution (ND) is important for two reasons. First, many natural and artificial processes are ND. You’ll look at some of those in this chapter. Second, any process can be treated as a ND through sampling. That will be the subject of Chapter 8, and it’s also the foundation of the inferential statistics you’ll do in Chapters 9 through 11.

Contents:

7A. Continuous Random Variables 7A1. Density Curves 7A2. Probability and Continuous Distributions · Area = Probability · Two Interpretations of Probability 7B. The Normal Model 7B1. Properties of the Normal Distribution 7B2. From Boundaries, Find Probability · Computing the Area · Percentiles 7B3. From Probability, Find Boundaries · Percentiles Again 7C. The Standard Normal Distribution 7C1. “Normal” and “Standard Normal” 7C2. Applying the Standard Normal Distribution 7C3. The z Function (Critical z) 7D. Checking for Normality 7D1. Checking Data Sets 7D2. Optional: How Normal Probability Plots Work What Have You Learned? Exercises for Chapter 7 What’s New

7A. Continuous Random Variables You met random variables back in Chapter 6. Any random variable has a single numerical value, determined by chance, for each outcome of a procedure. Discrete random variables are limited to specified values, usually whole numbers. But a continuous random variable can take any value at all, within some interval or across all the real numbers. Just as discrete probability models are used to model discrete variables, continuous probability models are used to model continuous variables. Of course, because a continuous random variable has infinitely many possible values, you can’t make a table of values and probabilities as you could do for a discrete distribution. Instead, either there’s an equation, or just a density curve (below). A probability model is often called a distribution, so you can say that a variable “is normally distributed” (ND), that it “is a normal distribution” (also ND), or that it “follows a normal probability model”. There are lots of specialized continuous distributions, but the normal distribution is most important by a wide margin. Many, many real-life processes follow the normal model, and the ND is also the key to most of our work in inferential statistics. This section will give you some concepts that are common to all continuous distributions, and the rest of the chapter will talk about special properties of the normal distribution and applications. In Chapter 8, you’ll apply the normal distribution to get a handle on the variation from one sample to the next.

7A1. Density Curves In Chapter 2, you learned to graph continuous data by grouping the data in classes and making a histogram, like the one below left. This is wait times in a fast-food drive-through, with time in minutes — not whole minutes, which would make a discrete distribution, but minutes and fractional minutes. Any sample you might take has a finite number of data points, so you set up classes, place the data points in the classes, and then draw a histogram. The height of each bar is proportional to the frequency or relative frequency of that class.

But when you come to consider all the possible values of a continuous variable, you have an infinite number of data points. If you tried to assign them to classes, it would take you forever —literally! Instead, you draw a smooth curve, called a density curve, to show the possible values and how likely they are to occur. An example is shown above right. The density curve is a picture of a continuous probability model. It doesn’t just represent the data in a particular sample, but all possible data for that variable — along with the probabilities of their occurrence, as you’ll see next.

7A2. Probability and Continuous Distributions Up to now, the height of a bar in a histogram has been the number of data points in that class, or the relative frequency of that class. But how do you interpret the height of a density curve? Answer: you don’t! The height of the curve above any particular point on the x axis just doesn’t lend itself to a simple interpretation. You might think it would be the probability of that value occurring. But with infinitely many possible values, “what’s the likelihood of a wait time of exactly 4 minutes?” just isn’t a meaningful question, because what about 3.99997 minutes or 4.002 minutes? Area = Probability What is meaningful is the probability within an interval, which equals the area under the curve within that interval. For example, in this illustration, the probability of a wait time of 6.4 to 9.5 minutes is 29.4%. In symbols, P(6.4 ≤ x ≤ 9.5) = 29.4% or P(6.4 < x < 9.5) = 29.4% That’s right — the probability is the same whether you include or exclude the endpoints of the interval. BTW: Okay, I lied. The height of the curve is meaningful, but only if you’ve had some calculus. The curve is the graph of a probability density function or pdf. The integral of that curve from a to b is the area between x=a and x=b and is the probability that the random variable will have a value between a and b. This explains why the probability is the same whether you include or exclude either endpoint of the interval. The difference is the area of a “rectangle” whose height is the height of the density curve and whose width is the distance from a to a — which is zero. Thus the area of the “rectangle” is zero, and the probability of the random variable taking any particular value, exactly, is zero. Since area equals probability, and total probability must be 1, total area must be 1. Every pdf — the height of every density curve — is scaled so that the integral from −∞ to +∞ is 1.

You can also have the probability for an interval with one boundary, < or ≤ some value like the picture at right, or > or ≥ some value. For example, 3.33 minutes is about 3 minutes and 20 seconds, so the probability of waiting up to 3 minutes and 20 seconds is 20.6%: P(x ≤ 3.33) = 20.6%. The total area under any density curve equals the probability that the random variable will take any one of its possible values, which of course is 1, or 100%. So you can use the complement to say that the probability of waiting 3 minutes and 20 seconds or more (or, more than 3 minutes and 20 seconds) is 100−20.6% = 79.4%. Two Interpretations of Probability You remember from Interpreting Probability Statements in Chapter 5 that every probability can be interpreted as a probability of one or a proportion of all. For example, P(x > 3.33) = 79.4% can equally well be interpreted in two ways: Probability of one: “Any randomly selected person has a 79.4% chance of waiting more than 3 minutes and 20 seconds.” Proportion of all: “79.4% of people will wait more than 3 minutes and 20 seconds.” Which interpretation you use in a given situation depends on what seems simplest and most natural in the situation. Here, the “proportion of all” interpretation seems simpler. But you’re always free to switch to the other interpretation if it helps you in thinking about a situation.

Area = Probability of One = Proportion of All

7B. The Normal Model Why study the normal distribution? First, it’s useful on its own. Lots and lots of real-life distributions match the normal model: body temperature or blood pressure of healthy people, scores on most standardized tests, commute times on a given route, lifetimes of batteries or light bulbs, heights of men or women, weights of apples of a particular variety, measurement errors (in many situations), and on and on. BTW: Why is the ND so common? In real life, very few events have just one cause; most things are the result of many factors operating independently. It turns out that if you take a lot of independent random variables and add them up, their sum is ND. For example, your IQ score results from multiple genetic factors, countless occurrences in your education and your family life, even transient factors like how well you slept the night before the test. Most of these are independent of each other, so the result of adding them is a ND. BTW: Several mathematicians can claim the discovery of the normal distribution. Abraham de Moivre (1667–1754, French) was probably first, in 1733. But the name of Carl Friedrich Gauss is permanently coupled to the normal distribution — literally. Although Sir Francis Galton coined the term normal distribution in 1889, Karl Pearson called it the Gaussian distribution in 1905, and that’s still a recognized synonym.

Second, through sampling, even non-ND populations follow a normal model. You’ll use this model in inferential statistics to make statements about a whole population based on just one sample. You’ll learn about this neat trick in Chapter 8.

7B1. Properties of the Normal Distribution The normal distribution (ND) has the properties of other continuous distributions as listed earlier. In particular, area = probability, and the total area under the density curve is the total probability, which is 1. The ND also has these special properties: A ND is completely described by its mean and SD. The mean locates the center of the curve, but has no effect on the shape. For example, here are three normal curves with µ = 0, 2, and 5 and = 4.



The standard deviation determines the shape of the curve, but has no effect on the location. Smaller SD means the data stick closer to the mean, so the peak is higher and the tails are shorter and fatter. Larger SD means the data vary more, so they spread out from the mean: the peak is lower and the tails are longer and thinner. The second picture shows are three normal curves with µ = 2 and = 2, 4, and 6. (The vertical scale is different from the first picture.) The ND is symmetric — left and right sides are mirror images of each other. This implies that the mean, median and mode are all equal. In principle, the tails of the normal curve run out to ±∞. However, data points more than 3 standard deviations from the mean are rare. (This is part of the Empirical Rule from Chapter 3.) The books all say that inflection points are one SD above and below the mean. Inflection points, if you haven’t had calculus, are where the curve transitions between concave up and concave down. The books don’t tell you that those points are far from obvious visually. Just do the best you can when making sketches. All of this is the theoretical normal distribution. In fact, nothing in real life is perfectly ND, because nothing in real life has an infinite number of data points. When we say something is ND, we mean it’s a close match, not a perfect match. “Normally distributed” (or ND) is short for “using a normal distribution to model this data set, the calculations will come out close enough to reality.” This is a lot like what you did in Chapter 3, when you computed the statistics of a grouped distribution. The statistics were only approximate, because of the simplification you introduced by grouping, but the approximation was good enough. Now let’s get to some applications! There are two main categories: “forward” problems, where you have the boundaries and you have to find the area or probability, and “backward” problems, where you have a probability or area and you have to find the boundaries. BTW: In case you’re interested, the pdf, the height of the density curve above a given x, is . The cdf, the area to the left of a given x, is the integral of that, just the same as finding the area under any curve to the left of a given x: . This integral doesn’t have a “closed form”, a finite sequence of basic algebraic operations, so it must be found by successive approximations. That’s what your calculator does with normalcdf and Excel does with NORM.DIST.

7B2. From Boundaries, Find Probability Summary:

Make a sketch, estimate the probability (area), then compute it.

TI-83/84/89:

Use normalcdf(left bound, right bound, mean, SD ). I’ll walk you through the TI83/84 keystrokes in the first example below. If you have a TI-89, press [CATALOG] [F3] [plain 6 makes N] [ENTER].

Excel:

In Excel 2010 or later, use (deep breath here) =NORM.DIST(right bound, mean, SD, TRUE) − NORM.DIST(left bound, mean, SD, TRUE). In Excel 2007 or earlier, it’s NORMDIST rather than NORM.DIST.

Example 1: Heights of human children of a given age and sex are ND. One study found that threeyear-old girls’ heights have a mean of 38.72² and SD of 3.17². What percentage of three-year-old girls are 35² to 40² tall? Solution: Take the time to make a sketch. It doesn’t have to be beautiful, but you should make it as accurate as you reasonably can. It’s an important safeguard against making boneheaded mistakes. Here’s what should be on your sketch: 1. Draw the axis line. 2. Label the axis, x or z as appropriate. x is the symbol for real-world data points, and z is the symbol for z-scores in the standard normal distribution, below. 3. Draw a vertical line in the middle of the distribution and write the numerical value of the mean below the axis where that central line meets it. (If necessary, offset it with a tick mark, as I did.) 4. Draw a horizontal line at about the right spot and show the numerical value of the standard deviation. 5. Draw a line and show the value for each boundary. Important: When you marked the SD, you set the scale for the sketch. Now you have to honor that and place your boundaries in proportion. For instance, in this problem the mean is 38.72 and the left boundary is 35, which is 3.72 below the mean. Your left boundary therefore needs to be a bit more than one SD (3.17) left of the mean. The right bound is 40, which is 1.28 above the mean, so your line needs to be just over a third of a SD to the right of the mean. (Students often put in more numbers and lines, like the values of 1, 2, and 3 SD above and below the mean. That’s not wrong, but it’s usually not helpful, and it definitely clutters up the sketch.) 6. Shade the area you’re trying to find. 7. Look at your sketch and estimate the area before you pull out your calculator. That way, if you make a mistake that leads to a ridiculous answer, you’ll recognize it as ridiculous and fix it. From my sketch, I estimate an area of 50%–60%. If it’s 45% or 70% I won’t be terribly surprised, but if it’s 5% or 99% I’ll know something is wrong. 8. Compute the area (below). If you wish, add that number to your sketch — not below the axis, please. Write it within the shaded area, if there’s room, or as a callout to the left or right of the diagram, the way I did here. Computing the Area On a TI-83 or TI-84, press [2nd VARS makes DISTR] [2] to select normalcdf. Enter the left boundary (35), right boundary (40), mean (38.72), and SD (3.17). (If you have a TI-89 or you’re using Excel, see above.) With the “wizard” interface:

With the classic interface: After entering the standard deviation, press [)] [ENTER] to get the answer.

Press [ENTER] twice, and your screen will look like the one at right. You always need to show your work, so write down normalcdf(35,40,38.72,3.17) before you proceed to the answer. (There’s no need to write down the keystrokes you used.) In this book, I round probabilities to four decimal places, or two decimal places if expressed as a percentage. The probability is P(35 ≤ x ≤ 40) = 0.5365 That number matches my estimate of 50%–60%. But the problem asked for a percentage. (Always, always, always look back at the problem and make sure you’re answering the question that was actually asked.) The answer: 53.65% of three-year-old girls are 35² to 40² tall. Example 2: A three-year-old girl is randomly chosen. Would it be unusual (unexpected, surprising) if she’s over 45² tall? In Chapter 5 you learned to call a low-probability event unusual (a/k/a surprising or unexpected). The standard definition of unusual events is a probability below 0.05, so really this problem is just asking you to find the probability and compare it to 0.05. Solution: The sketch is at right, and obviously the probability should be small. The left boundary is 45, but what’s the right boundary? The normal distribution never quite ends, so the right boundary is ∞ (infinity). TI-89s have a key for ∞, but TI-83s and TI-84s don’t and Excel doesn’t, so use 10^99 instead. (That’s 10 to the 99th power; the [^] key on your TI calculator is between [CLEAR] and [÷].) Show your work: P(x > 45) = normalcdf(45,10^99,38.72,3.17) = 0.0238 That’s rounded from 0.0237914986, and it’s in line with my estimate of “small”. Now answer the question: There’s only a 2.38% chance that a randomly selected three-year-old girl will be over 45² tall, so that would be unusual. Example 3: For the same population, find and interpret P(x < 33). Solution: The sketch is at right, and again the expected probability is small. The right boundary is 33, but what’s the left boundary? You might want to use 0, since no one can be under 0² tall, but you could make the same argument for 1² or 5², so that can’t be right. To locate the left boundary, remember that you’re using a normal model to approximate the data, and the normal distribution runs right out to ±∞. Therefore, the left boundary is minus ∞ on a TI-89, or minus 10^99 on a TI-83/84. (Use the [(-)] key, not the [−] subtraction key.) P(x < 33) = normalcdf(-10^99,33,38.72,3.17) = 0.0356 The proportion of three-year-old girls under 33² tall is 0.0356 or 3.56%; or, 3.56% of three-year-old girls are under 33² tall. The other interpretation is the chance that a randomly selected three-year-old girl is under 33² tall is 0.0356 or 3.56%. Percentiles Example 4: What’s the percentile rank of a three-year-old girl who is 33² tall? Solution: Long ago, in a galaxy called Numbers about Because this textbook helps you, Numbers, you learned the definition of percentiles. The please click to donate! percentile rank of a data point is the percentage of the data set that is ≤ that data point. So you need P(x ≤ 33). But that’s exactly what you computed in the previous example: 3.56%. So the 33²-tall girl is between the third and fourth percentiles for her age group. “That was P(x < 33), and for a percentile I need P(x ≤ 33)!” I hear you yell. But those two are equal. When we talked about density curves, near the beginning of this chapter, you learned that the area and probability are the same whether you include or exclude the boundary. And this is why it doesn’t make much difference whether you define a percentile rank in terms of < or ≤, because the probability in a continuous distribution is the same either way.

7B3. From Probability, Find Boundaries Summary:

Make a sketch, estimate the value(s), then compute the value(s).

TI-83/84/89:

Use invNorm(area to left, mean, SD ). I’ll walk you through the TI-83/84 keystrokes in the first example below. If you have a TI-89, press [CATALOG] [F3] [plain 9 makes I] [t 3 times] [ENTER].

Excel:

In Excel 2010 or later, use =NORM.INV(area to left, mean, SD ). In Excel 2007 or earlier, it’s NORMINV rather than NORM.INV.

Example 5: Blood pressure is stated as two numbers, systolic over diastolic. The World Health Organization’s MONICA Project (Kuulasmaa 1998) reported these parameters for the US: Systolic: µ = 120, = 15 Diastolic: µ = 75, = 11 Blood pressure in the population is normally distributed. The lowest 5% is considered “hypotensive”, according to Kuzma and Bohnenblust (2005, 103). What systolic blood pressure would be considered hypotensive? Solution: Always make a sketch for these problems. Your sketch is similar to the ones you made for the first group of problems, except that you use a symbol like x1 or “?” for the unknown boundary, and you write in the known area. Always estimate your answer to guard against at least some errors. In the sketch, x1 looks like it’s not quite two SD left of the mean, so I’ll estimate a pressure of 95 to 100. (Okay, I cheated by using my calculator to make my “sketch”. But even with a real pencil-and-paper sketch, you ought to be in the right ballpark.) Now you’re ready to calculate. TI-89 or Excel users, please see the instructions above. On your TI83 or TI-84, press [2nd VARS makes DISTR] [3] to select invNorm. Enter the area to the left of the point you’re interested in (.05), the mean (120), and the SD (15). With the “wizard” interface:

With the classic interface: After the standard deviation, press [)] [ENTER] to get the answer.

Press [ENTER] twice, and your screen will look like the one at right. Show your work! Write down invNorm(.05,120,15) before you proceed to the answer. (There’s no need to write down the keystrokes you used.) Answer: Systolic blood pressure (first number) under 95 would be considered hypotensive. Example 6: The same source considers the top 5% “hypertensive”. What is the minimum systolic blood pressure that is hypertensive? Solution: My “sketch” is at right. It’s mostly straightforward — the x1 boundary is between the 5% tail and the rest of the distribution. But what’s up with the 1−0.05? The problem asks you about the upper 5%, which is the area to the right of the unknown boundary. But invNorm on the calculator, and NORM.INV in Excel, need area to left of the desired boundary. The area to the left is the probability of “not hypertensive”, and area is probability, so the area to left is 1 minus the area to right, in this case 1−0.05. Could you just write down 0.95? Sure, that would be correct. But if the area to right was 0.1627 you’d probably make the calculator compute 1 minus that for you, so why not be consistent? x1 = invNorm(1−.05,120,15) = 144.6728044 Õ 145 (That’s actually a little liberal. Several sources that I’ve seen give 140 as the threshold.) Example 7: Kuzma and Bohnenblust describe the middle 80% as “normal”. What is that range of systolic blood pressure? This problem wants you to find two boundaries, lower and upper. You have to convert the 80% middle into two areas to left. Here’s how. If the middle is 80%, then the two tails combined must be 100−80% = 20%. But the curve is symmetric, so each tail must be 20/2 = 10%. Strictly speaking, I probably should have written that computation on the diagram, instead of just a laconic “0.1”, but it would take up a lot of space and the computation was easy enough. You’ll probably do the same — just be careful. Once you have the areas squared away, the computation is simple enough: x1 = invNorm(.1,120,15) = 100.7767265 Õ 101 x2 = invNorm(1−.1,120,15) = 139.2232735 Õ 139 Check: The boundaries of the middle 80% (or the middle any percent) should be equal distances from the mean. (100.776265+139.2232735)/2 = 120, so at least it’s consistent. Answer: Systolic b.p. of 101 to 139 is considered normal. Percentiles Again Example 8: What’s the 40th percentile for systolic blood pressure? Sometimes the gods smile on us. The kth percentile is the value that is ≥ k% of the population, so k% is exactly the area to left that you need. P40 = invNorm(.4,120,15) = 116.1997935 Õ 116

7C. The Standard Normal Distribution Definition:

The standard normal distribution is a normal distribution with a mean of 0 and standard deviation of 1, sometimes written N(0,1).

The standard normal distribution is a picture of z-scores of any possible real-world ND — more about that later. The standard normal distribution lets you make computations that apply to all normal models, not just a particular model. You’ll see some examples shortly, but first —

7C1. “Normal” and “Standard Normal” The main point about the standard normal distribution is that it’s a stand-in for every ND from real life. How does this work? Well, if you take any real data set and subtract the mean from every data point, the mean of the new data set is 0. And if you then divide that data set by the standard deviation (which doesn’t change when you subtract a constant from every data point), then the SD of the newnew data set is 1. But all you did with those manipulations was replace the numbers with z-scores. Remember the formula:

. The standard normal distribution is what you get when you convert any

normal model to z-scores. BTW: Long ago, when dinosaurs ruled the earth — okay, up through the early 1980s — a “computer” was a person who used a slide rule to make computations. (I swear I am not making this up.) There were no statistical calculators and no Excel. The only way for most people to make computations on a normal model was to look up probabilities in printed tables. But obviously a book couldn’t print tables for every normal model. So the printed tables were for the standard normal distribution. If you had boundaries and wanted the probability of the interval, you converted your real-world numbers to z-scores, looked up the probabilities in the table, and subtracted them. If you had a probability and needed a boundary, you looked up the z-score in the table and then converted it to a raw score using the mean and SD of your data set. The need to do normal computations the hard way has gone the way of the dinosaurs, but I think this history is why many stats books still use tables to do their computations. Inertia is a powerful force in textbooks! BTW: The pdf and cdf functions for the standard normal distribution are what you get when you set µ=0 and =1 in the general equations for the ND:

and

. Again, the integral must be found by successive approximations. That’s where the tables in books come from, and it’s what your calculator does with normalcdf and Excel does with NORM.DIST.

7C2. Applying the Standard Normal Distribution I said above that the standard normal distribution lets you make statements about all normal models. What sort of statements? Well, the Empirical Rule for one. Example 9: The Empirical Rule says that 68% of the population in a normal model lies within one SD of the mean. How good is the rule? In other words, what’s the actual proportion? Solution: As usual, you start with a sketch. This is the standard ND, so the axis is z, not x. There’s no need to mark the mean or SD, because the z label identifies this as a standard normal distribution and therefore µ = 0 and = 1. Just label the boundaries. Compute the probability the same way you’ve already learned. (Both Excel and the TIs have special procedures available for the standard normal distribution, but it’s not worth taking brain cells to learn them, when the regular procedures for the ND work just fine with N(0,1).) P(−1 ≤ z ≤ 1) = normalcdf(−1,1,0,1) = .6826894809 Õ 68.27% The Empirical Rule says 68% of the data are within z = ±1. Actually it’s about 68¼%, close enough. Example 10: How many standard deviations must you go above and below the mean to take in the middle 50% of the data in a normal model? Solution: This is similar to finding the middle 80% of blood pressures earlier, except now you’re making a statement about all normal models, not just a particular one. Shading the middle 50% leaves 100−50 = 50% in the two tails combined, so each tail is 50/2 = 25%. z1 = invNorm(.25,0,1) = −.6744897495 Õ −0.67 By symmetry, z2 must be numerically equal to z1 but have the opposite sign: z2 = 0.67. 50% of the data in any normal model are within about 2/3 of a SD of the mean. Since the bounds of the middle 50% of the data are Q1 and Q3, the IQR of any normal distribution is twice that, about one and a third standard deviations. More precisely, the IQR is 2×0.674 » 1.35 times the SD.

7C3. The z Function (Critical z) There’s one special notation you’ll use when you compute confidence intervals in Chapter 9. Definition:

zarea or z(area), also known as critical z, is the z-score that divides the standard normal distribution such that the right-hand tail has the indicated area.

This may seem a little weird, but really it’s just a recipe to specify a number. Compare with the square root of 48. That is the positive number such that, if you multiply it by itself, you get 48. Or consider : the number that you get when you divide the circumference of a perfect circle by its diameter. Math is full of numbers that are specified as recipes. An example will make things clearer. Example 11: Find z0.025 . Solution: The problem is diagrammed at right. Caution! 0.025 is an area, not a z-score, so you don’t write 0.025 on the number line (the z axis). z0.025 is a z-score (though you don’t know its value yet), so it goes on the number line. Once you have your sketch, the computation is straightforward. Have area (probability), compute boundary. The area is 0.025, but it’s an area to right, and invNorm needs an area to left, so you subtract from 1 as usual: z0.025 = invNorm(1−.025, 0, 1) = 1.959963986 Õ 1.96 Caution! You’re computing a boundary for the right-hand tail. If you get a negative number, that can’t possibly be right. z0.025 = 1.96 makes sense, if you think about it. If you also shaded in the left-hand tail with an area of 0.025, the two tails together would total 5%, leaving 95% in the middle. The Empirical Rule says that 95% of data are within 2 SD above and below the mean, and 1.96 is approximately 2.

7D. Checking for Normality How do you know whether a normal model is appropriate? How do you know whether your data are normally distributed? A histogram can rule out skewed data, or data with more than one peak. But what if your data are unimodal and not obviously skewed? Is that enough to justify a normal model? No, it’s not. You need to perform a test called a normal probability plot. You’ll need this procedure in Chapters 8 through 11, whenever you have a small sample of numeric data. Summary:

To check whether a normal model can represent your sample, make a normal probability plot. This plots the actual data points, against the z-scores you would expect for this number of points that are ND. If the plot is close to a straight line, a normal model is appropriate; if the plot is far from a straight line, a normal model is not appropriate. That’s the bare outline, and you’ll get a little bit more with the examples. For those who want the full theory, it’s marked optional at the end of this section.

Technology:

Testing for normality can be automated partly or completely, depending on what technology you have: On a TI-83/84, you have two choices: Normality Check on TI-83/84, or the MATH200A program (shown below). I strongly recommend the program, not just because I wrote it but because it saves you a lot of work. See Getting the Program. On a TI-89, you have to do the plot and the computations yourself. See the stepby-step procedure in Normality Check on TI-89. There’s an Excel workbook that does everything described here, and even includes a second test of normality. See Normality Check and Finding Outliers in Excel.

7D1. Checking Data Sets Example 12: Consider these vehicle weights (in pounds): 2500, 3250, 4000, 3500, 2900, 4500, 3800, 3000, 5000, 2200 Do they fit a normal model? Solution: Put the data in any statistics list, then press [PRGM], scroll down to MATH200A, and press [ENTER] twice. Select Normality chk. The program makes the plot, and you can look at the points to determine whether they seem to be pretty much on a straight line. At least, that’s the theory. In practice, most data sets are a lot less clear cut than this one. It can be hard to tell whether the points fit a line, particularly if you have only a few of them. The plot takes up the whole screen, so deviations can look bigger than they really are. Fortunately, there’s a test for whether points lie on a straight line. As you know from Chapter 4, the closer the correlation coefficient r is to 1, the closer the points are to a straight line. The program computes r for you, and it also computes a critical valueH to help you determine if the points are close enough to a straight line. (For technical reasons, the critical value is different from the decision points of Chapter 4.) If r≥crit, it’s close enough to 1, the points are close enough to a straight line, and you can use a normal model. If r crit, and therefore these vehicle weights fit the normal model. HThe “classic TI-83” (non-“Plus” model) doesn’t compute the critical value, so you have to do it yourself. See the formula in item 4 in the next section. Example 13: Here’s a random sample of the lengths (in seconds) of tunes in my iTunes library: 120 219 242 134 129 105 275 76 412 268 486 199 651 291 126 210 151 98 100 92 305 231 734 468 410 313 644 117 451 375

Do they fit a normal model? Solution: I entered them in a statistics list and then ran MATH200A Program part 4. The result was the plot at the right. You can see that the plot is curved. This is reinforced by comparing r=0.9473 to crit=0.9639. r < crit. The points diverge too far from a straight line, and therefore I cannot use a normal model for the lengths of my iTunes songs.

7D2. Optional: How Normal Probability Plots Work The basic idea isn’t too bad. You make an xy scatterplot where the x’s are the data points, sorted in ascending order, and the y’s are the expected z scores for a normal distribution. Why would you expect that to be a straight line? Recall the formula for a z score: z = (x−x)/s. Breaking the one fraction into two, you have z = x/s−x/s. That’s just a linear equation, with slope 1/s and intercept x/s. So an xz plot of any theoretical ND, plotting each data point’s z score against the actual data value, would be a straight line. Further, if your actual data points are ND, then their actual z scores will match their expectedfor-a-normal-distribution z scores, and therefore a scatterplot of expected z scores against actual data values will also be a straight line. Now, in real life no data set is ever exactly a ND, so you won’t ever see a perfectly straight line. Instead, you say that the closer the points are to a straight line, the closer the data set is to normal. If the data points are too far from a straight line — if their correlation coefficient r is lower than some critical value — then you reject the idea that the data set is ND. Okay, so you have to plot the data points against what their z-scores should be if this is a ND, and specifically for a sample of n points from a ND, where n is your sample size. This must be built up in a sequence of steps: 1. Divide the normal curve (mentally) into n regions of equal probability and take one probability from each region. For technical reasons, the probability number you use for region i is (i−.375)/(n+.25). This formula is in many textbooks, and also in Normal Probability Plots and Tests for Normality (Ryan and Joiner 1976). 2. Compute the expected z scores for those probabilities. Working with the calculator, that’s just invNorm of (i−.375)/(n+.25). 3. Plot those expected z scores against the data values. This xy plot (or xz plot) has a correlation coefficient r, computed just like any other correlation coefficient. 4. Compare the r for your data set to the critical value for the size of your data set. Ryan and Joiner determined that the critical value for sample size n is 1.0063−.1288/√n −.6118/n+1.3505/n², but to make it a little easier on the calculator I rearranged it as 1.0063−.6118/n+1.3505/n²−.1288/√n. The closer the points are to a straight line, the closer the data set is to fitting a normal model. In other words, a larger r indicates a ND, and a smaller r indicates a non-ND. You can draw one of two conclusions: If r is less than the critical value, reject the hypothesis of normality at the 0.05 significance level and say that the data set is not ND. (If you haven’t studied hypothesis testing yet, another way to say it is that you’re pretty sure the data set doesn’t fit the normal model because there’s less than a 5% probability that it does.) If r is greater than the critical value, fail to reject the hypothesis that the data set comes from a ND. This doesn’t mean you are certain it does, merely that you can’t rule it out. Technically you don’t know either way, but practically it doesn’t matter. Remember (or you will learn later) that inferential statistics procedures like t tests are robust, meaning that they still work even if the data are moderately non-normal. But if your data were extremely non-normal, r would be less than the critical value. When r is greater than the critical value, you don’t know whether the data set comes from normal data or moderately non-normal data, but either way your inferential statistics procedures are okay. So the bottom line is, if r > CRIT, treat the data as normal, and if r < CRIT, don’t. BTW: The normal probability plot is just one of many possible ways to determine whether a data set fits the normal model. Another method, the D’Agostino-Pearson test, uses numerical measures of the shape of a data set called skewness and kurtosis to test for normality. For details, see Assessing Normality in Measures of Shape: Skewness and Kurtosis.

What Have You Learned? Key ideas:

Properties of the ND. Sketching the ND. Area = proportion of all = probability of one. “Forward problem”: have boundary(ies), find area or probability or proportion. Sketch and use normalcdf. “Backward problem”: have area, find boundaries (values of the ND). Sketch and use invNorm. That function needs area to left, so if the problem gives area to right you have to use 1 minus that area. For problems involving percentiles (also here), remember that k% of the area is to the left of the kth %ile. Standard ND and the zarea notation. In this notation, area is area to right, so you need invNorm(1−area). Determining whether a data set is ND. Use MATH200A Program part 4.

Study aids:

TI-83/84 Cheat Sheet Because this textbook helps you, please click to donate!

Statistics Symbol Sheet

← Chapter 6 WHYL

Chapter 8 WHYL Õ

Exercises for Chapter 7 Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand. Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all. You’ll need this information for several of the problems: US men’s heights: ND with µ = 69.3², = 2.92² US women’s heights: ND with µ = 64.1², = 2.75² Source: “Is Human Height Bimodal?” (Schilling 2002).

1

Suppose that variable X is Chantal’s commute time between home and school, in minutes. Give two interpretations of the statement P(x < 17) = 0.0900.

2

A male co-worker is “six foot four and a half” — 76.5² tall. How unusual is that? (Give two interpretations of your number.)

3 4

What proportion of women are 64² to 67² tall?

5

To enter the Pennsyltuckey Police Academy, you have to be at or above the 15th percentile in height. How tall is that, for a man?

What heights for men would be considered unusual (less than 5% likely)? Hint: Your answer will be in the form “under ____ inches or over ____ inches”.

6

(a) Find the 25th and 75th %iles for women’s heights. (b) Find the interquartile range. (c) Example 10 found that, in a normal distribution, the interquartile range equals 1.35 standard deviations. Does your computed IQR match that prediction?

7

Determine whether this sample of diastolic blood pressures fits the normal model: 78 66 98 90 74 70 70 76 72 86 62 84 66 70 68

8

Scores on the math SAT are ND with a mean of 500 and standard deviation of 100. What percentile is represented by a score of 735?

9

To join Mensa, you must be in the top 2% of the population on a recognized intelligence test. Mensa accepts the SAT as a qualifying test for membership. The mean on the combined three parts is 1500 and the SD is 300. What’s the minimum combined score to qualify you for Mensa?

10 11 12

Find z0.01 .

13

A small shop decided to stock formal wear for men and women in the middle 90% of height. How tall must men and women be to shop there?

For men’s heights, find P(x < 60²) and write two interpretations. Test scores are supposed to be ND, but this is questionable on small tests. Here are scores from a recent quiz; do they fit the normal model? 0.3 8.8 11.5 12 12.3 12.5 13 13.5 14.8



Solutions Õ

What’s New 30 Mar 2016: For students with the “TI-83 classic”, add a pointer to the formula for the critical number for normality check. 1 Jan 2016: Retake the screen shots here and here, for the new version of MATH200A Program part 4. 23 June 2015: Correct “diastolic” to systolic, thanks to Darlene Huff. 22 Feb 2015: Add a reference to the D’Agostino-Pearson method of testing for a ND, and reorganize the alternatives to the MATH200A program. 1 Feb 2015: Add a reference to Excel methods for testing normality. 23 Jan 2015: Cross reference the new TI-89 procedure for checking normality. 11 Jan 2015: Add links and study aids in What Have You Learned? (intervening changes suppressed) 19 June 2013: New document.

8. How Samples Vary Updated 25 June 2015 (What’s New?) Intro:

Inferential statistics says, “I’ve got this sample. What does it tell me about the population it came from?” Eventually, you’ll estimate a population mean or proportion from a sample and use a sample to test a claim about a population. In essence, you’re reasoning backward from known sample to unknown population. But how? This chapter lays the groundwork. First you have to reason forward. Way back in Chapter 1, you learned that samples vary because no one sample perfectly represents its population. In this chapter, you’ll put some numbers on that variation. You’ll learn about sampling distributions, and you’ll calculate the likelihood of getting a particular sample from a known population. That will be the basis for all your inferential statistics, starting in Chapter 9.

Contents:

8A. Numeric Data / Means of Samples 8A1. One Sample and Its Mean 8A2. Meet the Sampling Distribution of x 8A3. Properties of the Sampling Distribution of x · There’s an App for That · Center of the Sampling Distribution of x · Spread of the Sampling Distribution of x · Shape of the Sampling Distribution of x · Requirements, Assumptions, and Conditions 8A4. Applications · How to Work Problems · Example 1: Bank Deposits · Example 2: Women’s Heights · Example 3: Elevator Load Limit 8B. Binomial Data / Proportions of Samples 8B1. Sampling Distribution of p · Center of the Sampling Distribution of p · Spread of the Sampling Distribution of p · Shape of the Sampling Distribution of p · Requirements, Assumptions, and Conditions 8B2. Applications · How to Work Problems · Example 5: Swain v. Alabama 8C. Summary of Sampling Distributions What Have You Learned? Exercises for Chapter 8 What’s New

Acknowledgements: The approach I take to this material was suggested by What Is a p-Value Anyway? (Vickers 2010, ch 10), though of course any problems with this chapter are my responsibility and not Vickers’. The software used to prepare most of the graphs and all of the simulations for this chapter is @RISK from Palisade Corporation.

8A. Numeric Data / Means of Samples

8A1. One Sample and Its Mean Having time on my hands, I was curious about the lengths of tunes in the Apple Lengths of Store. Being lazy, I decided to look instead at the lengths of tunes in my iTunes 30 Tunes library. There are 10113 of them, and I’m going to assume that they are mm:ss seconds representative. (That’s my story, and I’m sticking to it.) 2:00 120 I set Shuffle to Songs and then took the first 30, which gave me the times you 3:39 219 see at right for a random sample of size 30. 4:02 242 Here is a histogram of the data. The tune times are moderately skewed right. 2:14 134 That makes sense: most tunes run around two to five minutes, but a few are longer. 2:09 129 1:45 105 4:35 275 1:16 76 6:52 412 4:28 268 8:06 486 3:19 199 10:51 651 4:51 291 2:06 126 3:30 210 2:31 151 1:38 98 1:40 100 1:32 92 5:05 305 3:51 231 12:14 734 The mean of this sample is 280.9 seconds, and the standard deviation is 181.7 7:48 468 seconds. But you know that there’s always sampling error. No sample can represent 6:50 410 the population perfectly, so if you take another sample from the same population 5:13 313 you’d expect to see a different mean, but not very different. This chapter is all about 10:44 644 what differences you should expect. 1:57 117 7:31 451 First, ask yourself: Why should you expect the mean of a second sample to be 6:15 375 “different, but not very different” from the mean of the first sample? The samples are independent, so why should they relate to each other at all? Answer: because they come from the same population. In a given sample, you would naturally expect some data points below the population mean µ, and others above µ. You’d expect that the points below µ and the points above µ would more or less cancel each other out, so that the mean of a sample should be in the neighborhood of µ, the mean of the population. And if you think a little further about it, you’ll probably imagine that this canceling effect works better for larger samples. If you have a sample of four data points, you wouldn’t be much surprised if they’re all above µ or all below µ. If you have a sample of 100 data points, having them all on one side of µ would surprise you as much as flipping a coin 100 times and getting 100 heads. So you expect that the means of large samples tend to stick closer to µ than the means of small samples do. That’s absolutely true, as you’ll find out in this chapter. To get a handle on “different, but not very different”, take a look at a second sample of 30 from the same population. This one has x = 349.1, s = 204.2 seconds. From its histogram, you can see it’s a bit more strongly skewed than the first sample.

The two sample means differ by 349.14−280.93 » 68.2 seconds. That might seem like a lot, but it’s only about a quarter of the first sample mean and under a fifth of the second sample mean. Also, it’s a lot less than the standard deviations of the two samples, meaning that the difference between samples is much less than the variability within samples. There’s an element of hand waving in that paragraph. Sure, it seems plausible that the two sample means are “different, but not very different”; but you could just as well construct an argument in words that the two means are different. Without numbers to go on, how much of a difference is reasonable? In statistics, we like to use numbers to decide whether a thing is reasonable or not. How can we make a numerical argument about the difference between samples? Well, put on your thinking cap, because I’m about to blow your mind.

8A2. Meet the Sampling Distribution of x The key to sample variability is the sampling distribution. Definition:

Imagine you take a whole lot of samples, each sample with n data points, and you compute the sample mean x of each of them. All those x’s form a new data set, which can be called the distribution of sample means, or the sampling distribution of the mean, or the sampling distribution of x, for sample size n. Notice that n is the size of each sample, not the number of samples. There’s no symbol for the number of samples, because it’s indefinitely large.

The sampling distribution is a new level of abstraction. It exists only in our minds: nobody ever takes a whole lot of samples of the same size from a given population. You can think of the sampling distribution as a “what if?” — if you took a whole lot of samples of a given size from the same population, and computed the means of all those samples, and then took those means as a new set of data for a histogram, what would that distribution look like? Why ask such an abstract question? Simply this: if you know how samples from a known population are distributed, you can work backward from a single sample to make some estimates about an unknown population. In this chapter, I work from a population of tunes with known mean and standard deviation, and I ask what distribution of sample means I can expect to get. In the following chapters, I’ll turn that around: looking at one sample, we’ll ask what that says about the mean and standard deviation of the population that the sample came from. What does a sampling distribution look like? Well, I used a computer simulation with @RISK from Palisade Corporation to take a thousand samples of 30 tunes each — the same n as before — and this is what I got:

40%

“Big whoop!” I hear you say. I agree, it’s not too impressive at first glance. But let’s compare this distribution of sample means to the population those samples come from. (In real life, you wouldn’t know what the population looks like. But in this chapter I work from a known population to explore what the distribution of its samples looks like. Starting in the next chapter, I’ll turn that around and use one sample to explore what the population probably looks like.) Look at the two histograms below. The left-hand plot shows the individual lengths of all the tunes in the population — it’s a histogram of the original population. The right-hand plot shows the means of a whole lot of samples, 30 tunes per sample — it’s a histogram of the sampling distribution of the mean. That right-hand plot is the same as the plot I showed you a couple of paragraphs above, just rescaled to match the left-hand plot for easier comparison.

Now, what can you see? Shape: The original population is skewed strongly to the right, but the sampling distribution is nearly a bell curve. (The shape is easier to see if you look at the first picture of the sampling distribution. Remember, the right-hand plot and the earlier plot are the same plot, just drawn on different scales.) Center: The mean of the sampling distribution is 296.9 seconds, the same as the mean of the population. Spread: Individual tune lengths (original population, left graph) vary quite a lot, but means of 30-tune samples (sampling distribution of x, right graph) vary much less. You can say that most individual tune lengths are a lot shorter or longer than the population average, but most mean lengths in samples of 30 are very close to the population average. Compare these measures of spread from the two graphs: Population Sampling (indiv. tunes) Distribution Values 50 to 1000s(*) 200 to 400 Middle 95% of values 98.0 to 696.3 244.6 to 359.1 Standard deviation 158.6 29.0 (*) I cut off the right tail of the population graph to save space. At this point, you’re probably wondering if similar things are true for other numeric populations. The answer is a resounding YES.

8A3. Properties of the Sampling Distribution of x When you describe a distribution of continuous data, you give the center, spread, and shape. Let’s look at those in some detail, because this will be key to everything you do in inferential statistics. There’s an App for That Before I get into the properties of the sampling distribution, I’d like to tell you about two Web apps that let you play with sampling distributions in real time. (I’m grateful to Benjamin Kirk for suggesting these.) Sampling Distributions, part of the Rice Virtual Lab in Statistics. This app lets you sample from symmetric and skewed distributions, at various sample sizes, and see how the sampling distribution builds up. The app plots the sampling distribution and calculates its mean and SD, so you can compare them to the original population and also to the expected center, spread, and shape described below. CentLimApplet. This shows you why “sample size at least 30 or so” is a good rule of thumb for numeric data. Try setting the number of samples to the maximum, then increase the sample size one unit at a time, and you’ll see how the sampling distribution gets closer and closer to a ND. If you possibly can, try out these apps, especially the second one. Sampling distributions are new and strange to you, and playing with them in real time will really help you to understand the text that follows. Center of the Sampling Distribution of x Summary:

The mean of the sampling distribution of x equals the mean of the population: µx = µ. This is true regardless of the shape of the original population and regardless of sample size.

Why is this true? Well, you already know that when you take a sample, usually you have some data points that are higher than the population mean and some that are lower. Usually the highs and lows come pretty close to canceling each other out, so the mean of each sample is close to µ — closer than the individual data points, that is. When you take a distribution of sample means, the same thing happens at the second level. Some of the sample means x are above µ and some are below. The highs and lows tend to cancel, so the average of the averages is pretty darn close to the population mean. Spread of the Sampling Distribution of x Summary:

The standard deviation of the sampling distribution of x has a special name: standard error of the mean or SEM; its symbol is x . The standard error of the mean for sample size n equals the standard deviation of the population divided by the square root of n: SEM or x = /√ n. This is true regardless of the shape of the original population and regardless of sample size.

BTW: Why is this true? Each member of the sample is a random variable, all drawn from the same population with a SD of and therefore a variance of ². If you combine random variables — independent random variables — their variances add. Okay, the sample is n random values drawn from a population with a variance of ². The total of those n values in the sample is a random variable with a variance of ² n, and therefore the standard deviation of the total is √(² n) = √ n. Now divide the sample total by n to get the sample mean. x is a random variable with a standard deviation of (√ n)/n = /√ n. QED — which is Latin for “told ya so!”

Shape of the Sampling Distribution of x Summary:

If the original population is normally distributed (ND), the sampling distribution of the mean is ND. If the original population is not ND, still the sampling distribution is nearly ND if sample size is ≥ 30 or so but not more than about 10% of population size.

You can probably see that if you take a bunch of samples from a ND population and compute their means, the sample means will be ND also. But why should the means of samples from a skewed population be ND as well? The answer should be called the Fundamental Theorem of Statistics, but instead it’s called the Central Limit Theorem. (The name was given by Richard Martin Edler von Mises in a 1919 article, but the theorem itself is due to the Marquis de Laplace, in his Théorie analytique des probabilités [1812].) The CLT is the only theorem in this whole course. There is a mathematical way to state and prove it, but we’ll go for just a conceptual understanding. Central Limit Theorem:

The sampling distribution of the mean approaches the normal distribution, and does so more closely at larger sample sizes. An equivalent form of the theorem says that if you take a selection of independent random variables, and add up their values, the more independent variables there are, the closer their sum will be to a ND.

The second form of the theorem explains why so many real-life distributions are bell curves: Most things don’t have a single cause, but many independent causes. Example: Lots of independent variables affect when you leave the house and your travel time every day. That means that any person’s commute times are ND, and so are people’s arrival times at an event. The same sorts of variables affect when buses arrive, so wait times are ND. Most things in nature have their growth rate affected by a lot of independent variables, so most things in nature are ND. But it’s the first form of the theorem that we’ll use in this chapter. If samples are randomly chosen, or chosen by another valid sampling technique, then they will be independent and the Central Limit Theorem will apply. The further the population is from a ND, the bigger the sample you need to take advantage of the CLT. Be careful! It’s size of each sample that matters, not number of samples. The number of samples is always large but unspecified, since the sampling distribution is just a construct in our heads. As a rule of thumb, n=30 is enough for most populations in real life. And if the population is close to normal (symmetric, with most data near the middle), you can get away with smaller samples. On the other hand, the sample can’t be too large. For samples drawn without replacement (which is most samples), the sample shouldn’t be more than about 10% of the population. In symbols, n ≤ 0.1N. Suppose you don’t know the population size, N? Multiply left and right by 10 and rewrite the requirement as 10n ≤ N. You always know the sample size, and if you can make a case that the population is at least ten times that size then you’re good to go. You’ll remember that the population of tune times was highly skewed, but the sampling distribution for n=30 was pretty nearly bell shaped. To show how larger sample size moves the sampling distribution closer to normal, I ran some simulations of 1000 samples for some other sample sizes. Remember that the sampling distribution is an indefinitely large number of samples; you’re still seeing some lumpiness because I ran only 1000 samples in each simulation.







The means of 3-tune samples are still fairly well skewed, though the range is less than the population range. Increasing sample size to 10, the skew is already much less. 20-tune samples are pretty close to a bell curve except for the extreme right-hand tail. Finally, with a sample size of 100, we’re darn close to a bell curve. Yes, there’s still some lumpiness, but that’s because the histogram contains only 1000 sample means. Requirements, Assumptions, and Conditions The requirements mentioned in this chapter will be your “ticket of admission” to everything you do in the rest of the course. If you don’t check the requirements, the calculator will happily calculate numbers for you, they’ll be completely bogus, and your conclusions will be wrong but you won’t know it. Always check the requirements for any type of inference before you perform the calculations. I talk about “requirements”. By now you’ve probably noticed that I think very highly of DeVeaux, Velleman, and Bock’s Intro Stats (2009). They test the same things in practice, but they talk about “assumptions” and “conditions”. Assumptions are things that must be true for inference to work, and conditions are ways that you test those assumptions in practice. You might like their approach better. It’s the same content, just a different way of looking at it. And sampling distributions are so weird and abstract that the more ways you can look at them the better! Following DeVeaux pages 591–593, here’s another way to think about the requirements. Independence Assumption: Always look at the overall situation and try to see if there’s any way that different members of the sample can affect each other. If they seem to be independent, you’ll then test these conditions: Randomization Condition: Was the sample randomly selected? (A proper systematic sample counts as random.) Later, when you do inference on two samples in Chapter 11, you’ll ask instead whether the participants were randomly assigned to treatments. 10% Condition: If the population is small, a decent-sized sample may be too large. Remember, back in Chapter 5, you learned that sampling without replacement changes the mix of what’s left? In practice, if the sample is less than about 10% of the population, the effect is not serious enough to worry about. These conditions must always be met, but they’re a supplement to the Independence Assumption, not a substitute for it. If you can see any way in which individuals are not independent, it’s game over regardless of the conditions. Normal Population Assumption: For numeric data, the sampling distribution must be ND or you’re dead in the water. There are two conditions to check this: Nearly Normal Condition: If the sample is small, check for normality as you learned in Chapter 7. This matters because skewed data and outliers can distort the sample mean and SD. Large Sample Condition: But if the sample is larger, more than about 30, outliers and skew have less effect on the mean and SD, and you don’t have to worry about the Nearly Normal Condition. The Normal Population Assumption and the Nearly Normal Condition or Large Sample Condition are for numeric data and only numeric data. We’ll have a separate set of requirements, assumptions, and conditions for binomial data later in this chapter. See also:

Is That an Assumption or a Condition? is a very nice summary by Bock of all assumptions and conditions. It puts all of our requirements for all procedures into context. (Just ignore the language directed at instructors.)

8A4. Applications Ultimately, you’ll use sampling distributions to estimate Because this textbook helps you, the population mean or proportion from one sample, or to please click to donate! test claims about a population. That’s the next four chapters, covering confidence intervals and hypothesis tests. But before that, you can still do some useful computations. How to Work Problems For all problems involving sampling distributions and probability of samples, follow these steps: 1. Determine center, spread, and shape of the sampling distribution, even if you’re not explicitly asked to describe the distribution. 2. If you can’t show that the sampling distribution is ND, stop! 3. Sketch the curve, and estimate the answer. (See examples below.) 4. Compute the probability (area) using normalcdf. Caution! Don’t use rounded numbers in this calculation. Example 1: Bank Deposits You are auditing a bank. The bank managers have told you that the average cash deposit is $200.00, with standard deviation $45.00. You plan to take a random sample of 50 cash deposits. (a) Describe the distribution of sample means for n = 50. (b) Assuming the given parameters are correct, how likely is a sample mean of $189.56 or below? Solution (a): Recall that describing a distribution means giving its center, its spread, and its shape. Center: The mean of the sampling distribution equals the mean of the original population: µx = µ, so µx = $200.00 . This does not depend on whether the sampling distribution is normal. Spread: The standard deviation of the sampling distribution of the mean, better known as the standard error of the mean, is x = /√ n = 45/√50 and x = $6.36 . This does not depend on whether the sampling distribution is normal. Shape: The sample was random, and 10n = 10×50 = 500 is obviously less than the number of cash deposits at any bank. Sample size 50 is ≥30, so the sampling distribution of the mean is near enough to a normal model . (If n was much under 30, you would be unable to say anything about the shape and you would be unable to solve part (b).) Solution (b): Please refer to How to Work Problems, above. You’ve already described the distribution, so the next step is to make the sketch. You may be tempted to skip this step, but it’s an important reality check on the numerical answer you get from your calculator. The sketch for this problem is shown at right. Please observe these key points when sketching sampling distribution problems: 1. Draw the axis line. 2. Label the axis, x or p as appropriate. 3. Draw a vertical line in the middle of the distribution and show the numerical value of the mean. Caution! This is the mean of the sampling distribution, equal to the population mean, not the sample mean. 4. Draw a horizontal line at about the right spot and show the numerical value of the SEM, not of the original population. (For Binomial Data, below, you’ll use the SEP instead of the SEM.) 5. Draw a line and show the value for each boundary. 6. Shade the area you’re trying to find, and estimate it. (From the sketch for this problem, I estimated a few percent, definitely under 10%.) 7. (optional) After you find the area, show its value. Next, compute the probability on your calculator. Press [2nd VARS makes DISTR] [2] to select normalcdf. Fill in the arguments, either on the wizard interface or in the function itself. Either way, you need four arguments, in this order: Left boundary. In this case, there is no left boundary because the problem specifies ≤$189.56. Conceptually, the boundary is −∞, but your calculator doesn’t have an infinity key, so use (−)10^99 instead. (Don’t use 0. Yes, 0 is the lower limit for a deposit, but you’re using the normal model for the sampling distribution, so the tails go on forever.) Right boundary. For this problem, 189.56 is the right boundary. Mean. 200 in this problem. Standard error. You computed it earlier as $6.36, but that’s an approximate number. Never use rounded numbers in further calculations. normalcdf calculations are particularly sensitive to rounding errors, especially when one or both boundaries are out in the tails, so use the exact value: 45/√50. With the “wizard” interface:

With the classic interface:

The wizard prompts you for a standard deviation . Don’t enter the SD of the population. Do enter the SD of the sampling distribution, which is the standard error.

After entering the standard error, press [)] [ENTER]. You’ll have two closing parentheses, one for the square root and one for normalcdf.

After entering the standard error, press [ENTER] twice and your screen will look like the one at right. Always show your work. There’s no need to write down all your keystrokes, but do write down the function and its arguments: normalcdf(−10^99, 189.56, 200, 45/√50) Answer: P(x ≤ 189.56) = 0.0505 Comment: Here you see the power of sampling. With a standard deviation of $45.00, an individual deposit of $189.56 or lower can be expected almost 41% of the time. But a sample mean under $189.56 with n=50 is much less likely, only a little over 5%. BTW: This is one reason you should take the trouble to make your sketch reasonably close to scale. If you enter the standard deviation, 45, instead of the standard error, 45/√50 — a common mistake — you’ll get 0.4083. A glance at your sketch will tell you that can’t possibly be right, so you then know to find and fix your mistake.

Example 2: Women’s Heights US women’s heights are normally distributed (ND), with mean 65.5² and standard deviation 2.5². You visit a small club on a Thursday evening, and 25 women are there. (Let’s assume they are a representative sample.) Your pickup line is that you’re a statistics student and you need to measure their heights for class. Amazingly, this works, and you get all 25 heights. How likely is it that the average height is between 65² and 66²? Solution: First, get the characteristics of the sampling distribution: Center: The mean of the sampling distribution is 65.5², the same as the mean of the original population. Spread: The standard deviation of the sampling distribution (standard error of the mean or SEM) is /√ n = 2.5/√25 = 0.5². Shape: The sample is representative of all women (we assume), and 10n = 10×25 = 250 is less than the total number of women. The sample size is under 30, but the original population is a ND and therefore the sampling distribution is also ND. If the SEM is 0.5², then 65² and 66² equal the mean ± one standard error. The Empirical Rule (68–95–99.7 Rule) tells you that about 68% of the data fall between those bounds. In this problem, the sketch is a really good guide to the answer you expect.

BTW: This is the distribution of sample means, so you expect 68% of them to fall between those bounds. But do the computation anyway, because the Empirical Rule is approximate and now you’re able to be precise. Also, the SEM of 0.5² is an exact number, but still I put the whole computation into the calculator just to be consistent.

The chance that the sample mean is between 65² and 66² is P(65 ≤ x ≤ 66) = 0.6827 BTW: Remember the difference between the distribution of sample means and the distribution of individual heights. From the computation at the right, you expect to see under 16% of women’s heights between 65² and 66², versus over 68% of sample mean heights (for n=25) between 65² and 66². That’s the whole point of this chapter: sample means stick much closer to the population mean.

Example 3: Elevator Load Limit Suppose hotel guests who take elevators weigh on average 150 pounds with standard deviation of 35 pounds. An engineer is designing a large elevator, to lift 50 people. If she designs it to lift 4 tons (8000 pounds), what is the chance a random group of 50 people will overload it? Need a hint? This is a problem in sample total. You haven’t studied that kind of problem, but you have studied problems in sample means. In math, when you have an unfamiliar type of problem, it’s always good to ask: Can I change this into some type of problem I do know how to solve? In this case, how do you change a problem about the total number of pounds in a sample (∑x) into a problem about the average number of pounds per person (x)? Please stop and think about that before you read further. Solution: To convert a problem in sums into a problem in averages, divide by the sample size. If the total weight of a sample of 50 people is 8000 lb, then the average weight of the 50 people in the sample is 8000/50 = 160 lb. So the desired probability is P(x > 160): P(∑x > 8000 for n = 50) = P(x > 160) And you know how to find the second one. What does the sampling distribution of the mean look like for µ = 150, = 35, n = 50? The mean is µx = 150 lb, and the standard error is 35/√50 » 4.9 lb. That’s all you need to draw the sketch at right. Samples are random, 10×50 = 500 is less than the number of people (or potential hotel guests) in the world, and n = 50 ≥ 30, so the sampling distribution follows a normal model. Now make your calculation. This time the left boundary is a definite number and the right boundary is pseudo infinity, 10^99. And again, you want the standard error, not the SD of the original population. With the “wizard” interface:

With the classic interface: After entering the standard error, press [)] [ENTER].

After entering the standard error, press [ENTER] twice, and your screen will look like the one at right. Show your work: normalcdf(160, 10^99, 150, 35/√50). There’s a 0.0217 chance that any given load of 50 people will overload the elevator. That’s not 2% of all loads, but 2% of loads of 50 people. Still, it’s an unacceptable level of risk. Is there an inconsistency here? Back in Chapter 5, I said that an unusual event was one that had a low probability of occurring, typically under 5%. Since 2% is less than 5%, doesn’t that mean that an overloaded elevator is an unusual event, and therefore it can be ignored? Yes, it’s unusual. But no, fifty people plunging to a terrible death can’t be ignored. The issue is acceptable risk. Yes, there’s some risk any time you step in an elevator that it will be your last journey. But it’s a small risk, and it’s one you’re willing to accept. (The risk is much greater every time you get into a car.) Without knowing exact figures, you can be sure it’s much, much less than 2%; otherwise every big city would see many elevator deaths every day. In Chapter 10, you’ll meet the significance level, which is essentially the risk of being wrong that you can live with. The worse the consequences of being wrong, the lower the acceptable risk. With an elevator, 5% is much too risky — you want crashes to be a lot more unusual than that.

8B. Binomial Data / Proportions of Samples Binomial data are yes/no or success/failure data. Each sample yields a count of successes. (A reminder: “success” isn’t necessarily good; it’s just the name for the condition or response that you’re counting, and the other one is called “failure”.) Need a refresher on the binomial model? Please refer back to Chapter 6. The summary statistic or parameter is a proportion, rather than a mean. In fact, the proportion of success (p) is all there is to know about a binomial population. In Chapter 6 you computed probabilities of specific numbers of successes. Now you’ll look more at the proportions of success in all possible samples from a binomial population, using the normal distribution (ND) as an approximation. Here’s a reminder of the symbols used with binomial data: p The proportion in the population. Example: If 83% of US households have at least one cell phone, then p = 0.83. Remember “proportion of all equals probability of one”, so p is also the probability that any randomly selected response from the population will be a success. q = 1−p is therefore the proportion of failure or the chance that any given response will be a failure. n The sample size. x The number of successes in the sample. Example: if 45 households in your sample have at least one cell phone, then x = 45. p “p-hat”, the proportion in the sample, equal to x/n. Example: If you survey 60 households and 45 of them have at least one cell phone, then p = 45/60 = 0.75 or 75%.

8B1. Sampling Distribution of p The sampling distribution of the proportion is the same idea as the sampling distribution of the mean, and there are a lot of parallels between the two. (A table at the end of this chapter summarizes them.) Definition:

Imagine you take a whole lot of samples from the same population. Each sample has n success/failure data points, and you compute the sample proportion p of each of them. All those p’s form a new data set, which can be called the distribution of sample proportions, or the sampling distribution of the proportion, or the sampling distribution of p, for sample size n. As before, n is the size of each sample, not the number of samples. There’s no symbol for the number of samples, because it’s indefinitely large.

One change from the sampling distribution of x is that the sampling distribution of p is a different data type from the population. The original data are non-numeric (yeses and noes), but the distribution of p is numeric because the p’s are numbers. Each p says “so many percent of this sample were successes.” Center of the Sampling Distribution of p Summary:

The mean of the sampling distribution of p equals the proportion of the population: µp = p (“mu sub p-hat equals p”). This is true regardless of the proportion in the original population and regardless of sample size.

Why is this true? The reasons are similar to the reasons in Center of the Sampling Distribution of x. p for a given sample may be higher or lower than p of the population, but if you take a whole lot of samples then the high and low p’s will tend to cancel each other out, more or less. Spread of the Sampling Distribution of p Summary:

The standard deviation of the sampling distribution of p has a special name: standard error of the proportion or SEP; its symbol is p (“sigma-sub-p-hat”). The standard error of the proportion for sample size n equals the square root of the population proportion, times 1 minus the population proportion, divided by the sample size: SEP or p = √[pq/n]. This is true regardless of the proportion in the original population and regardless of sample size.

BTW: Why is this true? For a binomial distribution with sample size n, the standard deviation is √[npq]. That is the SD of the random variable x, the number of successes in a sample of size n. The sample proportion, random variable p, is x divided by n, and therefore the SD of p is the SD of random variable x, also divided by n. In symbols, p = √[npq] / n = √[npq/n²] = √[pq/n].

Shape of the Sampling Distribution of p Summary:

If np and nq are both ≥ about 10, and 10n ≤ N, the normal model is good enough for the sampling distribution. Let’s look at some sampling distributions of p. First I’ll show you the effect of the population’s proportion of success p, and then the effect of the sample size n. Using @RISK from Palisade Corporation, I simulated all of the sampling distributions shown here. The mathematical sampling distribution has an indefinitely large number of samples, but I stopped at 10,000. These first three graphs show the sampling distributions for samples of size n = 4 from three populations with different proportions of successes. Reminder: these are not graphs of the population — they’re not responses from individuals. They are graphs of the sampling distributions, showing the proportion of successes (p) found in a lot of samples.



How do you read these? For example, look at the first graph. This shows the sampling distribution of the proportion for a whole lot of samples, each of size 4, where the probability of success on any one individual is 0.1. You can see that about 67% of all samples have p = 0 (no successes out of four), about 29% have p = .25 (one success out of four), about 4% have p = .50 (two successes out of four), and so on. Why the large gaps between the bars? With n = 4, each sample can have only 0, 1, 2, 3, or 4 successes, so the only possible proportions for those samples are 0, 25%, 50%, 75%, and 100%.



But let’s not obsess over the details of these graphs. I’m more interested in the shapes of the sampling distributions. What do you see? If you take many samples of size 4 from a population with p = 0.1 (10% successes and 90% failures), the sampling distribution of the proportion is highly skewed. Now look at the second graph. When p = .25 (25% successes and 75% failures in the population), again with n = 4 individuals in each sample, the sample proportions are still skewed, but less so. And in the third graph, where the population has p = 0.5 (success and failure equally likely), then the sampling distribution is symmetric even with these small samples. For a given sample size n, it looks like the closer the population p is to 0.5, the closer the sampling distribution is to symmetric. And in fact that’s true. That’s your take-away from these three graphs. Now let’s look at sampling distributions using different sample sizes from the same population. I’ll use a population with 10% probability of success for each individual (p = 0.1). You’ve already seen the graph of the sampling distribution when n = 4. The three graphs here show the sampling distribution of p for progressively larger samples. (Remember always that n is the number of individuals in one sample. The number of samples is indefinitely large, though in fact I took 10,000 samples for each graph.)



What do you see here? The distribution of p’s from samples of 50 individuals is still noticeably skewed, though a lot less than the graph for n = 4. If I take samples of size 100, the graph is starting to look nearly symmetric, though still slightly skewed. And if I take samples of 500 individuals, the distribution of p looks like a bell curve.

What do you conclude from these graphs? First, even if p is far from 0.5 (if the population is quite unbalanced), with large enough samples, the sampling distribution of p is a normal distribution. Second, you need big samples for binomial data. Remember that 30 is usually good enough for numeric data. For binomial data, it looks like you need bigger samples.

Okay, let’s put it together. If the size of each sample is large enough, the sampling distribution is close enough to normal. How large a sample is large enough? It depends on how skewed the original population is, which means it depends on the proportion of successes in the population. The further p is from 0.5, the more unbalanced the population and the larger n must be. How big a sample is big enough? Here’s what some authors say: DeVeaux, Velleman, Bock (2009, 462): np ≥ 10 and nq ≥ 10. Johnson and Kuby (2004, 432): n > 20, np > 5, and nq > 5. Bulmer (1979, 120): npq ≥ 2. The Math Forum (2003): np > 5 and nq > 5. Sullivan (2011, 437): npq ≥ 10. Why the disagreements? They can’t all be right, can they? Actually, they can. The question is, what’s close enough to a ND? That’s a judgment call, and different statisticians are a little bit more or less strict about what they consider close enough. Fortunately, with samples bigger than a hundred or so, which are customary, all the conditions are usually met with room to spare. We’ll follow DeVeaux and his buddies and use np ≥ 10 and nq ≥ 10. This is easy to remember: at least ten “yes” and at least ten “no” expected in a sample. (You can compute the expected number of noes as nq = n(1−p) or simply n−np, sample size minus the expected number of yeses.) How does this work out in practice? Look at the next-to-last graph, with n=100 and p=0.1. It’s close to a bell curve, but has just a little bit of skew. (It’s easier to see the skew if you cover the top part of the graph.) Check the requirements: np = 100×.1 = 10, and nq = 100−10 = 90. In a sample of 100, 10 successes and 90 failures are expected, on average. This just meets requirements. And that matches the graph: you can see that it’s not a perfect bell curve, but close; but if it was a little more skewed then the normal model wouldn’t be a good enough fit. BTW: De Veaux and friends (page 440) give a nice justification for choosing ≥ 10 yeses and noes. Briefly, the ND has tails that go out to ±infinity, but proportions are between 0 and 1. They chose their “success/failure condition”, at least ten of each, so that the mismatch between the binomial model and the normal model is only in the rare cases.

But there’s an additional condition: the individuals in the sample must be independent. This translates to a requirement that the sample can’t be too large, or drawing without replacement would break the binomial model. Big surprise (not!): Authors disagree about this too. For example, De Veaux and Johnson & Kuby say sample can’t be bigger than 10% of population (n ≤ 0.1N); Sullivan says 5%. We’ll use n ≤ 0.1N, just like with numeric data. And just as before, you can think of that as 10n ≤ N when you don’t have the exact size of the population. Example 4: You asked 300 randomly selected adult residents of Ithaca a yes-or-no question. Is the sample too large to assume independence? You may not know the population of Ithaca, but you can compute 10×300 = 3000 and be confident that there are more than 3000 adult Ithacans. Therefore your sample is not too large. Don’t just claim 10n ≤ N. Show the computation, and identify the population you’re referring to, like this: “10n = 10×300 = 3000 ≤ number of adult Ithacans.” Remember to check your conditions: np ≥ about 10, nq ≥ about 10, and 10n ≤ N. And of course your sample must be random. Requirements, Assumptions, and Conditions Just like with numeric data, you might find it helpful to name the requirements for binomial data. These are the same requirements that I just gave you, but presented differently. I’m following DeVeaux, Velleman, Bock (2009, 493). Independence Condition, Randomization Condition, 10% Condition: These are the same for every sampling distribution and every procedure in inferential stats. I’ve already talked about them under numeric data earlier in the chapter. In practice, the 10% Condition comes into play more often for binomial data than numeric data, because binomial samples are usually much larger. Sample Size Assumption: For binomial data, the sample is like Goldilocks and porridge — it can’t be too big and it can’t be too small. (Maybe it was beds or chairs and not porridge? And what the heck is porridge?) “Too big” is checked by the 10% Condition; “too small” is checked by the Success/Failure Condition: The more lopsided the population is — the further the population proportion is from 50% — the larger sample you need for the sampling distribution to be close enough to a normal model. Our rule of thumb is that your sample needs to be big enough that you expect ≥ 10 successes and ≥ 10 failures based on the population proportion p. See also:

Is That an Assumption or a Condition? (Bock). Again, these are the same requirements you see in this textbook, just presented differently.

8B2. Applications How to Work Problems Working with the sampling distribution of p, the technique is exactly the same as for problems involving the sampling distribution of x. Follow these steps: 1. Determine center, spread, and shape of the sampling distribution, even if you’re not explicitly asked to describe the distribution. 2. If you can’t show that the sampling distribution is ND, stop! 3. Sketch the curve, and estimate the answer. (See example below.) 4. Compute the probability (area) using normalcdf. Caution! Don’t use rounded numbers in this calculation. BTW: The ND is continuous and goes out to ±infinity, but the binomial distribution is discrete and bounded by 0 and n. If the requirements are met (at least 10 successes and 10 failures expected), the normal model is a good fit near the middle of the distribution. The fit is usually good enough in the tails, but not as good as it is in the middle. Because of this, some authors apply a continuity correction to make the normal model a better fit for the binomial. This means extending the range by half a unit in each direction. For example, if n = 100 and p = 0.20, and you’re finding the probability of 10 to 15 successes, MATH200A part 3 gives a probability of 0.1262. The normal model with standard error √[.20×(1−.20)/100] = 0.04 gives normalcdf(10/100, 15/100, .2, .04) = 0.0994 With the continuity correction, you compute the probability for 9½ to 15½ successes. Then the normal model gives a probability of normalcdf(9.5/100, 15.5/100, .2, .04) = 0.1260 This is a better match to the exact binomial probability. Why use the normal model at all, then? Why not just compute the exact binomial probability? Because there’s only a noticeable discrepancy far from the center, and only when the sample is on the small side. (100 is a small sample for binomial data, as you’ll see in the next two chapters.) You can apply the continuity correction if you want, but many authors don’t because it usually doesn’t make enough difference to matter.

Example 5: Swain v. Alabama 1965. Talladega County, Alabama. An African American man named Robert Swain is accused of rape. The 100-man jury panel includes 8 African Americans, but through exemptions and peremptory challenges none are on the final jury. Swain is convicted and sentenced to death. (Juries are all male. 26% of men in the county are African American.) (a) In a county that is 26% African American, is it unexpected to get a 100-man jury panel with only eight African Americans? Solution: “Unexpected”, “unusual”, or “surprising” describes an event with low probability, typically under 5%. This problem is asking you to find the probability of getting that sample and compare that probability to 0.05. This is binomial data: each member of the sample either is or is not African American. Putting the problem into the language of the sampling distribution, your population proportion is p = 0.26. You’re asked to find the probability of getting 8 or fewer successes in a sample of 100, so n = 100 and your sample proportion must be p ≤ 8/100 or p ≤ 0.08. Why “8 or fewer”? p = 8% for the questionable sample, and you think it’s too low since it’s below the expected 26%. Therefore, in determining how likely or unlikely such a sample is, you’ll compute the probability of p ≤ 0.08. First, describe the sampling distribution of the proportion: Center: µp = p = 0.26 Spread: p = √[pq/n] = √[.26(1−.26)/100] = 0.044 Shape: For the sampling distribution to be normal, you have four requirements to check: Random sample? That’s the county’s claim. Check. Sample not too large? 10n = 10×100 = 1000. We don’t know how many men are in the county, but it must be more than a thousand. Check. Expected number of successes? np = 100×.26 = 26 ≥ 10. Check. (Use 0.26, not 0.08. The sampling distribution is based on the population proportion p, not on any particular sample proportion.) Expected number of failures? nq = 100−26 = 74 ≥ 10. Check. Therefore the sampling distribution of the proportion is a ND. Next, make the sketch and estimate the answer. I’ve numbered the key points in the sketch at right, but if you need a refresher please refer back to the sketch under Example 1. From this sketch, you’d expect the probability to be very small, and indeed it is. Compute the probability using normalcdf as before. Be careful with the last argument, which is the standard deviation of the sampling distribution. Don’t use a rounded number for the standard error, because it can make a large difference in the probability. With the “wizard” interface:

With the classic interface:

The standard error expression, √(.26* (1−.26)/100), scrolls off the screen as you type it in, so be extra careful!

Press [)] [ENTER] after entering the standard error. You’ll have two closing parentheses, one for the square root and one for normalcdf.

Press [ENTER] twice, and your screen will look like the one at right. Always show your work — not keystrokes but the function and its arguments: normalcdf(−10^99, .08, .26. √(.26*(1−.26)/100)) BTW: The SEP is a nasty expression, and you have to enter it twice in every problem. You might like to save some keystrokes by computing it once and then storing it in a variable, as I did at the right. When you’re drawing the sketch and need the standard error, compute it as usual but before pressing [ENTER ] press [STOÕ] [x,T,,n]. Then when you need the standard error in normalcdf , in the wizard or the classic interface, just press the [x,T,,n] key instead of reentering the whole SEP expression. The probability is naturally the same whether you use the shortcut or not.

P(p ≤ 0.08) = 2.0×10-5 , or P(x ≤ 8) = 0.000 020. There are only 20 chances in a million of getting a 100-man jury pool with so few African Americans by random selection from that county’s population. This is highly unexpected — so unlikely that it raises the gravest doubts about the county’s claim that jury pools were selected without racial bias. BTW: You might remember that in Chapter 6 you computed this binomial probability as 0.000 005, five chances in a million. If the ND is a good approximation, why does it give a probability that’s four times the correct probability? Answer: The normal approximation gets a little dicier as you move further out the tails, and this sample is pretty far out (z = −4.10). But is the approximation really that bad? Sure, the relative error is large, but the absolute error is only 0.000 015, 15 chances in a million. Either way, the message is “This is extremely unlikely to be the product of random chance.”

(b) From 1950 to 1965, as cited in the Supreme Court’s decision, every 100-man jury pool in the county had 15 or fewer African Americans. How likely is that, if they were randomly selected? Solution: 15 out of 100 is 15%. You know how to compute the probability that one jury pool would be ≤15% African American, so start with that. You’ve already described the sampling distribution, so all you have to do is make the sketch and then the calculation. Everything’s the same, except your right boundary is 0.15 instead of 0.08. If you use my little shortcut:

Otherwise:

Either way, P(p ≤ 0.15) = 0.0061. The Talladega County jury panels are multiple samples with n = 100 in each, so the “proportion of all” interpretation makes sense: In the long run, you expect 0.61% of jury panels to have 15% or fewer African Americans, if they’re randomly selected. But actually 100% of those jury panels had 15% or fewer African Americans. How unlikely is that? Well, we don’t know how many juries there were in the county in those 16 years, but surely it must have been at least one a year, or a total of 16 or more. The probability that 16 independent jury pools would all have 15% or fewer African Americans, just by chance, is 0.006074559116 » 3E-36, effectively zip. And if there was more than one jury a year, as there probably was, the probability would be even lower. Something is definitely fishy. BTW: The binomial probability is 0.0061 also. This is still pretty far out in the left-hand tail (z = −2.51), but the normal approximation is excellent. The message here is that the normal approximation is pretty darn close except where the probabilities are so small that exactness isn’t needed anyway.

8C. Summary of Sampling Distributions Here’s a side-by-side summary of sampling distributions of the mean (numeric data) and sampling distributions of the proportion (binomial data). Always check requirements for the type of data you actually have! Numeric Data

Binomial Data

Each individual in sample provides a number.

Each individual in sample provides a success or failure, and you count successes.

Statistic of one sample

mean x = ∑x/n

proportion p = x/n

Parameter of population

mean µ

proportion p

Sampling distribution of the ...

Sampling distribution of the mean (sampling distribution of x)

Sampling distribution of the proportion (sampling distribution of p)

Mean of sampling distribution

µx = µ

µp = p

Standard deviation of sampling distribution

SEM = standard error of the mean = /√n

SEP = standard error of the proportion

x Sampling distribution is close enough to normal if ...

p

= √[pq/n]

Random sample 10n ≤ N Population is ND or n ≥ about 30

Random sample 10n ≤ N np ≥ about 10 and nq ≥ about 10

NOTE: n is number of individuals per sample. Number of samples is indefinitely large and has no symbol.

What Have You Learned? Key ideas: The sampling distribution of the mean (sampling distribution of x) and the sampling distribution of the proportion (sampling distribution of p) are concepts, not something you ever construct in reality. x or p is a random variable, varying with each sample. The size of each sample is n. The distribution contains an indefinitely large number of samples of size n. (There’s no symbol for the number of samples in the distribution.) Treat all those sample means x or sample proportions p as a new data set and mentally draw a histogram. This is a picture of the sampling distribution. The Central Limit Theorem: the closer to normal the original population, and the larger the sample, the closer the sampling distribution will be to a ND. For numeric data, n ≥ 30 is almost always good enough. For binomial data, it’s more complicated. Describing a sampling distribution means giving its center, spread, and shape. For any data type, the sampling distribution describes random samples that aren’t too large, not more than 10% of population. For numeric data, the sampling distribution of x (sampling distribution of the mean) has these properties: The center of the sampling distribution (mu sub x-bar, µx ) always equals the mean of the population (µ). The standard deviation of the sampling distribution (sigma sub x-bar, x , also known as the standard error of the mean or SEM) always equals the standard deviation of the population divided by the square root of the sample size, /√ n. If n ≥ about 30, or population is ND, then the shape of the sampling distribution is close enough to normal. If requirements are not met, you generally can’t say anything useful about the shape, and you can’t use the normal model. For binomial data, the sampling distribution of p (sampling distribution of the proportion) has these properties: The center of the sampling distribution (mu sub p-hat, µp ) always equals the proportion in the population (p). The standard deviation of the sampling distribution (sigma sub p-hat, p , also known as the standard error of the proportion or SEP) always equals √[pq/n]. If there are at least 10 successes and 10 failures expected per sample — if np ≥ 10 and nq ≥ 10 — then the shape of the sampling distribution is close enough to normal. If requirements are not met, you generally can’t say anything useful about the shape, and you can’t use the normal model. Given µ and of a numeric population, or p of a binomial population, find the probability of a specified sample. To do this, 1. Check requirements. If they’re not met, stop. 2. Compute the standard error and make a sketch to scale. 3. Use normalcdf to compute probability. In normalcdf, the fourth argument is the unrounded standard error, not the population standard deviation. Study aids:

TI-83/84 Cheat Sheet Statistics Symbol Sheet

Because this textbook helps you, please click to donate!

← Chapter 7 WHYL

Chapter 9 WHYL Õ

Exercises for Chapter 8 Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand. Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.

1

Household incomes in the country Freedonia are a skewed distribution with mean $48,000 and standard deviation (SD) $2,000. You take a random sample of size 64 and compute the mean of the sample. That x is one sample mean out of the distribution of all possible sample means. Describe the sampling distribution of the mean, including all symbols and formulas.

2

A manufacturer of light bulbs claims a mean life of 800 hours with SD 50 hours. You take a random sample of 100 bulbs and find a sample mean of 780 hours. (a) If the manufacturer’s claim is true, is a sample mean of 780 hours surprising? (Hint: Think about whether you need the probability of x ≤ 780 or x ≥ 780.) (b) Would you accept the manufacturer’s claim?

3

Suppose 72% of Americans believe in angels, and you take a simple random sample of 500 Americans. (a) Describe the sampling distribution of the proportion who believe in angels in samples of 500 Americans. (b) Use the normal approximation to compute the probability of finding that 350 to 370 in a sample of 500 believe in angels. Reminder: You can’t use the sample counts directly; you have to convert them to sample proportions.

4

In a town with 100,000 households, the last census showed a mean income of $32,400 with SD $19,000. The city manager believes that average income has fallen since the census. Students at the local community college randomly survey 1000 households and find a sample mean income of $31,000. What’s the chance of getting a sample mean ≤$31,000 if the true mean and SD are still what the census found?

5

Roulette is a popular casino game. The croupier spins the wheel in one direction and spins a white ball in the other direction along the rim, and the ball drops into one of the slots. In the US, roulette wheels have 38 slots: 18 red, 18 black, and 2 green. (In Monte Carlo, the wheels have 37 slots because there’s only one green.) (a) One way beginners play is to bet on red or black. If the ball comes up that color, they double their money; if it comes up any other color, they lose their money. Construct a probability model for the outcome of a $10 bet on red from the player’s point of view. (b) Find the mean and SD for the outcome of $10 bets on red, and write a sentence interpreting the mean. (c) Now take the casino’s point of view. A large casino can have hundreds of thousands of bets placed in a day. Obviously they won’t all be same, but it doesn’t take many days to see a whole lot of any given bet. Describe the sampling distribution of the mean for a sample of 10,000 $10 bets on red. (d) How much does the casino expect to earn on 10,000 $10 bets on red? (e) What’s the chance that the casino will lose money on those 10,000 $10 bets on red? (f) What’s the casino’s chance of making at least $2000 on those 10,000 $10 bets?

6

A sugar company packages sugar in 5-pound bags. The amount of sugar per bag varies according to a normal distribution. A random sample of 15 bags is selected from the day’s production. If the total weight of the sample is more than 75.6 pounds, the machine is packing too much per bag and must be adjusted. What is the probability of this happening, if the day’s mean is 5.00 pounds and SD 0.05 pounds?

7

The weights of cabbages in a shipment are normally distributed, with a mean of 38.0 ounces and SD of 5.1 ounces. (a) If you randomly pick one cabbage, what is the probability that its weight is more than 43.0 ounces? (b) If you randomly pick 14 cabbages, what is the probability that their average weight is more than 43.0 ounces?

8

Suppose the average household consumes 12.5 KW of electric power at peak time, with SD 3.5 KW. A particular substation in a typical neighborhood serves 1000 households and has a capacity of 12,778 KW at peak time. (That’s 12 thousand and some, not 12 point something.) Find the probability that the substation will fail to supply enough power.

9

In the Physicians’ Health Study, about Heart No 22,000 male doctors were randomly assigned Total p Attack Attack to take aspirin daily or a placebo daily. (Of course the study was double blind.) In the Placebo 189 10845 11034 1.71% placebo group, 1.71% of doctors had heart attacks Aspirin 104 10933 11037 0.94% over the course of the study. Let’s take 1.71% as the rate of heart attacks in an adult male population that doesn’t take aspirin. The heart attack rate among aspirin takers was 0.94%, which looks like an impressive difference. Is there any chance that aspirin makes no difference, and this was just the result of random selection? In other words, how likely is it for that second sample to have p = 0.94% if the true proportion of heart attacks in adult male aspirin takers is actually 1.71%, no different from adult males who don’t take aspirin?

10

Men’s heights are normally distributed with mean 69.3² and SD 2.92². If a random sample of 16 men is taken, what values of the sample mean would be surprising? In other words, what values of x are in the 5% of the sampling distribution furthest away from the population

mean? (Hint: The 5% is the tails, the part of the sampling distribution that is not in the middle 95%.)

11

In June 2013, the Pew Research Center found that 45% of Americans had an unfavorable view of the Tea Party. In the second week of October 2013, according to Tea Party’s Image Turns More Negative (Pew Research Center 2013e), 737 adults in a random sample of 1504 had an unfavorable view of the Tea Party. In a population where 45% have an unfavorable view of the Tea Party, how likely is a sample of 1504 where 737 or more have an unfavorable view? Can you draw any conclusions from that probability?



Solutions Õ

What’s New 25 June 2015: Correct a typo, ≤ for ≥, thanks to Darlene Huff. 1 Apr 2015: Modify the exercise on belief in angels to include converting counts to proportions. (intervening changes suppressed) 24 Mar 2013: New document.

9. Estimating Population Parameters Updated 1 Jan 2016 (What’s New?) Summary:

In Chapter 8, you learned what sort of samples to expect from a known population. In the rest of the course, you’ll learn how to use a sample to make statements about an unknown population. This is inferential statistics. In inferential statistics, there are two types of things you want to do: test whether some claim is true, and estimate the size of some effect. In this chapter you’ll construct confidence intervals that estimate population means and proportions; Chapter 10 starts you on testing claims.

Contents:

9A. Estimating Population Proportion p 9A1. Confidence Interval for p (Binomial Data) · Computing a Confidence Interval · Interpreting a Confidence Interval · Easy CIs with TI-83/84 9A2. How Big a Sample for Binomial Data? 9B. Estimating Population Mean µ When You Know 9B1. Confidence Interval 9B2. How Big a Sample Do You Need? 9C. Estimating Population Mean µ When You Don’t Know 9C1. Student’s t Distribution 9C2. Confidence Interval for µ (Numeric Data) 9C3. The Trouble with Outliers 9C4. How Big a Sample for Numeric Data? What Have You Learned? Exercises for Chapter 9 What’s New

9A. Estimating Population Proportion p In the Physicians’ Health Study, about 22,000 male doctors were randomly assigned to take aspirin or a placebo every night. Of 11,037 in the treatment group, 104 had heart attacks and 10,933 did not. Can you say how likely it is for people in general (or, at least, male doctors in general) to have a heart attack if they take aspirin nightly? As always, probability of one equals proportion of all. So you could just as well ask, what proportion of people who take aspirin would be expected to have heart attacks? Before statistics class, you would divide 104/11037 = 0.0094 and say that 0.94% of people taking nightly aspirin would be expected to have heart attacks. This is known as a point estimate. But you are in statistics class. You know that a sample can’t perfectly represent the population, and therefore all you can say is that the true proportion of heart attacks in the population of aspirin takers is around 0.94%. Can you be more specific? Yes, you can. You can compute a confidence interval for the proportion of heart attacks to be expected among aspirin takers, based on your sample, and that’s the subject of this chapter. We’ll get back to the doctors and their aspirin later, but first, let’s do an example with M&Ms.

9A1. Confidence Interval for p (Binomial Data) Example 1: You take a random sample of 605 plain M&Ms, and 87 of them are red. What can you say about the proportion of reds in all plain M&Ms? Definition:

A point estimate of a population parameter is the single best available number, and in fact it’s nothing more than the corresponding sample statistic. In this example, your point estimate for population proportion is sample proportion, 87/605 = 14.4%, and you conclude “Somewhere around 14.4% of all plain M&Ms are red.” The sample proportion is a point estimate of the proportion in the population, the sample mean is a point estimate of the mean of the population, the sample standard deviation is a point estimate of the standard deviation of the population, and so on.

Definition:

A confidence interval estimate of a population parameter is a statement of bounds on that parameter and includes your level of confidence that the parameter actually falls within those bounds. For instance, you could say “I’m 95% confident that 11.6% to 17.2% of plain M&Ms are red.” 95% is your confidence level (symbol: 1−, “one minus alpha”). and 11.6% and 17.2% are the boundaries of your estimate or the endpoints of the interval. As an alternative to endpoint form, you could write a confidence interval as a point estimate and a margin of error, like this: “I’m 95% confident that the proportion of red in plain M&Ms is 14.4% ± 2.8%.” 14.4% is your point estimate, and 2.8% is your margin of error (symbol: E), also known as the maximum error of the estimate. Since the confidence interval extends one margin of error below the point estimate and one margin of error above the point estimate, the margin of error is half the width of the confidence interval.

BTW: For all the cases you’ll study in this course, the point estimate — the mean or proportion of your sample — is at the middle of the confidence interval. But that’s not true for some other cases, such as estimating the standard deviation of a population. For those cases, computing the margin of error is uglier.

Computing a Confidence Interval As you might expect, your TI-83/84 and lots of statistical packages can compute confidence intervals for you. But before doing it the easy way, let’s take a minute to understand what’s behind computing a confidence interval. You can compute an interval to any level of confidence you desire, but 95% is most common by far, so let’s start there. How do you use those 87 reds in a sample of 605 M&Ms to estimate the proportion of reds in the population, and have 95% confidence in your answer? In Chapter 8, you learned how to find the sampling distribution of p. Given the true proportion p in the population, you could then determine how likely it is to get a sample proportion p within various intervals. To find a confidence interval, you simply run that backward. You don’t know the proportion of reds in all plain M&Ms, so call it p. You know that, if the sample size is large enough, samples are ND and there’s a 95% chance that any given sample proportion will be within 2 standard errors on either side of p, whatever p is. The standard error of the proportion is p = √[pq/n]. You don’t know p — that’s what you’re trying to find. Are you stuck? No, you have an estimate for p. Your point estimate for the population proportion p is the sample proportion p = 87/605. You can estimate the standard error of the proportion (the SEP) by using the statistics of your sample: p » √[(87/605)(1−87/605)/605] = 0.0142656 or about 1.4% Two standard errors is 0.0285312 Õ 0.029 or 2.9%. BTW: How good is this estimate? For decent-sized samples, it’s quite good. For example, suppose the true population proportion p is 50% or 0.5. For a sample of n = 625, the SEP is √[.5(1−.5)/625] = 0.0200 or 2.00%. Your sample proportion is very, very, very unlikely to be as far away as 40% or 0.4, but even if it is then you would estimate the SEP as √[.4(1−.4)/625] = 0.0196 or 1.96%, which is extremely close. BTW: Different authors use the term “standard error” slightly differently. Some use it only for the standard deviation of the sampling distribution, which you never know exactly because you never know the population parameters exactly. Others use it only for the estimate based on sample statistics, which I computed just above. Still others use it for either computation. In practice it doesn’t make a lot of difference. I don’t see much point to getting too fussy about the terminology, given that only one of them can be computed anyway.

Any given sample proportion is 95% likely to be within two standard errors or 2.9% of the population proportion: p−0.029 ≤ p ≤ p+0.029 (probability = 95%) Now the magic reverso: Given a sample proportion, you’re 95% confident that the population proportion is within 2.9% of that sample proportion: p−0.029 ≤ p ≤ p+0.029 (95% confidence) In this case, your sample proportion is 87/605 » 0.144: 0.144−0.029 ≤ p ≤ 0.144+0.029 (95% confidence) 0.115 ≤ p ≤ 0.173 (95% confidence) So your 95% confidence interval is 0.115 to 0.173, or 11.5% to 17.3%. BTW: If the magic reverso seems like cheating, it’s not. Suppose you’re 95% sure that Cortland is within 12 miles of Dryden; aren’t you equally sure that Dryden is within 12 miles of Cortland? But you can also prove it with algebra. Here was our starting point: p−0.029 ≤ p ≤ p+0.029 Multiply by −1. When you multiply by a negative, you have to reverse the inequality signs. −p+0.029 ≥ −p ≥ −p−0.029 Rewrite in conventional order, from smallest to largest. −p−0.029 ≤ −p ≤ −p+0.029 Now add p+p to all three “sides”. p−0.029 ≤ p ≤ p+0.029

You might have noticed that I changed from 95% probability to 95% confidence. What’s up with that? Well, the sample proportion is a random variable — different samples will have different sample proportions p, and you can compute the probability of getting p in any particular range. But the population proportion p is not a random variable. It has one definite value, even though you don’t know what that definite value is. Probability statements about a definite number make about as much sense as discussing the probability of precipitation for yesterday. The population proportion is what it is, and you have some level of confidence that your estimated range includes that true value. What does “95% confident” mean, then? Simply this: In the long run, when you do everything right, 95% of your 95% intervals will actually include the population proportion, and the other 5% won’t. 5% is 5/100 = 1/20, so in the long run about one in 20 of your 95% confidence intervals will be wrong, just because of sample variability. Probability of one = proportion of all, so there’s one chance in twenty that this interval is wrong, meaning that it doesn’t contain the true population proportion, even if you did everything right. If that makes you too nervous, you can use a higher confidence level, but you can never reach 100% confidence. There’s one more wrinkle. That margin of error of 0.029 was 2 p , two standard errors. The figure of 2 standard errors for the middle 95% of a ND comes from the Empirical Rule or 68–95–99.7 Rule, so it’s only approximately right. But you can be a little more precise. In Chapter 7 you learned to find the middle any percent, and that lets you generalize to any confidence level: This Example Confidence level (middle area of the ND) Area in the two tails combined Area in each tail The boundaries are The margin of error is And you compute it as

General Case

95%

1−

100%−95% = 5% or 0.05 0.05/2 = 0.025 ±z0.025 = invNorm(1−0.025) = 1.9600 E = 1.96 p E = 1.96

1−(1−) = /2 ±z/2 E = z/2 p E = z /2

The margin of error on a 1− confidence interval is z /2 standard errors. (This will be important when you determine necessary sample size, below.) The margin of error on a 95% confidence interval is close to 2 p , but more accurately it’s 1.96 p . For the proportion of red M&Ms, where the SEP was p = 0.0142656, the margin of error is 1.96 p = 0.0279606 Õ 0.028 or 2.8%. Since the point estimate was 14.4%, you’re 95% confident that the proportion of reds in plain M&Ms is within 14.4%±2.8%, or 11.6% to 17.2%. Interpreting a Confidence Interval You’ve seen that there are two ways to state a confidence interval: from ____ to ____ with ____% confidence, or ____ ± ____ with _____% confidence. Mathematically these are equivalent, but psychologically they’re very different. The first form is better than the second. What’s wrong with the ____ ± ____ form? It’s easy to misinterpret. If you say “I’m 95% confident that the proportion of reds in plain M&Ms is within 14.4% ±2.8%”, some people will read 14.4% and stop — they’ll think that the population proportion is 14.4%. And even people who get past that will probably think that there’s something special about 14.4%, that somehow it’s more likely to be the true proportion of reds among all plain M&Ms. But 14.4% is just a value of a random variable, namely the proportion of reds in this sample. Another sample would almost certainly have a different p and therefore a different midpoint for the interval. It’s much better to use the endpoint form, because the endpoint form is harder to misinterpret. When you say “I’m 95% confident that the proportion of reds in plain M&Ms is 11.6% to 17.2%”, you lead the reader, even the non-technical reader, to understand that the proportion could be anything in that range, and even that there’s a slight chance that it’s outside that range. Requirements check (RC): This is an essential step — do it before you compute the confidence interval. Computing the CI assumes that the sampling distribution of p is a ND, but “assumes” in statistics means you don’t assume, you check it. The requirements are stated in Chapter 8 as simple random sample (or equivalent), np and nq both ≥ about 10, 10n ≤ N. You don’t know p, but for binomial data it’s okay to use p as an estimate. But np is just the number of yeses or successes in your sample, and nq is just the number of noes or failures in your sample, so you really don’t need to do any multiplications. Here’s how you check the requirements: Random sample: stated at start of section, OK. Successes in sample: 87; failures in sample: 605−87 = 518; both ≥ 10, OK. 10n = 10×605 = 6050. We don’t know how many plain M&Ms there are in the world, but surely M&M Mars makes far more than that every second, so this is also OK. Easy CIs with TI-83/84 Your TI-83 or TI-84 can easily compute confidence intervals for a population proportion. With binomial data, this is Case 2 in Inferential Statistics: Basic Cases. (Excel can do it too, but it’s significantly harder in Excel.) Example 2: Let’s do the red M&Ms, since you already know the answer. See the requirements check above. Press [STAT] [] to get to the STAT TESTS menu, and scroll up or down to find 1-PropZInt. (Caution: you don’t want 1-PropZTest. That’s reserved for Chapter 10.) Enter the number of successes in the sample, the sample size, and the confidence level — easy-peasy! Write down the screen name and your inputs, then proceed to the output screen and write down just the new stuff:

Here’s how you show your work: 1-PropZInt 87, 605, .95 (not PropZInt, please!) (.11584, .17176), p = .1438016529 There’s no need to write n=605 because you already wrote it down from the input screen. Interpretation: I’m 95% confident that 11.6% to 17.2% of plain M&Ms are red. You can vary that in several ways. For instance, some people like to put the confidence level last: 11.6% to 17.2% of plain M&Ms are red (95% confidence). Or they may choose more formal language: We’re 95% confident that the true proportion of reds in plain M&Ms is 11.6% to 17.2%. I’ve already pooh-poohed the margin-of-error form, but sometimes you have to write it that way, for instance if your boss or your thesis advisor demands it. You can get it easily from the TI-83/84 output screen. The center of the interval, the point estimate, is given: 14.38%. To find the margin of error, subtract that from the upper bound of the interval, or subtract the lower bound from it: .17176−.1438 = .02796, or .1438−.11584 = .02796. Either way it’s 2.8%. You can then express the CI as 14.4%±2.8% with 95% confidence. Example 3: What about the male doctors who started this section? 104 out of 11037 of the doctors taking nightly aspirin had heart attacks. Assuming that male doctors are representative of adults in general, in terms of heart-attack risk, what can you say about the chance of heart attack for anyone who takes aspirin nightly? Use confidence level 1− = 95%. Solution: Requirements check (RC): Random sample stated, OK. 104 successes in sample, 11037−104 = 10933 failures, OK. 10n = 10×11037 = 110370. Without knowing the number of adults in the US or the world, we know it’s a lot more than that; OK. 1-PropZInt 104, 11037, .95 (.00762, .011223), p = .0094228504 Conclusion: People who take nightly aspirin have a 0.76% to 1.12% chance of heart attack (95% confidence). BTW: An interesting special case occurs when you have no successes. Although you can’t do the regular calculation, because 0 successes doesn’t meet the requirement, you can use an approximate procedure called the Rule of Three. In Confidence Intervals with Zero Events, Steve Simon (2010) explains, “zero to 3/n is an approximate 95% confidence interval for a data set where we observed 0 events in n patients.” Example: Suppose that your sample was only 50 doctors, and none of them had heart attacks. 3/50 = 6%, so you would be 95% confident that people who take nightly aspirin have a zero to 6% chance of heart attack.

9A2. How Big a Sample for Binomial Data? The equation for margin of error is packed with information: E = z/2 You can see that a larger sample size n means a narrower confidence interval, but the sample size is inside the square-root sign so you don’t get as much benefit as you might hope for. If you take a sample four times as big, the square root of 4 is 2 and so your interval is half as wide, not ¼ as wide. You can see also that you get a narrower interval if you’re willing to live with a lower confidence level. The lower your confidence interval, the smaller z/2 will be, and therefore the narrower your confidence interval. The bottom line is that there’s a three-way tension among sample size, confidence level, and margin of error. You can choose any two of those, but then you have to live with the third. (p doesn’t come into it. Although p does contribute to the standard error and therefore to the margin of error, you can’t choose what p you’re going to get in a sample.) If you want to get a confidence interval at your preferred confidence level with (no more than) a specified margin of error, how big a sample do you need? MATH200A Program part 5 will compute this for you, but let’s look at the formula first. (See Getting the Program for instructions on getting the MATH200A program into your calculator.) The equation at the start of this section shows the margin of error you get for a given sample size and confidence level. You can solve for the sample size n, like this: E = z/2

Þ

In the formula, p is your prior estimate if you have one. This can be the result of a past study, or a reasonable estimate if it has some logical basis. If you don’t have a prior estimate, use 0.5. BTW: .5 or 50% is the conservative choice. It gives the largest possible sample size for a given E and CLevel. Why is that? Because the formula contains a multiplication by p(1−p), and that product takes on its largest value when p = 0.5. Using .5 as your prior estimate, you’re guaranteed that your sample won’t be too small, though it may be larger than necessary. Why not just use .5 all the time? Because taking samples always costs time and usually costs money, so you don’t want a larger sample than necessary.

Example 4: In a sample of 605 plain M&Ms, 87 were red. The 95% confidence interval had a 2.8% margin of error. How big a sample would you need to reduce the margin of error to 2%? With the MATH200A program (recommended):

If you’re not using the program:

Press [PRGM], select MATH200A, and press [ENTER] twice. Dismiss the title screen and you’ll see a menu. Press [5] for sample size. Then select your data type, binomial.

Marshal your data: prior estimate p = 87/605, desired margin of error E = 0.02, and confidence level 1− = 0.95. You need z/2 . Get /2 from the confidence level: 1− = 0.95 Þ = 0.05 Þ /2 = 0.025 z/2 is the z-score such that the area to the right is /2. In this problem, /2 = 0.025, so you’re computing z 0.025 . You’ll use invNorm, but invNorm wants area to left and 0.025 is an area to right, so you compute invNorm(1−.025).



Now, to avoid re-entering that z value, chain your calculations. The formula says you need to divide by E, so simply press [/] and type .02, the desired margin of error, then press [ENTER]. Notice how the calculator displays Ans as soon as you press the [/] key, to confirm that you’re continuing the previous calculation. To square the fraction, press [x²] [ENTER]. The next screen wants your estimated p, your desired margin of error, and your desired confidence level. Your prior estimate is 87/605, from your earlier study. Your margin of error is 2% = .02 (not .2 !), and your confidence level is 95% = .95.

Finally, multiply by p and (1−p). You get 1182.4…, and therefore your required sample size is 1183. Caution! Your answer is 1183, not 1182. You don’t round the result of a sample-size calculation. If it comes out to a whole number (unusual), that’s your answer. Otherwise, you round up to the next whole number. Why? Smaller sample size makes larger margin of error. n = 1182.4… corresponds to E = 0.02 exactly. A sample of 1182 would be just slightly under 1182.4…, and your margin of error would be just slightly over 0.02. But 0.02 was the maximum acceptable margin of error, so 1182.4… is the minimum acceptable sample size. You can’t take a fraction of an M&M in your sample, so you have to go up to the next whole number.

The output screen echoes back your inputs, in case you forgot to write down the input screen, and then tells you that the sample size must be at least 1183 M&Ms. Notice the inequalities: for a margin of error of .02 (2%) or less, you need a sample size 1183 or more. (z Crit is critical z or z/2 , the number of standard errors associated with your chosen confidence level.) There’s no requirements check in sample-size problems. These are planning how to take your sample; requirements apply to your sample once you have it. Example 5: You’re taking the first political poll of the season, and you’d like to know what fraction of adults favor your candidate. You decide you can live with a 90% confidence level and a 3% margin of error. How many adults do you need in your random sample? Solution: Since you have no prior estimate for p, make p = .5. With the MATH200A program (recommended):

If you’re not using the program:

MATH200A/sample size/binomial p=.5, E=.03, C-Level=.9, n ≥ 752

1− = .9, E = .03, p = .5 1− = .9 Þ = 0.1 Þ /2 = 0.05 z0.05 = invNorm(1−.05) Divide by E, which is .03. Square the result. Multiply by p times (1−p). n =751.5… Õ 752

9B. Estimating Population Mean µ When You Know Numeric data are pretty much the same deal as binomial data, though there are a couple of wrinkles: The requirements are different. For numeric data you need a sample bigger than about 30. If your sample is smaller, then it must be ND with no outliers. (You need a random sample for every procedure.) The standard error is x = /√ n, so the margin of error is E = z/2 ·/√ n. The second one is a problem, because you almost never know the standard deviation of the population. Therefore, we won’t be working any problems for this case. Instead, I’ll give you a little more theory to lay the groundwork for the next section, which explains how we get around this knowledge gap.

9B1. Confidence Interval If you know the standard deviation of the population — and you hardly ever do — then your confidence interval is x − z/2 · /√ n ≤ µ ≤ x + z/2 · /√ n If you’re ever in this situation, you can compute a confidence interval on your TI-83/84 by choosing ZInterval in the STAT TESTS menu.

9B2. How Big a Sample Do You Need? The margin of error is E = z/2 · /√ n, so the required sample size for a margin of error E with confidence level 1− is n = [ z/2 · / E]² (You can also use MATH200A part 5.)

9C. Estimating Population Mean µ When You Don’t Know “Houston, we have a problem!” A confidence interval is founded on the sampling distribution of the mean or proportion. Everything in Chapter 8 on the sampling distribution of the mean was based on knowing the standard deviation of the population. But you almost never know the standard deviation of the population. How to resolve this? The solution comes from William Gosset, who worked for Guinness in Dublin as a brewer. (I swear I am not making this up.) In 1908 he published a paper called The Probable Error of a Mean. For competitive reasons, the Guinness company wouldn’t let him use his own name, and he chose the pen-name “Student”. The t distribution that he described in his paper has been known as Student’s t ever since. BTW: While looking for Gosset’s original paper, I stumbled on Probable Error of a Mean, The (“Student”) (Moulton). It’s a fascinating look at what Gosset did and didn’t accomplish, and how this classic paper was virtually ignored for years. Things didn’t start to happen till Gosset sent a copy of his tables to R. A. Fisher with the remark that Fisher was the only one who would ever use them! It was Fisher who really got the whole world using Student’s t distribution.

9C1. Student’s t Distribution Gosset knew that the standard error of the mean is /√ n, Because this textbook helps you, but he didn’t know . He wondered what would happen please click to donate! if he estimated the standard error as s/√n, and did some experiments to answer that question. Since s varies from one sample to the next, this new t distribution spreads out more than the ND. Its peak is shallower, and its tails are fatter. Actually, there’s no such thing as “the” t distribution. There’s a different t for each sample size. The larger the sample, the closer that t distribution is to a normal distribution, but it’s never quite normal. For technical reasons, t distributions aren’t identified by sample size, but rather by degrees of freedom (symbol df or Greek , “nu”). df = n−1. Here are two t distributions:



Solid: standard normal distribution Line: Student’s t for df = 4, n = 5

Solid: standard normal distribution Line: Student’s t for df = 29, n = 30

What do you see? Student’s t for 4 degrees of freedom is quite a bit more spread out than the ND: 12.2% of sample means are more than two standard errors from the mean, versus only 5% for the ND. At this scale, Student’s t for 29 degrees of freedom looks identical to the ND, but it’s not quite the same. You can see that 6% of sample means are more than two standard errors from the mean, versus 5% for the ND. You don’t really need a list of properties of Student’s t, because your calculator is going to do the work for you. It’s enough to know this: There’s a t distribution for each sample size n. They’re identified by degrees of freedom, or df = n−1. t distributions show more variability than z (the standard normal curve). As df or n increases, the t distributions get closer and closer to normal. Beyond about df = 30, the difference is slight. High t numbers occur more often than high z numbers, but only for small samples is the difference very noticeable.

9C2. Confidence Interval for µ (Numeric Data) The logic of confidence intervals for numeric data is the same whether you know the standard deviation of the population or not. Even the requirements are the same. The only difference is between using a z and a t. BTW: The confidence interval formula for numeric data with unknown looks a lot like the one for known . You just replace by s and z by t: x − t/2 · s/√n ≤ µ ≤ x + t/2 · s/√n (1- confidence) (It’s understood that you have to use the right number of degrees of freedom, df = n−1, in finding critical t.)

Example 6: You’re auditing a bank. You take a random sample of 50 cash deposits and find a mean of $189.56 and standard deviation of $42.17. (a) Estimate the mean of all cash deposits, with 95% confidence. (b) The bank’s accounting department tells you that the average cash deposit is over $210.00. Is that believable? Solution: You want to compute a confidence interval about the mean of all deposits. You have numeric data, and you don’t know the standard deviation of the population, . This is Case 1 in Inferential Statistics: Basic Cases. In your sample, n = 50, x = 189.56, and s = 42.17. First, check the requirements (RC): Random sample given, OK. Sample size 50 is > 30, OK. 10n = 10×50 = 500, and surely the bank has more deposits than that. Since the sample is large enough, there’s no need to verify normality or check for outliers.) Now calculate the interval. On your TI-83/84, in the STAT TESTS menu, select 8:TInterval. The difference between Data and Stats is whether you have all the data points, or just summary statistics. In this case you have only the stats, so cursor onto Stats and press [ENTER]. (The lower part of the screen may change.) Enter your sample statistics and your desired confidence level. Write down your inputs before you select Calculate: TInterval 189.56, 42.17, 50, .95 Proceed to the output screen, and write down everything new. There isn’t much: (177.58, 201.54)

Finally, write your interpretation. I’m 95% confident that the average of all cash deposits is between $177.58 and $201.54. Caution! Don’t say anything like “95% of deposits are between $177.58 and $201.54.” Your confidence interval is an estimate of the true average of all deposits, and it’s not about the individual deposits. With a standard deviation of $42 and change, you would predict that 95% of deposits are within 2×42.17 = $84.34 either side of the mean, which is a much wider interval. Now turn to part (b). Management claims that the average of all cash deposits is > $210.00. Is that believable? Well, it’s not impossible, but it’s unlikely. You’re 95% confident that the average of all deposits is between $177.58 and $201.54, which means you’re 95% confident that it’s not < $177.58 or > $201.54. But they’re claiming $210, which is outside your confidence interval. Again, they’re unlikely to be correct — there’s less than a 5% likelihood (100%−95% = 5%). Example 7: In a random sample from the 237 vehicles on a used-car lot, the following weights in pounds were found: 2500 3250 4000 3500 2900 4500 3800 3000 5000 2200 Estimate the average weight of vehicles on the lot, with 90% confidence. Solution: Check the requirements first. You have a small sample (n < 30), so you have to verify that the data are ND and there are no outliers. Here are the results of normality check and box-whisker plot in MATH200A:

There’s not much you can write for the box-whisker plot, but you can show the normality test numerically: Random sample, OK. Box-whisker: no outliers, OK. Normality: r(.9936) > crit(.9179). OK. 10% of population size is 23.7, and this sample of 10 is smaller than that. Now proceed to your TInterval. This time you have the actual data, so you choose Data on the screen. Specify your data list. Freq (frequency) should already be set to 1; if not, first press the [ALPHA] key once, and then [1] [ENTER]. Enter your confidence level, and write down your inputs: TInterval L1, 1, .90

When you have raw data, everything on the output screen is new: (2956, 3974) x = 3465, s = 878.1, n = 10 You’re 90% confident that the average weight of all vehicles on the lot is between 2956 and 3974 pounds. Again, this is an estimate of the average weight of the population (the 237 cars on the lot). In your interpretation, you can’t say anything about the weights of individual vehicles, because you don’t know anything about the weights of individual vehicles, apart from your sample.

9C3. The Trouble with Outliers Why do you have to check for outliers? If your sample passes the normality check, isn’t that enough? No! If a sample passes the normality check, it still might have outliers. BTW: How can this be? How can a sample that contains outliers still pass the normality check? Well, back in Chapter 7 I said that if r > crit you can use the normal model and if r < crit you can’t. But that simple rule hides a more complicated truth. No sample is perfectly normal, so you’re not actually deciding “is it normal or not?” Instead, you’re finding the strength of evidence against normality. The smaller r is, the stronger the evidence against a ND. If r < crit, the evidence is so strong that you say the data are non-normal. But if r > crit, you can’t say that the data are definitely normal, only that you can’t rule out a ND based on this test. But outliers make the evidence against the normal model too strong, so if outliers are present then you can’t treat the data as normal. This “fail to prove” is similar to what you saw in Chapter 4 with decision points: you could prove that the correlation was non-zero, but you couldn’t prove that it was zero. Starting in Chapter 10, you’ll see that this is how inferential statistics works whenever you’re testing some proposition.

Why are outliers a problem? Well, your confidence interval depends on the mean and standard deviation of your sample. But x and s are sensitive to outliers. (That sensitivity goes down as sample size goes up, so you don’t have to worry with samples bigger than about 30.) To make this clearer, let’s look at an example. I drew these 15 points from a moderately skewed population: 157 171 182 189 201 208 217 219 229 242 247 252 265 279 375 The normality test shows r > crit. So far so good. But the box plot shows a big honkin’ outlier:

How big a difference does it make? Quite a lot, unfortunately. Here are the 95% confidence intervals for the original sample, and the sample with the outlier removed. The means are different, the standard deviations are really different, and the high ends of the confidence intervals are pretty different too. (The screens don’t show the margins of error, but they too are quite different: (258.45199.28)/2 = 29.6 and (239.36-197.5)/2 = 20.9.)

95% CI from full sample

95% CI excluding outlier

Do you say that the outlier increased the mean by almost 5% and the SD by almost 50%, moved the confidence interval and made it wider? That’s not really fair — the sample is what it is (assuming you’ve ruled out a mistake in data entry). If you start throwing out points, you no longer have a random sample. On the other hand, that one point does seem to carry an awful lot of weight, and it doesn’t seem right to have results depend so heavily on one point. So what do you do? If you can, you take another sample, preferably a larger one. Larger samples are less likely to have outliers in the first place, and outliers that do occur have less influence on the results. But taking a new sample may not be practical. An alternative — not really great, but better than nothing — is to do the analysis both ways, once with the full sample and once with the outlier(s) excluded. That will at least give a sense of how much the outliers affect the results.

9C4. How Big a Sample for Numeric Data? With the MATH200A program (recommended):

If you’re not using the program:

Example 8: For the vehicle weights, your margin of error in a 90% CI was 3974−3465 = 509 pounds. How many vehicles would you need in your sample to get a 95% confidence interval with a margin of error of 500 pounds?

What if you don’t have the program? Since t is not super different from the normal distribution, you can alter the above formula and use z in place of t: n = [z/2 ·s/E]². But the t distribution is more spread out than the normal (z) distribution, so your answer may be smaller than the actual necessary sample size. If you do that and you get > about 30, it’s probably nearly right for the t distribution. If your answer is small, you should increase it so that the TInterval doesn’t come out with too large a margin of error. You calculate z/2 exactly as you did in the sample-size formula for a confidence interval about a proportion. For example, with a 95% CI, 1− = 1−0.95, = 0.05, and /2 = 0.025. z/2 = z0.025 = invNorm(1−.025) = 1.9600. so using z for t you compute sample size [1.96·878.1/500]² = 11.8… Õ 12. That’s well under 30, so you want to bump it up a bit.

Solution: In

MATH200A part 5, select 2:Num unknown since you don’t know the standard deviation of the population. You’re first prompted for the estimated standard deviation s, which is based on your sample. Enter that, then the desired margin of error E and the desired confidence level. When you enter the last piece of information, you’ll notice that the calculator takes several seconds to come up with an answer; this is normal because it has to do an iterative calculation (fancy words for trial and error). Critical t for a 95% CI with 14 degrees of freedom (n = 15) is 2.14, larger than critical z of 1.96 because the t distribution is more spread out. But of course what you really care about is the bottom line: to keep margin of error no greater than 500 pounds in a 95% CI, you need to sample at least 15 vehicles.

BTW: I’m deliberately glossing over this, because the program is a lot easier. But if you want more, check out Case 1 in How Big a Sample Do I Need? That page gives you all the details of the method, with worked-out examples.

At first glance, this procedure is less precise than the successive approximations done by MATH200A. But in fairness, there’s one more source of un-preciseness that neither method can avoid. Unlike binomial data, where small variations in the prior estimate p made little difference to the computed sample size, for numeric data variations in the standard deviation do make a difference in computed sample size. Since s is squared in the formula, it can be a big difference. This can swamp any pettifogging details about t versus z.

BTW: How is this computed? Start with the margin of error and solve for sample size: E = t/2 ·s/√n Þ n = [t/2 ·s/E]² The problem here is that t/2 depends on df, which depends on n, so you haven’t really isolated sample size on the left side. The only way to solve this equation precisely is by a process of trial and error, and that’s what MATH200 does.

What Have You Learned? Key ideas:

The point estimate for a population parameter is the sample statistic — sample mean estimates population mean, sample proportion estimates population proportion, and so on. But x and p vary from one sample to the next, so your estimate for µ or p must be a range. A confidence interval has three numbers: either confidence level, lower bound, upper bound, or confidence level, point estimate, margin of error. For binomial data, use 1-PropZInt to compute a CI estimate of the population proportion p. See Inferential Statistics: Basic Cases for the requirements. Find necessary sample size to estimate a population proportion p to within a desired margin of error. Use MATH200A Program part 5 or the formula. For numeric data, use TInterval to compute a CI estimate of the population mean µ. See Inferential Statistics: Basic Cases for requirements. To reduce margin of error by a given factor, sample size must increase by the square of that factor. Always write an interpretation of your CI. With binomial data, your words should make it clear that you’re talking about the proportion in the population, not just your sample. With numeric data, make it clear that you’re talking about an average, not individual data points, and that it’s an average of a whole population, not just your sample.

Study aids:

Inferential Statistics: Basic Cases Because this textbook helps you, please click to donate!

Interactive: Triage: Which Inferential Stats Case Should I Use? Statistics Symbol Sheet

← Chapter 8 WHYL

Chapter 10 WHYL Õ

Exercises for Chapter 9 Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand. Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.

1

For a confidence interval, people sometimes say, “There’s a 95% chance that the mean height of all trees in the forest is …” Why is that not correct — why must you say “I’m 95% confident” and not “there’s a 95% chance”?

2

Simple Simon took a random sample of 40 TC3 students and computed a 90% confidence interval of (45.20,60.14) for weekly food expense per student. (He checked all requirements, and his computations were correct.) He wrote, “I’m 90% confident that TC3 students spend between $45.20 and $60.14 a week for food.” There’s a huge mistake in that conclusion. Identify it, and write a correct conclusion.

3

Silly Sally took a random sample of 150 TC3 students and found that 50 of them “usually” or “always” prepare their own food instead of buying from the cafeteria. She computed a 90% confidence interval of (.27002, .39664) and reported “With 90% confidence, on average 27% to 40% of TC3 students usually or always prepare their own food instead of buying from the cafeteria.” What is the biggest mistake in her conclusion?

4

The Neveready Company tested 40 randomly selected A-cell batteries to see how long they would operate a wireless mouse. They found a mean of 1756 minutes (29 hours, 16 minutes) and standard deviation (SD) 142 minutes. With 95% confidence, what’s the average life of all Neveready A cells in wireless mice?

5

In World War II, a prisoner of war flipped a coin 10,000 times and recorded 5,067 heads. (a) Find the point estimate for proportion of heads. (b) What is the sample size? What is the population size?

6

You’re planning to conduct a poll about people’s attitudes toward a hot political issue, and you have absolutely no idea what proportion will be in favor and what proportion will be opposed. If you want a margin of error no more than 3.5% at 95% confidence, how large must your sample

be?

7

The Department of Veterans Affairs is under fire for slow processing of veterans’ claims. An investigator for the Nightly Show randomly selected 100 claims (out of 68,917 at one office) and found that 40 of them had been open for more than a year. Find a 90% confidence interval for the proportion of all claims that have been open for more than a year.

8

For her statistics project, Sandra kept track of her commute times for 40 consecutive mornings (8 weeks). Treat this as a random sample. Her mean commute time was 17.7 minutes and her SD was 1.8 minutes. Find a 95% confidence interval for her average time on all commutes, not just this sample of 40.

9

Fifteen women in their 20s were randomly selected for health screening. As part of this, their heights in inches were recorded: 62.5 63 67 63.5 62 63 65 64.5 66.5 64.5 62.5 62 61.5 64.5 67.5 Construct a 95% confidence interval for the average height of women aged 20–29.

10

For his statistics project, Fred measured the body temperature of 18 randomly selected healthy male students. Here are his figures in °F: 98.3 97.7 98.6 98.5 97.5 98.6 98.2 96.9 97.9 96.9 97.8 99.3 98.6 99.2 96.9 97.8 97.9 98.3 (a) Write a 90% confidence interval for the average body temperature of healthy male students. (b) What does this say about the famous “normal” temperature of 98.6°? (c) What is his margin of error? (d) To get an answer to within 0.1° with 95% confidence, how many students would he have to sample?

11

The Colorectal Cancer Screening Guidelines (CDC 2014) recommend a colonoscopy every ten years for adults aged 50 to 75. A public-health researcher interviews a simple random sample of 500 adults aged 50–75 in Metropolis (pop. 6.4 million) and finds that 219 of them have had a colonoscopy in the past ten years. (a) What proportion of all Metropolis adults in that age range have had a colonoscopy in the past ten years, at the 90% level of confidence? (b) Still at the 90% confidence level, what sample size would be required to get an estimate within a margin of error of 2%, if she uses her sample proportion as a prior estimate?

12

The next year, you go back to audit the bank again. This time, you take a random sample of 20 cash deposits. Here are the amounts: 192.68 188.24 152.37 211.73 201.57 167.79 177.19 191.15 209.22 178.49 185.90 226.31 192.38 190.23 156.13 224.07 191.78 203.45 186.40 160.83 Construct a 95% confidence interval for the average of all cash deposits at the bank.

13

Not wanting to wait for the official results, Abe Snake commissioned an exit poll of voters. In a systematic sample of 1000 voters, 520 (52%) said they voted for Abe Snake. (14,000 people voted in the election.) That sounds good, but can he be confident of victory, at the 95% level?



Solutions Õ

What’s New 1 Jan 2016: Retake the screen shots here and here, for the new version of MATH200A Program part 4. 30 June 2015: Correct a typo, thanks to Darlene Huff. 23 May 2015: Add the Rule of Three. 10 Jan 2015: New section: The Trouble with Outliers. (intervening changes suppressed) 26 June 2013: New document.

10. Hypothesis Tests Updated 1 Jan 2016 (What’s New?) Summary:

You want to know if something is going on (if there’s some effect). You assume nothing is going on (null hypothesis), and you take a sample. You find the probability of getting your sample if nothing is going on (p-value). If that’s too unlikely, you conclude that something is going on (reject the null hypothesis). If it’s not that unlikely, you can’t reach a conclusion (fail to reject the null).

Contents:

10A. Testing a Proportion (Binomial Data) 10A1. Example 1: Swain v. Alabama · Step 1: Hypotheses · Step 2: Significance Level · Step RC: Requirements Check · Steps 3/4: Test Statistic and p-Value · Step 5: Decision Rule · Step 6: Conclusion (in English) 10A2. Example 2: Cancer Screening 10A3. Example 3: Small Samples 10B. Sharp Points 10B1. Type I and Type II Errors 10B2. One-Tailed or Two-Tailed? · Pick the Right Hypotheses · p < in Two-Tailed Test: What Does it Tell You? 10B3. What Does the p-Value Mean? 10B4. Practical and Statistical Significance 10B5. Conclusions: Write ’em Right! · When p < , you reject H 0 and accept H 1 . · When p > , you fail to reject H 0 . 10C. Testing a Mean (Numeric Data) 10C1. Example 12: Bank Deposits 10C2. Example 13: Smokers and Retirement 10D. Confidence Interval and Hypothesis Test 10E. Testing a Non-Random Sample What Have You Learned? Exercises for Chapter 10 Problem Set 1 Problem Set 2 What’s New

10A. Testing a Proportion (Binomial Data) Remember the Swain v. Alabama example? In a county that was 26% African American, Mr. Swain’s jury pool of 100 men had only eight African Americans. In that example, you assumed that selection was not racially biased, and on that basis you computed the probability of getting such a low proportion. You found that it was very unlikely. This disconnect between the data and the claim led you to reject the claim. You didn’t know it, but you were doing a hypothesis test. This is the standard way to test a claim in statistics: assume nothing is going on, compute the probability of getting your sample, and then draw a conclusion based on that probability. In this chapter, you’ll learn some formal methods for doing that. BTW: The basic procedure of a hypothesis test or significance test is due to Jerzy Neyman (1894–1981), a Polish American, and Egon Pearson (1895–1980), an Englishman. They published the relevant paper in 1933.

We’re going to take a seven-step approach to hypothesis tests. The first examples will be for binomial data, testing a claim about a population proportion. Later in this chapter you’ll use a similar approach with numeric data to test a claim about a population mean. In later chapters you’ll learn to test other kinds of claims, but all of them will just be variations on this theme.

10A1. Example 1: Swain v. Alabama Step 1: Hypotheses Your first task is to turn the claim into algebra. The claim may be that nothing is going on, or that something is going on. You always have two statements, called the null and alternative hypotheses. Definition:

The null hypothesis, symbol H 0 , is the statement that nothing is going on, that there is no effect, “nothin’ to see here. Move along, folks!” It is an equation, saying that p, the proportion in the population (which you don’t know), equals some number.

Definition:

The alternative hypothesis, symbol H 1 , is the statement that something is going on, that there is an effect. It is an inequality, saying that p is different from the number mentioned in H 0 . (H 1 could specify , or just ≠.)

The hypotheses are statements about the population, not about your sample. You never use sample data in your hypotheses. (In real life you can’t make that mistake, since you write your hypotheses before you gather data. But in the textbook and the classroom, you always have sample data up front, so don’t make a rookie mistake.) You must have the algebra (symbols) in your hypotheses, but it can also be helpful to have some English explaining the ultimate meaning of each hypothesis, or the consequences if each hypothesis is true. Here you want to know whether there’s racial bias in jury selection in the county. You don’t want to know if the proportion of African Americans in Mr. Swain’s jury pool is less than 26%: obviously it is. You want to know if it’s too different — if the difference is too great to be believable as the result of random chance. Write your hypotheses this way: (1) H 0 : p = 0.26, there’s no racial bias in jury selection H 1 : p < 0.26, there is racial bias in jury selection Obviously those can’t both be true. How will you choose between them? You’ll compute the probability of getting your sample (or a more unexpected one), assuming that the null hypothesis H0 is true, and one of two things will happen. Maybe the probability will be low. In that case you rule out the possibility that random chance is all that’s happening in jury selection, and you conclude that the alternative hypothesis H 1 is true. Or maybe the probability won’t be too low, and you’ll conclude that this sample isn’t unusual (unexpected, surprising) for the claimed population. The number in your null hypothesis H 0 , with binomial data, is called po because it’s the proportion as given in H 0 . (You may want to refer to the Statistics Symbol Sheet.) BTW: What exactly is p? Yes, it’s the population proportion being tested, but what’s the population? It can’t be people in the county, or men in the county, or African-American men in the county. In fact it’s all people serving on Talladega County jury pools past, present and future. If there’s racial bias, then African Americans are less likely to be selected than whites, and — probability of one, proportion of all — therefore the overall population of jury pools has less than 26% African Americans. If there’s no racial bias, then in the long run the overall population of jury pools has the same 26% of African Americans as the county. BTW: Although a hypothesis test is officially about the population, in cases like this one it’s okay to think of it as answering a simpler question: Is the difference between the claim of no racial bias and the reality of this sample significant, or could it be explained away as the result of random chance? The hypotheses are the same either way, the calculations are the same, and the conclusions are the same. This is why a hypothesis test is also called a significance test or a test of significance.

Step 2: Significance Level Okay, you’re looking to figure out if this sample is inconsistent with the null hypothesis. In other words, is it too unlikely, if the null hypothesis H 0 is true? But what do you mean by “too unlikely”? Back in Chapter 5, we talked about unusual events, with a threshold of 5% or 0.05 for such events. We’ll use that idea in hypothesis testing and call it a significance level. Definition:

The significance level, symbol (the Greek letter alpha), is the chance of being wrong that you can live with. By convention, you write it as a decimal, not a percentage.

50%

(2) = 0.05 A significance level of 0.05 is standard in business and science. If you can’t tolerate a 5% chance of being wrong — if the consequences are particularly serious — use a lower significance level, 0.01 or 0.001 for example. (0.001 is common if there’s a possibility of death or serious disease or injury.) If the consequences of being wrong are especially minor, you might use a higher significance level, such as 0.10, but this is rare in practice. In a classroom setting, you’re usually given a significance level to use. BTW: Later in this chapter, you’ll see that the significance level is actually concerned with a particular way of being wrong, a Type I error.

Step RC: Requirements Check Back in Chapter 8, you learned the CLT’s requirements for binomial data: random sample not larger than 10% of population, and at least 10 successes and 10 failures expected if the null hypothesis is true. You compute expected successes as npo by using po , which is the number from H 0 . Expected failures are then sample size minus expected successes, n−npo in symbols. Steps 3 and 4 need the sampling distribution of the proportion to be a ND, so you must check the requirements as part of your hypothesis test. (RC)

Random sample? Yes, according to the county. 4 10n = 10×100 = 1000. We don’t know the number of adult males in the county, but it must be greater than 1000, surely. (“I know that, and don’t call me Shirley.”) 4 Expected successes = npo = 100×.26 = 26; expected failures are 100−26 = 74; both are ≥ 10. 4

You might wonder about the first test. “The county may say it’s random, but I don’t believe it. Isn’t that why we’re running this test?” Good question! Answer: Every hypothesis test assumes the null hypothesis is true and computes everything based on that. If you end up deciding that the sample was too unlikely, in effect you’ll be saying “I assumed nothing was going on, but the sample makes that just too hard to believe.” This same idea — the null hypothesis H0 is innocent till proven guilty — explains why you use 0.26 (po ) to figure expected successes and failures, not 0.08 (p). Again, the county claims that there’s no racial bias. If that’s true, if there’s no funny business going on, then in the long run 26% of members of jury pools should be African American. Comment: Usually, if requirements aren’t met you just have to give up. But for one-population binomial data, where the other two requirements are met but expected successes or failures are much under 10, you can use MATH200A part 3 to compute the p-value directly. There’s an example in “Small Samples”, below. Steps 3/4: Test Statistic and p-Value This is the heart of a hypothesis test. You assume that the null hypothesis is true, and then use what you know about the sampling distribution to ask: How likely is this sample, given that null hypothesis? Definition:

A test statistic is a standardized measure of the discrepancy between your null hypothesis H 0 and your sample. It is the number of standard errors that the sample lies above or below H 0 .

You can think of a test statistic as a measure of unbelievability, of disagreement between H 0 and your sample. A sample hardly ever matches your null hypothesis perfectly, but the closer the test statistic is to zero the better the agreement, and the further the test statistic is from 0 the worse the sample and the null hypothesis disagree with each other. BTW: Because you showed that the sampling distribution is normal and the standard error of the proportion is implicitly known, this is a z test. The test statistic is z = (p−po) / p where , but as you’ll see your calculator computes everything for you.

Definition:

The p-value is the probability of getting your sample, or a sample even further from H 0 , if H 0 is true. The smaller the p-value, the stronger the evidence against the null hypothesis.



Inferential Statistics: Basic Cases tells you that binomial data in one population are Case 2. This is a hypothesis test of population proportion, and you use 1-PropZTest on your calculator. To get to that menu selection, press [STAT] [] [5]. Enter po from the null hypothesis H 0 , followed by the number of successes x, the sample size n, and the alternative hypothesis H 1 . Write everything down before you select Calculate. When you get to the output screen, check that your alternative hypothesis H 1 is shown correctly at the top of the screen, and then write down everything that’s new. (3/4) 1-PropZTest .26, 8, 100, 0.096, percentage has increased

(2)

= 0.05

(RC)

SRS? Systematic sample can be analyzed like a random sample. 4 10n≤N? 10×80 = 800, less than number of car owners in any county. 4 Expected successes are npo = 80×.096 = 7.7, too far below 10 to live with. 8 The sampling distribution of p doesn’t follow the normal model, so you can’t use 1-PropZTest. But the other two requirements are met, so you can proceed, calculating the binomial probability directly.

(3/4)

MATH200A/Binomial prob: n=80, p=0.096, x=13 to 80; p-value = 0.0410 (If you don’t have the program, use 1−binomcdf(80,0.096,12) = 0.0410.) [Why 13 to 80? H 1 contains >, so you test the probability of getting the sample you got, or a larger one, if H 0 is true. If H 1 contained . The significance level was given in the problem. (Problems will usually give you an to use.) (2) = 0.05 Next is the requirements check. Even though it doesn’t have a number, it’s always necessary. In this case, n = 20, which is less than 30, so you have to test for normality and verify that there are no outliers. Enter your data in any statistics list (I used L5), and check your data entry carefully. Use the MATH200A program “Normality chk” to check for a normal distribution and “Box-whisker” to verify that there are no outliers.

You don’t need to draw the plots, but do write down r and crit and show the comparison, and do check for outliers. (For what to do if you have outliers, see Chapter 3.) (RC)

Random sample: given. 10n = 10×20 = 200, and the bank had better have more deposits than that or it can’t afford to pay you for your work! Normality: yes. From MATH200A part 4, r(0.9864) > crit(0.9503). Outliers: none (MATH200A part 2).

Now it’s time to compute the test statistic (t) and the p-value. On the T-Test screen, you have to choose Data or Stats just as you did on the TInterval screen. You have the actual data, so you select Data on the T-Test screen, instead of Stats. Then the sample mean, sample SD, and sample size are shown on the output screen, so you write them down as part of your results. Always write down x, s, and n.

(3/4) T-Test: µo =200, List=L5, Freq=1, µ≠µo results: t=−2.33, p=0.0311, x=189.40, s=20.37, n=20 The decision rule is the same for every single hypothesis test, regardless of data type. In this case: (5) p < . Reject H 0 and accept H 1 . And as usual, you can write your conclusion with the significance level or the p-value: (6) At the 0.05 level of significance, management is incorrect and the average of all cash deposits is different from $200.00. In fact, the true average is lower than $200.00. Or, (6) Management is incorrect, and the average of all cash deposits is different from $200.00 (p = 0.0311). In fact, the true average is lower than $200.00. Remember what happens when you do a two-tailed test (≠ in H 1 ) and p turns out less than : After you write your “different from” conclusion, you can go on to interpret the direction of the difference. See p < in Two-Tailed Test . In a classroom exercise, if you were asked to do a hypothesis test you would do a hypothesis test and only a hypothesis test. But in real life, and in the big labs for class, it makes sense to answer the obvious question: If the true mean is less than $200.00, what is it? You don’t have to check requirements for the CI, because you already checked them for the HT. TInterval L5, 1, .95 outputs: (179.86, 198.93) With 95% confidence, the average of all cash deposits is between $179.86 and $198.93.

10C2. Example 13: Smokers and Retirement Here’s an example where you have statistics without the raw data. It’s adapted from Sullivan (2011, 483). According to the Centers for Disease Control, the mean number of cigarettes smoked per day by individuals who are daily smokers is 18.1. Do retired adults who are daily smokers smoke less than the general population of daily smokers? To answer this question, Sascha obtains a random sample of 40 retired adults who are current daily smokers and record the number of cigarettes smoked on a randomly selected day. The data result in a sample mean of 16.8 cigarettes and a SD of 4.7 cigarettes. Is there sufficient evidence at the = 0.01 level of significance to conclude that retired adults who are daily smokers smoke less than the general population of daily smokers?

Solution: Start with the hypotheses. You’re comparing the unknown mean µ for retired smokers to the fixed number 18.1, the known mean for smokers in general. Since the data type is numeric (number of cigarettes smoked), and there’s one population, and you don’t know the SD of the population, this is Case 1, test of population mean, from Inferential Statistics: Basic Cases. (1)

H 0 : µ = 18.1, retired smokers smoke the same amount as smokers in general H 1 : µ < 18.1, retired smokers smoke less than smokers in general Comment: The claim is a population mean of 18.1, so you use 18.1 in your hypotheses. Using the sample mean of 16.8 in Step 1 is a rookie mistake, one of the Top 10 Mistakes of Hypothesis Tests. Never use sample data in your hypotheses. Comment: Why does H 1 have < instead of ≠? The short answer is: that’s what the problem says to do. In the real world, you would do a two-tailed test (≠) unless there’s a specific reason to do a one-tailed test (< or >); see One-Tailed or Two-Tailed? (earlier in this document). Presumably there’s some reason why they are interested only in the case “retired smokers smoke less” and not in the case “retired smokers smoke more”.

(2)

= 0.01

(RC)

Random sample (given). n > 30. 10n = 10×40 = 400, less than the total number of retired smokers. Therefore the sampling distribution is normal.

(3/4)

T-Test: µo =18.1, x=16.8, s=4.7, n=40, µ . Fail to reject H 0 .

(6)

At the 0.01 level of significance, we can’t determine whether the average number of cigarettes smoked per day by retired adults who are current smokers is less than the average for all daily smokers or not. Or, We can’t tell whether the average number of cigarettes smoked per day by retired adults who are current smokers is less than the average for all daily smokers or not (p = 0.0440).

When you fail to reject H0 , you cannot reach any conclusion. You must use neutral language in your non-conclusions. Please review When p > , you fail to reject H 0 earlier in this chapter.

10D. Confidence Interval and Hypothesis Test Summary:

You can use a confidence interval to conclude whether results are statistically significant. A hypothesis test (HT) and confidence interval (CI) are two ways of looking at the same thing: what possibilities for the population mean or proportion are consistent with my sample? A 95% CI is the flip side of a 0.05 two-tailed HT. More generally, a 1− CI is the complement of an two-tailed HT.

Example 14: The baseline rate for heart attacks in diabetes patients is 20.2% in seven years. You have a new diabetes drug, Effluvium, that is effective in treating diabetes. Clinical trials on 89 patients found that 27 (30.3%) had heart attacks. The 95% confidence interval is 20.8% to 39.9% likelihood of heart attack within seven years for diabetes patients taking Effluvium. What does this tell you about the safety of Effluvium? Solution: Okay, you’re 95% confident that Effluvium takers have a 20.8% to 39.9% chance of a heart attack within seven years. If you’re 95% confident that their chance of heart attack is inside that interval, then there’s only a 5% or 0.05 probability that their chance of heart attack is outside the interval, namely 39.9%. But 20.2% is outside the interval, so there’s less than a 0.05 chance that the true probability of heart attack with Effluvium is 20.2%. CI and HT calculations both rely on the sampling distribution. The open curve centered on 20.2% shows the sampling distribution for a hypothetical population proportion of 20.2%. Only a very small part of it extends beyond 30.3%, the proportion of heart attacks you actually found in your sample. The chance of getting your sample, given a hypothetical proportion po in the population, is the p-value. If po = 20.2%, your sample with p = 30.3% would be unlikely (p-value below 0.05). You would reject the null hypothesis and conclude that Effluvium takers have a different likelihood of heart attack from other diabetes patients, at the 0.05 significance level. Further, the entire confidence interval is above the baseline value, so you know that Effluvium increases the likelihood of heart attack in diabetes patients. At significance level 0.05, a two-tailed test against any value outside the 95% confidence interval (the shaded curve) would lead to rejecting the null hypothesis. And you can say the same thing for any other significance level and confidence level 1−. What if the interval does include the baseline or hypothetical value? Then you fail to reject the null hypothesis. Example 15: A machine is supposed to be turning out something with a mean value of 100.00 and SD of 6.00, and you take a random sample of 36 objects produced by the machine. If your sample mean is 98.4 and SD is 5.9, your 95% confidence interval is 96.4 to 100.4. Now, can you make any conclusion about whether the machine is working properly? Solution: Well, you’re 95% confident that the machine’s true mean output is somewhere between 96.4 and 100.4. With this sample, you can rule out a true population mean of 100.4, at the 0.05 significance level; but you can’t rule out a true population mean between 96.4 and 100.4 at = 0.05. A hypothesis test would fail to reject the hypothesis that µ = 100. You can’t determine whether the true mean output of the machine is equal to 100 or not.

When µo or po is inside the 1− CI, the two-tailed p-value is > . Your sample does not contradict H0 and you fail to reject H0 . When µo or po is outside the 1− CI, the two-tailed p-value is < . Your sample contradicts H0 , and you reject H0 . Leaving the symbols aside, when you test a null hypothesis your sample either is surprising (and you reject the null hypothesis) or is not surprising (and you fail to reject the null). Any null hypothesis value inside the confidence interval is close enough to your sample that it would not get rejected, and any null hypothesis value outside the interval is far enough from the sample that it would get rejected. Special Note for Binomial Data For numeric data, the CI and HT are exactly equivalent. But for binomial data, the CI and HT are only approximately equivalent. Why? Because with binomial data, the HT uses a standard error derived from po in the null hypothesis, but the CI uses a standard error derived from p, the sample proportion. Since the standard errors are slightly different, right around the borderline they might get different answers. But when the hypothetical po is a fair distance outside the CI, as it was in the drug example, the p-value will definitely be less than . What about One-Tailed Tests? Good question! A confidence interval is symmetric (for the cases you study in this course), so it’s intrinsically twotailed. A one-tailed HT for < or > at = 0.01 corresponds to a two-tailed HT for ≠ at = 0.02, so the CI for a one-tailed HT at = 0.01 is a 98% CI, not a 99% CI. The confidence level for a one-tailed is 1−2 , not 1−. Correspondence between Significance Level and Confidence Level tails

C-Level

1

1−2×.05 = 90%

2

1−.05 = 95%

1

1−2×.01 = 98%

2

1−.01 = 99%

1

1−2×.001 = 99.8%

2

1−.001 = 99.9%

0.05

0.01

0.001

If the baseline value is outside the confidence interval, you can say (at the appropriate significance level) that the true value of µ or p is different from the baseline, and then go on to say whether it’s bigger or smaller, so you get your one-tailed result. On the other hand, if the baseline value is inside the confidence interval, you can’t say whether the true µ or p is equal to the baseline or different from it, and if you can’t say whether they’re different then you can’t say which one is bigger than the other.

10E. Testing a Non-Random Sample Though most hypothesis tests are to find out something about a population, sometimes you just want to know whether this sample is significantly different from a population. In this case, you don’t need a random sample, but the other requirements must still be met. Example 16: At Wossamatta University, instructors teach the statistics course independently but all sections take the same final exam. (There are several hundred students.) One semester, the mean score on the exam is 74. In one section of 30 students, the mean was 68.2 and the SD was 10.4. The students felt that they had not been adequately prepared for the exam by the instructor. Can they make their case? Solution: In effect, they are saying that their section performance was significantly below the performance of students in the course overall. This is a testable hypothesis. But the hypothesis is not about the population that these 30 students were drawn from; we already know about that population. Instead, it is a test whether this sample, as a sample, is different from the population. (1)

H 0 : This section’s mean was no different from the course mean. H 1 : This section’s mean was significantly below the course mean.

(2)

= 0.05

(RC)

(Omit the requirement for a random sample.) 10n = 10×30 = 300 is less than the “several hundred students” in the course. Sample size is ≥30, so the sampling distribution is normal.

(3/4)

TTest: µ = 74, x = 68.2, s = 10.4, n = 30, µ < µo Outputs: t = −3.05, p-value = 0.0024

(5)

p < . Reject H 0 and accept H 1 .

(6)

This section’s average exam score was less than the overall course average (p-value = 0.0024).

Okay, there was a real difference. This section’s mean exam score was not only below the average for the whole course, but too far below for random chance to be enough of an explanation. But did the students prove their case? Their case was not just that their average score was lower, but that the difference was the result of poor teaching. Statistics can’t answer that question so easily. Maybe it was poor teaching; maybe these were weaker students; maybe it was environmental factors like classroom temperature or the time of day; maybe it was all of the above.

What Have You Learned? Key ideas:

You don’t know the proportion or mean of a population. You want to test whether it is different from some baseline number. You take a sample, and then compute how likely that sample would be if the true proportion or mean in the population is equal to that baseline. If the sample is too unlikely, you reject the null hypothesis and conclude that the true proportion or mean must be different from that baseline number. Know the seven steps of hypothesis tests. Know them by heart, and write them on your cheat sheet if you need to. Know whether you have binomial or numeric data. This totally determines which type of test you will do, so think before you act! When you have numeric data, you test for the mean of a population (hypotheses about µ). When you have binomial data in a count of successes, you test for the proportion in a population (hypotheses about p). Understand one-tailed versus two-tailed tests. When should you use which one? How do you interpret the results in step 6? Understand the significance level . Know how to pick an appropriate level. Understand the p-value. It’s the probability, if H 0 is true, of getting the sample you got (or one even further away from H 0 ). Know how to write conclusions (if p-value < ) or non-conclusions (if pvalue > ). Understand Type I and Type II errors. Describe what each one means in specific situations. Understand the relationship between a confidence interval and a hypothesis test. How can you relate the endpoints of a CI to whether you do or don’t have a statistically significant result, so that H 0 would or wouldn’t be rejected?

Study aids:

Inferential Statistics: Basic Cases Because this textbook helps you, please click to donate!

Interactive: Triage: Which Inferential Stats Case Should I Use? Seven Steps of Hypothesis Tests Top 10 Mistakes of Hypothesis Tests Statistics Symbol Sheet

← Chapter 9 WHYL

Chapter 11 WHYL Õ

Exercises for Chapter 10 Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand. Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.

Problem Set 1

1 2 3 4

List the seven steps of every hypothesis test. Why must you select a significance level before computing a p-value? Explain the p-value in your own words.

You’ve tested the hypothesis that the new accelerant makes a difference to the time to dry paint, using = 0.05. What is wrong with each conclusion, based on the p-value? Write a correct conclusion for that p-value. (a) p = 0.0214. You conclude, “The accelerant may make a difference, at the 0.05 significance level.” (b) p = 0.0714. You conclude, “The accelerant makes no difference, at the 0.05 significance level.”

5

You are testing whether the new accelerant makes your paint dry faster. (You have already eliminated the possibility that it makes your paint dry slower.) (a) What conclusion would be a Type I error? What wrong action would a Type I error lead you to take? (b) What conclusion would be a Type II error? What wrong action would a Type II error lead you to take?

6

Are Type I and Type II errors actually mistakes? What one thing can you do to prevent both of them, or at least make them both less likely?

7

What can you do to make a Type I error less likely at a given sample size? What’s the unfortunate side effect of that?

8

Explain in your own words the difference between “accept H 0 ” (wrong) and “fail to reject H 0 ” (correct) when your p-value is > .

9

The engineering department claims that the average battery lifetime is 500 minutes. Write both hypotheses in symbols.

10

Suppose H 0 is “the directors are honest” and H 1 is “the directors are stealing from the company.” Write conclusions, in Statistics and in English, if … (a) if p = 0.0405 and = 0.01 (b) if p = 0.0045 and = 0.01

11

In your hypothesis test, H 0 is “the defendant is innocent” and H 1 is “the defendant is guilty”. The crime carries the death penalty. Out of 0.05, 0.01, and 0.001, which is the most appropriate significance level, and why?

12

When Keith read the AAA’s statement that 10% of drivers on Friday and Saturday nights are impaired, he believed the proportion was actually higher for TC3 students. He took a systematic sample of 120 students and, on an anonymous questionnaire, 18 of them admitted being alcohol impaired the last Friday or Saturday night that they drove. Can he prove his point, at the 0.05 significance level?

13

In 2006–2008 there was controversy about creating a sewer district in south Lansing, where residents have had their own septic tanks for years. The Sewer Committee sent out an opinion poll to every household in the proposed sewer district. In a letter to the editor, published 3 Feb 2007 in the Ithaca Journal, John Schabowski wrote, in part: The Jan. 4 Journal article about the sewer reported that “only” 380 of 1366 households receiving the survey responded, with 232 against it, 119 supporting it, and 29 neutral. ... The survey results are statistically valid and accurate for predicting that the sewer project would be voted down by a large margin in an actual referendum.

Can you do a hypothesis test to show that more than half of Lansing households in the proposed district were against the sewer project? (You’re trying to show a majority against, so combine “supporting” and “neutral” since those are not against.)

14

Esperanza wanted to determine whether more than 40% of grocery shoppers — specifically, the primary grocery shoppers in their households — regularly use manufacturers’ coupons. She conducted a random telephone survey and contacted 500 people. (For this exercise, let’s assume that telephone subscribers are representative of grocery shoppers.) Of the 500 she contacted, 325 do the grocery shopping in their households. Of those 325, 182 said they regularly use manufacturers’ coupons. (a) What is the size of the sample? (Think before you answer!) (b) What is the population, and how large is it? (c) What does the number 182 represent? (d) Don’t do a hypothesis test. But if you did, what would po be? (e) Is it a source of bias that she considered only each household’s primary grocery shopper?

15

Doubting Thomas remembered the Monty Hall example from Chapter 5, but he didn’t believe the conclusion that switching doors would improve the chance of winning to 2/3. (It’s okay if you don’t remember the example. All the facts you need are right here.) Thomas watched every Let’s Make a Deal for four weeks. (Though this isn’t a random sample, treat it as one. There’s no reason why the show should operate differently in these four weeks from any others.) In that time, 30 contestants switched doors, and 18 of them won. (a) At the 0.05 significance level, is it true or false that your chance of winning is 2/3 if you switch doors? (b) At the 95% confidence level, estimate your chance of winning if you switch doors. (c) If you don’t switch doors, your chance of winning is 1/3. Using your answer to (b), is switching doors definitely a good strategy, or is there some doubt?

16

Most of us have spam filters on our email. The filter decides whether each incoming piece of mail is spam. Heather trusts her spam filter, and she sets it to just delete spam rather than save it to a folder. (a) What would Heather’s spam filter do if it makes a Type I error? What would it do if it makes a Type II error? (b) Which is more serious here, a Type I error or a Type II error? Should the significance level be set higher or lower?

17

Rosario read in Chapter 6 that 30.4% of US households own cats. She felt like dogs were a lot more visible than cats in Ithaca, so she decided to test whether the true proportion of cat ownership in Ithaca was less than the national proportion. She took a systematic sample of Wegmans shoppers one day, and during the same time period a friend took a systematic sample of Tops shoppers. (They counted groups shopping together, not individual shoppers, so they didn’t have to worry about getting the same household twice.) Together, they accumulated a sample of 215 households, and of those 54 owned cats. Did she prove her case, at the 0.05 significance level?



Solutions Õ

Problem Set 2

18

What is wrong with each pair of hypotheses? Correct the error. (a) H 0 = 14.2; H 1 > 14.2 (b) H 0 : µ < 25; H 1 : µ > 25 (c) You’re testing whether batteries have a mean life of greater than 750 hours. You take a sample, and your sample mean is 762 hours. You write H 0 :µ=762 hr; H 1 :µ>762 hr. (d) Your conventional paint takes 4.3 hours to dry, on average. You’ve developed a drying accelerant and you want to test whether adding it makes a difference to drying time. You write H 0 : µ=4.3 hr; H 1 : µ < 4.3 hr.

19

This year, water pollution readings at State Park Beach seem to be lower than last year. A sample of 10 readings was randomly selected from this year’s daily readings: 3.5 3.9 2.8 3.1 3.1 3.4 3.2 2.5 3.5 3.1 Does this sample provide sufficient evidence, at the 0.01 level, to conclude that the mean of this year’s pollution readings is significantly lower than last year’s mean of 3.8?

20

Dairylea Dairy sells quarts of milk, which by law must contain an average of at least 32 fl. oz. You obtain a random sample of ten quarts and find an average of 31.8 fl. oz. per quart, with SD 0.60 fl. oz. Assuming that the amount delivered in quart containers is normally distributed, does Dairylea have a legal problem? Choose an appropriate significance level and explain your choice.

21

You’re in the research department of StickyCo, and you’re developing a new glue. You want to compare your new glue against StickyCo’s best seller, which has a bond strength of 870 lb/in². You take 30 samples of your new glue, at random, and you find an average strength of 892.2 lb/in², with SD 56.0. At the 0.05 significance level, is there a difference in your new glue’s strength?

22

New York Quick Facts from the Census Bureau (2014b) says that 32.8% of residents of New York State aged 25 or older had at least a bachelor’s degree in 2008–2012. Let’s assume the figure hasn’t changed today. You conduct a random sample of 120 residents of Tompkins County aged 25+, and you find that 52 of them have at least a bachelor’s degree. (a) Construct a 95% confidence interval for the proportion of Tompkins County residents aged 25+ with at least a bachelor’s degree. (b) Don’t do a full hypothesis test, but use your answer for (a) to determine whether the proportion of bachelor’s degrees in Tompkins County is different from the statewide proportion, at the 0.05 significance level.

23

You’re thinking of buying new Whizzo bungee cords, if the new ones are stronger than your current Stretchie ones. You test a random sample of Whizzo and find these breaking strengths, in pounds: 679 599 678 715 728 678 699 624 At the 0.01 level of significance, is Whizzo stronger on average than Stretchie? (Stretchies have mean strength of 625 pounds.)

24

For her statistics project, Jennifer wanted to prove that TC3 students average more than six hours a week in volunteer work. She gathered a systematic sample of 100 students and found a mean of 6.75 hours and SD of 3.30 hours. Can she make her case, at the 0.05 significance

level?

25

As a POW in World War II, John Kerrich flipped a coin 10,000 times and got 5067 heads. At the 0.05 level of significance, was the coin fair?

26

People who take aspirin for headache get relief in an average of 20 minutes (let’s suppose). Your company is testing a new headache remedy, PainX, and in a random sample of 45 headache sufferers you find a mean time to relief of 18 minutes with SD of 8 minutes. (a) Construct a 95% confidence interval for the mean time to relief of PainX. (b) Don’t do a full hypothesis test, but use your answer for (a) to determine at the 0.05 significance level whether PainX offers headache relief to the average person in a different time than aspirin.



Solutions Õ

What’s New 1 Jan 2016: Retake the screen shot here, for the new version of MATH200A Program part 4. 27 Nov 2015: Add “treat as random” to the problem statement, to avoid a confusing side issue. Move this problem to Problem Set 2, since numeric data are introduced in the second half of the chapter. 2 July 2015: Correct a typo, thanks to Darlene Huff. 19 Apr 2015: Correct a typo, thanks to Gabriela LeBaron. 20 Jan 2015: Add a suggestion to use the Statistics Symbol Sheet. Expand on the definition of the test statistic. Compare the t statistic to z. Add a table showing correspondence between and C-Level. Many small edits for clarity or flow. 13 Jan 2015: Several changes to What Have You Learned? Add links and study aids. Add advice to know the HT steps by heart. Add an item about one- and two-tailed tests. Be more explicit about what’s important in the relationship between HT and CI. (intervening changes suppressed) 5–10 Mar 2012: New document, formed by merging eight class handouts.

11. Inference from Two Samples Updated 1 Jan 2016 (What’s New?) Intro:

In Chapter 10, you looked at hypothesis tests for one population, where you asked whether a population mean or proportion is different from a baseline number. In this chapter, you’ll ask “are these two populations different from each other?” (hypothesis test) and “how large is the difference?” (confidence interval).

Contents:

11A. Numeric Data — Paired or Unpaired? Unpaired Data / Independent Samples Paired Data / Dependent Samples Paired and Unpaired Data Compared Example 5: Seed Corn When to Use Paired Data? Example 6: Where the Rubber Meets the Road 11B. Inference with Paired Numeric Data (Case 3) Example 7: The Freshman Fifteen Entering Paired Numeric Data Hypothesis Test for Mean Difference Confidence Interval for Mean Difference Example 8: Coffee and Heart Rate 11C. Inference with Unpaired Numeric Data (Case 4) Example 9: A Tough Grader? Hypothesis Test for Difference of Means Confidence Interval for Difference of Means Example 10: Sorority Academics 11D. Inference on Two Proportions (Case 5) Example 11: Traffic Stops and Traffic Tickets Hypothesis Test for Difference of Proportions Confidence Interval for Difference of Proportions Necessary Sample Size for Confidence Interval Example 14: Gardasil Vaccine 11E. Confidence Interval and Hypothesis Test (Two Populations) 11F. More Confidence Intervals for Two Populations Example 17: Heights of Men and Women Example 18: Coffee and Heart Rate with Negatives Example 19: Opinion Poll Example 20: GPA of Fraternity Members and Nonmembers What Have You Learned? Exercises for Chapter 11 What’s New

11A. Numeric Data — Paired or Unpaired? That’s the key question when you’re doing inference on numeric data from two samples. Your answer will control how you analyze the data, so let’s look closely at the difference.

11A1. Unpaired Data / Independent Samples Definitions:

You have unpaired data when you get one number from each individual in two unrelated groups. The two groups are known as independent samples.

Independent samples result when you take two samples completely independently, or if you take one sample and then randomly assign the members to groups. Randomization always gives you independent samples. Example 1:

What if any is the average difference in time husbands and wives spend on yard work? You randomly select 40 married men and 40 married women and find how much time a week each spends in yard work. There’s no reason to associate Man A with Woman B any more than Woman C; these are independent samples and the data are unpaired.

Example 2:

How much “winter weight” does the average adult gain? You randomly select 500 adults and weigh them all during the first week of November. Then during the last week of February you randomly select another 500 adults and weigh them. The data are unpaired, and the samples are independent.

Before you read further, what’s the big problem in the design of those two studies? Right! Our old enemy, confounding variables. Look at the examples again, and see how many you can identify. For example, what might make a random person in one sample weigh more or less than a random person in the other sample, other than the passage of time? What might make a random woman spend more or less time on yard work than a random man, apart from their genders? With independent samples, if there’s actually a difference between the two groups, it may be swamped by all the differences within each group.

11A2. Paired Data / Dependent Samples Definitions:

You have paired data when each observational unit gives you two numbers. These can be one number each from a matched pair of individuals, or two numbers from one individual. Paired data come from dependent samples.

Example 3:

What if any is the average difference in time husbands and wives spend on yard work? You randomly select 40 couples and find how much time a week each person spends in yard work. Each husband and wife are a matched pair. The samples are dependent because once you’ve chosen a couple you’ve equally specified a member of the “wives” sample and a member of the “husbands” sample.

Example 4:

How much “winter weight” does the average adult gain? You randomly select 500 adults and weigh them all during the first week of November, then again during the last week of February. You have paired data in the before and after numbers. The two samples are dependent because they are the same individuals.

Do you see how a design with paired data (dependent samples) overcomes the big problem with unpaired data (independent samples)? You want to study weight gain, and now that’s what you’re measuring directly. You wanted to know whether husband or wife spends more time on yard work, and now you’ve eliminated all the differences between couples. Paired data are more likely than unpaired to reveal an effect, if there is one. Why? Because a paired-data design minimizes differences within each group that can swamp any difference between groups. In studying human development and behavior, twins are a prime source of dependent samples. If you have a pair of identical twins who were raised apart (and that’s surprisingly common), you can investigate which differences between people’s behavior are genetic and which are learned. The Minnesota Study of Twins (Bouchard 1990), found that a lot of behaviors that “should” be learned seem to be genetic. The New York Times published a nontechnical account in Major Personality Study Finds That Traits Are Mostly Inherited (Goleman 1986).

11A3. Paired and Unpaired Data Compared Sample type

Dependent

Independent, or randomized

Numeric data type

Paired Data

Unpaired Data

How many numbers from each experimental unit?

Two

One

Can you rearrangeH one sample?

No

Yes

Problem of confounding variables

Minimal

Severe

Use this design …

… if you can

… if you must

HIf the data from the sample are arranged in two rows or two columns, can you rearrange one row or column without destroying information?

11A4. Example 5: Seed Corn You’re the head of research for the Whizzo Seed Company, and you’ve developed a new type of seed that looks promising. You randomly select three farmers in Western New York to receive new corn, and three to receive your standard product. (Of course you don’t tell them which one they’re getting.) At the end of the season they report their yield figures to you. What’s wrong with this picture? You can easily think of all sorts of confounding variables here: different soils, different weather, different insects, different irrigation, different farming techniques, and on and on. Those differences can be great enough to hide (confound) a difference between the two types of corn, especially in a small sample.

Testing new corn versus standard corn for yield. Can you see a problem with the sample in Western New York that’s not a problem with the sample in Central New York? Adapted from Dabes and Janik (1999, 263)

The following year, you try again in Central New York. This time you send each farmer two stocks of seed corn, with instructions to plant one field with the first stock and another field with the second. Does that eliminate confounding variables? Maybe not totally, but it reduces them as far as possible. Now, if you see significant differences in yield between two fields planted by the same farmer, it’s almost sure to be due to differences in the seed.

When to Use Paired Data? You always want to structure an experiment or observation with paired data (dependent samples) — if you can. “If you can.” Aye, there’s the rub. Suppose you want to know whether attending kindergarten makes kids do better in first grade. There’s no way to set this up as paired data: how can a given kid both go through kindergarten and not go through kindergarten? Twin studies don’t help you here, because if the twins are raised together the parents will send both of them to kindergarten, or neither; and if the twins are raised apart then there will be too many other differences in their upbringing that could affect their performance in first grade. If the samples are independent, you can’t pair the data, even if the samples are the same size. If you’re not sure whether you have dependent or independent samples, look back at 11A5. Paired and Unpaired Data Compared.

11A6. Example 6: Where the Rubber Meets the Road You want to determine whether a new synthetic rubber makes tires last longer than the competitor’s product. Can you see how to do this with independent samples (unpaired data) and with dependent samples (paired data)? Think about it before you read on. For independent samples, you randomly assign drivers to receive four tires with your new rubber or four of the competitor’s tires. For dependent samples, you put two tires of one type on the left side of every driver’s car, and two on the right side of every driver’s car. (You do half the cars one way and half the other, to eliminate differences like the greater likelihood of hitting the curb on the right.) With the first method, if there’s only a small difference between your rubber and the competitor’s, it may not show up because you’ve also got differences in driving styles, roads, and so forth — confounding variables again. With the second method, those are eliminated.

11B. Inference with Paired Numeric Data (Case 3) Summary:

The hypothesis test is almost exactly like the Case 1 hypothesis test. The difference is that you define a new variable d (difference) in Step 1 and write hypotheses about µd instead of µ. For a confidence interval, you’re estimating the average difference, not the average of either population. You need to state both size and direction of the effect.

11B1. Example 7: The Freshman Fifteen You’ve probably heard about the “freshman fifteen”, the weight gain many students experience in their first year at college. The Urban Dictionary even talks about the “freshman twenty” (2004). Francine wanted to know if that was a real thing or just an urban legend. During the first week of school, she got the other nine women in her chemistry class at Wossamatta U to agree to help her collect data. (She reasoned that students in any particular class would effectively be a random sample of the school, since class choice is unrelated to weight or other health issues. Of course that would be questionable for a spin class or a cooking class.) Wossamatta U CHEM101 — Women’s Weights (in pounds) Student

A

B

C

D

E

F

G

H

I

J

Sept.

118

105

123

112

107

130

120

99

119

126

May

125

114

128

122

106

143

124

103

125

135

When she had the data, Francine realized she didn’t know what to do next. If she had just one set of numbers, she would do a Student’s t test, since she doesn’t know the population standard deviation (SD). But what to do with two lists? Then she had a brainstorm. She realized that she’s not trying to find out anything about students’ weights. She wants to know about their weight gain. Looking at their weights, she’d have plenty of lurking variables starting with pre-college diet and lifestyle. Looking only at the weight gain minimizes or eliminates those variables, and measures just what happened to each student during freshman year. So she added a third row to her chart: Wossamatta U CHEM101 — Women’s Weights (in pounds) Student

A

B

C

D

E

F

G

H

I

J

Sept.

118

105

123

112

107

130

120

99

119

126

May

125

114

128

122

106

143

124

103

125

135

7

9

5

10

−1

13

4

4

6

9

d = May−Sept.

Notice the new variable d, the difference between matched pairs. (You know the data must be paired, because each May number is associated with one and only one September number. You can’t rearrange the May numbers and still have everything make sense.) This is the heart of Case 3 in Inferential Statistics: Basic Cases: reducing paired numeric data to a simple t test of a population mean difference. Here’s what’s new: Define d so that you always subtract in the same direction. Always include the definition of d in your analysis. Your population parameter becomes µd , the mean difference. In requirements check, test the d’s, not the original data. Write your conclusion about the mean difference as a positive number with a direction. (Example: not “a mean difference of −14”, but “a mean decrease of 14”.) Now she’s all set. She has one set of ten numbers, representing the continuous variable “weight gain in freshman year” for a random sample of Wossamatta U women. (Notice with student E, Francine has a negative value for d because May minus September is 106−107 = −1. That student lost weight as a freshman.) Time for a t test! But first, what will she test? Her original idea was to test Because this textbook helps you, the “freshman fifteen”. But a glance at the d’s shows her please click to donate! that no one gained as much as 15 lb. An average can’t be larger than every member of the data set, so there’s no way she could prove a hypothesis that the average gain is above fifteen pounds. She decides instead to try to prove a “freshman five”, µd > 5, with 0.05 significance. BTW: Subtle point here: You never use sample data in a hypothesis, but you can sometimes adjust your hypotheses after you collect your data, especially when it’s obvious that your data won’t prove what you wanted to prove. Another reasonable choice for Francine would be to try to prove simply that the average student gains weight, µd > 0. When you do a confidence interval, you don’t have to make any decision of this kind because you just follow the data where they lead.

Entering Paired Numeric Data Francine subtracted by hand here, but you shouldn’t do that because it’s a rich source of errors and makes it harder to check your work. Instead, follow this procedure on your TI-83/84: 1. Enter the first data set (September, in this case) in L1. 2. Enter the second data set (May) in L2. Unlike the one-population cases, the order matters. 3. Check your data entry. Since you entered all of the September figures and then all the May figures, check them the opposite way, first student A September and May, then student B, and so on. 4. Cursor to L3 — the column heading, not the first number. 5. Francine defined d as May−Sept., which is L2−L1, so enter that formula. (To subtract in the other direction, enter L1−L2.) As soon as you press [ENTER], the calculator does all the subtractions, wiping out whatever was in L3 previously.

This isn’t Excel — if you change L1 or L2 after entering the formula for L3, L3 won’t change. You need to re-enter the formula for L3 in that case. (You actually can make the calculator behave like Excel by binding a formula to a list, but it’s not worth the hassle.)

Hypothesis Test for Mean Difference With paired numeric data, your population parameter is the mean difference µd. The random variable is a difference (in this case, a number of pounds gained from September to May), so the parameter is the mean of all those weight gains. (1)

d = May−September H 0 : µd = 5, average student gains 5 lb or less H 1 : µd > 5, average student gains more than 5 lb

(2)

= 0.05

(RC)

Random sample? Yes, effectively. (It’s a random sample of Wossamatta U women frosh, not necessarily those from other colleges.) 10n ≤ N? Yes, because any university has more than 10×10 = 100 women in the freshman class. n = 10 (< 30), so Francine must test for normality and verify absence of outliers. She tests L3, not L1 and L2, because L3 holds her sample data of weight gain:

r=.9811 and crit=.9179. r>crit, and the box-whisker shows no outliers. (3/4)

This is a regular T-Test, number 2 in the STAT TESTS menu. Francine writes down T-Test: 5, L3, 1, >µo results: t=1.29, p = 0.1146 , d=6.6, s=3.9, n=10

BTW: The sample mean is d (“d-bar”), not x, because the data are d’s, not x’s.

(5)

p > . Fail to reject H 0 .

(6)

You can’t determine whether the average Wossamatta U woman student gains more than 5 pounds in her freshman year or not (p = 0.1146). Or, At the 0.05 significance level, you can’t determine whether the average Wossamatta U woman student gains more than 5 pounds in her freshman year or not.

After a “fail to reject H 0 ”, you always remember to write your conclusion in neutral language, right? Maybe the true average weight gain is greater than 5 pounds but this particular sample just happened not to show it; maybe the true average weight gain really is under 5 pounds, A confidence interval can help you get a handle on the effect size.

Confidence Interval for Mean Difference When a hypothesis test fails to reach a conclusion, a confidence interval can salvage at least some information. When a hypothesis test does reach a conclusion, a confidence interval can give you more precise information. If Francine was doing only the confidence interval, she’d have to start off by testing requirements. But she has already tested them as part of the hypothesis test, so she goes right to the TINTERVAL screen. Which confidence level does she choose? Her one-tailed hypothesis test at = 0.05 would be equivalent to a two-tailed test at = 0.10, and that suggests a confidence level of 90%. But she decides since her hypothesis test has already failed to reach a conclusion she’d at least like to get a 95% CI. TInterval: L3, 1, .95 results: (3.7948, 9.4052)

Conclusion: Francine is 95% confident that the average woman student at Wossamatta U gains 3.8 to 9.4 pounds during her freshman year. (Francine doesn’t write down d, s, and n because she’s already written them in the hypothesis test. She would write them down when she does only a confidence interval.) Common mistake: Don’t say the average weight is 3.8 to 9.4 pounds. You aren’t estimating the average first-year woman’s weight, but her weight gain. Always re-read your conclusion after you write it, and ask yourself whether it seems reasonable in the context of the problem. That can save you from mistakes like this.

11B2. Example 8: Coffee and Heart Rate A few years back, a coffee company tried to market drinking coffee as a way to relax — and they weren’t talking about decaf. Jon decided to test this. He randomly selected six adults. He recorded their heart rates, then recorded them again half an hour after each person drank two cups of regular coffee. His data are shown at right. (Data come from Dabes and Janik [1999, 264].)

Person

The data are paired, because each person (experimental unit) gives you two numbers, Before and After; because each After is associated with one specific Before; and because you can’t rearrange Before or After and still have the data make sense.

Before

After

1

78

83

2

64

66

3

70

77

4

71

74

5

70

75

6

68

71

Jon selected the 0.01 significance level. (He tests for difference even though he believes coffee increases heart rate, because it could decrease it.) Jon could equally well define d as Before−After or After−Before. At least, mathematically he could. But you’ll find it’s easier to interpret results if you always define d as high minus low so that all or most of the d’s will be positive numbers. (You can do this based on your common sense or by looking at the data.) Jon sees that the After numbers are generally larger than the Before numbers, so he chooses d = After−Before. (1)

d = After−Before H 0 : µd = 0, coffee makes no difference to heart rate H 1 : µd ≠ 0, coffee makes a difference to heart rate

(2)

= 0.01

(RC) Jon has a random sample, but the sample size is crit.

(3/4)

T-Test: 0, L3, 1, ≠µo results: t=5.56, p = 0.0026 , d=4.2, s=1.8, n=6

(5)

p < . Reject H 0 and accept H 1 .

(6)

Drinking coffee does make a difference in heart rate half an hour later (p = 0.0026). In fact, coffee increases heart rate. Or, Drinking coffee does make a difference in heart rate half an hour later, at the 0.01 significance level. In fact, drinking coffee increases heart rate.

As usual, when you do a two-tailed test and p < , you can interpret it in a one-tailed manner. Jon defined d as After−Before, which is the amount of increase in each subject’s heart rate. His sample mean d was positive, so the average outcome in his sample was an increase. Because he proved that the mean difference µd for all people is nonzero, the sign of his sample mean difference d tells him the sign of the population mean difference µd . Jon can’t say that the average increase for people in general is 4.2 beats per minute. That was the mean difference in his sample. If he wants to know the mean difference for all people, he has to construct a confidence interval: TInterval: L3, 1, .99 result: (1.146, 7.187) Jon is 99% confident that the average increase in heart rate for all people, half an hour after drinking two cups of coffee, is 1.1 to 7.2 beats per minute. Caution! The confidence interval expresses a difference, not an absolute number. You are estimating the amount of increase or decrease, not the heart rate. A common mistake would be to say something about the heart rate being 1.1 to 7.2 bpm after coffee. Again, you’re not estimating the heart rate, you’re estimating the change in heart rate.

11C. Inference with Unpaired Numeric Data (Case 4) With paired data, you tested the population mean difference µd between matched pairs. But suppose you don’t have matched pairs? With unpaired data in independent samples, you test the difference between the means of two populations, µ1 −µ2 . This is Case 4 in Inferential Statistics: Basic Cases. Key features: You identify population 1 and population 2 at the start of your HT or CI. The two samples must be independent, and you check requirements for each sample separately. Sample sizes need not be equal, but they should not be very different. You use 2-SampTTest for hypothesis test, 2-SampTInt for confidence interval. Always use Pooled:No with both. Advice: Take your time when you look at data to decide whether you have paired or unpaired data. If your sample sizes are different, it’s a no-brainer: the data are unpaired. But if the sample sizes are the same, think carefully about whether the data are paired or unpaired. Sometimes students just seem to take a stab in the dark at whether data are paired or unpaired, but if you just stop and think about how the data were taken you can make the right decision every time. Look back at Paired and Unpaired Data if you need a refresher on the difference.

Example 9: A Tough Grader? Prof. Sullivan’s students at Wossamatta U felt that he was a tougher grader than the other speech professors. They decided to test this, at the 0.05 significance level. Eight of them each took a two-hour shift, assigned randomly at different times and days of the week, and distributed a questionnaire to each student on the main quad. They felt this was a reasonable approximation to a random sample of current students. (They asked students not to take a questionnaire if they had already submitted one.) The questionnaire asked whether the student had taken speech in a previous semester, and if so from which professor and what grade they received. They then divided the questionnaires into three piles, “no speech”, “Sullivan”, and “other prof”. It would be possible to do an analysis with the categorical data of letter grades. But you should always use numerical data when you can, because p-values are usually lower with numeric data than attribute data, for a given sample size. The students counted an A as 4 points, A-minus as 3.7, and so on. Here is a summary of their findings: Mean

Standard Deviation

Sample Size

Sullivan

2.21

1.44

32

Other prof

2.68

1.13

54

Students of

Hypothesis Test for Difference of Means In this test, you have unpaired numeric data in two samples. The requirements for each sample are the same as the test for the sample in a one-sample t test: Simple random samples (or equivalent, such as systematic). Each sample is either n ≥ 30, or normally distributed with no outliers. Each sample is no more than 10% of the population it was drawn from. There’s an additional requirement for the two samples: The two samples must be independent. Here’s the hypothesis test, as performed by Prof. Sullivan’s students: (1)

pop. 1 = Sullivan students, pop. 2 = other speech profs’ students H 0 : µ1 = µ2 , no difference in average grades H 1 : µ1 < µ2 , Sullivan’s grades lower on average

(2)

= 0.05

(RC)

Random sample (systematic). Are samples less than 10% of their populations? 10×32 = 320, and 10×54 = 540. At a university there are almost certainly more speech students per professor than that, especially considering multiple years. Both sample sizes > 30. Samples independent (no connection between Sullivan students and non-Sullivan students).

(3/4)

2-SampTTest: x1=2.21, s1=1.44, n1=32, x2=2.68, sx2=1.13, n2=54, µ1 . Fail to reject H 0 .

(6)

At the 0.05 level of significance, they can’t determine whether Prof. Sullivan is a tougher grader than the other professors or not. BTW: How does your calculator analyze a difference of independent means? If you remember what you learned about one-sample t tests, all you have to do is extend it. You’re working with a difference of sample means. The standard error of the mean for the first population is s1/√n1 and therefore the variance is s1²/n1, and similarly for the second population. The variance of the sum or difference of independent variables is the sum of their variances, so VAR(x1−x2) = s1²/n1 + s2²/n2. The standard deviation (the standard error of the difference of sample means) is the square root of the variance:

.

It turns out that the difference of sample means follows a t distribution — if you choose the right number of degrees of freedom (more on that later). The one-sample test statistic was t = (x−µo) / (s/√n). The twosample test statistic is analogous, with the differences substituted. The test statistic becomes . In this course, you’ll just be testing whether one population mean is greater than, less than, or different from the other. In other words, you’ll test against a hypothetical mean difference of 0. That simplifies t a bit:

.

What about degrees of freedom? You might think df would be n1+n2−1, but it isn’t. The sampling distribution approximately follows a t with df equal to the lower of n1−1 and n2−1. It’s only approximate because the population SD are usually different. The exact degrees of freedom were computed by B. L. Welch (1938), and the horrendous, ugly equation is shown at right. Fortunately, your TI-83/84 has the computation built in, and you don’t have to worry about it. What about pooling? Why do you always select Pooled:No on your TI-83/84? Well, if the two populations have the same SD (if they are homoscedastic) you can treat them as a single population (pool the data sets) and use a higher number of degrees of freedom. That in turn means your p-value will be a bit lower, so you’re a bit more likely to be able to reject H0. Sounds good, right? But there are problems: You have to perform another test, called an F test, to determine whether 1 = 2. If the F test gets a large p-value, that doesn’t prove homoscedasticity; it just fails to prove that the SD are different. So you can never be sure that it’s okay to pool the data for the t test. The F test requires the populations to be normal, not just approximately normal. This is usually difficult or impossible to prove. Even if you do pool the two samples, there’s seldom enough difference in the p-value to make a difference in whether you reject H0 or not. For these reasons and others, the issue of pooling is controversial. Some books don’t even mention it. It’s best just to use Pooled:No always.

Confidence Interval for Difference of Means The requirements are exactly the same as the requirements for the hypothesis test. You compute a confidence interval on your TI-83/84 through 2-SampTInt. Since they couldn’t prove that Prof. Sullivan was a tough grader, the students decided to compute a 90% confidence interval for the difference between Prof. Sullivan’s average grades and the other speech profs’ average grades: pop. 1 = Sullivan students; pop. 2 = other speech profs’ students Requirements: already covered in hypothesis test. 2-SampTInt: x1=2.21, s1=1.44, n1=32, x2=2.68, sx2=1.13, n2=54, C-Level=.9, Pooled:No Results: (−.9678, .02779)

Interpretation: The TI-83 gives you the bounds for the confidence interval about µ1 −µ2 . A negative number indicates µ1 smaller than µ2 , and a positive number indicates µ1 larger than µ2 . Therefore: We’re 90% confident that the average student in Prof. Sullivan’s classes receives somewhere between 0.97 of a letter grade lower than the average student in other profs’ speech classes, and 0.03 of a letter grade higher. Remark: The 90% confidence interval is almost all negative. This reflects the fact that the p-value in the one-tailed test for µ1 < µ2 was almost as low as 0.05. The students could have chosen any confidence level they wanted, just for showing an effect size. But for a confidence interval equivalent to their one-tailed hypothesis test that used = 0.05, the confidence level has to be 1−2×0.05 = 0.90 = 90%. Why do you need a special two-sample t procedure? Can’t you just compute a confidence interval from each sample and then compare them? No, because the standard errors are different. The twosample standard error takes the sample SD and sample sizes into account. Here’s a simple example, provided by Benjamin Kirk: A farmer tests two diets for his pigs, randomly assigning 36 pigs to each sample. The Diet A group gained an average 55 lb with SD of 3 lb; that gives a 95% confidence interval 54.0 to 56.0 lb. The Diet B group gained 53 lb on average, with SD of 4 lb; the CI is 51.6 to 54.4 lb. Those intervals overlap slightly, which would not let you conclude that there’s any difference in the diets. But the 2-SampTInt is 0.3 to 3.7 lb in favor of Diet A, which says there is a difference. The issue is that the B group had a lower sample mean, but there was more variation within the group.

11C1. Example 10: Sorority Academics The Alpha Alpha Alpha sorority chapter at Staples University (Yes, corporate sponsorship is getting ridiculous!) has a tradition of putting in extra effort academically. They gave their incoming pledges the task of proving that Alpha Alpha Alpha had higher average GPA than other sororities, at the 0.05 level of significance. The Alphas are a large sorority, with 119 members. The pledges hacked the campus server and obtained GPAs of ten randomly selected Alphas and ten randomly selected members of other sororities on campus. Do their ill-gotten data prove their point? Alphas:

2.31 3.36 2.77 2.93 2.27 2.35 3.13 2.20 3.20 2.45

Other sororities: 1.49 1.74 2.70 2.40 2.17 1.08 1.85 1.96 2.08 1.49 Since you have independent samples (unpaired data) from two different populations, this is Case 4, difference of population means, in Inferential Statistics: Basic Cases. Caution: You can’t treat these as paired data just because the sample sizes are equal; that’s a rookie mistake. When deciding between a paired or an unpaired analysis, always ask yourself: “Is data point 1 from the first sample truly associated with data point 1 from the second sample?” In this case, they’re not. (1) pop. 1 = Alpha Alpha Alpha; pop. 2 = other sororities H 0 : µ1 = µ2 , No difference in average GPA H 1 : µ1 > µ2 , Average GPA of all Alphas is higher than other sororities (2) = 0.05 You check requirements against both samples independently. These samples are both smaller than 30, so you have to check normality and outliers on both. Here are the normality checks:

The first picture doesn’t look much like a straight line, but r is greater than crit, so it’s close enough. (With small data sets like this one, fitting the data to the screen can make differences look larger than they really are.) The calculator lets you “stack” two or three boxplots on one screen. Not only is this a bit of a labor saver, but it also gives you a good sense of how different the samples are. To do this, select “Compare 2 smpl” on the first box-whisker screen. You can guess what “Compare 3 smpl” does, but we don’t use it in this course.) For these samples, the difference is dramatic. Every single Alpha’s GPA (in the sample) is above the third quartile in the sample of other sororities, and the max of other sororities is just barely above the median Alpha. With such a big difference, why do the pledges even need to do a hypothesis test? Because they know these are just samples. Maybe the Alphas actually aren’t any better academically, but these particular samples just happened to be far apart. The hypothesis test tells you whether the difference you see is too big to be purely the result of random chance in sample selection. (RC)

Random samples, OK 10% of Alphas is 12, and the sample is smaller than that. We don’t know how many are in all the other sororities combined, but it must be more than 10×10 = 100. OK Normality check, sample 1: r(.9567) > crit(.9179), OK Normality check, sample 2: r(.9946) > crit(.9179), OK Box-whisker: no outliers in either sample, OK

(3/4)

2-SampTText L1, L2, 1, 1, >µ2 , Pooled:No outputs: t = 3.93, p-value = 0.0005 , x1 = 2.70, s1 = 0.43, n1 = 10 x2 = 1.90, s2 = 0.48, n2 = 10

(5)

p < . Reject H 0 and accept H 1 .

(6)

The average GPA in Alpha Alpha Alpha is higher than the average GPA of other sorority members (p = 0.0005). [Or, at the 0.05 level of significance, the average GPA in Alpha Alpha Alpha is higher than the average GPA of other sorority members.)

Comment: You have to phrase your conclusion carefully. The pledges proved that the average GPA of Alphas is higher than the average GPA of all other sorority members, not all other sororities. What’s the difference? Here’s a simple example. Suppose there are ten other sororities besides the Alphas. The Omegas have an average GPA of 3.66, higher than the Alphas’ average. If the other nine each have an average GPA of 1.70, that could easily produce exactly the sample that the pledges got. The message here: Aggregating data can lose information. Sometimes that’s okay, but be wary when one population is being compared to an aggregate of multiple other populations.

11D. Inference on Two Proportions (Case 5) When you have two samples of binomial data, they represent two populations. Each population has some proportion of successes, p1 and p2 respectively. You don’t know those true proportions, and in fact you’re not concerned with them. Instead, you’re concerned with the difference between the proportions, p1 −p2 . You can test whether there is a difference (hypothesis test), or you can estimate the size of the difference (confidence interval). This is Case 5 in Inferential Statistics: Basic Cases. Key features of Case 5, the difference of proportions: You identify population 1 and population 2 at the start of your HT or CI. The two samples must be independent. (It’s possible to analyze paired binomial data, but we don’t do it in this course.) You check requirements for each sample separately. Sample sizes need not be equal, but they should not be very different. Advice: take your time with two-sample binomial data. You have a lot of p’s and a lot of percentages floating around, and it’s easy to get mixed up if you try to hurry. Take extra care when writing conclusions. You’re making statements about the difference between the two proportions, not about the individual proportions. And you’re making statements about the difference in proportions between the populations, not between the samples.

11D1. Example 11: Traffic Stops and Traffic Tickets One of my students — call him Don — had Stopped by Traffic Cop several traffic tickets, and he knew one more would trigger a suspension. He felt that women Ticket Just a Total p stopped by a traffic cop were more likely than Issued Warning men to get off with just a warning, and for his Men 86 11 97 89% Field Project he set out to prove it, with = 0.05. Don quickly realized that he should test Women 55 15 70 79% whether men and women stopped by a cop are equally likely to get a ticket, not just whether men are more likely. After all, he couldn’t rule out the possibility that women are more likely to get a ticket if stopped. Don distributed a questionnaire to a systematic sample of TC3 students. (He assumed that any gender-based difference in TC3 students would be representative of college students in general. That seems reasonable.) He asked three questions: 1. Male or female? 2. Stopped by a traffic cop since your 18th birthday? 3. If yes, did you receive a ticket the last time you were stopped? Don disregarded any questionnaires from students who had never been stopped as adults. He wasn’t interested in the likelihood of getting a ticket, but in the likelihood of getting a ticket after being stopped by a cop. You could say that he was interested in the different proportions, for men and women, of stops that lead to tickets.

Hypothesis Test for Difference of Proportions This is just another variation on the good old Seven Steps of Hypothesis Tests: You identify population 1 and population 2 at the beginning of step 1. As usual, your TI-83 will compute the p-value for you; the menu selection this time is a 2-PropZTest, so your test statistic is a z-score. In step 6, if your p-value is < , you conclude that the proportion of ___ in one population is greater than or less than the proportion of ___ in the other. If p-value is > , you can’t tell whether the two populations have the same proportion of ____ or different proportions. Here are the requirements for a Case 5 hypothesis test of a difference of proportions: Each sample must be random, and ≤ 10% of population, so 10n1 ≤ N1 and 10n2 ≤ N2 . The two samples must be independent. You need at least 10 successes and 10 failures in each sample. Actually, that’s an approximation to the real requirement. We use it because it nearly always gives the same answer, and it’s easier to test. The real requirement is at least 10 successes and 10 failures EXPECTED in each sample. The expected numbers are what you would see in your samples if H 0 is true and there’s no difference between the two population proportions. In that case, the pooled proportion p, which is the overall percentage of success in the combined samples, is an estimator of the true proportion in both populations. That pooled proportion is

. (2-PropZTest shows you p on the output

screen.) Using that pooled proportion p, the expected successes and failures in sample 1 are n1 p and n1 −n1 p, and the expected successes and failures in sample 2 are n2 p and n2 −n2 p. All four of these must be ≥ 10. The Gardasil vaccine example, below, shows a situation where you have to use the blended proportion to test requirements. Here is Don’s hypothesis test about the different proportions of men and women that receive tickets after being stopped in traffic. (1)

population 1 = college men stopped by traffic cops; population 2 = college women stopped by traffic cops H 0 : p1 = p2 , college men and women equally likely to get a ticket after being stopped H 1 : p1 ≠ p2 , college men and women not equally likely to get a ticket after being stopped

(2)

= 0.05

(RC)

Samples 1 and 2 random? Yes, effectively (systematic). 10n1 = 10×97 = 970, and there have been more than 970 male students (at all colleges) stopped by traffic cops. 10n2 = 10×70 = 700, and there have been more than 700 female students (at all colleges) stopped by traffic cops. Sample 1 has 86 successes and 97−86 = 11 failures; sample 2 has 55 successes and 70−55 = 15 failures.

(3/4)

2-PropZTest: 86, 97, 55, 70, p1≠p2

Results: z=1.77, p-value = 0.0760 , p1 =.89, p2 =.79, p=.84

60%

There’s a difference of 10 percentage points between the sample proportions, but with Don’s sample sizes that difference is not large enough to be statistically significant. Even if there really is a difference in proportions for college men and women in general, random chance would be enough to explain the difference Don sees in his samples. (5)

p > . Fail to reject H 0 .

(6)

At the 0.05 level of significance, Don can’t tell whether men and women stopped by traffic cops are equally likely to get tickets, or not.

If this non-conclusion leaves you non-satisfied, you’re not alone. As usual, the confidence interval (next section) can provide some information. BTW: Why does the “official” requirement use a pooled proportion p instead of testing each sample? In fact, for a confidence interval you always test requirements for each sample. But in a hypothesis test, your H0 is always “no difference in population proportions”, and a hypothesis test always starts by assuming H0 is true. If the null is true, then there is no difference in the two populations, and you really just have one big sample of size n1+n2 and sample proportion p. So that’s what you test. BTW: Why is this a z test? For the same reason that a one-proportion test is a z test: from the population proportion p you know the SD. Of course the two-population case is a bit more complicated. You need the key fact that when you add or subtract independent random variables, their variances add. If the two populations have the same proportion p, as H0 assumes, then the SD of the sampling distribution of the proportion for population 1 is √[p(1−p)/n1], and similarly for population 2, where p is the pooled proportion mentioned in the requirements check, above. Square the SD to get the variances, add them, and take the square rot to get the standard error:

. And from this you have

the test statistic:

.

Confidence Interval for Difference of Proportions In a confidence interval for the difference of two proportions, some unknown proportion p1 of population 1 has some characteristic, and some unknown proportion p2 of population 2 has that characteristic. You aren’t concerned with those proportions on their own, but you want to estimate which population has the greater proportion, and by how much. You identify population 1 and population 2 at the beginning of your analysis. As usual, your TI-83 will compute the interval for you; the menu selection this time is a 2-PropZInt. The CI estimate is for p1 −p2 , the true difference between the proportion of success in the two populations. A negative number in the confidence interval means the population 1 proportion is lower than the population 2 proportion, and a positive number means p1 is greater than p2 . Your conclusion states the size and direction of the difference between the two population proportions. It may take a couple of drafts before you get this into understandable language. The requirements for a CI are almost the same as a HT, but with one subtle difference: Each sample must be random, and ≤ 10% of the population it was drawn from. The two samples must be independent. Each sample must have ≥ 10 successes and ≥ 10 failures. Why is that last requirement different from the “official” requirement for the hypothesis test? With the HT, you assumed H 0 was true and both populations had the same proportion. That let you use a blended or pooled proportion from your combined samples. But with a CI, you don’t make any such assumption. What would be the point of a confidence interval for the difference if you assume there is no difference? But despite the difference in theory, as a practical matter you can just test for ≥ 10 successes and ≥ 10 failures in each sample for both HT and CI. Don has already checked requirements in the hypothesis test, so he moves right to a 2-PropZInt:

Don gets a result of −1.4% to +21.6%. How does he interpret that? Well, he can write it as −1.4% ≤ p1 −p2 ≤ 21.6% (95% conf.) Adding p2 to all three “sides” gives p2 −1.4% ≤ p1 ≤ p2 +21.6% (95% conf.) With 95% confidence, p1 is somewhere between 1.4% below p2 and 21.6% above p2 . You don’t know the numerical value of p1 , but out of male students who are stopped by a traffic cop, p1 is the proportion who get a ticket, and similarly for p2 and women. So Don can write his confidence interval like this: I’m 95% confident that, out of students stopped by traffic cops, the proportion of men who actually get tickets is somewhere between 1.4 percentage points less than women, and 21.6 percentage points more than women. If you’re not feelin’ the love with the algebra approach, you can reason it out in words. The confidence interval is the difference in proportions for men minus women. If that’s negative, the proportion for men is less than the proportion for women; if the difference is positive, the proportion for men is greater than the proportion for women. Why do I say “percentage points” instead of just “percent” or “%”? Well, how do you describe the difference between 1% and 2%? It’s a difference of one percentage point, but it’s a 100% increase, because the second one is 200% of the first. When you subtract two percentages, the difference is a number of percentage points. If you just say “percent”, that means you’re expressing the difference using one of the percentages as a base, even if you don’t mean to. Getting back to Don’s confidence interval, the −1.4% to +26.1% difference between men and women in traffic tickets is a simple subtraction of men’s rate minus women’s rate, so it is percentage points, not percent.

used by permission; source: http://xkcd.com/985/ (accessed 2014-10-03)

BTW: Where does the confidence interval come from? First you have to find the standard error. Yes, it’s different from the standard error associated with the hypothesis test. Why? That standard error assumed H0 was true and used the pooled p. You can’t do that in the confidence interval, because if H0 is true then the difference between the population proportions is zero and you don’t have a confidence interval! The standard deviation of the sampling distribution of the proportion for population 1 is √[p1(1−p1)/n1], and similarly for population 2. Square them, add, and take the square root to get the SD of the distribution of differences in sample proportions, also known as the standard error of the difference of proportions:

. The margin of error is z/2

times that. The center of the confidence interval is the point estimate, (p1−p2), so the bounds for the (1−)% confidence interval are (p1−p2)−E ≤ p1−p2 ≤ (p1−p2)+E where

Just like with numeric data, you have to use the two-sample procedure to compute a correct confidence interval. Here’s an example. Two candidates are running for city council, so they each commission an exit poll on Election Day. Of 200 voters polled, 110 voted for Mr. X; 90 of a different 200 voted for Ms. Y. The 95% confidence intervals are 48.1% to 61.9% and 38.1% to 51.9%. The intervals overlap, so Ms. Y might still hope for victory. But a 2-PropZInt tells a different story. The interval for the difference of proportions, X−Y, is 0.2% to 19.8%, so Mr. X is 95% confident of winning, and the only question is whether it will be a squeaker or a landslide.

Necessary Sample Size for Confidence Interval You have a confidence level and a desired margin of error in mind. How large must each sample be? You may remember with the one-population binomial case, part of the calculation was your prior estimate, or if you had no prior estimate you used 0.5. With two binomial populations, you need a prior estimate (or 0.5) for each one. The easiest way to compute the necessary sample size is to use MATH200A Program part 5. If you don’t have the program and want to get it, see Getting the Program. You can also calculate necessary sample size by using the formula in the next paragraph, if you don’t have the program. BTW: The formula for sample size is not too difficult. Start with the formula for margin of error. The desired confidence level determines critical z. But when you fill in your desired margin of error E and your prior estimates p1 and p2, you still have two unknowns, n1 and n2. The simplest assumption is that you’ll make your two samples the same size, so set n1 = n2 and solve:

For a detailed explanation, with worked examples, see How Big a Sample Do I Need?.

Caution! When you’re planning to study the difference between two binomial populations, you have to use the two-population binomial computation of sample size. If you compute one sample size for sample 1 and a separate sample size for sample 2, you’ll come out much too low. Example 12: Let’s look back once more at Don and his traffic stops. His 95% confidence interval was −0.0141 to +0.21587. That’s a margin of error of (.21587−(−.0141))/2 = 11½ percentage points. How large must his samples be if he wants a margin of error of no more than 5 percentage points but he’s willing to be only 90% confident? Solution: Don can use his sample proportions as prior estimates. Those were 86/97 » 0.8866 for men and 55/70 » 0.7857 for women. With the MATH200A program (recommended):

If you’re not using the program:

Here’s the output screen from MATH200A Program part 5, 2-pop binomial:

The calculation is a little easier if you break it into chunks. First compute p1 (1−p1 ) + p2 (1−p2 ). When you press [Enter], the calculator displays that result. You want to multiply that by (z/2 /E)². Press the [×] key, and the calculator displays Ans*. Then press the opening paren [(], enter the fraction, and square it. What is z/2 ? You did this in How Big a Sample for Binomial Data? in Chapter 9. The confidence level is 1− = 0.9, so = 0.1, /2 = 0.05, and z/2 is invNorm(1−.05). The margin of error is 5% or .05 (not .5 !). Caution: You don’t round the sample size. If you don’t get a whole number from the calculation, always go up to the next whole number. A sample size of 291.0255149 or greater gives a margin of error of .05 or less, at 90% confidence. The smallest whole number that is 291.0255149 or greater is 292, not 291.

Answer: Don needs a sample of 292 men and 292 women if he wants 90% confidence in an estimate of the difference with margin of error no more than 5%. Rookie mistake: Don’t just say “292”. It’s 292 from each population. Why do you need such large samples, even at a confidence level as low as 90%? Part of the answer is that binomial data do need large samples; remember that a single sample of just over a thousand gives you a 3% margin of error at the 95% confidence level. And when you have two populations, you are estimating the difference between two unknown parameter values, p1 and p2 . If each of those was estimated within a 3% margin of error, the margin of error for their difference would be 6%, so the samples have to be larger in the two-population binomial case. Example 13: The Prime Minister knows that his program of tax cuts and reduced social services appeals more to Conservatives than to Labour, but he wants to know how large the difference is. To estimate the difference with 95% confidence, with a margin of error of no more than 3%, how many members of each party must he survey? Solution: You’re given no estimate of support within either party, so use 0.5 for p1 and p2 . E = 0.03 (not 0.3). With the MATH200A program (recommended):

If you’re not using the program:

MATH200a/sample size/2-pop binomial:

First compute p1 (1−p1 ) + p2 (1−p2 ) = 0.5(1−0.5)+0.5(1−0.5). You have to multiply that by z/2 , which you find like this: CLevel = 1− = 0.95 Þ = 1−0.95 = 0.05 Þ /2 = 0.025 Þ z/2 = invNorm(1−.025).

Answer: To gauge the difference within a 3% margin of error, at the 95% confidence level, the Prime Minister needs to poll 2135 Conservative Party members and 2135 Labour Party members .

11D2. Example 14: Gardasil Vaccine The Gardasil vaccine is marketed by Merck to prevent cervical cancer. What are the statistics behind it? How do women decide whether to get vaccinated? Should the vaccine be mandatory? A Cortland Standard story (21 Nov 2002) summarized an article from the New England Journal of Medicine as follows A new vaccine can protect against Type 16 of the human papilloma virus, a sexually transmitted virus that causes cervical cancer, a new study shows. An estimated 5.5 million people become infected with a strain of HPV [not necessarily this strain] each year in the United States. Efficiency rate of vaccine and placebo Placebo: Group size 765, infection 41 HPV-16 vaccine: Group size 768, infection 0 Note: The study included 1533 women with an average age of 20.

(Similar studies were done for the vaccine’s effectiveness against another strain, HPV-18. According to the front page of the Wall Street Journal on 16 Apr 2007, HPV-16 and -18 between them “are thought to cause 70% of cervical-cancer cases.” The vaccine, developed by Merck, is now marketed as Gardasil.) The samples certainly show an impressive difference, but is it statistically significant? Could the luck of random sampling be enough to account for that difference in infection rates? The claim is “the vaccine protects against HPV-16.” To translate this into the language of statistics, realize that there are two populations: (1) women who don’t get the vaccine, and (2) women who do get the vaccine. Notice that the populations are all women, past, present, and future who don’t or do get vaccinated. The 765 and 768 women are samples, not populations. The populations are unvaccinated and vaccinated, not placebo and vaccine. Placebos are administered to members of a sample, but a population doesn’t “get placeboed”. The data type is attribute (binomial) because the original question or measurement of each participant is the yes/no question: “Did this woman contract the virus?” (“Success” is an HPV-16 infection, not a good thing.) Since you’re comparing two populations, this is Case 5, Difference of Two Proportions. Is the Vaccine Effective? If the vaccine works, then you expect more women without the vaccine to contract the virus, so make them population 1. (That’s not necessary; it just usually makes things a little simpler to call population 1 the one with higher numbers expected.) Although you hope that the vaccine population will have a lower infection rate, it’s not impossible that they could have a higher rate. Therefore you do a two-tailed test (≠). If p < , then it’s time to say whether the vaccine makes things better or worse. Let’s use = 0.001. You’re talking about cancer in humans, after all. A Type I error would be saying that Gardasil makes a difference when actually it doesn’t. You don’t want women to get vaccinated, and have a false sense of security, if the vaccine actually doesn’t work, so a Type I error is pretty serious. (1)

population 1 = unvaccinated women; population 2 = vaccinated women H 0 : p1 = p2 , the vaccine makes no difference H 1 : p1 ≠ p2 , the vaccine does make a difference

(2)

= 0.001

(RC)

Randomized design? We’re not told in so many words, but this is a high-profile medical study so you can be pretty confident it was done right. Samples less than 10% of population? Yes, since millions of women will get the vaccine (if it’s proved effective) and millions won’t. At least 10 yes and 10 no in each sample? In the placebo group, there were 41 yes and 765−41 = 724 no. In the treatment group, there were no successes at all. Does that mean you can’t do the hypothesis test? Remember that “at least 10 yes and 10 no in each sample” is a shortcut for the real requirement, which is “at least 10 yes and 10 no expected in each sample if the null hypothesis is true”. If H 0 is true, then the pooled proportion p = 0.0267 is an estimator of the proportions in both populations. What would you expect if H 0 is true? In the placebo group of 765, you would expect n1 p = 765×.0267 » 20 yes and n1 −n1 p = 765−20 = 745 no. You’d expect about the same in the treatment group of 768, so the “at least 10” requirement is met.

(3/4)

2-PropZTest: 41, 765, 0, 768, p1≠p2 results: z=6.50, p-value = 7.9E-11 , p1 =.0536, p2 =0, p=.0267

Pause for a minute to make sure you can keep all those p’s straight. The first one, p = 7.9E11, is the p-value, the chance of getting such different sample results if the vaccine makes no difference. p1 and p2 are those sample results: 5.4% of unvaccinated women and 0% of vaccinated women in the samples contracted HPV-16 infections. p without subscript is the pooled proportion: 2.7% of all women in the study contracted HPV-16. (5)

p < . Reject H 0 and accept H 1 .

(6)

The Gardasil vaccine does make a difference to HPV-16 infection rates (p = 8×10-11 ). In fact, it lowers the chance of infection. Or, At the 0.001 level of significance, the Gardasil vaccine does make a difference to HPV-16 infection rates. In fact, it lowers the chance of infection.

It’s worth reviewing what this p-value of 8×10-11 means. If the vaccine actually made no difference, there are only 8 chances in a hundred billion of getting the difference between samples that the researchers actually got, or a larger difference. How do you get from “makes a difference” to “reduces infection rate”? Remember that when p < in a two-tailed test, you can interpret the result in a one-tailed manner. If the vaccine makes things different, as appears virtually certain, then it must either make them better or make them worse. But in the sample groups, the vaccine group did better than the placebo group. Therefore the vaccine can’t make things worse, and it must make them better. How Effective Is the Vaccine? Can you do a confidence interval to estimate how much Gardasil reduces a woman’s risk of HPV-16 infection? Unfortunately, you can’t, because the requirements aren’t met: There were zero successes in the second sample. You can’t think like the hypothesis test and use the blended p to meet requirements. Why wouldn’t that make sense? In a confidence interval, you’re specifically trying to estimate the difference between p1 and p2 (likelihood of infection for unvaccinated and vaccinated women), so you can’t very well assume there is no difference. In terms of what you’re required to know for the course, you can skip to the next section right now. But if you want to know more, keep reading. One informal calculation finds a number needed to treat per person actually helped (Simon 2000c). The difference in sample proportions is 5.4 percentage points, and 1/.054 » 18.5 is called the number needed to treat. (You may recognize this as the expected value of the geometric distribution with p = 5.4%.) In the long run, for every 18 or 19 women who are vaccinated, one HPV-16 infection is prevented. Caution! 5.4 percentage points is a difference in sample proportions. You can say only that the difference in the population is somewhere in the neighborhood of 5.4 percentage points, not that it is that. The number needed to treat is therefore not exactly 18.5, just somewhere in the neighborhood of 18.5. Even so, this is valuable information for women and their doctors. Another approach is the rule of three, explained in Confidence Interval with Zero Events (Simon 2010). When there are zero successes in n events, the 95% confidence interval is 0 to 3/n. Here 3/768 = 0.0039, about 0.4%. The 95% confidence interval for the unvaccinated population is 3.8% to 7.0%. So a doctor can tell her patients that about 38 to 70 unvaccinated women in a thousand will be infected with HPV-16, but only about four vaccinated women in a thousand. BTW: Each of those is a 95% confidence interval, but the combination isn’t a 95% confidence interval! In the long run, if you do a bunch of 95% CIs, one in 20 of them won’t capture the true population parameter. Here you’re doing two, so there’s only a 95%×95% = 90.3% chance that both of these actually capture the true population proportions.

11E. Confidence Interval and Hypothesis Test (Two Populations) Summary:

If you have a confidence interval for the difference of two population means or proportions, you can conclude whether the difference is statistically significant or not, just like the result of a hypothesis test.

Example 15: You’re testing the new drug Effluvium to see whether it makes people drowsy. Your 95% confidence interval for the difference between proportions of drowsiness in people who do and don’t take Effluvium is (0.017, 0.041). That means you’re 95% confident that Effluvium is more likely, by 1.7 to 4.1 percentage points, to cause drowsiness. There’s the key point. You’re 95% confident that it does increase the chance of drowsiness by something between those two figures. How likely is it that Effluvium doesn’t affect the chance of drowsiness, then? Clearly it’s got to be less than 5%. When both endpoints of your confidence interval are positive (or both are negative), so that the confidence interval doesn’t include 0, you have a significant difference between the two populations. Example 16: Now, suppose that confidence interval was (−0.013, 0.011). That means you’re 95% confident that Effluvium is somewhere between 1.3 percentage points less likely and 1.1 more likely to cause drowsiness. Can you now conclude that Effluvium affects the chance of drowsiness? No, because 0 (“no difference”) is inside your confidence interval. Maybe Effluvium makes drowsiness less likely, maybe it has no effect, maybe it makes drowsiness more likely; you can’t tell. When one endpoint of your confidence interval is negative and one is positive, so that the confidence interval includes 0, you can’t tell whether there’s a significant difference between the two populations or not.

When 0 is inside the 1− CI (the two endpoints have different signs), the two-tailed p-value is > . Your sample doesn’t show a difference between the population means or proportions, and you fail to reject H0 . When 0 is outside the 1− CI (the two endpoints have the same sign), the two-tailed p-value is < . Your sample shows a difference between the two population means or proportions, and you reject H0 . This is exact for numeric data but approximate for binomial data. Why? Because the HT and CI use the same standard error for the numeric data cases, but slightly different standard errors for twopopulation binomial data. (The two calculations are in BTW paragraphs earlier in the chapter, here and here).

11F. More Confidence Intervals for Two Populations Summary:

Confidence intervals for two populations are easy enough to calculate on your TI-83. But one or both endpoints can be negative, and that means you have to write your interpretation carefully. Don’t just say “difference”; specify which population’s mean or proportion is larger or smaller. You must also distinguish between mean difference (for paired data) and difference in means (for unpaired data). Study these examples of confidence intervals for two populations, and you’ll learn how to write your interpretations like a pro!

11F1. Example 17: Heights of Men and Women Here’s an example adapted from Johnson and Kuby (2003, 425). Men’s and women’s heights are ND. From this random sample, estimate the difference in average height as a 95% CI. Sample

Mean, x

Standard Deviation, s

Sample Size, n

Female, pop. 2

63.8"

2.18"

20

Male, pop. 1

69.8"

1.92"

30

Analysis You have independent samples here: you get one number from each individual. The data type is numeric (height), so you have Case 4, difference of independent means. Requirements Check With independent means, you check requirements for each sample separately. You’re told that each sample was an SRS, so that’s no problem. The samples are smaller than 10% of all men and 10% of all women. The sample of 30 men is big enough that normality and outliers aren’t an issue. But what about the sample of 20 women? You don’t have the original data, so you can’t check normality and outliers. Fortunately, you don’t need to. Since women’s heights are normally distributed, the distribution of sample means will be normal regardless of sample size. All requirements for Case 4 are met. Calculation and Conclusion The TI-83 or TI-84 computes µ1 −µ2 , so you need to decide which will be population 1 and which will be population 2. I like to avoid negative signs, so unless there’s a good reason to do otherwise I take the sample with the larger mean as sample 1; in this case that’s the men. Whichever way you decide, write it down: pop 1 = ________, pop 2 = ________. On your calculator, press [STAT] [] and scroll up or down to find 0:2-SampTInt. Enter the sample statistics and use Pooled:No. Here are the input and output screens :

Conclusion: With 95% confidence, the average man at that college is between 4.8² and 7.2² taller than the average woman, or µM −µF = 6.0²±1.2². (You would probably present one or the other of those forms, not both.) (6.0 is the difference of sample means and is the center of the confidence interval: x1 −x2 = 69.8−63.8 = 6.0.) Remark: The difference from the case of dependent means is subtle but important. With dependent means (paired data), the CI is about the average difference in measurements of a single randomly selected individual or matched pair. But with independent means (unpaired data), the CI is about the difference between the averages for two different populations.

11F2. Example 18: Coffee and Heart Rate with Negatives Now let’s make up new data for the coffee example. (The new d’s are still normally distributed with no outliers.) Again, you’re estimating the mean difference in heart rate due to drinking coffee. Person

1

2

3

4

5

6

Before

78

64

70

71

70

68

After

79

62

73

70

71

67

1

−2

3

−1

1

−1

d = A−B

Notice that some heart rates declined after the people drank coffee. Now when you compute a 95% CI you get the results shown at right. How should you interpret a negative endpoint in the interval? Remember that you are computing a CI for the quantity After−Before. You could follow the earlier pattern and say “With 95% confidence, the mean increase in heart rate for all individuals after drinking coffee is between −1.8 and +2.1 beats per minute,” but only a mathematician would love a statement that talks about an increase being negative. Instead, you draw attention to the fact that the change might be a decrease or an increase, as follows. Conclusion: With 95% confidence, the mean change in heart rate for all individuals after drinking coffee is between a decrease of 1.8 and an increase of 2.1 beats per minute. Since it’s obviously very important to get the direction right, be sure to check your conclusion against your H 1 (if any) and your original definition of d. Remark 1: Though it’s correct to present the CI as a point estimate and margin of error, it’s probably not a good idea because that form is so easy to misinterpret. If you say “With 95% confidence, the mean increase in heart rate for all individuals is 0.2±1.9 beats per minute,” many people won’t notice that the margin of error is bigger than the point estimate, and they’ll come to the false conclusion that you have established an increase in heart rate after drinking coffee. As statistics mavens, we have a responsibility to present our results clearly, so that people draw the right conclusions and not the wrong ones. Remark 2: Remember that the CI occupies the middle of the distribution while the HT looks at the tails. If 0 is inside the CI, it can’t be in either tail. Therefore, from this confidence interval you know that testing the null hypothesis µd = 0 at the 0.05 level (0.05 = 1−95%) would fail to reject H 0 : this experiment failed to find a significant difference in heart rate after drinking coffee. (See Confidence Interval and Hypothesis Test (Two Populations).) Remember the difference between “no significant difference found” and “no difference exists”. Since 0 is in the CI, you can’t say whether there is a difference. The correct statement, “I don’t know whether there is a difference,” is different from the incorrect “There is no difference.”

11F3. Example 19: Opinion Poll The following data are from Dabes and Janik (1999, 269). Men and women were polled in a systematic sample on whether they favored legalized abortion, and the results were as follows: Sample

Number in Favor, x

Sample Size, n

Females, pop. 1

60

100

Males, pop. 2

40

80

Find a 98% confidence interval for the difference in level of support between women and men. Analysis You have binomial data: each person either supports legalized abortion or not. (Obviously this example is oversimplified.) Binomial data with two populations is Case 5, difference of proportions. Requirements Check Support among the sample of women is 60/100 = 60%, and among the sample of men is 40/80 = 50%, so let’s define population 1 = women, population 2 = men. Each sample is a systematic sample, as good as an SRS. 10n1 = 10×100 = 1000; 10n2 = 10×80 = 800. There are more than 1000 women and 800 men in the world. In the sample of women there are 60 successes and 100−60 = 40 failures; in the sample of men there are 40 successes and 80−40 = 40 failures. All four numbers are well above 10. All requirements for a Case 5 CI are met. Calculation and Conclusion On the TI-83 or TI-84, press [STAT] [] and scroll up to find B:2-PropZInt. The input and output screens look like this:

Two-population confidence intervals can be tricky to interpret, particularly when the two endpoints have different signs and particularly for Case 5, two population proportions. You can reason it out in words, or use algebra. In words, remember that the confidence interval is the estimated difference p1 −p2 , which is the estimated amount by which the proportion in the first population exceeds the proportion in the second population. So a negative endpoint for your CI means that the first proportion is lower than the second, and a positive endpoint means that the first proportion is larger. Using algebra, begin with the calculator’s estimate of p1 −p2 : −0.0729 ≤ p1 −p2 ≤ +0.27292 (98% conf.) Add p2 to all three parts of the inequality, and you have p2 −0.0729 ≤ p1 ≤ p2 +0.27292 (98% conf.) That’s a little easier to work with. The 98% confidence bounds on p1 (level of women’s support) are p2 −0.0729 (7.3% below men’s support) and p2 +0.27292 (27.3% above men’s support). Conclusion: You are 98% confident that support for legalized abortion is somewhere between 7.3 percentage points lower and 27.3 points higher among females than males. Remark: It would be equally valid to turn that around and say you’re 98% confident that support is between 27.3 percentage points lower and 7.3 points higher among males than females.

11F4. Example 20: GPA of Fraternity Members and Nonmembers Johnson and Kuby (2003, 427) present another example. What is the difference (if any) in academic performance between fraternity members and nonmembers? Forty members of each population were randomly selected, and their cumulative GPA recorded as an indication of performance. The results were as follows: Sample

x

s

n

Fraternity members, pop. 1

2.03

0.68

40

Independents, pop. 2

2.21

0.59

40

Analysis Here you have numeric data, two independent samples. (You know it’s independent samples, unpaired data, because each member of the sample gives you just one number.) This is Case 4, difference of independent means. Requirements Check Each sample was random, and each sample size is >30. We can assume that there are more than 10×40 = 400 fraternity members and 400 independents on campus. All requirements for Case 4 are met. Calculation and Conclusion The CI is −0.46 to +0.10, with 95% confidence. To interpret this, remember that the TI-83 computes a CI for µ1 −µ2 , and we defined population 1 as fraternity and population 2 as independent. The calculator is telling you that −0.46 ≤ µ1 −µ2 ≤ +0.10 (95% conf.) or, adding µ2 to all three parts, µ2 −0.46 ≤ µ1 ≤ µ2 +0.10 (95% conf.) Conclusion: The true difference in academic performance, as measured by average GPA, is somewhere between 0.46 worse and 0.10 better for fraternity members relative to nonmembers, with 95% confidence. You could also write a somewhat longer form: with 95% confidence, the average fraternity member’s academic performance, as measured by GPA, is somewhere between 0.46 worse and 0.10 better than the average independent’s performance. Remark: Don’t be fooled by the fact that the CI is mostly below zero. You really cannot conclude that fraternity members probably have lower academic performance. Remember that the 95% CI is the result of a process that captures the true population mean (or difference, in this case) 95 times out of 100. But you can’t know where in that interval the true mean (or difference) lies. If you could, there would be no point to having a CI! Remark 2: Even though zero is within the CI, you must not say that there is no difference in performance between members and nonmembers. The difference might indeed be zero, but it might also be anywhere between 0.46 in one direction and 0.10 in the other. There’s even a 5% chance that the true difference lies outside those limits. Always bear in mind the difference between insufficient evidence for and evidence against. (You may hear that said as “lack of evidence for is not evidence against.”)

What Have You Learned? This chapter covered confidence intervals and hypothesis tests for two samples, both binomial and numeric data. Instead of testing a population’s µ or p against some baseline number, you test the µ or p of two populations against each other. Key ideas:

Numeric data can be paired or unpaired, a/k/a dependent or independent samples. Paired data arise when one experimental unit generates two numbers. With unpaired data, there’s no specific association between a data point in one sample and any specific data point in the other sample. If there’s an effect to be found, you’re more likely to find it with a paireddata design than with unpaired data. Caution! Just having equal sample sizes is not enough for paired data; there has to be an association between each member of one sample and a specific member of the other sample. They can be two tasks performed by the same individual, husband-wife studies, identical-twin studies, etc. With paired numeric data, you have Case 3, mean difference. In step 1 of your HT or at the beginning of your CI, write the definition of d, showing which direction you will subtract. Your HT is about µd . Do your requirements check on the d’s — you don’t care whether the original numbers pass requirements. Use plain T-Test or TInterval on the differences. With unpaired numeric data, you have Case 4, difference of means. In HT step 1 or at start of CI, identify population 1 and population 2 (not sample 1 and sample 2). Check requirements on each sample separately. Use 2-SampTTest or 2-SampTInt. Binomial data are never paired in this course; you have Case 5, difference of proportions. In HT step 1 or at start of CI, identify population 1 and population 2 (not sample 1 and sample 2). Requirements are slightly different between CI and HT, and in fact with HT it’s easier to check requirements after the computations. Use 2-PropZTest or 2-PropZInt. Be able to calculate necessary sample size to keep margin of error below a desired value for a desired confidence level. Spend time to interpret confidence intervals correctly. You must identify one population’s mean or proportion as larger or smaller than the other (not just different), and by how much; in other words, you give the direction and size of the effect. This may take more than one draft, especially when the ends of the CI have opposite signs. With binomial data, the difference is a matter of percentage points, not percent. Your interpretation should clearly be about the populations, not the samples. CI and HT are intimately associated. Zero outside the CI at a C-level of 1− rules out “no difference”, so it matches up with a two-tailed HT result of “reject H 0 ” at the level. Zero inside the interval admits “no difference” as a possibility, so it matches up with a “fail to reject H 0 ”.

Study aids:

Inferential Statistics: Basic Cases Because this textbook helps you, please click to donate!

Interactive: Triage: Which Inferential Stats Case Should I Use? Seven Steps of Hypothesis Tests Top 10 Mistakes of Hypothesis Tests Paired and Unpaired Data Compared Statistics Symbol Sheet

← Chapter 10 WHYL

Chapter 12 WHYL Õ

Exercises for Chapter 11 Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand. Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.

1

You want to determine whether sports fans would pay 20% extra for reserved bleacher seats. You suspect that the answer may be different between people aged under 30 and people 30 years old or older. (a) How big must your samples be to let you construct a 95% confidence interval for the difference, with a margin of error of 3 percentage points? (b) You discover a poll done last year, in which 30% of young people and 45% of older people said they would pay extra. Now how large must your samples be?

2

A researcher wanted to see whether the English like soccer more than the Scots. She randomly selected eight English and eight Scots and asked them to rate their liking for soccer on a numeric scale of 1 (hate) to 10 (love), and she recorded these responses: English

6.4

5.9

2.9

8.2

7.0

7.1

5.5

9.3

Scots

5.1

4.0

7.2

6.9

4.4

1.3

2.2

7.7

(a) From the above data, can the researcher prove that the English have a stronger liking for soccer than the Scots? Use = 0.05. (b) Construct a 90% confidence interval for the different average levels of English and Scottish enthusiasm for soccer.

3

Another researcher took a different approach. Sample Size Number of “Yes” She polled random samples with the question “Do you watch football at least once a week?” English 150 105 (In the UK they call soccer “football”). She got the Scots 200 160 results shown at right. (a) At the 0.05 level of significance, are the English and the Scots equally fans of soccer? (b) Construct a 95% confidence interval for the difference. (c) Find the margin of error in that interval. (d) If the researcher repeats her survey, what sample size would she need to reduce the margin of error to 4 percentage points at the same confidence level?

4

To see if running raises the HDL (“good”) cholesterol level, five female volunteers (randomly selected) had their HDL level measured before they started running and again after each had run regularly for six months, an average of four miles daily. (a) See if you can prove that the average person’s HDL cholesterol level would be raised after all that running. Use =0.05. (b) Compute and interpret a 90% confidence interval for the change in HDL from running four miles daily for six months.

Before Running

After Running

1

30

35

2

34

39

3

36

42

4

34

33

5

40

48

Person

5

The Physicians’ Health Study tested the effects of 325 mg aspirin every other day in preventing heart attack in people with no personal history of heart attack. 22,071 doctors were randomized into an aspirin group and a placebo group. Of the 11,037 doctors who received aspirin, 10 had fatal heart attacks and 129 had non-fatal heart attacks; total 139. For the 11,034 doctors in the placebo group, the figures were 26 and 213; total 239. (a) At the 0.001 level of significance, does this aspirin regimen make a difference to the likelihood of a heart attack? (b) Find a 95% confidence interval for the reduced risk.

6

June was planning to relocate to Central New Mean S.D. n York, considering Binghamton and Cortland. She found an online survey of prices of recently Cortland Co. $134,296 $44,800 30 completed house sales as shown at right. The survey Broome Co. $127,139 $61,200 32 was two random samples taken about a month before she looked at the Web site. (a) Construct a 95% confidence interval for the difference in mean house price in the two counties. (b) Use that answer to determine which county has a lower average price of houses, at the 0.05 significance level.

7

The Canter Polling Service conducted two national polls in the same week, one for the Red Party candidate and one for the Blue Party candidate. Each one was a random sample of 1000 likely voters — not the same 1000, of course. (Most national polls have sample size 1000.) In the first sample, 520 (52%) said they would vote for Red. In the second sample, 480 (48%) said they would vote for Blue. The newspaper reported that Red was leading by 4%. What’s wrong with that? Write a correct statement, at the 95% level of confidence.

8

You have two independent random samples of yes/no data. Sample 1 has 7 yes out of 28, and sample 2 has 18 yes out of 32. Each sample is smaller than 10% of the population. (a) Is it valid to use 2-PropZInt to compute a confidence interval? Why or why not? (b) Is it valid to use 2-PropZTest to compute a p-value? Why or why not?



Solutions Õ

What’s New 1 Jan 2016: Retake the screen shots here, here, and here, for the new version of MATH200A Program part 4. 15 Dec 2015: Thanks to Briann Lehman, correct typos here and here. 9 July 2015: Correct H 0 to H 1 here and here, thanks to Halsey Huff. 19 Jan 2015: In Numeric Data — Paired or Unpaired? (formerly “Paired and Unpaired Data”), invert the first two pairs of examples so that unpaired comes first and I can ask about common problems with unpaired, and add section heads for unpaired and paired. Add the section 11F5. Paired and Unpaired Data Compared with a side-by-side chart. Add a section head, When to Use Paired Data? Add more explicit “thought process” on recognizing paired data; also here. Fix an instance of backward data entry under Entering Paired Numeric Data. Also add a warning about changing numbers after entering a formula. Add a problem that requires the formal requirements check for Case 5. (intervening changes suppressed) 29 Apr 2013: New document.

12. Tests on Counted Data Updated 21 Jan 2015 (What’s New?) Intro:

In Chapter 10 you learned about hypothesis tests, using one sample of numeric or binomial data to test a hypothesis about a population mean or proportion. In Chapter 11, you extended that to inferences about the difference between two numeric or binomial populations. With binomial data, you have counts of success and failure: two categories for one or two populations. But what if you have more categories or populations? That’s when you use the tests in this chapter. The hypothesis tests will use the same seven steps that you already know and love, but with a new test statistic called chi-squared. Your data will be counts of members of the sample that fall into particular categories of one or two variables.

Contents:

12A. Testing Goodness of Fit to a Model First Problem: M&M Colors · The Theory · Hypothesis Test in Practice · Optional: Residuals · Confidence Intervals? Second Problem: Fruit Flies · Solution · Scientific Method and Your Conclusions Third Problem: Equal Preferences · Solution 12B. Testing for Independence or Homogeneity First Problem: Office Equipment · The Theory · Hypothesis Test in Practice · Optional: Residuals and More Second Problem: The “Monday Effect” · Solution Third Problem: Tobacco Smoke and Tumors · Solution 12C. But Wait, There’s More! What Have You Learned? Exercises for Chapter 12 What’s New

12A. Testing Goodness of Fit to a Model Suppose you have one population divided into three or more categories — there are ≥ 3 possible non-numeric responses from each subject. For example, instead of monitoring whether each patient had a heart attack or not (two possibilities), you might monitor whether each person had a fatal heart attack, a non-fatal heart attack, or no heart attack (three possibilities). When there were only two possibilities, you could talk about the proportion of successes in the population, because failure was the only other possibility. If you knew about successes, you knew about failures. The population proportion of successes, p, was the population parameter. But when you have three or more possibilities, that goes out the window. Knowing the proportion of one category in the population doesn’t tell you the proportions of the others. So instead of testing against a particular proportion, you test against all the proportions at once. You have a probability model in mind, and you perform a goodness-of-fit (GoF) test to compare the data and the model. If the data are too far away from the model, you reject the model. This is a standard hypothesis test, but you’ll learn that you compute the p-value from a new distribution called ². As usual, I’ll show you the theory first, and then you’ll do calculations the easy way.

12A1. First Problem: M&M Colors The M&M Mars Web site used to give the color distribution of plain M&Ms as 24% blue, 13% brown, 16% green, 20% orange, 13% red, and 14% yellow. My Spring 2011 class counted the colors of 628 plain M&Ms and computed the sample proportions, as shown at right. Obviously their percentages differ from the company model. But are they different enough to let the class reject the company’s model?

Plain M&Ms Color Model Observed Blue 24% 127 Brown 13% 63 Green 16% 122 Orange 20% 147 Red 13% 93 Yellow 14% 76 Totals 100% 628

The company’s model is your null hypothesis H0 . The alternative hypothesis H1 is that the company model is wrong. Let’s use a significance level of 0.05.

p 20.2% 10.0% 19.4% 23.4% 14.8% 12.1% 99.9%

Caution: As always, to apply the analysis techniques you need a simple random sample, and the class didn’t have that. The Fun Size M&Ms packs that they analyzed were bought from the same store on the same day and almost certainly came from one small part of one production run. Although this wasn’t a true random sample, I’m going to proceed as though it was, to show you the method. The Theory By now you know the drill. Samples vary, so just about any real-life sample will be different from the theoretical expectation in H 0 . The question is always the same: Is pure sample variability enough to account for the difference between H 0 and this sample, or is there some real effect here beyond that? For each data type, you have a method to figure a test statistic and a p-value. The test statistic is a standardized measure of the discrepancy between H 0 and the sample, taking sample size into account; the p-value is the probability of getting that sample, or one even further away from H 0 , if random chance is the only thing going on. (Most software and statistical calculators compute the test statistic and p-value for you at the same time.) So far you know two test statistics, z and t. Can you use one of them on this problem? There’s the obvious choice of performing six z tests of proportion on the six colors. But in the immortal words of Richard Nixon in Watergate, “We could do that, but it would be wrong.” Why would it be wrong? Well, you’re doing a hypothesis test at 0.05 significance, right? That means that you can live with a one in twenty chance of a Type I error, calling the model bad when it’s actually good. But if you do a 0.05 significance test of each color, then you have a 0.05 chance of a Type I error on blue, a 0.05 chance of a Type I error on brown, and so forth. Suddenly your real significance level is almost 0.30, which is ridiculously high. (It’s not quite equal to 6×0.05 because you might get Type I errors on more than one color, and also because the colors aren’t independent.) Never do multiple tests on the same data, because that makes Type I errors way more likely than you can live with. You must do a single overall test of the model as a whole, and that means a new test statistic. It’s called ² or chi-squared. The ² computation will look a little weird: you have to deal with each category because there are no summary statistics like x and s to help you along. But I’ll walk you through it and you’ll see that it’s not too bad, really. How to pronounce ² or chi-squared: The first consonant is roughly a k sound, so you can pronounce or chi as Kyle without the l sound. ² rhymes with “high-chaired”. If you want to get technical — and you know I do — the Greek letter sounds similar to Yiddish ch in l’chaim or Scottish ch in loch. It’s definitely not English ch as in church. is not an X, by the way, even though it looks like one. Greek words beginning with are written with ch in English — words like chiropractor and chronology. The Greek letter with the x sound is , spelled xi and pronounced ksee, but it doesn’t figure in this course. Okay, that’s enough Greek class. We now return you to statistics. Computing Expected Counts (E’s)

The key concept in testing a probability Plain M&Ms model against data is expected count. Color Model Observed Expected Samples never actually match a model, but Blue 24% 127 24% × 628 = 150.7 what would a sample with this same size look Brown 13% 63 13% × 628 = 81.6 like if it did? Well, if colors are supposed to 16% 122 16% × 628 = 100.5 be distributed in 24% blue, 13% brown, and Green Orange 20% 147 20% × 628 = 125.6 so on, then a perfect match within 628 Red 13% 93 13% × 628 = 81.6 M&Ms would be distributed in 24% blue, 13% brown, and so on. The expected counts Yellow 14% 76 14% × 628 = 87.9 are computed in the table at the right. Totals 100% 628 627.9 The observed column is counts, which means whole numbers. But E’s are averages in a sense — what’s the average number of blues you’d expect if you took many, many samples of size 628 and the company’s 24% is correct? 150.7 — so they don’t need to be whole numbers and typically are not. As you can see, even carrying E’s to one decimal place there’s a slight rounding error, 627.9 versus 628; rounding to whole numbers would give a bigger rounding error. Software and calculators avoid this issue, by carrying more precision internally than they display. In a goodness-of-fit test, your data are counts, just as they were in tests on binomial data. So it’s no surprise that the requirements for GoF are similar to the requirements for binomial data. You need a random sample (or equivalent) that is less than 10% of the population. But there are more than two categories in the model, so instead of a success/failure condition you have a condition on the expected counts (E): The expected count in every category must be ≥ 5. (Some authors use a looser requirement, that none of the E’s can be below 1, and no more than 20% of them can be below 5.) By the way, make sure you actually have counted data. (Dave Bock calls this the Counted Data Condition.) Sometimes students try to do a chi-squared test on sample means, but the chisquared distribution is just for counts of categorical data. What do you do if your E’s are too small? You can combine smaller categories, if the combination seems reasonable. For example, suppose you’re studying some characteristic of people based on their home state. You could combine adjacent small states like Connecticut–Rhode Island and Delaware– Maryland. But it’s best to plan ahead and not get into this position. Your smallest E will come from your sample size times your smallest model category. Just plan for a large enough sample size to make that product ≥5. Computing ² Contributions and Total

Eyeballing the observed and expected numbers doesn’t really tell you much. What you need is a single number that shows the overall badness of fit and can be related to a standardized distribution. Plain M&Ms To find this, you take Color Model Observed Expected ² contribution the difference between 24% 127 150.7 (127−150.7)²/150.7 = 3.73 observed and expected, Blue Brown 13% 63 81.6 (63−81.6)²/81.6 = 4.24 square it so that it’s always positive, and Green 16% 122 100.5 (122−100.5)²/100.5 = 4.60 then divide by expected Orange 20% 147 125.6 (147−125.6)²/125.6 = 3.65 to scale the effect size by Red 13% 93 81.6 (93−81.6)²/81.6 = 1.59 the sample size. Do this Yellow 14% 76 87.9 (76−87.9)²/87.9 = 1.61 for each row, and the Totals 100% 628 627.9 19.42 result is called the “² contribution” for that row. Add up the rows and you have your ² test statistic . The computations are shown at right, and for this sample and this model you have ² = 19.42. This is a standardized measure of how far the model and the data disagree. BTW: All these computations are summarized in the formula ² = ∑(O−E)²/E , where the summation is over categories, not individual data points. Properties of the ² Distribution

Let’s pause to talk a little about the ² distribution. ² is actually a family of distributions distinguished by degrees of freedom (Greek “nu” or df). df = number of categories minus 1, so for the M&Ms example df = 6−1 = 5. BTW: The chi-squared distribution was developed independently by Ernst Carl Abbe (1840– 1895, German) in 1863, by Friedrich Robert Helmert (1843–1917, German) in 1875, and by Karl Pearson (1857–1936, English) in 1900. The name “chi-squared” is due to Pearson, who also invented the goodness-of-fit test.

What does the distribution look like? Some good pictures on the Web show ² curves for different df overlaid on one graph, but like this one at Wikipedia they really need to be seen on a color screen. So here are my own side-by-side shots.

df=2

df=3

df=4

df=5

df=8

df=10

² distributions, all on the same scale of ² = 0 to 16

Obviously ² is skewed right, though less so for higher df. That makes sense, if you think about it. You compute ² by adding positive numbers, so obviously it can’t have a left tail that goes below 0, as z and t do. And it also makes sense in terms of what you’re testing. Higher ² represent poorer matches between model and data. ² = 0 would mean that the data match the model exactly, which is extremely rare. Negative ² would mean that the data and model are better than a perfect match, which obviously can’t happen. BTW: You might be interested to know that the mean of the distribution equals df, the mode is at df−2, and the median is about df−2/3. And then again, you might not.

p-values must be looked up in a table, or more likely with software or a calculator. When the ² value is a lot bigger than the degrees of freedom, the data and the model are very different and the p-value is small. In the M&Ms example, df = 5 and ² = 19.42, so the pvalue is small. (In fact, it’s 0.0016.) BTW: Your TI-83 has the ² distribution in the [ 2nd VARs makes DIST ] menu, so if you needed to you could compute the p-value as ²cdf(19.42,10^99,5). But in practice your calculator will give you the p-value automatically, the same way it does in z and t tests.

Hypothesis Test in Practice So much for the theory. But how will you test goodness of fit in practice? This section runs through the complete hypothesis test. There’s still some commentary, but the stuff in boxes is what you’d actually write for a quiz or homework. 1. Hypotheses

With goodness of fit, there’s no single population parameter to test for. (If you want to get technical, the population parameter is a probability distribution.) So you state the hypotheses in words, but usually including the model: (1) H 0 : The 24:13:16:20:13:14 color distribution is correct. H 1 : The color distribution on the Web site is incorrect.

2. Significance Level

Nothing new here: (2) = 0.05

3–4. Test Statistic and p-Value

Here you have a choice. The MATH200A Program is easiest to use, and also saves you work with several other statistics procedures. If you don’t have the program, follow the procedure in Testing Goodness of Fit on TI-83/84. If you have a calculator in the TI-89 family, please see Testing Goodness of Fit on TI-89. Put the model numbers in L1 — not the total. The model can be percentages or ratios. For example, with the M&Ms you can enter 24, 13, 16, and so on, or .24, .13, .16, and so on; it doesn’t matter as long as you’re consistent. Similarly, if you have a 9:3:3:1 model you can enter 9, 3, 3, 1 or 9/16, 3/16. 3/16, 1/16. Put the observed counts in L2 — counts, not ratios or percentages. Never enter the total. Press [PRGM], then the number you see for MATH200A., then [ENTER]. Caution! Don’t press [] or []. Dismiss the splash screen and press [6] to select the GoF test. The confirmation screen asks you if you’ve entered the two necessary lists. If you have, press [9] [ENTER].





The program performs the computations and graphs the ² curve, also showing the p-value, test statistic, and degrees of freedom. (In this case the graph looks blank because the p-value is so small.) You might notice that the test statistic is 19.44, not 19.42 as computed earlier. That’s because the calculator keeps many digits of precision, avoiding problems with rounding. The program tells you how many of the categories have expected counts below 5. (If any are below 5, it also tells you how many are below 1, but we don’t use that information in this book.) See Requirements Check below. (3–4) MATH200A/GoF Test df=5, ²=19.44, p=0.0016

RC. Requirements Check

The program computes the expected counts, and places them in L3. As discussed above, you must not have any E’s below about 5 to be sure that the test procedure is valid. I said “about 5”. If the results screen shows one or more E’s below 5, look at L3 to see how far below 5. One expected count just a little below 5 is not necessarily a fatal flaw in the test. (RC)

Random sample? Not really, but we’re pretending. Sample size under 10% of population? Yes, 10×628 = 6280 is far less than the total number of M&Ms. All E’s ≥ 5? Yes, the smallest E in L3 is 81.64.

5. Decision Rule

This is the same for every type of hypothesis test. (5) p < . Reject H 0 and accept H 1 .

6. Conclusion

(6) At the 0.05 level of significance, the color distribution on the Web site is incorrect. [Or, “... the color distribution on the Web site is inconsistent with the data.”] Or, The color distribution on the Web site is inconsistent with the data (p = 0.0016).

Optional: Residuals If you reject H 0 , can you say anything about which categories are most “responsible” for the overall deviation from the model? Yes. DeVeaux, Velleman, Bock (2009, 699–700) suggest that you can look at the standardized residuals (observed−expected)/√expected. These are essentially z-scores, and you recall that z has only a 5% chance of being outside ±2 if the null hypothesis is true. MATH200A part 6 already computes the squares of the residuals for you in list L4. The square of ±2 is 4, so when you look at list L4 after running the program, you can be pretty sure that any row with a value above 4 indicates a category that doesn’t match the model. (It’s more complicated, but that’s a decent rule of thumb.) In this example, brown and green (rows 2 and 3) have squared residuals above 4. Therefore, for those colors, the differences between this sample and the model are probably significant. Remember that L2 is the observed counts in the sample, and L3 is the expected counts from the model for this sample size. You can see that there were significantly fewer browns than expected, and significantly more greens than expected. You might be a little suspicious of blue (row 1) and orange (row 4), but this sample’s differences from the model are probably not significant. Even so, you can’t simply do 1-PropZTest on each category after rejecting H 0 on your GoF test, because that would greatly increase your chance of a Type I error above your stated . More advanced textbooks will suggest alternatives, such as adjusting the significance level or taking a new sample. Confidence Intervals? You may be wondering about computing a confidence interval. You can’t just do 1-PropZInt confidence intervals on the category proportions. A confidence interval is the complement of a hypothesis test, so multiple confidence intervals on the same data have the same problem as multiple hypothesis tests. Confidence intervals can be computed for individual categories or the overall model, but the techniques are beyond the scope of this course. If you’re interested, please look at Confidence Intervals for Goodness of Fit. It shows how to make the calculations and includes an Excel workbook with instructions.

12A2. Second Problem: Fruit Flies “A problem which frequently arises is that of testing the agreement between observation and hypothesis.” — Bulmer (1979, 154) The 9:3:3:1 ratio for crosses is pretty basic in Model Observed genetics, when two independent traits are involved. ratio Here the traits are green or red eyes and having Green-eyed winged 9 120 wings or not. Dabes and Janik (1999, 273) give some data Green-eyed wingless 3 49 for the hybrid offspring of fruit flies; see figures at 3 36 right. The flies were randomly selected. Your task is Red-eyed winged to determine whether this cross follows the 9:3:3:1 Red-eyed wingless 1 12 model or not. Use =0.05. Total 16 217 Suggestion: Stop reading at this point, and try to write out all the steps on your own, using the preceding example as a model if you need to. Then compare your work to what follows. BTW: You can read about the 9:3:3:1 ratio in many places such as Wikipedia’s Mendelian Inheritance: scroll down to “Law of Independent Assortment (the ‘Second Law’)”. A Web search for “9:3:3:1” will bring up plenty more.

Solution (1)

H 0 : The fruit flies follow the 9:3:3:1 model. H 1 : The fruit flies do not follow the model.

(2)

= 0.05

(3–4) MATH200A/GoF Test df=3, ²=2.45, p=0.4838

(RC)

Random sample. 10×217 = 2170, and the number of fruit flies is far greater. All E’s are >5; the smallest is 13.563.

(5)

p > . Fail to reject H 0 .

(6)

At the 0.05 level of significance, we can’t say whether the fruit flies follow the 9:3:3:1 model or not. Or, We can’t say whether the fruit flies follow the 9:3:3:1 model or not (p = 0.4838).

(Again, if you don’t have the program you can follow the procedure in Testing Goodness of Fit on TI-83/84.) Scientific Method and Your Conclusions While it’s true that this one experiment gave no Because this textbook helps you, conclusion, science wouldn’t stop there. You know that please click to donate! the scientific method calls for experiments to be replicated. Now, when the experiment is repeated, either H 0 will be rejected or it will fail to be rejected. Here’s how those possibilities interact with what you’ve learned about writing conclusions. If additional experiments do reject H 0 , then we conclude that H 0 is actually false, and this first sample just happened to be one of the unlucky ones that failed to show an effect that actually exists. There’s one caveat, though. Experiments at the 0.05 significance level will reject H 0 in about one case in twenty where it’s actually true. So while a “reject H 0 ” deserves a lot of respect, if it’s one result out of dozens we can’t take it on its own as enough to overthrow H0. If additional experiments still come up with a “fail to reject H 0 ”, we begin to think that H 0 is probably true. How can we do that on the basis of multiple experiments when we can’t do it from one experiment? Well, remember what “fail to reject H 0 ” means: either H 0 is actually true, or it’s actually false but this experiment’s sample happened not to show it. If it was actually false, we would expect most experiments to reject it. But as test after test fails to disprove H 0 , we grow more and more confident that it’s not going to be disproved. For this reason, in scientific contexts the conclusion after failing to reject H 0 is often written in terms like “the data are not inconsistent with the model” or even “we were unable to rule out the model.” The scientists are not accepting the null hypothesis here; they’re writing for a technical audience that understands what a “fail to reject H 0 ” means. When you’re writing for a general audience, stick to neutral language when you fail to reject H 0 .

12A3. Third Problem: Equal Preferences A store manager always has to decide how to use limited shelf space or freezer space most effectively. The store currently carries four brands of veggie burgers, and the manager wants to know if customers have a preference. (This is the last store in America that has not computerized its inventory.) She randomly selects a week, and finds the following sales figures: 145 Brand B, 195 Brand G, 189 Brand Q, and 153 Brand V. At the 0.05 level of significance, can you say that customers have equal or different preference for the brands? Solution You’re not explicitly given a model, so you have to develop one. But “equal preferences” must mean the expected counts are all equal, or in other words the numbers in the model are all equal. You could enter ¼:¼:¼:¼, or 1:1:1:1, or any numbers as long as it’s four equal numbers. (1)

H 0 : Consumers have equal preference for the four brands. H 1 : Consumers have unequal preference for the four brands. Comment: Students often write these backwards. Remember that H 0 is always some variation on “nothin’ special goin’ on here.” Preference for one brand over another would be something, so that must be H 1 .

(2)

= 0.05

(3–4) L1=1,1,1,1; L2=145,195,189,153 MATH200A/GoF Test df=3, ²=11.14, p=0.0110

(RC)

Random sample? The week was random, and we assume that the week’s customers are representative. Sample size is 145+195+189+153 = 682, and 10×682 = 6820. There will be more than that number of shoppers, past, present, and future. All E’s equal 170.5, ≥ 5.

(5)

p < . Reject H 0 and accept H 1 .

(6)

Consumers in general, at this store anyway, do have unequal preferences among the four brands (p = 0.0110). Or, At the 0.05 level of significance, we can say that consumers in general do have unequal preferences among the four brands.

So what should the manager do? The ² test shows that brand preferences aren’t equal, and Brand B is clearly the loser in this sample, but is that really enough to throw out Brand B? I wouldn’t. Its ² contribution is below the threshold discussed in Residuals, above. And it did sell only eight units less than Brand V; that’s just a 5% difference. Maybe in another week it might sell more. What the manager can do, now that the finger of suspicion is pointed at Brand B, is make another study — this time maybe taking two random weeks — and focus on just Brand B as a proportion of total veggie-burger sales. If they’re all equal, every brand would have 25%, so the manager might want to drop Brand B if a one-proportion test shows that less than say 20% of all sales are Brand B. But again, this would need to be a new sample, not just a 1-PropZTest on the data from this sample. You should never perform multiple significance tests on the same data.

12B. Testing for Independence or Homogeneity Summary:

You’ve already met two-way tables back in the chapter on probability. Now you’ll learn two types of inferences on those samples: Test of independence — table of one population and two attributes Test of homogeneity — table of two or more populations and one attribute People may not always agree on whether a given situation is a test for independence or homogeneity, but that’s okay because the two tests are identical in every way; it’s just a matter of how you phrase your conclusions to match what you tested.

12B1. First Problem: Office Equipment “In 1970, SCM surveyed 150 office managers in three Preference IBM SCM Total states to see if typewriter brand preference varies between states.” The quote and the table at right are from Dabes New York 35 35 70 and Janik (1999, 274). They didn’t say, but presumably Pennsylvania 25 15 40 this was a random sample. They go on to ask, “Do [the] above indicate that brand preference depends on state? ... Connecticut 30 10 40 = 0.05.” Total 90 60 150 (A “typewriter” was a Stone Age piece of office equipment, sort of like a keyboard and printer fused into some sort of bizarre hybrid. Believe it or not, in the 1970s every business had several, and many homes had at least one. They were popular gifts for high-school graduation!) The question seems clear: Does typewriter brand preference vary among states? But be careful in your thinking! The question is not asking whether preference varies among the managers surveyed. Obviously it does: NY has 50%–50%, PA has 63%–37%, and CT has 75%–25%. The question is whether this sample lets us conclude that brand preference varies among all managers in the three states. The first is descriptive statistics; this is inferential statistics. But how to analyze it? Well, you have three populations, office managers in the three states. And you have one attribute, preferred brand. So you need to do a test of independence. As always, the first step is to set up your hypotheses. Recall that H0 is always some variant of “Nothin’ goin’ on here” or “no effect”. So your null hypothesis must be that brand preference is independent of state, and the alternative naturally is that brand preference depends on (is associated with, varies by, is not independent of) state. What do you do to come up with a test statistic and a p-value? As usual, the calculator is your friend. But as usual, first I’ll take you on a little tour so that you understand what you’re testing. The Theory Just like goodness of fit, two-way tables are analyzed using the ² distribution. So you are once more concerned with the differences between observed and expected, and ² will be the sum of (observed − expected)² / expected just as it was in goodness of fit. But the computation of “expected” is a bit more complicated. Computing Expected Counts (E’s)

What is meant by “expected” for this two-way table? Preference IBM SCM Total Well, in the overall sample IBM was preferred 60–40 New York 35 35 over SCM: 90/150=60%, 60/150=40%. So if brand (42) (28) 70 preference doesn’t vary by state — if H 0 is true — you would expect that same 60–40 split in each state. Pennsylvania 25 15 Once you’ve got that, it’s just a matter of applying (24) (16) 40 the 60–40 split to each state. In New York, 60% of 70 Connecticut 30 10 is 42, and 40% of 70 is 28. (Conventionally, the (24) (16) 40 expected numbers are written in parentheses in the cell, under the observed numbers.) Pennsylvania and Total 90 60 Connecticut are filled in the same way in the table at (60%) (40%) 150 right. This table just happens to have whole numbers for all the expected counts. But it’s possible and okay for expected numbers to be decimals. BTW: There’s an alternative formula for the expected numbers. You may have observed that this was a two-pass procedure: first calculate the overall percentage preferences, and then apply those percentages. There exists a one-pass procedure that is mathematically equivalent: expected = (row total) × (column total) / (grand total) For example, the expected count for NY IBM is 70×90/150 = 42, and similarly for all the others. This is a neat formula, but then you never get to see the real point, which is that equal percentage split among all the populations.

The expected counts are how you test requirements. They’re exactly the same as for goodness of fit: random sample less than 10% of population, with all expected counts at least 5. (Again, some authors require only that none of the E’s can be below 1, and no more than 20% of them can be below 5.) What do you do if your E’s are too small? You can combine smaller categories or smaller populations, if the combination seems reasonable. For example, if this was a four-row table and included Rhode Island, but the RI expected counts were too low, you could combine CT and RI since they’re adjacent small states in coastal New England. (If you had 50 states, you wouldn’t combine Rhode Island and Wyoming, because they’re geographically and demographically different.) But it’s best to plan ahead and not get into this position. You should make some kind of guess about how the percentages will work out, and then plan a large enough sample in each population based on the percentages you expect. Computing ² and p-Value

Eyeballing the observed and ² Contributions IBM SCM expected numbers doesn’t really New York (35−42)²/42 = 1.17 (35−28)²/28 = 1.75 tell you much. You can see that observed and expected are pretty Pennsylvania (25−24)²/24 = 0.04 (15−16)²/16 = 0.06 close in PA but further apart in NY Connecticut (30−24)²/24 = 1.50 (10−16)²/16 = 2.25 and CT. But what you can’t see is whether that difference is too great to be purely the result of random sample selection. For that, you need to compute ² in each of the six cells and add them up. Those computations are shown in the table here. Add up those six numbers and you have your test statistic: ² = 6.77. Degrees of freedom is a bit different from the goodness-of-fit case. You might expect df to be 6−1=5, but for two-way tables it’s actually df = (rows−1) × (columns−1) You don’t count the total row and total column. For this table, df = (3−1)×(2−1) = 2. Finally, computing ²cdf(6.77,∞,2) gives p-value = 0.0339. Hypothesis Test in Practice So much for the theory. But how will you perform the test in practice? This section runs through the complete hypothesis test. There’s still some commentary, but the stuff in boxes is what you’d actually write for a quiz or homework. 1. Hypotheses

With independence and homogeneity (two-way tables), there’s no single population parameter to test for. So you state the hypotheses in words. In a test of independence, H 0 is always independence, and H 1 is always dependence. (1) H 0 : Brand preference doesn’t vary by state [or, is independent of state, is not associated with state, etc.]. H 1 : Brand preference varies by state [or, is dependent on state, is not independent of state, is associated with state, etc.].

2. Significance Level

Nothing new here: (2) = 0.05

3–4. Test Statistic and p-Value

Here you have a choice. The MATH200A Program is a bit easier to use and gives more information, but you can also use the native TI-83/84 test called ²-Test . (In the TI-89 family, it’s Chi2 2-way.) Either way, start by putting the observed numbers into a matrix, as follows: 1. If you have a [MATRX] key, press it. But you probably don’t, so press [2nd x-1 makes MATRX]. You get the matrix menu, similar to the one shown at right.

2. Unlike the stats menu, the matrix menu doesn’t come up ready for editing. You have to press [] [] before pressing [ENTER]. You’re then prompted for the numbers of rows and columns, not including the total row and total column. As you enter the number of rows and number of columns, the matrix changes shape to match. 3. Enter the observed numbers in the matrix. As you press the [ENTER] key after entering each number, the calculator automatically moves to the next cell.

After you fill matrix A with the observed numbers, it’s time to perform the calculations. With the MATH200A program (recommended):

If you’re not using the program:

Press [PRGM], then the number you see for MATH200A, then [ENTER]. Dismiss the splash screen and press [7] to select the two-way test.

Press [STAT] [] and then press [s] repeatedly to get to ²-Test . Press [ENTER].

Chances are good that Observed and Expected will already show [A] and [B]. If not, change either or both by pressing [2nd x-1 makes MATRX] (or the [MATRX] key, if you have one), and then [1] to select matrix A or [2] to select matrix B.

As soon as you make the selection, the program begins computing. (It assumes that you have the observed counts in matrix A; it knows how to compute the expected numbers and puts them in matrix B automatically.) The results look like this:

Select Calculate and press [ENTER]. You should see these results:

You’ll notice that the program tells you that it put some information in matrix C. Under Residuals and More, below, we’ll look at that. (3– 4)

(3– 4)

²-Test ²=6.77, df=2, p=0.0339

MATH200A/Two-way table df=2, ²=6.77, p=0.0339

RC. Requirements Check

In addition to a random sample less than 10% of population, you need all the E’s to be at least 5. Don’t just say this without checking, because sooner or later you’ll have a case where that’s not true. With the MATH200A program (recommended):

If you’re not using the program:

The results screen (above) shows you how many categories had expected counts below 5. A piece of cake! The expected counts are stored in matrix B in case you need to look at them. (For classes that use the looser requirement, if any E’s are below 5 then the program tells you how many are below 1.)

The calculator puts the expected counts in matrix B while doing the ²-Test . To view matrix B, press [2nd x-1 makes MATRX] [] [] [2]. Look at every value (you may have to scroll) for any that are below 1, or below 5 but ≥1.

In this case you see that all expected counts are above 5. (RC)

Random sample. NY 10×70 = 700, PA and CT 10×40 = 400. Surely there are many more office managers in those states. All E’s in [B] are >5.

5. Decision Rule

This is the same for every type of hypothesis test. (5) p < . Reject H 0 and accept H 1 .

6. Conclusion

(6) At the 0.05 level of significance, brand preference does vary by [or is dependent on, associated with, not independent of] state. Or, Brand preference does vary by [or is dependent on, associated with, not independent of] state (p = 0.0339).

Optional: Residuals and More Just as with the goodness-of-fit test, if you reject the null hypothesis then you can look at the standardized residuals, which are (observed − expected) / √expected A standardized residual outside of ±2 is probably significant. The ² contributions are the squares of the standardized residuals, so a ² contribution above 4 is probably significant. The MATH200A program shows you the ² contributions and more. (If you used the TI’s native ²-Test and want this information, you have to compute it for yourself.) To view the additional information, which is in matrix C, press [2nd x-1 makes MATRX] [] [3]. Here’s what it looks like for the typewriter survey. (I’ve pasted screens together to save the effort of scrolling.) Unfortunately it’s not possible to put captions in a matrix, but here’s your guide to interpreting it. There are three regions: the ² contributions, the row and column totals, and the row and column percentages. The following paragraphs show and explain those. The ² contributions are in the upper left corner of the matrix, with the size and shape of the original matrix. The original matrix had three rows and two columns, so you want to look at the top left 3×2. As I mentioned at the start of this section, if you’re able to reject H 0 then a ² contribution >4 is probably significant. In this problem, your p-value was < and you did reject H 0 , but the ² contributions are all well under 4. How do you interpret this? There is indeed some variation in brand preference among states, but you can’t tell just where it is. Isn’t this kind of a paradox? Maybe, but I would say instead that the sample was large enough to show that some effect existed, but not large enough to show the details of the effect. If you repeat the survey with a larger sample, you might be able to learn more. “But wait!” I hear you say. “Isn’t it obvious? NY was 50–50, PA was just over 60–40, and CT was 75–25. Isn’t it obvious that CT is very different from NY?” Yes, it is — in the sample. But you don’t know whether it’s true in the population. For all you know, the NY sample just happened to under-represent IBM lovers and over-represent SCM lovers, and CT the opposite, so that in another sample the proportions might be reversed. You simply don’t have enough information to draw more detailed conclusions. The row and column totals are in the next section, in an extra column and an extra row. This particular problem gave them to you, but many problems do not, and you’d be surprised how hard it is to grasp a problem and interpret a result without this information. Here you see that the three states’ sample sizes were 70, 40, and 40; the overall preference for the two manufacturers was 90 and 60. Of course whether you add down or across, you get the same 150 for overall sample size. Finally, you have the row and column percentages in the last column and last row. What is this telling you? NY was 46.7% of the whole sample, and PA and CT were each 26.7%. IBM lovers were 60% of the whole sample, and SCM lovers were 40%. The two 0’s are just space fillers; the 100 at the lower right reminds you that the row percentages and column percentages add up to 100% in either direction. Why do you care about the row and column percentages? Because they explain what the null hypothesis means. The null hypothesis is that brand preference is the same among states. So if the null hypothesis is true, then NY, PA, and CT all have the same 60–40 split between IBM and SCM that the overall sample has. (I figured the percentages by hand in Computing Expected Counts (E’s).) You can read the percentages in the other direction, too. It doesn’t have much use in this particular problem, but you can do it. NY was 46.7% of the whole sample, so if H 0 is true then 46.7% of the IBM lovers and 46.7% of the SCM lovers should be in the NY sample. Similarly, if H 0 is true then PA and CT should each have 26.7% of the IBM lovers and 26.7% of the SCM lovers. And if you look back at the matrix of expected counts you’ll see that it matches. (Hey, I told you it wasn’t very useful in this situation! But there are others where it can be useful to read the table down or across.) BTW: If your two-way table has two rows and two columns, you’re testing the proportions in two populations. That’s a case you had back in Chapter 11. In this situation, you can do a ² test or a 2proportion z test; you’ll get the same p-value from both. But if you want to know the size of the difference between the two population proportions, you have to do a 2-proportion z interval. (There is such a thing as a confidence interval in the ² procedure, but it’s pretty gnarly and we don’t study it in this course.)

12B2. Second Problem: The “Monday Effect” It’s a persistent idea that cars manufactured on Monday are of lower quality because the workers are recovering from wild weekends. But is it true? A quality analyst randomly chose 100 records from each weekday over the past year and obtained the following results: Mon

Tue

Wed

Thu

Fri

Defective

15

10

5

5

10

Non-defective

85

90

95

95

90

At the 0.05 significance level, are the proportions of defective cars different on different days? Comment: The way I phrased it, this is a problem of homogeneity, five populations (Monday cars, Tuesday cars, and so on) with one attribute (defectiveness). But the question could just as well be asked whether likelihood of a defect depends on the day of the week when the car was manufactured. In those terms, it’s a test of independence: one population (cars) with two attributes (day manufactured, and defectiveness). This is a good illustration of what I said in the summary: many situations can be treated equally well as tests of independence or tests of homogeneity. Fortunately, the procedure is identical; you just need to state your hypotheses and your conclusions in terms of what you were asked to test. Solution (1)

H 0 : Proportions of defective cars are the same on all weekdays. H 1 : Proportions of defective cars are different on different weekdays.

(2)

= 0.05

(3–4) ²-Test or MATH200A/Two-way table df=4, ²=8.55, p=0.0735 (RC)

Random sample. We don’t know production figures, but we can be confident they’re far higher than 10×500 = 5000 during the year. All E’s ≥ 5.

(5)

p > . Fail to reject H 0 .

(6)

At the 0.05 significance level, it’s impossible to say whether the proportions of defective cars are the same or different on different days. Or, We can’t determine whether the proportions of defective cars are the same or different on different days (p = 0.0735).

12B3. Third Problem: Tobacco Smoke and Tumors “The cancer-producing potential of pyrobenzene (a major constituent of cigarette smoke) was tested. Eighty mice were used as a control group with no exposure to pyrobenzene, eighty more were exposed to a low dose, … and another seventy were given a high dose.” No tumors

One tumor

Two or more tumors

Total

Control

74

5

1

80

Low dose

63

12

5

80

High dose

45

15

10

70

182

32

16

230

Total

Can you prove that pyrobenzene dosage affects production of tumors, at the 0.01 significance level, or “could the above apparent difference be due to random chance?” Source: Dabes and Janik (1999, 53–54) Please stop here and write out your complete hypothesis test on paper, then check your solution against mine. Solution (1) H 0 : Pyrobenzene dosage does not affect tumor production. H 1 : Pyrobenzene dosage affects tumor production. Caution! Students sometimes write their null hypothesis as something like “random chance can account for the observed difference.” Yes, if your p-value is high, it means that random chance could explain the observed sample, but that doesn’t mean that it is the explanation. There’s always the possibility that dosage does affect tumors but this sample just happened not to show it. So write your H 0 and H 1 as contrasting statements about tumors and dosage, as I’ve done here. (2)

= 0.01

(3–4) ²-Test or MATH200A/Two-way table df=4, ²=19.25, p=7.012755E-4 or about 0.0007 Caution! You have a 3×3 table, not a 4×4 table. Never enter total rows or total columns in your matrix. (If you got df=9, you made that mistake.) (RC)

Random sample? In a lab setup with mice, we can assume so. (Most likely the mice were genetically identical, purchased from a supply house.) Sample size less than 10% of population? Yes, the population of mice is unlimited (unfortunately). All E’s ≥ 5? Look at matrix B, and you see that one of the nine expected counts is below 5 — specifically, 4.8696. That’s just a smidge below 5, and given the very low p-value it’s not a problem.

(5)

p < . Reject H 0 and accept H 1 .

(6)

At the 0.01 significance level, we can conclude that pyrobenzene dosage level does affect production of tumors. Or, Pyrobenzene dosage level does affect production of tumors (p = 0.0007).

Comment: Why can we make a statement of cause here, rather than the weaker “is associated with”? Answer: This was an experiment, and the mice were either genetically identical or randomly assigned to the three groups, or both.

12C. But Wait, There’s More! Every course has to draw the line somewhere and leave out lots of interesting things, and this one is no exception. Inferential Statistics Cases lists quite a few cases that we don’t have time to study in this course. For those who are interested, here are handouts for some of the cases that I had to leave out: In this chapter, you studied only the hypothesis test for goodness of fit. It’s possible to do a confidence interval as well, but it’s tricky because every category can vary simultaneously. See Confidence Intervals for Goodness of Fit, which includes Excel workbooks to do the calculations. As you know, the sampling distribution for means is the t distribution. But standard deviations of samples vary according to a ² distribution. You can do Inferences about One Population Standard Deviation. That page includes an Excel workbook for the calculations, or you can use MATH200B Program part 5. In Chapter 10 you tested the mean of one population, and in Chapter 11 you tested the means of two populations. It’s possible to test the means of three or more populations, and your calculator already contains the needed command. See One-Way ANOVA. Back in Chapter 4, you computed the correlation coefficient of your sample and the best fitting regression line for your sample. You also did some primitive inference about the correlation coefficient by using a table of decision points. You can actually do a hypothesis test and confidence interval about the correlation coefficient of a population, as explained in Inferences about Linear Correlation. That page includes an Excel workbook for the calculations (including where the decision point numbers come from), or you could use MATH200B Program part 6. The slope and intercept you computed for your regression line are actually random variables. If you had a different sample, you would have come up with a different regression line. For hypothesis test and confidence interval about the regression line, see Inferences about Linear Regression. There’s an Excel workbook for calculations, or you can use MATH200B Program part 7.

What Have You Learned? Key ideas:

With non-numeric data involving three or more responses to a variable, multiple populations, or both, each bucket contains some number of data points. The expected count for each bucket is the number you would expect to see in that bucket if H 0 is true. (Although actual counts are whole numbers, the expected counts typically aren’t.) The observed counts are usually different from the expected counts. The ² statistic measures how different, as one number for the whole model. The p-value says how likely this difference (or a greater difference) is if H 0 is true. When you have one population with one categorical variable, you test goodness of fit to a model, Case 6. H 0 is that the model is consistent with the data, and H 1 is that it’s not consistent. Use MATH200A Program part 6. When you have one population with two categorical variables, you test independence of the two attributes. When you have two or more populations with one categorical variable, you test homogeneity. The line between independence and homogeneity is very blurry, but fortunately both are computed exactly the same way, in Case 7. H 0 is that the variables are independent or the population proportions are all equal; H 1 is that the variables are not independent or some population proportions are different from others. Use MATH200A Program part 7 or ²-Test . Requirements for both ² tests are a random sample, sample size less than 10% of population, and all expected counts at least 5. These cases have only hypothesis tests. Although confidence intervals exist, we don’t study them in this course.

Study aids:

Inferential Statistics: Basic Cases Because this textbook helps you, please click to donate!

Interactive: Triage: Which Inferential Stats Case Should I Use? Seven Steps of Hypothesis Tests Top 10 Mistakes of Hypothesis Tests Statistics Symbol Sheet

← Chapter 11 WHYL

course review Õ

Exercises for Chapter 12 Write out your solutions to these exercises, showing your work for all computations. Then check your solutions against the solutions page and get help with anything you don’t understand. Caution! If you don’t see how to start a problem, don’t peek at the solution — you won’t learn anything that way. Ask your instructor or a tutor for a hint. Or just leave it and go on to a different problem for now. You may find when you return to that “impossible” problem that you see how to do it after all.

1 2

Explain the difference between a model of 25% to 40% to 35% and a model of 25 to 40 to 35.

You think that people’s ice cream favorites are 25% each vanilla and chocolate, 20% strawberry, 15% butter pecan, 8% rocky road, and 7% other or no preference. You randomly survey 1000 people and find preferences are 220 vanilla, 255 chocolate. 190 strawberry, 170 butter pecan, 95 rocky road, and 70 other or no preference. Using =0.05, was your idea right or wrong?

3

Democrats and Republicans (random sample) were surveyed for their opinions on gun control, and the results are shown in the table at right. Based on this sample, does a person’s opinion on gun control depend on party affiliation, at the .05 level of significance?

favor

oppose

unsure

total

Democrat

440

400

120

960

Republican

320

480

100

900

total

760

880

220

1860

4

In a random sample, 425 first graders were surveyed about what they want to be when they grow up, out of a choice of five professions. The results were Teacher 80, Doctor 105, Lawyer 70, Police officer 70, Firefighter 100. Obviously these particular children preferred some occupations over others. Test whether their preferences reflect a real difference among all first graders, at the 0.05 significance level.

5

A random sample of women were observed for their consumption of eggs and their age at menarche: Egg Consumption Age at Menarche

Never

Once a week

2–4 times a week

Low

5

13

8

Medium

4

20

14

11

18

15

High

Data were adapted from Kuzma and Bohnenblust (2005, 224). Test, at the 0.01 level, whether age at menarche is independent of egg consumption.

6

In Jury Selection, George Michailides (n.d.a) quotes a study of the age breakdown of grand jurors in Alameda County, California, in 1973. It’s pretty obvious that this sample doesn’t match the age distribution of the county, but is the discrepancy too great to be random chance? Choose an appropriate significance level. (This isn’t a random sample, but that’s okay because you’re not asked to generalize to a population. For the same reason, the “≤10% of population” requirement doesn’t apply.)

7

Age

County

Jurors

21–40

42 %

5

41–50

23 %

9

51–60

16 %

19

≥ 61

19 %

33

100 %

66

Total

Do men choose the size town to live in based on the size town they grew up in? 500 men (a simple random sample) were surveyed. Use = 0.05. Now residing in Raised in

< 10,000

10,000–49,999

≥ 50,000

Total

< 10,000

24

45

45

114

10,000–49,999

18

64

70

152

≥ 50,000

21

54

159

234

Total

63

163

274

500

8

Is Echinacea effective against the common cold? The New England Journal of Medicine (Turner 2005) reported a study on 437 volunteers who were randomly assigned to receive a CO 2 extract of Echinacea, a 60% alcohol extract, a 20% alcohol extract, or a placebo for seven days before and 5 days after exposure to rhinovirus type 39. Some withdrew or were excluded from the study for another reason. The final results are shown at right. The common cold is annoying but rarely fatal, so use = 0.01 to determine whether Echinacea makes a difference in the likelihood of catching a cold.

Day −7 to 0

Day 0 to 5

Cold

No Cold

Totals

CO 2 extract

CO 2 extract

40

5

45

60% extract

60% extract

42

10

52

20% extract

20% extract

48

4

52

Placebo

CO 2 extract

43

5

48

Placebo

60% extract

44

4

48

Placebo

20% extract

44

7

51

Placebo

Placebo

88

15

103

349

50

399

Totals



Solutions Õ

What’s New 21 Jan 2015: In the hypotheses for equal preferences, explain why they have to be as written and can’t be the other way around. Do a better job with a requirements check. Numerous small edits for clarity and correction of a few typos. 30 Nov 2014: Add a problem on the effectiveness of Echinacea. (intervening changes suppressed) 6 Jan 2013: New document.

Review Updated 9 Dec 2015 (What’s New?) Summary:

Even if you’ve been doing all the work and keeping up with the course, the mass of material you need to know for the exam can be overwhelming. This page helps you identify what’s most important in preparing for the exam. (If you’re an independent learner, it points you to the most important things you should have learned from your study.)

Contents:

What’s Important? Do This for Every Chapter Finish with Overall Course Review Links to “What Have You Learned?” Review Problems Problem Set 1: Short Answers Problem Set 2: Calculations What’s New

What’s Important? Here are your guidelines for reviewing the subject matter.

Do This for Every Chapter Read the Summary at the beginning, when there is one. Notice when a section is marked optional. When you’re doing your final studying, spend your time on the core concepts, not the optional extras. Scroll through the chapter and look at the definitions. Do you understand the meaning of each term and how to use it? Scroll through the chapter again, and this time look at section heads and key concepts, which are marked in bold. Make notes for your cheat sheet of anything important that you think you might forget. Pay attention also to calculator procedures and formulas. Know when and how to carry out each calculator procedure, and when and how to use the very few formulas that aren’t built into your calculator or the MATH200A program. There’s at least one example for each calculator procedure and each formula, so work through it if you need to refresh your memory. Again, make notes for your cheat sheet of anything you’re likely to forget. The “What Have You Learned?” section at the end of the chapter lists the most important concepts. (Links to “What Have You Learned?” are below.) If you’ve actually learned everything listed there, you should be in good shape for the exam. If you haven’t, review that section of the text and work the examples. Glance over the chapter exercises. If you had trouble with any of them before, make sure you thoroughly understand it now.

Finish with Overall Course Review So much for the trees. Now it’s time to think “forest”. Go through the cheat sheets you just made, and boil them down to one sheet, front and back. Making a one-sheet cheat sheet is always useful, even if the exam will be open book or if the instructor doesn’t allow any notes at all. Writing your summary of the course helps you make sense of the course material, to see it as a whole instead of an unrelated jumble of facts to memorize. Practice with the review problems below. If you can’t work a problem, go back and learn what you’re missing. If something is missing from your cheat sheet, add it. Get a good night’s rest the night before the exam. Sleep deprivation makes people make stupid mistakes, so protect yourself from that. Study aids:

TI-83/84 Cheat Sheet Because this textbook helps you, please click to donate!

Histogram Versus Bar Graph Inferential Statistics: Basic Cases Interactive: Triage: Which Inferential Stats Case Should I Use? Seven Steps of Hypothesis Tests Top 10 Mistakes of Hypothesis Tests Paired and Unpaired Data Compared Statistics Symbol Sheet

Links to “What Have You Learned?” Statistics! Graphing Your Data Numbers about Numbers Linked Variables Probability Discrete Probability Models Normal Distributions How Samples Vary Estimating Population Parameters Hypothesis Tests Inference from Two Samples Tests on Counted Data

Review Problems Here are practice problems to help you test your knowledge and prepare for the final exam. Solutions are provided, but make a genuine effort to work any given problem on your own before you turn to the solution. How to use:

Don’t necessarily make it your goal to work every problem. But do at least look at every one and make sure that you can set it up correctly. Your success on the final exam hinges on your ability to identify which type of problem you are facing.

Don’t panic!

This problem set is much longer than the exam will be, and some problems are harder than the problems you will meet on the exam.

Problem Set 1: Short Answers Write your answer to each question. There’s no work to be shown. Don’t bother with a complete sentence if you can answer with a word, number, or phrase. 70%

1

Two events A and B are disjoint. Is it possible for those same events to be independent as well? Give an example, or explain why it’s impossible.

2

Yummo candy bars are supposed to have an average weight of 87.5 grams (about three ounces). To test this, a team of students bought one Yummo bar from each of the six stores in the village of Carlyle and weighed it.

(a) The data would best be analyzed as an example of A. one population proportion B. two populations, difference in proportions C. one population mean D. two populations, difference in means, paired data E. two populations, difference in means, unpaired data F. goodness of fit G. contingency table (b) Which two tests must you perform on your sample data before doing the analysis mentioned above? (In other words, how would you make sure that the sample meets the requirements?)

3

The two main types of data are qualitative and quantitative. What other names can you give for each? Give an example of each.

4

The probability of rolling a 6 on an honest die is 1/6. If you roll an honest die ten times and none of the rolls comes up 6, is the probability of rolling a 6 on the next roll less than 1/6, equal to 1/6, or greater than 1/6? Explain why.

5

In a large elementary school, you select two age-matched groups of students. Group 1 follows the normal schedule. Group 2 (with parents’ permission) spends 30 minutes a day learning to play a musical instrument. You want to show that learning a musical instrument makes a student less likely to get into trouble. You consider a student in trouble if s/he was sent to the principal’s office at any time during the year. (a) Write your hypotheses, in symbols. (b) Identify either the case number or the specific TI-83 test you would use.

6

Imagine rolling five standard dice. You compute the probability of rolling no 3s, one 3, and so on up to five 3s. Is this a binomial probability distribution? With reference to the definition of a binomial PD, why or why not?

7

Over the course of many statistical experiments, which one of these values for the significance level would enable you to prove the most results? A. 5% B. 1% C. 0.1% D. Significance level has no effect on how likely you are to prove a hypothesis.

8

A key step in hypothesis testing is computing a p-value and comparing it to your preselected . After you do that, which of the following conclusions would be possible, depending on the specific values of p and ? (Write the letter of each correct answer; there may be more than one.) A. Accept H 0 , reject H 1 B. Reject H 0 , accept H 1 C. Fail to accept H 0 , no conclusion D. Fail to reject H 0 , no conclusion

9

Distinguish disjoint events, mutually exclusive events, and complementary events. Give an example of each.

10 11 12

When is a histogram an appropriate graphical method of presentation?

13

What are the two types of numeric data called? Explain the difference, and give an example of each.

For what type of events does P(A or B) = P(A) + P(B)? Give an example.

In a ² goodness-of-fit test, which of the following is/are true? (A question with this many technical alternatives will not be on the exam. Just use it to test your own understanding of ².) A. The hypotheses are stated in words rather than relating some population parameter to a number. B. The null hypothesis is always some variation on “the observed sample matches the model.” C. The alternative hypothesis is always some variation on “our model is good.” D. Instead of a p-value, we compare the value of ² to to draw a conclusion. E. Degrees of freedom equals the number of cells in our model. F. If the difference between our observed results and our expected results could likely have occurred by random chance, we reject the null hypothesis.

14

Suppose the null hypothesis is that a machine is producing the allowed 1% proportion of defectives (H 0 : p = 0.01). Your experiment could end in one of several conclusions, depending on your sample data. List the letters of all possible conclusions from those below. (The actual conclusion would depend on the choice of H 1 , the choice of , and the calculated pvalue. Not all possible conclusions are listed below.) A. The machine is producing exactly the acceptable proportion of defectives. B. The machine is producing no more defectives than acceptable. C. The machine is producing too many defectives. D. Unable to prove anything either way.

15 16

How can you avoid making a Type I error in a hypothesis test?

17

You’re doing a hypothesis test to try to show that Drug A is more effective than Drug B, and your p-value is 0.0678. Your roommate, who has not taken statistics, asks, “So there’s a 6.78% chance that the drugs are equally effective, right?” Explain what the p-value actually

You want to find what proportion of churchgoers believe that evolution should be taught in public schools, so you take a systematic survey at a local mall. You collect 487 survey forms. Of those, 321 identify as churchgoers, and 227 of those 321 say that evolution should be taught in public schools. (a) What is the population? (b) What is the population size? (c) What is the sample size? (d) Is limiting the sample to churchgoers a bias source?

means.

18

Eight percent of the 2×4s from a lumber yard have cracks longer than an inch. Assume that the defectives are randomly distributed. Do you use a binomial or a geometric distribution to compute each of the following, and why? (You don’t actually need to compute the probabilities; just identify the distributions.) (a) The probability that no more than five 2×4s in a random sample of 100 have cracks longer than an inch. (b) The probability that exactly five 2×4s in a random sample of 100 have cracks longer than an inch. (c) The probability that, pulling 2×4s at random, the first four don’t have cracks longer than an inch but the fifth one does.

19

Data are gathered and a computation is done to answer the question “As near as we can tell, how much does the average high-school student spend on lunch?” This computation would be part of A. hypothesis test B. sample size C. confidence interval D. none of the above

20

Linear correlation coefficients must lie between what two values? What value indicates “no linear correlation”? Does this mean no correlation at all?

21

“Four out of five dentists surveyed recommend Trident sugarless gum for their patients who chew gum.” Which of these is the correct symbol for “four out of five dentists surveyed”? µ p p po x x s

22

A poll concludes that 26.9% of TC3 students are satisfied with the food service. What is the type of the original data gathered?

23 24

For what sort of data might you use a pie chart? Why?

25

Usually you make what you want to prove the alternative hypothesis, not the null hypothesis. Why?

The mean is usually the best measure of center of numerical data. But under certain circumstances the mean is not representative and you prefer a different measure of center. Which circumstances, and which measure of center?

26

A company wishes to claim, “People who eat our shredded wheat for breakfast every day for a month lose more than ten points on their cholesterol.” One or more of the following state the null and alternative hypotheses correctly. Which one(s)? E. H 0 = 10 H 1 > 10 A. H 0 > 10 H 1 ≤ 10 I. H 0 ≤ 10 H 1 > 10 F. H 0 : x = 10 H 1 : x > 10 B. H 0 : x > 10 H 1 : x ≤ 10 J. H 0 : x ≤ 10 H 1 : x > 10 G. H 0 : µ = 10 H 1 : µ > 10 C. H 0 : µ > 10 H 1 : µ ≤ 10 K. H 0 : µ ≤ 10 H 1 : µ > 10 H. H 0 : x = 10 H 1 : x > 10 D. H 0 : x > 10 H 1 : x ≤ 10 L. H 0 : x ≤ 10 H 1 : x > 10

27

Which of the following is a Type I error? A. failing to reject the null hypothesis when it is true B. failing to reject the null hypothesis when it is false C. rejecting the null hypothesis when it is true D. rejecting the null hypothesis when it is false

28 29

Compare an experiment and an observational study. Our symbol for level of confidence in a confidence interval is /2 1– z(/2) E (If none of these, supply the correct symbol.)

30

You gather a random sample of selling prices of 2006 Honda Civics. Which selection on your TI-83 would be used to test the claim “In the US, 2006 Honda Civics sell, on average, for more than $2,000”? A. Z-Test B. T-Test C. 1-PropZTest D. 1-PropTTest E. ²-Test F. none of these

31 32

Compare descriptive and inferential statistics, and give an example of each.

33 34

Compare “sample” and “population”; give an example.

35

You believe that more than 25% of high-school students experienced strong peer pressure to have sex. To test this belief, you survey 500 randomly selected graduating seniors nationwide and find that 150 of them say that they did feel such pressure.

You find that your maximum error of estimate (margin of error) is ±3.3 at a confidence level of 95%. At 90% confidence, what would be the maximum error of estimate? A. more than 3.3 B. 3.3 C. less than 3.3 D. can’t say without more information.

You take a random sample of Lamborghini owners and a random sample of Subaru owners. Which selection on your TI-83 would be used to answer the question “How much more do Lamborghini owners spend per year on maintenance than Subaru owners?” A. ZInterval B. TInterval C. 2-SampZInt D. 2-SampTInt E. 2-PropZInt F. none of these

(a) The data would best be analyzed as an example of A. one population proportion B. two populations, difference in proportions C. one population mean D. two populations, difference in means, paired data E. two populations, difference in means, unpaired data F. goodness of fit G. contingency table (b) Which tests must you perform on your sample data before doing the analysis mentioned above? (In other words, how would you make sure that the sample meets the requirements?)

Problem Set 2: Calculations Show your work for all problems. Round probabilities to four decimal places and test statistics (t, z, ²) to two. For hypothesis tests, check requirements and show all six numbered steps.

36

You are testing the assertion, “Judge Judy is more friendly to plaintiffs than Judge Wapner was.” Since it would be tedious to tabulate the hundreds or thousands of decisions each judge has handed down, you randomly select 32 of each judge’s decisions. Judge Judy’s average award to plaintiffs was $650 (standard deviation = $250) and Judge Wapner’s was $580 (standard deviation = $260). Assume that the amounts are normally distributed without outliers. Using a significance level of 0.05, can you conclude that Judge Judy does indeed give higher awards on average?

37

Weights of frozen turkeys at one large market were normally distributed with a mean of 14.8 pounds and a standard deviation of 2.1 pounds. If there were 10,000 turkeys in the market, how many choices would a shopper have who wanted a bird 20.5 pounds or larger? (Hint: begin by figuring the percentage or proportion of turkeys in that weight range.)

38

(from Johnson and Kuby 2003, problem 9.26) “The addition of a new accelerator is claimed to decrease the drying time of latex paint by more than 4%. Several test samples were conducted with the following percentage decrease in drying time: “5.2 6.4 3.8 6.3 4.1 2.8 3.2 4.7 “If we assume that the percentage decrease in drying time is normally distributed” (a) Test the claim, at the .05 level. (b) “Find the 95% confidence interval for the true mean decrease in the drying time based on this sample.”

39

28% of a certain breed of rabbits are born with long hair. Assume that the distribution is random, and consider a litter of five rabbits.

(a) What is the probability that none of the rabbits in the litter have long hair? (b) What is the probability that one or more in a litter have long hair? (c) What is the probability that four or five of them have long hair? (d) What is the average number (mean) of long-haired rabbits you expect in a litter of five?

40

A survey asked a number of professionals, “Which of the following is your most common choice for breakfast?” Using the following data from a random survey, determine whether doctors choose breakfasts in different proportions from other self-employed professionals, to a .05 significance level. Cereal Pastry Eggs Other No bfst Total Doctors 85 22 47 60 17 231 Others 185 90 160 135 35 605 Total 270 112 207 195 52 836

41

Suppose that the mean adult male height is 5¢10² (70²) and the standard deviation is 2.4². (a) If a particular man’s z-score is −1.2, what is his actual height to the nearest 0.1²? (b) Using the Empirical Rule, what percentile is a height of 67.6²? (c) By the Empirical Rule, what proportion of adult men are shorter than 74.8²?

42

The length of life of a random sample of incandescent light bulbs was obtained, and the results are in the table at right. (a) Plot a histogram of the data. (b) What is the size of the sample, with its proper symbol? (c) What are the mean and standard deviation? (Use the proper symbols and round to one decimal place.) (d) What is the relative frequency of the 1100–1250 class?

43

life, hr

count

500–650

6

650–800

18

800–950

60

950–1100

89

1100–1250

29

1250–1400

17

One way to set speed limits is to observe a random sample of drivers and set the speed limit at the 85th percentile. What speed corresponds to that 85th percentile, assuming drivers’ speeds are normally distributed with µ = 57.6 and = 5.2 mph?

44

You’re planning a survey to see what fraction of people who live in Virgil would take the bus if the county added a route between Greek Peak and downtown Cortland via routes 392 and 215. (a) You think the answer is only about 20% of them. If you need 90% confidence in an answer to within ±4%, how many people will you need to survey? (b) What if you have no idea of the answer? How many would you need to survey then?

45

Some popular fast-food items were compared for calories and fat, and the results are shown below: Calories (x)

270

420

210

450

130

310

290

450

446

640

233

9

20

10

22

6

25

7

20

20

38

11

Fat (y)

(a) Make a scatterplot on your TI-83. Do you expect a positive, negative, or zero correlation? Why? (b) Find the correlation coefficient and the equation of the line of best fit and write them down. Round to four decimal places and use proper symbols. (c) Give the value of the y intercept and interpret its meaning. (d) Using the regression equation or your TI-83 graph, how many grams of fat would you predict for an item of 310 calories? Explain why this is different from the actual data point (310 calories, 25 grams). (e) What is the value of the residual for the data point (310,25)? (f) What is the value of the coefficient of determination in this regression? What does it mean? (g) The decision point for n = 11 is 0.602. What if anything can you say about the correlation for all fast foods?

46

Aluminum plates produced by a company are normally distributed with a mean thickness of 2.0 mm and a standard deviation of 0.1 mm. If 6% of the plates are too thick, what is the cutoff point between “too thick” and “acceptable?”

47

Many people took a physical fitness course. Seven of them were randomly selected and were tested for how many sit-ups they could do. The same seven were re-tested after the course. From the data below, can you conclude that improvement took place among the general run of people who took the course? Use = 0.01. Anne Bill Chance Deb Ed Frank Grace Before 29 22 25 29 26 24 31 After 30 26 25 35 33 36 32

48

Your average morning commute time is 27 minutes, with SD 4 minutes. Your morning commute times are ND. (a)How likely is a morning commute under 24 minutes? (b)You pick a week (five mornings) at random. How likely is an average commute time under 24 minutes?

49

(adapted from Johnson and Kuby 2003 problem 11.15) A survey was taken nationally to see what size vacation home people preferred. A separate survey was taken in Nebraska. Both were random samples. Do the Nebraska results differ significantly (0.05 level) from the national results?

Unit size

Entire US

Nebraska

Studio/efficiency

18.2%

75

1 bedroom

18.2%

60

2 bedrooms

40.4%

105

3 bedrooms

18.2%

45

5.0%

15

100.0%

300

Over 3 bedrooms Total

50

An experiment was designed to test the effectiveness of a short course that teaches diabetic self-care. Fifty diabetic patients were enrolled in the course, and fifty others served as a control group. (Patients were randomly assigned between the two groups.) Six months after the course, blood sugar levels were tested and results obtained as follows: Diabetic course group: mean = 6.5, standard deviation = 0.7 Control group: mean = 7.1, standard deviation = 0.9 (a)At a significance level of 0.01, does the diabetic course succeed in lowering patients’ blood sugar? (b) Obviously diabetic patients are not all the same. In this experiment, the largish sample sizes and randomization mean that confounding variables are probably balanced out in the two groups. But suppose you had money only for a smaller study, with a total of 30 patients. Suggest an experimental design that would control for most lurking variables. What problem can you see with that design?

51

(adapted from Johnson and Kuby 2003 problem 9.36) “A study in the journal PAIN, October 1994, reported on six patients with chronic myofascial pain syndrome. The mean duration of pain had been 3.0 years for the 6 patients and the standard deviation had been 0.5 year. Test the hypothesis that the mean pain duration of all patients who might have been selected for this study [meaning, of all persons who suffer from this condition] was greater than 2.5 years. Use = 0.05. Assume that the sample is a random sample, normally distributed with no outliers.

52

In a survey of working parents, 200 men and 200 women were randomly selected and asked, “Have you refused a promotion because it would mean less time with your family?” Of the men, 60 said yes; 48 of the women said yes. (a) Obviously more men in the sample refused promotions. But can you conclude at the 0.05 significance level that a higher percentage of all working men have refused promotions, versus the percentage of all working women? (b) In an English sentence, state a 95% confidence interval for the difference in percentages of men and women who refuse promotions.

53

Ten thousand students take a test, and their scores are normally distributed. If the middle 95% of them score between 70 and 130, what are the mean and standard deviation?

54

An insurance company advertises that 75% of its claims are settled within two months of being filed. The state insurance commission thinks the percentage is less than 75, and sets out to prove it. First a small study is done. For this preliminary study, the commissioner can live with a 5% chance of making a Type I error. The commission staff randomly selects 65 claims, and finds out that 40 were settled within two months. Based on this study, can you say that less than 75% of claims are settled within two months? Work this problem only if you studied the optional extras in the Probability chapter. A shoe store gets its shoes from just two companies, 40% from A and 60% from B. 2.5% of pairs from Brand A are mislabeled, and 1.5% of pairs from Brand B are mislabeled. Find the probability that a randomly selected pair of shoes in the store is mislabeled.

55 56

Ten randomly selected men compared two brands of razors. Each man shaved one side of his face with brand A and the other side with brand B. (They flipped coins to decide which razor to use on which side.) Each tester assigned a “smoothness score” of 1 to 10 to each side after shaving. The scores are as shown below. Determine whether there is a difference in smoothness performance between the two razors, using = 0.10. Man 1 2 3 4 5 6 7 8 9 10 A score 7 8 3 5 4 4 9 8 7 4 B score 5 6 3 4 6 5 6 7 3 4

57

In August 2009, the National Geographic News Web site reported that 90% of US currency was tainted with cocaine. (a) If you drew a random sample of two bills, what is the chance that exactly one of them is tainted with cocaine? (b) You have ten bills, and you’ve been told that 90% of these ten bills are tainted with cocaine. If you draw two of the ten bills at random, what is the chance that exactly one of your two is tainted with cocaine?

58

Fifteen farms were randomly selected from a large agricultural region. Each farm’s yield of wheat per acre was measured. For the 15 farms, the mean yield per acre was 85.5 bushels and the standard deviation was 10.0 bushels. Find a 90% confidence interval for the mean yield per acre for all farms in this region, assuming yield per acre is normally distributed and there were no outliers in the sample.

59

You draw five cards from a deck, without replacement, and record the number of aces you drew. Then you replace the five cards and shuffle the deck thoroughly. If you repeat this experiment many times, is the number of aces in five cards drawn a binomial distribution? Why or why not?

60

In a survey of 300 people from Tompkins County, 128 of them preferred to rent or stream a movie on Saturday night rather than watch broadcast or cable TV. In Cortland County, 135 of 400 people surveyed preferred a movie. You’re interested in the difference of proportion in movie renters for Tompkins County over Cortland County. Both surveys were random samples. (a) What is the point estimate for that difference? (b) Find the 98% confidence interval for the difference in the two proportions for all residents of the counties. (c) What is the maximum error of estimate, at the 98% confidence level?

61

Two batches of seeds were randomly drawn from the same lot, and one batch was given a special treatment. Consider the data for germination shown at right. At significance level 0.05, does the treatment make any difference in how likely seeds are to germinate?

Germinated

Didn’t

80

20

135

15

Untreated Treated

Now check yourself on the solutions page.

What’s New 9 Dec 2015: Make it explicit that these surveys were random. 6 May 2015: Change a problem from gasoline octane to weight of candy bars. 13 Jan 2015: Add several study aids to the list of review documents. (intervening changes suppressed) 11 Nov 2007: Amalgamate the old separate sets of review problems for descriptive and inferential statistics. 4 Nov 2011: Create the list of key concepts.

Solutions to All Exercises Updated 1 Jan 2016 Contents:

Solutions for Chapter 1 Solutions for Chapter 2 Solutions for Chapter 3 Solutions for Chapter 4 Solutions for Chapter 5 Solutions for Chapter 6 Solutions for Chapter 7 Solutions for Chapter 8 Solutions for Chapter 9 Solutions for Chapter 10 Solutions for Chapter 11 Solutions for Chapter 12 Solutions to Review Problems

Solutions for Chapter 1

← Exercises for Ch 1

Because this textbook helps you, please click to donate!

1

Sampling error is another name for sample variability, the fact that each sample is different from the next because no sample perfectly represents the population it was drawn from. Nonsampling errors are problems in setting up or carrying out the data collection, such as poorly worded survey questions and failure to randomize. Nothing can eliminate sampling error, but you can reduce it by increasing your sample size. (Most nonsampling errors can be avoided by proper experimental design and technique.)

2

(a) systematic sample . (b) It is probably a good sample of that gynecologist’s patients, since there’s no reason to think that one month is different from another. But it’s a bad sample of pregnant women in general, because it suffers from selection bias. This gynecologist’s patients may use prenatal vitamins differently from pregnant women who see other gynecologists or who don’t have a regular gynecologist. (c) observational study

3

(a) completely randomized (b) the plant food administered (c) no food, Gro-Mor, Magi-Grow (d) 13 heights at the end of the 13 weeks (You could also make a case for growth rate .) (e) the 150 bulbs (f) selection of plant food (g) the group that gets no plant food

4

Each family answered the question “How many children do you have?” (a) The variable is number of children . (b) It is a discrete variable . (c) It summarizes population data, and therefore it is a parameter . Although “numeric” or “quantitative” is correct, it’s not an adequate answer because it is not as specific as possible. Discrete and continuous data are treated differently in descriptive statistics, so it matters which type you have. Students are sometimes fooled by the decimal. Always ask yourself what was the original question asked or the original measurement taken from each member of the sample.

5

(a) The sample is the 80 people in your focus group. (It is not the drinks. It’s also not the people’s preferences: Their preferences are the data or sample data.) (b) The sample size is 80, because that’s the number of people you took data from. It’s not 55: That’s just the number who gave one particular response. (c) The population is not stated explicitly, but you can infer that it’s cola drinkers in general, or Whoopsie Cola drinkers in general. (d) You don’t know how many cola drinkers (or Whoopsie Cola drinkers) there are. You can’t know, since people change their soft-drink habits all the time. You can say that the population is indefinitely large, or you can say that it’s infinite. (You can say that the population is uncountable, but don’t say that the population size is uncountable.) Common mistake: Students sometimes answer “80” for population size, but this is not correct. You took data from 80 people, so those 80 people are your sample and 80 is your sample size.

6

(a) sampling error (or sample variability) (b) increase sample size

7

You’re asking people to admit to socially disapproved behavior . People tend to shade their answers toward socially acceptable behavior. What can be done to reduce response bias? Interviewers should be trained to be absolutely neutral in voice and facial expression, which is how the Kinsey team gathered data on sexual behavior. Or the question can be asked on a written questionnaire, so that the subject isn’t looking another person in the face when answering. The question can also be made less threatening: “Have you ever left an infant alone in the house, even for just a minute?”

8

Random sample: get a list of the resident students. On your calculator, do randInt(1,2000) 50 times, not counting duplicates, and interview the students who came up in those positions. Systematic sample: You can’t station yourself in the cafeteria because that would exclude all students who don’t use it. Instead, station yourself at the main entrance to the dorm complex (or station yourself and confederates at the main entrance to each dorm) and interview every 20th person. Why k=20 and not 2000/50 = 40? Because whenever you’re there, you’re bound to miss a sizable proportion of students. To select the first person to survey, use randInt(1,20). Remember that a systematic survey begins with a randomly selected person from 1 to k, not 1 to 50 (sample size) or 1 to 2000 (population size). Notice that I didn’t suggest a time frame. What do you think would be a good time to do this? An alternative procedure might be to walk through the dorms (assuming you can get in) and interview the students in every 20th room. You may get better coverage that way than if you wait for them to come to you. Cluster sample: Randomly select 25 rooms, and interview both of the students in those rooms. (This is a single-stage cluster.)

Best balance? Probably the cluster sample. The true random sample is a lot of work for a sample of 50, because after selecting the names you have to track the students down. The systematic sample, no matter how you do it, is going to miss a lot of students, and you have that time-period problem. With the cluster sample, you can time it for when students are likely to be home, and you can go back to follow up on those you missed. But nothing is perfect, in this life where we are born to trouble as the sparks fly upward. The cluster sample works if the students were randomly assigned to rooms. When students pick their own roommates, they tend to pick people with similar attitudes, interests, and activities. That means those two are more similar to each other than other students, and there’s no way you can treat that cluster sample as a random sample. The cluster would probably be safe for freshman, where the great majority would be randomly assigned, but less so for students in later years.

9

No, you can’t reach that conclusion, because you can never conclude causation from an observational study. You would have to do an experiment, where people were randomly assigned to watch Fox News or to watch no news at all, and then see if there was a difference in how much they knew about the world. Students often answer questions like this with hand-waving arguments, either coming up with reasons why it’s a plausible conclusion or coming up with reasons why it isn’t. This is statistics, and we have to follow the facts. Whatever you may think about Fox News, the fact is that observational studies can’t prove causation.

10

(a) It excludes people who don’t use the bus . This means that people who are dissatisfied with the bus are systematically under-represented. Your survey will probably show that willingness to pay is higher than it actually is. (b) sampling bias

11

“Random” doesn’t mean unplanned; it takes planning. This is a bogus sample . If you want a more formal statistical word, call it a convenience sample , an opportunity sample or a nonprobability sample.

12

(a) This is attribute data or qualitative data or non-numeric data . Don’t be fooled by the number 42: the original question asked was “Do you have at least one streaming device?” and that’s a yes/no question. Alternative: the more specific answer binomial data , which you may have heard in the lecture though it’s not in the book till Chapter 6. (b) This is descriptive statistics because it’s reporting data actually measured: 42% of the sample. If it said “42% of Americans”, then it would be inferential because you know not every American was asked, so the investigators must have extrapolated from a sample to the population. (c) It is a statistic because it is a number that summarizes data from a sample.

13

The first people who present themselves are chosen. You should randomly select from among all volunteers. (Better still would be to randomly select from among all patients, and ask the selected individuals to volunteer.) Participants are not randomly assigned to control and experimental groups. This is always bad, but it’s especially bad when you accept a block of volunteers in order. The experiment is not double blind , only single blind. When doctors know who is getting a placebo and who is getting medicine, they may treat the two groups differently, consciously or unconsciously.

All of these are nonsampling errors .

14

2.145E-4 is 0.0002145, and 0.0004 is larger than that.

15

It’s spurious precision. (That much precision could be appropriate if you had surveyed a few hundred thousand households.) To fix it, round to one decimal place: 1.9. (Don’t make the common mistake of “rounding” to 1.8.)

16

(a) Non-numeric. (It has the form of a number, but think about the average area code in a group and you’ll realize an area code is not a number.) (b) Continuous. (c) Discrete. (d) Non-numeric. (e) Non-numeric. (f) Discrete. (or continuous if you allow answers like 6.3)

17

(a) was done for you. (b) Measurement: Amount of each dinner check. Continuous. (c) Question: “Did you experience bloating and stomach pain?” Non-numeric. (d) Measurement: Number of people in each party. Discrete.

Solutions for Chapter 2

← Exercises for Ch 2

Because this textbook helps you, please click to donate!

1

2

There’s no scale to interpret the quantities. And if one fruit in each row is supposed to represent a given quantity, then banana and apple have the same frequency, yet banana looks like its frequency is much greater.

3

90% of 15 is 13.5, 80% is 12, 70% is 10.5, and 60% is 9. Score 13.5–15 12–13.4 10.5–11.9 9–10.4 0–8.9

Grade A B C D F

Tallies || | |||| ||| ||||

Frequency 2 1 5 3 4

Alternatives: Instead of a title below the category axis, you could have a title above the graph. You could order the grades from worst to best (F through A) instead of alphabetically as I did here. And you could list the class boundaries as 13.5–15, 12–13.5, 10.5–12, and so on, with the understanding that a score of 12 goes into the 12–13.5 class, not the 10.5–12 class. (Data points “on the cusp” always go into the higher class.)

4

(a) The variable is discrete , “number of deaths in a corps in a given year”. (b)

Alternatives: Some authors would draw a histogram (bars touching) or even a pie chart. Those are okay but not the best choice.

5

Commuting Distance 0 | 5 9 8 1 1 | 5 2 2 1 9 6 2 8 7 6 5 7 2 | 3 2 6 1 6 4 0 3 | 1 4 | 5 Key: 2 | 3 = 23 km

6

Relative frequency is f/n. f = 25, and n = 35+10+25+45+20 = 135. Dividing 25/135 gives 0.185185... » 0.19 or 19%

7

(a) Bar graph, histogram, stemplot. A bar graph or histogram can be used for any ungrouped discrete data. (Some authors use one, some use the other. I like the bar graph for ungrouped discrete data.) A stemplot, or stem-and-leaf diagram, can be used when you have a moderate data range without too many data points. (b) Histogram. (c) Bar graph, pie chart.

8

skewed right

9

(a) Group the data when you have a lot of different values . (b) The classes must all be the same width , and there must be no gaps .

10

(a) See the histogram at right. Important features: The bars are labeled at their edges, not their centers, because this is a grouped histogram. Both axes are titled. The horizontal axis has a real-world title. (Sometimes you also need an overall title for the graph, but here the axis title says all that needs to be said.)

(b) 480.0−470.0 = 10.0 or just plain “10”. Don’t make the common mistake of subtracting 479.9−470.0. Subtract consecutive lower bounds, always. (c) skewed left

Solutions for Chapter 3

← Exercises for Ch 3

Because this textbook helps you, please click to donate!

1

When the data set is skewed, the median is better. Outliers tend to skew a data set, so usually the median is a better choice when you have outliers.

2

15% of people have cholesterol equal to or less than yours, so yours is on the low end. Though you might not really celebrate by eating high-cholesterol foods, there is no cause for concern.

3

(a) It uses only the two most extreme values. (b) It uses only two values, but they are not the most extreme, so it is resistant. (c) It uses all the numbers in the data set. (d) Any two of: It is in the same units as the original data, it can be used in comparing z-scores from different data sets, you can predict what percentage of the data set will be within a certain number of SD from the mean.

4

(a) s is standard deviation of a sample; is standard deviation of a population. (b) µ is mean of a population; x is mean of a sample. (c) N is population size or number of members of the population; n is sample size or number of members of the sample.

5

You were 1.87 standard deviations above average. This is excellent performance. 1.87 is almost 2, and in a normal distribution, z = +2 would be better than 95+2.5 = 97.5% of the students. 1.87 is not quite up there, but close. (In Chapter 7, you’ll learn how to compute that a z-score of 1.87 is better than 96.9% of the population.)

6

Since the weights are normally distributed, 99.7% (“almost all”) of them will be within three SD above and below the mean. 3 above and below is a total range of 6. The actual range of “almost all” the apples was 8.50−4.50 = 4.00 ounces. 6 = 4.00; therefore = 0.67 ounces .

Alternative solution: In a normal distribution, the mean is half way between the given extremes: µ = (4.50+8.50)/2 = 6.50. Then the distance from the mean to 8.50 must be three SD: 8.50−6.50 = 2.00 = 3; = 0.67 ounces.

7

(a) This is a grouped distribution, so you need the class midpoints, as shown at right. Enter the midpoints in L1 and the frequencies in L2. Caution! The midpoints are not midway between lower and upper bounds, such as (20+29)/2 = 24.5. They are midway between successive lower bounds, such as (20+30)/2 = 25. 1-VarStats L1,L2 (Check n first!) x = 63.85656971 Õ x = 63.86 s = 15.43533244 Õ s = 15.44 n = 997

Ages

Midpoint (L1)

Frequency (L2)

20 – 29

25

34

30 – 39

35

58

40 – 49

45

76

50 – 59

55

187

60 – 69

65

254

70 – 79

75

241

80 – 89

85

147

Common mistake: People tend to run 1VarStats L1, leaving off the L2, which just gives statistics of the seven numbers 25, 35, …, 85. Always check n first. If you check n and see that n = 7, you realize that can’t possibly be right since the frequencies obviously add up to more than 7. You fix your mistake and all is well. (b) You need the original data to make a boxplot, and here you have only the grouped data. A boxplot of a grouped distribution doesn’t show the shape of the data set accurately, because only class midpoints are taken into account. The class midpoints are good enough for approximating the mean and SD of the data, but not the five-number summary that is pictured in the boxplot.

8

You need the weighted average, so put the quality points in L1 and the Course credits in L2. (No, you can’t do it the other way around. The quality points Statistics are the numeric forms of your grades, Calculus and you have to give them weights according to the number of credits in Microsoft Word each course.) Microbiology 1-VarStats L1,L2 n = 14 (This is the number of English Comp credits attempted. If you get 5, you forgot to include L2 in the command.) x = 2.93

Quality Points (L1)

Credits (L2)

Grade

3

A

4.0

4

B+

3.3

1

C−

1.7

3

B−

2.7

3

C

2.0

9

You don’t have the individual quiz scores, but remember what the average means: it’s the total divided by the number of data points. If your quiz average is 86%, then on 10 quizzes you must have a total of 86×10 = 860 percentage points. If you need an 87% average on 11 quizzes, you need 11×87 = 957 percentage points. 957−860 = 97; you can still skip the final exam if you get a 97 on the last quiz.

10

Commute Distance, km 0- 9 4 10-19 12 20-29 7 30-39 1 40-49 1 Total 25

(a)

(b) The class width is 10 (not 9). The class midpoints are 5, 15, 25, 35, 45 (not 4.5, 14.5, etc.). (c) Class midpoints in one list such as L2 and frequencies in another list such as L3. This is a sample, so symbols are x, s, n, not µ, , N. 1-VarStats L2,L3 x = 18.2 km s = 9.5 km n = 25

(d) Data in a list such as L1. 1-VarStats L1 gives x = 17.6 km , Median = 17 , s = 9.0 km , n = 25 (e)

(f) Mean , because the data are nearly symmetric. Or, median, because there is an outlier. Comment: The stemplot made the data look skewed, but that was just an artifact of the choice of classes. The boxplot shows that the data are nearly symmetric, except for that outlier. This is why the mean and median are close together. This is a good illustration that sometimes there is no uniquely correct answer. It’s why your justification or explanation is an important part of your answer. (g) The five-number summary, from MATH200A part 2 [TRACE], is 1, 12, 17 22.5, 45 . There is one outlier, 45 . (The five-number summary includes the actual min and max, whether they are outliers or not.)

11

Since 500 equals the mean, its z score is 0. For 700, compute the z score as z = (700−500)/100 = 2. So you need the probability of data falling between the mean and two SD above the mean. Make a sketch and shade this area. Draw an auxiliary line at z = −2. You know that the area between z = −2 and z = +2 is 95%, so the area between z = 0 and z = 2 is half that, 47.5% or 0.475.

12

To compare apples and oranges, compute their z scores: zJ = (2070−1500)/300 = 570/300 = 1.90 zM = (129−100)/15 = 29/15 = about 1.93 Because she has the higher z score, according to the tests Maria is more intelligent. Remark: The difference is very slight. Quite possibly, on another day Jacinto might do slightly better and Maria slightly worse, reversing their ranking.

13

Start with the class marks or Frequencies, f Class Midpoints, x midpoints, as shown at right. (Class Test Scores (L2) (L1) midpoints are halfway between successive lower bounds: 470.0–479.9 15 475.0 (470+480)/2 = 475. You can’t calculate 480.0–489.9 22 485.0 them between lower and upper bounds, (470+479.9)/2=474.95.) 490.0–499.9 29 495.0 Put class midpoints in a list, such as 500.0–509.9 50 505.0 L1, and frequencies go in another list, such as L2. (Either label the columns with the 510.0–519.9 38 515.0 lists you use, as I did here, or state them explicitly: “class marks in L1, frequencies in L2”.) 1-VarStats L1,L2 (Always write down the command that you used.) (a) n = 154 (b) x = 499.81 (before rounding, 499.8051948) (c) s = 12.74 (before rounding, 12.74284519) Be careful with symbols. Use the correct one for symbol or population, whichever you have. Common mistake: The SD is 12.74 (Sx ), not 12.70 (), because this is a sample and not the population.

14

The mean is much greater than the median. This usually means that the distribution is skewed right , like incomes at a corporation.

Solutions for Chapter 4

← Exercises for Ch 4

Because this textbook helps you, please click to donate!

1

64% of the variation in salary is associated with variation in age. Common mistake: Don’t use any form of the word “correlation” in your answer. Your friend wouldn’t understand it, but it’s wrong anyway. Correlation is the interpretation of r, not R². Yes, r is related to R², but R² as such is not about correlation. Common mistake: R² tells you how much of the variation in y is associated with variation in x, not the other way around. It’s not accurate to say 64% of variation in age is associated with variation in salary. Common mistake: Don’t say “explained by” to non-technical people. The regression shows an association, but it does not show that growing older causes salary increases.

2

(a) We know that power boats kill manatees, so the boat registrations must be the explanatory variable (x) and the manatee power-boat kills must be the response variable (y). (Although this is an observational study, the cause of death is recorded, so we do know that the boats cause these manatee deaths.) (b)

Yes

(c) The results of LinReg(ax+b) L1,L2,Y1 are shown at right. The correlation coefficient is r = 0.91 (d) ŷ = 0.1127x − 35.1786 Note: ŷ, not y. Note: −35.1786, not +−35.1786. (e) The slope is 0.1127. An increase of 1000 power-boat registrations is associated with an increase of about 0.11 manatee deaths, on average. It’s every 1000 boats, not every boat, because the original table is in thousands. Always be specific: “increase”, not just “change”. Remark: Although this is mathematically accurate, people may not respond well to 0.11 as a number of deaths, which obviously is a discrete variable. You might multiply by 100 and say that 100,000 extra registrations are associated with 11 more manatee deaths on average; or multiply by 10 and round a bit to say that 10,000 extra registrations are associated with about one more manatee death on average. (f) The y intercept is −35.1786. Mathematically, if there were no power boats there would be about minus 35 manatees killed by power boats. But this is not applicable because x=0 (no boats) is far outside the range of x in the data set. (g) R² = 0.83. About 83% of variation in manatee deaths from power boats is associated with the variation in registrations of power boats. It’s R², not r². And don’t use any form of the word “correlate” in your answer. 100% of manatee power-boat deaths come from power boats, so why isn’t the association 100%? The other 17% is lurking variables plus natural variability. For instance, maybe the weather was different in some years, so owners were more or less likely to use their boats. Maybe a campaign of awareness in some years caused some owners to lower their speeds in known manatee areas. (h)

ŷ = 27.8

(i) y−ŷ = 34−27.8 = 6.2 (j) Remember that x is in thousands, so a million boats is x = 1000. But x=1000 is far outside the data range, so the regression can’t be used to make a prediction.

3

The decision point for n=10 is 0.632, and |r| = 0.57. |r| < d.p., and therefore you can’t reach a conclusion. From the sample data, it’s impossible to say whether there is any association between TV watching and GPA for TC3 students in general. Note: Always state the decision point and show the comparison to r.

4

(a)

Yes

The point (0,6) is hard to see behind the y axis, but it’s there. (b) The results of LinReg(ax+b) L3,L4,Y2 are shown at right. ŷ = −3.5175x+6.4561 (c) The slope is −3.5175. Increasing the dial setting by one unit decreases temperature by about 3.5°. Again, state whether y increases or decreases with increasing x. (d) The y intercept is 6.4561. A dial setting of 0 corresponds to about 6.5°. (e) r = −0.99 (f) R² = 0.98. About 98% of variation in temperature is associated with variation in dial setting. This seems almost too good to be true, as though the data were just made up. But it’s hard to think of many lurking variables. Maybe it happened that some measurements were taken just after the compressor shut off, and others were taken just before the compressor was ready to switch on again in response to a temperature rise. (g)

ŷ = 2.9°

5

For n = 12, the decision point is 0.576. |r| = 0.85 is greater than that, so there is an association. Increased study time is associated with increased exam score for statistics students in general.

6

No. There’s a lurking variable here: age. Older pupils tend to have larger feet and also tend to have increased reading ability.

7

r, the linear correlation coefficient, would be roughly zero . Taking the plot as a whole, as x increases, y is about equally likely to increase or decrease. A straight line would be a terrible model for the data. Clearly there is a strong correlation, but it is not a linear correlation. Probably a good model for this data set would be a quadratic regression, ŷ = ax²+bx+c. Though we study only linear regressions, your calculator can perform quadratic and many other types.

8

The coefficient of determination, R², answers this question. For linear correlations, R² is indeed the square of the correlation coefficient r. r = 0.30 Þ R² = 0.09. Therefore 9% of the variation in IQ is associated with variation in income.

Remark: Don’t say “caused by” variation in family income. Correlation is not causation. You can think of some reasons why it might be plausible that wealthier families are more likely to produce smarter children, or at least children who do better on standardized tests, but you can’t be sure without a controlled experiment. Remark: Though it’s an interesting fact, the correlation in twins’ IQ scores is not needed for this problem. In real life, an important part of solving problems and making decisions is focusing on just the relevant information and not getting distracted.

Solutions for Chapter 5

← Exercises for Ch 5

Because this textbook helps you, please click to donate!

Problem Set 1

1

(a) There are three coins, and each has two possible outcomes, so the sample space will have 2³ = 8 entries .

S = { HHH HTH THH TTH } HHT HTT THT TTT

(b) (c) Three events out of eight equally likely events: P(2H) = 3/8 Common mistake: Sometimes students write the sample space correctly but miss one of the combinations of 2 heads. I wish I could offer some “magic bullet” for counting correctly, but the only advice I have is just to be really careful.

2

(a) In a probability model, the probabilities must add to 1 (= 100%). The given probabilities add to 62.6%. What is the missing 37.4%? They’ve accounted for cell and landline, cell only, and nothing; the remaining possibility is landline only. The model is shown at right. (b) P(Landline) = P(Landline only) + P(Landline and cell) P(Landline) = 37.4% + 58.2% = 95.6%

Service type

Prob.

Landline and cell 58.2% Landline only 37.4% Cell only 2.8% No phone 1.6% Total 100.0%

Remark: “Landline” and “cell” are not disjoint events, because a given household could have both. But “landline only” and “landline and cell” are disjoint, because a given house can’t both have a cell phone with landline and have no cell phone with landline.

3

No, because the events are not disjoint. The figures are for being struck or attacked, not killed. You’d have to be pretty unlucky to be struck by lightning and attacked by a shark in the same year, but it could happen. If the question were about being killed by lightning or by a shark, then the events would be disjoint and you could add the probabilities.

4

(a) P(not A) = 1−P(A) = 1−0.7 Õ P(not A) = 0.3

(b) That A and B are complementary means that one or the other must happen, but not both. Therefore P(B) = P(not A) Õ P(B) = 0.3 (c) Since the events are complementary, they can’t both happen: P(A and B) = 0 Common mistake: Many students get (c) wrong, giving an answer of 1. If events are complementary, they can’t both happen at the same time. That means P(A and B) must be 0, the probability of something impossible. Maybe those students were thinking of P(A or B). If A and B are complementary, then one or the other must happen, so P(A or B) = P(A) + P(B) = 1. But part (c) was about probability and, not probability or.

5

Yes, because the events are disjoint or mutually exclusive: a person might have both cancer and heart disease, but the death certificate will list one cause of death. (1/5 + 1/7 » 34%.)

6

P(divorced | man) is the probability that a randomly selected man is divorced, or the proportion of men who are divorced . P(man | divorced) is the probability that a randomly selected divorced person is a man, or the proportion of divorced persons that are men .

7

If the probability of a future event is zero, then that event is impossible. If the probability of a past event is zero, that just means that it didn’t happen in the cases that were studied, not that it couldn’t have happened. This is the difference between theoretical and empirical probability. A truly impossible event has a theoretical probability of zero. But the 0 out of 412 figure is an empirical probability (based on past experience). Empirical probabilities are just estimates of the “real” theoretical probability. From the empirical 0/412, you can tell that the theoretical probability is very low, but not necessarily zero. In plain language, an unresolved complaint is unlikely, but just because it hasn’t happened yet doesn’t mean it can’t happen.

8

13/52 or 1/4

Common mistake: Students often try some sort of complicated calculation here. You would have to do that if conditions were stated on all five of those cards, but they weren’t. Think about it: any card has a 1/4 chance of being a spade.

9

S = { HH, HT, TH, TT } (a) Three outcomes (HH, HT, TH) have at least one head. One of the three has both coins heads. Therefore the probability is 1/3 . (b) Two outcomes (HH, HT) have heads on the first coin. One of the two has both coins heads. Therefore the probability is 1/2 .

10

(a) 0.0171 × 0.0171 = 0.0003

11

(a) P(divorced) = 22.8/219.7 » 0.1038

(b) The events are not independent. When a married couple are at home together or out together, any attack that involves one of them will involve the other also.

(b) About 10.38% of American adults in 2006 were divorced. If you randomly selected an American adult in 2006, there was a 0.1038 probability that he or she was divorced. (c) Empirical or experimental (d) P(divorcedC) = 1−P(divorced) = 1−22.8/219.7 » 0.8962 About 89.62% of American adults in 2006 were not divorced (or, had a marital status other than divorced). (e) P(man and married) = 63.6/219.7 » 0.2895 (You can’t use a formula on this one.) (f) Add up P(man) and P(not man but married): P(man or married) = 106.2/219.7 + 64.1/219.7 » 0.7751 Alternative solution: By formula: P(man or married) = P(man) + P(married) − P(man and married) P(man or married) = 106.2/219.7 + 127.7/219.7 − 63.6/219.7 = 0.7751 Remember, math “or” means one or the other or both. (g) What proportion of males were never married ? 30.3/106.2 = 28.53% . (h) P(man | married) uses the sub-subgroup of men within the subgroup of married persons. P(man | married) = 63.6/127.7 = 0.4980 49.80% of married persons were men. Remark: You might be surprised that it’s under 50%. Isn’t polygamy illegal in the US? Yes, it is. But the table considers only resident adults. Women tend to marry slightly earlier than men, so fewer grooms than brides are under 18. Also, soldiers deployed abroad are more likely to be male. (i) P(married | man) used the sub-subgroup of married persons within the subgroup of men. P(married | men) = 63.6/106.2 = 0.5989 59.89% of men were married.

12

P(five cards, all diamonds) = (13/52) × (12/51) × (11/50) × (10/49) × (9/48) » 0.0005 (I was surprised that the probability is that high, about once every 2000 hands. And the probability of being dealt a five-card flush of any suit is four times that, about once in every 500 hands.)

13

(a) 3 of 20 M&Ms are yellow, so 17 are not yellow. You want the probability of three nonyellows in a row: (17/20)×(16/19)×(15/18) » 0.5965

(b) The probability is zero , since there are only two reds to start with.

14

You’re being asked about all three possibilities: two fail, one fails, none fail. Therefore the three probabilities must add up to 1, and you need to compute only two of them. It’s also important to note that the companies are independent: whether one fails has nothing to do with whether the other fails. (Without knowing that the companies are independent, you could not compute the probability that both fail.) (a) Since the companies are independent, you can use the simple multiplication rule: P(A bankrupt and W bankrupt) = P(A bankrupt) × P(W bankrupt) P(A bankrupt and W bankrupt) = .9 × .8 = 0.72 At this point you could compute (b), but it’s little messy because you need the probability that A fails and W is okay, plus the probability that A is okay and W fails. (c) looks easier, so do that first. (c) “Neither bankrupt” means both are okay. Again, the events are independent so you can use the simple multiplication rule. P(neither bankrupt) = P(A okay and W okay) P(A okay) = 1−.9 = 0.1; P(W okay) = 1−.8 = 0.2 P(neither bankrupt) = .1 × .2 = 0.02 (b) is now a piece of cake. P(only one bankrupt) = 1 − P(both bankrupt) − P(none bankrupt) P(only one bankrupt) = 1 − .72 − .02 = 0.26 Remark: If you have time, it’s always good to check your work and work out (b) the long way. You have only independent events (whether A is okay or fails, whether W is okay or fails) and disjoint events (A fails and W okay, A okay and W fails). The “okay” probabilities were computed in part (c). P(only one bankrupt) = (A bankrupt and W okay) or (A okay and W bankrupt) P(only one bankrupt) = (.9 × .2) + (.1 × .8) = 0.26 Common mistake: When working this out the long way, students often solve only half the problem. But when you have probability of exactly one out of two, you have to consider both A-and-not-W and W-and-not-A. You can’t use the “or” formula here, even if you studied it. That computes the probability of one or the other or both, but you need the probability of one or the other but not both. Remark: If you computed all three probabilities the long way, pause a moment to check your work by adding them to make sure you get 1. Whenever possible, check your work with a second type of computation.

15

(a) (You can assume independence because it’s a small sample from a large population.) P(red1 and red2 and red3 ) = 0.13×0.13×0.13 = 0.0022

(b) P(red) = 0.13; P(redC) = 1−0.13 = 0.87. P(red1 C and red2 C and red3 C) = 0.87×0.87×0.87 or 0.87³ = 0.6585 Common mistake: Students sometimes compute 1−.13³. But .13³ is the probability that all three are red, so 1−.13³ is the probability that fewer than three (0, 1, or 2) are red. You need the probability that zero are red, not the probability that 0, 1, or 2 are red. Think carefully about where your “not” condition must be applied! (c) The complement is your friend with “at least” problems. The complement of “at least one is green” is “none of them is green”, which is the same as “every one is something other than green.” P(green) = 0.16, P(non-green) = 1−0.16 = 0.84. P(≥1 green of 3) = 1 − P(0 green of 3) = 1 − P(3 non-green of 3) = 1−0.84³ » 0.4073 (d) (Sequences are the most practical way to solve this one.) (A) G 1 and G 2 C and G 3 C; (B) G 1 C and G 2 and G 3 C; (C) G 1 C and G 2 C and G 3 .16×(1−.16)×(1−.16) + (1−.16)×.16×(1−.16) + (1−.16)×(1−.16)×.16 » 0.3387

16

In “at least” and “no more than” probability problems, the complement is often your friend. The complement of “at least one had not attended” is “all had attended”. If the fans are randomly selected, their attendance is independent and you can use the simple multiplication

rule. P(all 5 attended) = 0.45^5 = 0.0185 P(at least 1 had not attended) = 1 − 0.0185 = 0.9815

17

Sequences are the way to go here: (cherry1 and orange2 ) or (orange1 and cherry2 )

Common mistake: There are two ways to get one of each: cherry followed by orange and orange followed by cherry. You have to consider both probabilities. There are 11+9 = 20 sourballs in all, and Grace is choosing the sourballs without replacement (one would hope!), so the probabilities are: (11/20)×(9/19) + (9/20)×(11/19) = 99/190 or about 0.5211 The complement is your friend, and the complement of “win at least once in 5 years” is “win 0 times in 5 years” or “lose 5 times in 5 years”. P(win ≥1) = 1−P(win 0) = 1−P(lose 5). P(lose) = 1−P(win) = 1−(1/500) = 499/500 P(lose 5) = [P(lose)]5 = (499/500)^5 = 0.9900 P(win ≥1) = 1−P(lose 5) = 1−0.9900 = 0.0100 or 1.00%

18

Common mistake: If you compute 1−(499/500)5 in one step and get 0.00996008, be careful with your rounding! 0.00996... rounds to 0.0100 or 1%, not 0.0010 or 0.1%. Common mistake: 1/500 + 1/500 + ... is wrong. You can add probabilities only when events are disjoint, and wins in the various years are not disjoint events. It is possible (however unlikely) to win more than once; otherwise it would make no sense for the problem to talk about winning “at least once”. Common mistake: You can’t multiply 5 by anything. Take an analogy: the probability of heads in one coin flip is 50%. Does that mean that the probability of heads in four flips is 4×50% = 200%? Obviously not! Any process that leads to a probability >1 must be incorrect. Common mistake: 1−(1/500)5 is wrong. (1/500)5 is the probability of winning five years in a row, so 1−(1/500)5 is the probability of winning 0 to 4 times. What the problem asks is the probability of winning 1 to 5 times.

19

(a), (b), and (c) are all the possibilities there are, so the probabilities must total 1. You can compute two of them and then subtract from 1 to get the third.

(a) P(not first and not second) = P(not first) × P(not second) = (1−.7)×(1−.6) = 0.12 (c) P(first and second) = P(first) × P(second) = .7×.6 = 0.42 (b) 1−.12−.42 = 0.46 Alternative: You could compute (b) directly too, using sequences: P(exactly one copy recorded) = P(first and not second) + P(second and not first) = P(first)×(1−P(second)) + P(second)×(1−P(first)) = .7×(1−.6) + .6×(1−.7) = 0.46 A very common mistake on problems like this is writing down only one of the sequences. When you have exactly one success (or exactly any definite number), almost always there are multiple ways to get to that outcome. You can’t use the “or” formula here, even if you studied it. That computes the probability of one or the other or both, but you need the probability of one or the other but not both.

Problem Set 2

20

(a) P(ticket on route A) = P(taking route A) × P(speed trap on route A) = 0.2×0.4 = 0.08. In the same way, the probabilities of getting a ticket on routes B, C, D are 0.1×0.3 = 0.03, 0.5×0.2 = 0.10, and 0.2×0.3 = 0.06. He can’t take more than one route to work on a given day, so those are disjoint events. The probability that he gets a ticket on any one morning is therefore 0.08+0.03+0.10+0.06 = 0.27 . (b) The probability of not getting a ticket on a given morning is 1−0.27 = 0.73. The probability of getting no tickets on five mornings in a row is therefore 0.735 » 0.2073 or about 21% .

21

Two events A and B are independent if P(A|B) = P(A). P(man) = 106.2/219.7 »0.4834 P(man|divorced) = 9.7/22.8 » 0.4254 Since P(man|divorced) ≠ P(man), the events are not independent. Alternative solution: You could equally well show that P(divorced|man) ≠ P(divorced): P(divorced|man) = 9.7/106.2 » 0.0913 P(divorced) = 22.8/219.7 » 0.1038

22

What’s the probability of ten of the same flip in a row? In other words, given either result, what’s the probability that the next nine will be the same? That must be (1/2)9 = 1/512. You therefore expect this to happen about once in about every 500 flips, or about twice in every thousand.

23

P(open door) = P(unlocked) + P(locked)×P(right key) P(open door) = 0.5 + 0.5×(2/5) = 0.7

Solutions for Chapter 6

← Exercises for Ch 6

Because this textbook helps you, please click to donate!

1

(a) 0, 1, 2, 3, 4, 5 (b)There are five trials, each die is either a two or not a two, and the dice are independent. This fits the binomial model .

2

(a) The probability model is shown at right. (I computed the probability of losing $5 as 1−[1/10000000+1/125+1/20].)

x ($)

P(x)

9,999,995

1/10,000,000

(b) $ in L1, probabilities in L2. 1-VarStats L1,L2 yields µ = −2.70. 95 1/125 The expected value of a ticket is −$2.70. This is a bad deal for you . (It’s a very good deal for the lottery company. They’ll make $2.70 per 5 1/20 ticket, on average.) −5 .9419999 Common mistakes: Students sometimes give hand-waving arguments such as the top prize being very unlikely, or the lottery company always getting to keep the ticket price, but these are not relevant. The only thing that determines whether it’s a good or bad deal for the player is the expected value µ.

3

(a) This is a geometric model: repeated failures until a success, with p = 0.066. µ = 1/p = 1/.066 » 15.2 Over the course of her undead existence, taking each night’s hunt as a separate experience, the average of all nights has her first getting an O negative drink from her fifteenth victim. (b) geometcdf(.066,10) = .4947936946 » 0.4948 . Velma has almost a 50% chance of getting O negative blood within her first ten victims. (You could also do this as a binomial, n = 10, p = 0.066, x = 1 to 10.) (c) This is a binomial model with n = 10, p = 0.066, and x = 2. Use MATH200A part 3 or binompdf(10,.066,2) = .1135207874 » 0.1135 . Velma has just over an 11% chance of getting exactly two O negative victims within her first ten.

4

This is a geometric distribution. You’re looking for someone who is opposed to universal background checks, so p = 1−.92 = 0.08.

(a) geometpdf(.08, 3) = .067712 Õ 0.0677 (b) geometcdf(.08, 3) = .221312 Õ 0.2213 (You could also do this as a binomial with n = 3, p = 0.08, x = 1 to 3.)

5

(a) This is a binomial distribution: each student passes or not, whether one student passes has nothing to do with whether anyone else passes, and there are a fixed seven trials. µ = np = 7*0.8 Þ µ = 5.6 people = √[ npq] = √[7*0.8*(1−0.8)] = 1.058300524 Þ = 1.1 people

(b) Binomial again, n = 7, p = 0.8, x = 4 to 6. Use binompdf-sum or MATH200A part 3 to find P(4 ≤ x ≤ 6) = 0.7569 .

(c) Geometric model: p = 0.8, x = 3. geometpdf(.8,3) = 0.0320 (d) geometcdf(.8,2) = 0.9600 Alternative solution: Binomial probability with n = 2, p = 0.8, x = 1 to 2 gives the same answer.

6

This is binomial data, p = .49. For a sample of 40, expected value is µ = np = 40×.49 = 19.6. 13 is less than 19.6, so asking whether 13 is surprising is really asking whether 0 to 13 is surprising; see Surprise! binomcdf(40,.49,13) or MATH200A Program part 3 with n=40, p=.49, x=0 to 13 gives .0259693307 Õ 0.0260 , less than 5%, so you would be surprised though maybe not flabbergasted.

7

(a) Probability of one equals proportion of all, and therefore a randomly selected 22-year-old male has a 0.1304% chance of dying in the next year. That’s the only “prize”, so multiply it by its probability to find fair price: 100000×0.001304 = $130.40

(b) The company’s gross profit is $180.00−130.40 = $49.60, about 28%. But it could very well cost the company that much to sell the policy, pay the agent’s commission, and enter the policy in the computer. Also, all policies must bear part of the company’s general overhead costs. The price is not necessarily unfair in the plain English sense.

8

(a) x’s in L1, P’s in L2. 1-VarStats L1,L2 yields µ = 2 (exactly) and = 1.095353824 or » 1.1 . Interpretation: In the long run, on average you expect to get two heads per group of five flips. You expect most groups of five flips will yield between µ− = 1 head and µ+ = 3 heads.

(b) (I wouldn’t use this part as a regular quiz question.) The long-term average is 2 heads out of 5 flips, which is p = 2/5 = 40%. Obviously coin flips are independent, so the probability of heads must be the same every time. Therefore you have a binomial model with n = 5 and p = 0.4 .

9

(a) Binomial probability with n = 5, p = 0.7, x = 3 to 5. MATH200A part 3 5, .7, 3, 5 yields .83692 or P(x ≥ 3) = 0.8369 . Or, binompdf(5,.7)ÕL6 and then sum(L6,4,6) to get the same answer. Or, use the complement: 1−binomcdf(5,.7,2).

(b) You need the mean of the binomial distribution: µ = np = 10×0.7 = 7 (c) 5 is less than the expected number, so you compute P(x≤5): MATH200A part 3 10, .7, 0, 5 yields 0.1503, or binomcdf(10,.7.5) = 0.1503, not surprising Common mistake: Don’t just compute P(x=5), which is 0.1029. When you want to know whether a result is unusual or surprising, you have to find the probability of that result or one even further from the expected value.

10

(a) Geometric model, p = 0.34. µ = 1/.34 » 2.94. About three (b) binompdf(5,.34,0) = .1252332576, about a 12.5% chance

11

Your words will vary, but you should have the idea that the binomial model is a fixed number of trials with varying number of successes, whereas the geometric model is a varying number of trials that ends with the first success.

12

Your words will vary, but you should have the idea that a pdf is the probability of a specific outcome, and the cdf is the cumulative probability of all outcomes 0 through a specified number. I’m not so concerned that you know what pdf and cdf stand for, as long as you understand what they mean and when to use each.

Solutions for Chapter 7

← Exercises for Ch 7

1

Because this textbook helps you, please click to donate!

On any given trip, there’s a 9% chance that Chantal’s commute will be less than 17 minutes. 9% of Chantal’s commutes are shorter than 17 minutes.

2

P(x ≥ 76.5) = normalcdf(76.5, 10^99, 69.3, 2.92) = .0068362782. Here are the two interpretations, from Interpreting Probability Statements in Chapter 5: The probability that a randomly selected man is 76.5² or taller is 0.0068 or 0.68%. Only 0.68% of men are 76.5² tall or taller.

3

“Have boundaries, find probability.” P(64 ≤ x ≤ 67) = normalcdf(64, 67, 64.1, 2.75) = 0.3686871988 Õ 0.3687 36.87% of women are 64² to 67² tall.

4

5% probability in the two tails means 2.5% or 0.025 in each tail. x1 = invNorm(.025, 69.3, 2.92) = 63.57690516 x2 = invNorm(1−.025, 69.3, 2.92) = 75.02309484 Heights under 63.6² or over 75.0² would be considered unusual.

5

The area to left is given as 15% or 0.15, and you need the boundary. P15 = invNorm(.15, 69.3, 2.92) = 66.27361453 You must be at least 66² or 5¢6² tall. Also acceptable: at least 66¼ inches, or at least 66.3 inches.

6

(a) By the definition of percentile, the number of the desired percentile is also the area to left. P25 = invNorm(.25, 64.1, 2.75) = 62.24515319 Õ P25 = 62.2² P75 = invNorm(.75, 64.1, 2.75) = 65.95484681 Õ P75 = 66.0² (b) Q3 is P75 and Q1 is P25, so the IQR is P75−P25 = 65.95484681−62.24515319 = 3.70969362 Õ IQR = 3.7². (c) 1.35 = 1.35×2.75 = 3.7125 Õ 3.7², matching the IQR as expected. (The match isn’t perfect, because 1.35 is a rounded number.)

7

Use MATH200A Program part 4. The screens are shown at right. The points fall reasonably close to a line. r = 0.9595 and crit = 0.9383. r > crit, and therefore you can say that the normal model is a good fit to the data.

8

The percentile is the percent of the population that scored ≤735. P(x ≤ 735) = normalcdf(−10^99, 735, 500, 100) = 0.9906. A score of 735 is at the 99th percentile .

9

2% or 0.02 is area to right, but invNorm needs area to left, so you subtract from 1. x1 = invNorm(1−.02, 1500, 300) = 2116.124673 You must score at least 2117. (If you round to 2116, you get a number that is a bit less than the computed minimum. While rounding usually makes sense, there are situations where you have to round up, or round down, instead of following the usual rule.)

10

z0.01 = invNorm(1−0.01, 0, 1) = 2.326347877 Õ z0.01 = 2.33

11

P(x < 60) = normalcdf(−10^99, 60, 69.3, 2.92) = 7.240062385E−4 P(x < 60) = 7.24×10-4 or (better) 0.0007 Common mistake: The probability is not 7.24! That’s not just wrong, it’s very wrong — probabilities are never greater than 1. “E−4” on your calculator comes at the end of the number, but it’s critical info. It means “times 10 to the minus 4th power”, so the probability is 7×10−4 or 0.0007. The probability that a randomly selected man is under 60² tall is 0.0007 or 0.07%. 0.07% of men are under 60² tall.

12

The plot is pretty clearly not a straight line — there’s a sharp bend around the second and third data points. The numbers confirm this: r = .8363, crit = .9121, r < crit, and therefore the normal model is not a good fit for this data set.

13

The middle 90% leaves 10% in the two tails, or 5% in each tail.

xm1 = invNorm(.05, 69.3, 2.92) = 64.49702741 xm2 = invNorm(1−.05, 69.3, 2.92) = 74.10297259 xf1 = invNorm(.05, 64.1, 2.75) = 59.57665253 xf2 = invNorm(1−.05, 64.1, 2.75) = 68.62334747 Men must be 64.5 to 74.1 inches tall; women must be 59.6 to 68.6 inches tall.

Solutions for Chapter 8

← Exercises for Ch 8

Because this textbook helps you, please click to donate!

1

This is numeric data. You have a random sample, and it’s less than 10% of the households in a country. Despite the skew, with sample size so far above 30 you can be sure that the shape of the sampling distribution is approximately normal. The mean of the sampling distribution is µx = µ = $48,000 The SD of the sampling distribution of the mean, a/k/a standard error of the mean, is x = /√ n = $2000/√64 Õ x = $250

2

(a) First, describe the distribution and sketch the situation. For the population, you’re given µ = 800, = 50, n = 100. Center: The mean of the sampling distribution is the same as the mean of the population, 800 hours. Spread: The standard error of the mean is x = / √n = 50/√100 = 5 hours. Shape: You have a random sample, 10n = 10×100 = 1000 is certainly less than the total number of light bulbs, and your sample size is comfortably larger than 30. Therefore you can use the normal model for the sampling distribution. Sample means are ND with mean 800 hours and SD 5 hours. The sketch is at right. Common mistake: The correct standard deviation is 5 hours, not 50. You’re not sketching the population of light bulbs. Rather, you’re now interested in the distribution of average lifetimes in samples of 100 bulbs. (The axis is the x axis, not the x axis.) 780 hours, the sample mean that the problem asks about, is 20 hours below the population mean of 800. 20/5 = 4 standard errors, so you should have marked 780 hours at four standard deviations below the mean. A sample mean of 780 is less than the population mean of 800 hours. Therefore you compute the probability of a sample mean of 780 hours or less. It will be surprising (unusual, unexpected) if the probability is under 5%. P(x ≤ 780) = normalcdf(−10^99, 780, 800, 50/√(100)) = 3.1686E-5 Õ P(x ≤ 780) = 0.00003 (You can also give the probability as 75.6). A sample weighing 75.6 lb total will have a sample mean of 75.6/15 = 5.04 lb, so this is really just another problem in finding the probability of turning up a sample mean in a given range. µx = µ = 5.00 lb The SEM is x = 0.05/√15 » 0.013 lb. The sample means are normally distributed, even for this small sample, because the original population is normally distributed. P(∑x > 75.6) = P(x > 5.04) = normalcdf(75.6/15, 10^99, 5.00, 0.05/√(15)) = 9.7295E-4 » 0.0010 , about one chance in a thousand.

7

(a) This part is a standard Chapter 7 problem about individuals, not samples, so the axis is x rather than x.

Answer: P(x > 43.0) = 0.1634 (b) The sampling distribution of x is ND, even for this small sample, because the population is ND. The standard error is x = 5.1/√14 » 1.4. P(x > 43.0) = normalcdf(43, 10^99, 38, 5.1/√(14)) 1.2212E-4 Õ P(x>43.0) = 0.0001 or 0.01% Remark: This sketch is not very well proportioned, because it makes the probability look much larger than it actually is.

8

12,778 KW shared among 1000 households is 12778/1000 = 12.778 KW per household on average. “Fail to supply enough power” means that the households are using more power than that. You need P(x > 12.778) for n = 1000. The standard error of the mean is x = 3.5/√1000, about 0.11. The sampling distribution of the mean is normal because data are numeric and n =1000, greater than 30. (Treat the sample as random because it’s a “typical neighborhood”. And a thousand households is less than 10% of all the households that there are.) P(x > 12.778) = normalcdf(12.778, 10^99, 12.5, 3.5/√(1000) = 0.0060

9

p = 0.0171, n = 11,037, and you want to find P(p ≤ 0.0094). First check that the sampling distribution of p is a ND: The doctors were randomized between treatment and placebo groups. 10×11,037 = 110,370. There are more adult males than that. np = 11037×.0171 = about 189; nq = 11037−189 = 10848. Both are well above 10. Therefore the sampling distribution can be approximated by a normal distribution. The standard error of the proportion or SEP is p = √[pq/n] = √[.0171(1−.0171)/11037] » 0.0012 If you use my shortcut, your screen will look like the one at the left; if not, it will look like the one at the right. or

Either way, the probability is 2.2013×10-10 , or 0.000 000 000 2. There are only two chances in ten billion of getting a sample proportion of 0.94% or less with sample size 11,037, if the true population proportion is 1.71%. That’s pretty darn unlikely, so based on this experiment you can rule out coincidence and decide that aspirin does reduce the chance of a heart attack among adult males.

10

Heights are ND, so the sampling distribution is also. By the Empirical Rule or 68–95–99.7 Rule, 95% of a ND falls within 2 SD of the mean. The distribution that concerns you in this problem is the sampling distribution of x, not the original distribution of individual men’s heights. Therefore, the SD that concerns you is the standard error of the mean, not the SD of men’s heights. The standard error of the mean or SEM is x = /√ n = 2.92/√16 = 0.73². µx ± 2 x = 69.3 ± 2×.73 = 67.84 to 70.76. Sample means between those values would not be surprising, and therefore a sample mean would be surprising if it is under 67.84² or over 70.76² . Alternative solution: That back-of-the-envelope calculation is good enough, but you could also get a more precise answer: L = invNorm(0.025, 69.3, 2.92/√(16)) = 67.87 H = invNorm(1−0.025, 69.3, 2.92/√(16)) = 70.73

11

This is like the Swain v. Alabama example. You have to convert the sample counts into a proportion: p = 737/1504 » 49%. The problem is really asking you for P(p ≥ 49%) in a sample of 1504 with population proportion of 45%.

What does the sampling distribution look like? The center is µp = p = 0.45. The standard error is p = √[.45×(1−.45)/1504] » 0.013. Check requirements to make sure that a normal model can be used for the sampling distribution: Random sample? Yes, given. Sample less than 10% of population? 10×1504 = 15,040, compared to millions of American adults, OK. Sample large enough? Yes, 0.45×1504 » 677 successes and 1504−677 » 827 failures expected, both above 10. P(x ≥ 737) = P(p ≥ 49%) = normalcdf(737/1504, 10^99, .45, √(.45*(1−.45)/1504)) » 9E-4 or 0.0009 . Can you draw a conclusion? Yes, you can. In a population with 45% unfavorable rating of the Tea Party, there are only 9 chances in 10,000 of getting a sample as unfavorable as this one (or more unfavorable). That’s pretty unlikely, so you conclude that the true unfavorable rating in October was most likely more than 45% of all Americans. (In Chapter 9, you’ll learn how to estimate that proportion from a sample.)

Solutions for Chapter 9

← Exercises for Ch 9

Because this textbook helps you, please click to donate!

1

You make probability statements about things that can change if you repeat the experiment. There’s a 1/6 chance of rolling doubles, because you’ll get doubles about 1/6 of the times that you roll two dice. But the mean of the population is one definite number. It doesn’t change from one experiment to the next. Your estimate changes, because it’s based on your sample and no sample is perfect. But the thing you’re trying to estimate, mean or proportion, is what it is even though you don’t know it exactly. (Statisticians would say, “the population mean or proportion is not a random variable.” By that, they mean just what I said in less technical language.) Answer: A confidence interval for numeric data is an estimate of the average, and tells you nothing about individuals. Correct his conclusion to I’m 90% confident that the average food expense for all TC3 students is between $45.20 and $60.14 per week. . Remark: Use all or a similar word to show that you’re estimating the mean for the population, not just the sample of 40 students. There’s no need to estimate the mean of the sample, because you know the exact sample mean x for your sample. Remark: Be clear in your mind that you’re estimating the average spending per student at $45– 60 a week. Some individual students will quite likely spend outside that range, so your interpretation shouldn’t say anything about individual student spending.

2

Answer: It’s the use of the word average . When you collect data points that are all yes/no or success/failure, you have a sample proportion p, equal to the number of successes divided by sample size, and you can estimate a population proportion. There is no “average” with nonnumeric data. Your 90% confidence estimate is simply that 27% to 40% usually or always prepare their own food.

3

4

This is a confidence interval about a mean, Case 1 in Inferential Statistics: Basic Cases.

Requirements: random sample, OK. 10n = 10×40 = 400 is less than total number of batteries made; OK. n = 40 >30, OK. TInterval 1756, 142, 40, .95 (1710.6, 1801.4) Neveready is 95% confident that the average Neveready A cell, operating a wireless mouse, lasts 1711 to 1801 minutes (28½ to 30 hours). Common mistake: Don’t make any statement about 95% of the batteries! Your CI is about your estimate of one number, the average life of all batteries. Your CI has a margin of error of ±15 minutes; the 95% range for all batteries would be about 4 to 5 hours.

5

(a) p = 5067/10000 = 0.5067 Don’t make the term “point estimate” harder than it is! The point estimate for the population mean (or proportion, standard deviation, etc.) is just the sample mean (or proportion, standard deviation, etc.). (b) The sample is his actual data, the 10,000 flips. Therefore the sample size is n = 10,000. The population is what he wants to know about, all possible flips. The population size is infinite or “indefinitely large”.

6

This is sample size for a confidence interval about a proportion, Case 2 in Inferential Statistics: Basic Cases. Since you have no prior estimate, use 0.5 for p.

With the MATH200A program (recommended):

If you’re not using the program:

MATH200a/sample size/binomial, p = .5, E = .035, C-Level = .95, sample size is at least 784

The formula is . 1− = .95 Þ /2 = 0.025. z0.025 = invNorm(1−.025, 0, 1) Divide by .035, square the result, and multiply by .5*(1−.5). Answer: at least 784. Remember — you’re not rounding, you’re going up to a whole number.

7

This is a confidence interval about a proportion, Case 2 in Inferential Statistics: Basic Cases.

Requirements: Random sample, OK. 10n = 10×100 = 1000 < 68,917, OK. 40 successes, 100−40 = 60 failures, both > 10, OK. Common mistake: Don’t say “n > 30” or “n ≥ 30”. That’s true, but it doesn’t help you with binomial data. For computing a confidence interval about a proportion from binomial data, the “sample size large enough” condition is at least 10 successes and at least 10 failures, not sample size at least 30. 1-PropZInt 40, 100, .9 Õ (.31942, .48058), p = .4 31.9% to 48.1% of all claims at that office have been open for more than a year (90% confidence).

8

This is a confidence interval about a mean, Case 1 in Inferential Statistics: Basic Cases.

9

This is a confidence interval about a mean, Case 1 in Inferential Statistics: Basic Cases.

Requirements check: Random sample, OK. 10×40 = 400, less than the number of times she could commute (past, present, and future), OK. Sample size 40 > 30, OK. TInterval 17.7, 1.8, 40, .95 Õ (17.124, 18.276) She’s 95% confident that the average of all her commutes is 17.1 to 18.3 minutes.

Requirements check: Random sample, OK. 10×15 = 150 is less than total number of women in their 20s. MATH200A/Normality: r=.9667, CRIT=.9383, r>CRIT, OK. MATH200A/Box-whisker: no outliers, OK. TInterval L6, 1, .95 Õ (62.918, 65.016), x=63.96666667, s=1.894226818, n=15 The average height of women aged 20–29 is 62.9 to 65.0 inches (95% confidence) . Remark: Since adult women’s heights are known to be normally distributed, you could get away without checking for normality and outliers in this sample. But it does no harm to check every time.

10

This is a confidence interval about a mean, Case 1 in Inferential Statistics: Basic Cases.

Requirements check: Random sample, OK. 10×18 = 180. There are far more than 180 male students; OK. MATH200A/box-whisker: no outliers, OK. MATH200A/Normality check: r=.9787, CRIT=.9461, r>CRIT, OK. TInterval L5, 1, .9 Õ (97.757, 98.343), x = 98.05, s =.7155828558 Õ 0.72, n = 18. (a) Fred is 90% confident that the average body temperature of healthy male students is 97.8 to 98.3 °F. (b) He’s 90% confident that the average body temperature is not more than 98.3°, so 98.6° as normal (average) temperature is inconsistent with his data. (c) E = 98.343−98.05 = 0.3°, or E = 98.05−97.757 = 0.3°, or (98.343−97.757)/2 = 0.3°. With the MATH200A program (recommended):

If you’re not using the program:

(d) MATH200A/Sample size/Num unknown : s=.7155828558, E=.1, C-Level=.95, n≥202. He will need at least 202 in his sample.

(d) Confidence level = 1− = 0.95 Þ = 0.05 Þ /2 = 0.025. z0.025 = invNorm(1−.025)

11

Multiply by s, divide by E, and square the result. This gives 197. But the t distribution is more spread out than the normal (z) distribution, so you probably want to bump that number up a bit, say to 200 or so.

This problem is about a confidence interval about a proportion, Case 2 in Inferential Statistics: Basic Cases.

(a) Requirements check: Random sample, OK. 10×500 = 5000. A city of 6.4 million must have more than 5000 in that age range. OK. 219 successes, 500−219 = 281 failures. Both > 10, OK. 1-PropZInt, 219, 500, .9 Õ (.4015, .4745), p = .438 You’re 90% confident that 40.2% to 47.5% of Metropolis adults aged 50–75 have had a colonoscopy in the past ten years. (b) MATH200A/sample size/binomial, p = .438, E = .02, C-Level = .9 Õ at least 1665

12

This is a confidence interval about a mean, Case 1 in Inferential Statistics: Basic Cases.

13

This is a confidence interval about a proportion, Case 2 in Inferential Statistics: Basic Cases.

Requirements check: Random sample, OK. 10×20 = 200, less than the total number of cash deposits, OK. MATH200A/normality check, r=.9864, CRIT=.9503, r>CRIT, OK. MATH200a/box-whisker, no outliers, OK. TInterval L4, 1, .95 Õ (179.86, 198.93), x = 189.40, s = 20.37, n = 20 You’re 95% confident that the average of all cash deposits is between $179.86 and $198.93. Common mistake: Don’t say that 95% of deposits are between those values — if you look at the sample you’ll see that’s pretty unlikely. You’re estimating the average, not the individual deposits in the population.

Requirements check: Systematic sample, OK. Sample 10n = 10×1000 = 10,000, less than the number of voters; OK. 520 successes and 1000−520 = 480 failures, OK.

1-PropZInt 520, 1000, .95 Õ (.48904, .55096), p = .52 With 95% confidence, 48.9% to 55.1% of voters voted Snake. At the 95% confidence level, we can’t tell whether more or less than 50% of voters voted for Abe Snake.

Solutions for Chapter 10

← Exercises for Ch 10

Because this textbook helps you, please click to donate!

Problem Set 1

1

1. Hypotheses. 2. Significance level RC. Requirements check 3–4. Test statistic and pvalue 5. Decision rule (or, conclusion in statistics language) 6. Conclusion (in English)

2

It keeps you honest. If you could select a significance level after computing the value, you could always get the result you want, regardless of evidence.

3

Answers will vary here. But you should get in the key idea that If H 0 is true, the p-value is the chance of getting the sample you got, or a sample even further from H 0 , purely by random chance. For more correct statements, and common incorrect statements, see What Does the pValue Mean?

4

(a) It’s too wishy-washy. When p you can’t reach a conclusion. Accepting H 0 is wrong because it reaches the conclusion that H 0 is true. Failing to reject H 0 is correct because it leaves both possibilities open. It’s like a jury verdict of “not guilty beyond a reasonable doubt. The jury is not saying the defendant didn’t do it. They are saying that either he didn’t do it or he did it but the prosecution didn’t present enough evidence to convince them. A hypothesis test can end up rejecting H 0 or failing to reject it, but the result can never be to accept H 0 .

9

H 0 : µ = 500 H 1 : µ ≠ 500

Remark: It must be ≠, not > or ; fail to reject H 0 . At the 0.01 significance level, we can’t determine whether the directors are stealing from the company or not.

(b) p < ; reject H 0 and accept H 1 . At the 0.01 level of significance, we find that the directors are stealing from the company.

11

is the probability of a Type I error that you can tolerate. A Type I error in this case is determining that the defendant is guilty (calling H 0 false) when actually he’s innocent (H 0 is really true), and the consequence would be putting an innocent man to death. You specify a low to make it less likely this will happen. Of the given choices, 0.001 is best. 80%

12

This is binomial data, a Case 2 test of proportion in Inferential Statistics: Basic Cases.

(1)

H 0 : p = .1, 10% of TC3 students driving alcohol impaired H 1 : p > .1, more than 10% of TC3 students driving alcohol impaired

(2)

= 0.05

(RC)

Systematic sample (counts as random), OK. npo = 120×.10 = 12 successes and n−npo = 120−12 = 108 failures expected, OK. 10n = 10×120 = 1200, and there are many more students than that at TC3, OK.

(3/4)

1-PropZTest: .1, 18, 120, >po results: z=1.825741858 Õ z = 1.83, p=.0339445194 Õ p = 0.0339 , p = .15

(5)

p < . Reject H 0 and accept H 1 .

(6)

At the 0.05 significance level, more than 10% of TC3 students were alcohol impaired on the most recent Friday or Saturday night when they drove, Or, More than 10% of TC3 students were alcohol impaired on the most recent Friday or Saturday night when they drove (p = 0.0339).

13

This is binomial data (against or not against): a Case 2 test of population proportion in Inferential Statistics: Basic Cases.

Requirements check: Random sample? NO, this is a self-selected sample, consisting only of those who returned the poll. (That could be overcome by following up on those who did not return the poll, but nobody did that.) The 10n≤N requirement also fails. 10n = 10×380 = 3800, much larger than the 1366 population size. Answer: No, you cannot do any inferential procedure because the requirements are not met.

14

(a) The sample size is 325 . Why not the 500 she talked to? Because she was studying the habits of the primary grocery shoppers. The 325 were members of that population and could therefore be part of her sample; the rest of the 500 were not. (b) The population is all persons who do the primary grocery shopping in their households . We don’t know the precise number, but it is surely in the millions since there are millions of households. We can say that it is indefinitely large . (c) The number 182 is x, the number of successes in the sample . (d) She wanted to know whether the true proportion is greater than 40%, so her alternative hypothesis is H 1 : p > 0.4 and po is 0.4 . (e) No. The researcher is interested in the habits of the primary grocery shoppers in households; therefore she must sample only people who are primary grocery shoppers in their households. If you even thought about saying Yes, please go back to Chapter 1 and review what bias actually means.

15

(a) This is inference about the proportion in one population, Case 2 in Inferential Statistics: Basic Cases.

(1)

H 0 : p = 2/3, the chance of winning is 2/3 if you switch doors. H 1 : p ≠ 2/3, the chance of winning is different from 2/3 if you switch doors. Remark: You need to test for ≠, not . Fail to reject H 0 .

(6)

We can’t determine whether the claim “switching doors gives a 2/3 chance of winning” is true or false (p = 0.4386). Or, At the 0.05 significance level, we can’t determine whether the probability of winning after switching doors is equal to 2/3 or different from 2/3. Remark: It’s true that you can’t disprove the claim, but it’s also true that you can’t prove it. This is where a confidence interval gives useful information.

(b) Requirements have already been checked. 1-PropZInt 18, 30, .95. Results: (.4247, .7753), p = .6. We’re 95% confident that the true probability of winning if you switch doors is between 42.5% and 77.5%. (c) It’s possible that the true probability of winning if you switch doors is 1/3 (33.3%) or even worse, but it’s very unlikely. Why? You’re 95% confident that it’s at least 42.5%. Therefore you’re better than 95% confident that the true probability if you switch is better than the 1/3 probability if you don’t switch doors. Switching is extremely likely to be the good strategy.

16

The null hypothesis is always “no effect”, “nothin’ goin’ on here.” In this case “no effect” is “not spam”, so H 0 is “This piece of mail is not spam.” (a) A Type I error is rejecting the null hypothesis when it’s actually true. Here, a Type I error means deciding a piece of mail is spam when it’s actually not, so if Heather’s spam filter makes a Type I error then it will delete a piece of real mail. A Type II error is failing to reject H 0 when it’s actually false, treating a piece of spam as real mail, so a Type II error would let a piece of spam mail into Heather’s in-box. . (b) Most people would rather see a piece of spam (Type II) than miss a piece of real mail (Type I), so a Type I error is more serious in this situation. Lower significance levels make Type I errors less likely (and Type II errors more likely), so a lower is appropriate here .

17

This is a test of one population proportion, Case 2 in Inferential Statistics: Basic Cases.

(1)

H 0 : p = .304 H 1 : p < .304, less than 30.4% of Ithaca households own cats.

(2)

= 0.05

(RC)

Random sample? Systematic, OK. Sample too large? 10×215 = 2150. Without knowing how many households are in Ithaca, we can be sure it’s more than 2150. Sample large enough? In a sample of 215, according to H 0 you expect 215×.304 » 65 successes and 215−65 = 150 failures, OK.

(3/4)

1-PropZTest .304, 54, 215, < results: z = −1.68, p-value = 0.0461 , p = 0.2512

(5)

p < . Reject H 0 and accept H 1 .

(6)

At the 0.05 significance level, fewer than 30.4% of Ithaca households own cats. Or, Fewer than 30.4% of Ithaca households own cats (p = 0.0461).

Problem Set 2

18

(a) The population parameter is missing. It should be either µ or p, but since a proportion can’t be greater than 1 it must be µ. Correction: H 0 : µ = 14.2; H 1 : µ > 14.2

(b) H 0 must have an = sign. Correction: H 0 : µ = 25; H 1 : µ > 25 (c) You used sample data in your hypotheses. Correction: H 0 :µ=750; H 1 :µ>750 (d) You were supposed to test “makes a difference”, not “is faster than”. Never do a one-tailed test (> or crit, therefore normal. Outliers? MATH200A part 2 shows none.

(3/4)

T-Test: 3.8, L1, 1, µo results: t=3.232782217 Õ t = 3.23, p=.0071980854 Õ p = 0.0072 , x = 675, s=43.74602023 Õ s = 43.7, n = 8

(5)

p < . Reject H 0 and accept H 1 .

(6)

At the 0.01 level of significance, Whizzo is stronger on average than Stretchie. Or, Whizzo is stronger on average than Stretchie (p = 0.0072).

24

This is numeric data, with unknown: test of population mean, Case 1 in Inferential Statistics: Basic Cases.

(1)

H 0 : µ = 6 H 1 : µ > 6

(2)

= 0.05

(RC)

Systematic sample. n =100 >30. 10n = 10×100 = 1000, less than the number of TC3 students.

(3/4)

T-Test: 6, 6.75, 3.3, 100, >µo results: t=2.272727273 Õ t = 2.27, p=.0126021499 Õ p = 0.0126

(5)

p < . Reject H 0 and accept H 1 .

(6)

TC3 students do average more than six hours a week in volunteer work, at the 0.05 level of significance. Or, TC3 students do average more than six hours a week in volunteer work (p = 0.0126).

25

Binomial data (head or tail) implies Case 2, test of population proportion on Inferential Statistics: Basic Cases. A fair coin has heads 50% likely, or p = 0.5.

(1)

H 0 : p = 0.5, the coin is fair H 1 : p ≠ 0.5, the coin is biased Common mistake: You must test ≠, not >. An unfair coin would produce more or less than 50% heads, not necessarily more than 50%. Yes, this time he got more than 50% heads, but your hypotheses are never based on your sample data.

(2)

= 0.05

(RC)

Random sample? Yes, it’s coin flips. npo = 10000×.5 = 5000 successes and 10000−5000 = 5000 failures expected. 10n = 10×10,000 = 100,000. It would be possible to flip the coin more than 100,000 times.

(3/4)

1-PropZTest, .5, 5067, 10000, prop≠po results: z = 1.34, p = .1802454677 Õ p-value = 0.1802 , p = .5067

(5)

p > . Fail to reject H 0 .

(6)

At the 0.05 level of significance, we can’t tell whether the coin is fair or biased. Or, We can’t determine from this experiment whether the coin is fair or biased (p = 0.1802). Common mistake: You can’t say that the coin is fair, because that would be accepting H 0 . You can’t say “there is insufficient evidence to show that the coin is biased”, because there is also insufficient evidence to show that it’s fair. Remark: “Fail to reject H 0 ” situations are often emotionally unsatisfying. You want to reach some sort of conclusion, but when p> you can’t. What you can do is compute a confidence interval: 1-PropZInt: 5067, 10000, .95 results: (.4969,.5165) You’re 95% confident that the true proportion of heads for this coin (in the infinity of all possible flips) is 49.69% to 51.65%. So if the coin is biased at all, it’s not biased by much.

26

You have numeric data, and you don’t know the SD of the population, so this is a Case 1 test of population mean in Inferential Statistics: Basic Cases.

(a) Check requirements: random sample, n = 45 > 30, and there are more than 10×45 = 450 people with headaches. TInterval: x=18, s=8, n=45, C-Level=.95 Results: (15.597, 20.403) We’re 95% confident that the average time to relief for all headache sufferers using PainX is 15.6 to 20.4 minutes. (b) Requirements have already been checked. A two-tailed test (a test for “different”) at the 0.05 level is equivalent to a confidence interval at the 1−0.05 = .95 = 95% confidence level. Since the 95% CI includes 20, the mean time for aspirin, we cannot determine, at the 0.05 significance level, whether PainX offers headache relief to the average person in a different time than aspirin or not.

Solutions for Chapter 11

← Exercises for Ch 11

Because this textbook helps you, please click to donate!

1

(a) Use MATH200A part 5 and select 2-pop binomial. You have no prior estimates, so enter 0.5 for p1 and p2 . E is 0.03, and C-Level is 0.95. Answer: you need at least 2135 per sample , 2135 people under 30 and 2135 people aged 30 and older. Here’s what it looks like, using MATH200A part 5:

Caution! Even if you don’t identify the groups, at least you must say “per sample”. Plain “2135” makes it look like you need only that many people in the two groups combined, or around 1068 per group, and that is very wrong. Caution! You must compute this as a two-population case. If you compute a sample size for just one group or the other, you get 1068, which is just about half of the correct value. If you don’t have the program, you have to use the formula: [p1 (1−p1 )+p2 (1−p2 )]·(z/2 /E)². You don’t have any prior estimates, so p1 and p2 are both equal to 0.5. Multiply out p1 × (1−p1 ) × p2 × (1−p2 ) to get .5. Next, 1− = 0.95, so = 0.05 and /2 = 0.025. z /2 = z0.025 = invNorm(1−0.025). Divide that by E (.03), square, and multiply by the result of the computation with the p’s. (b) Using MATH200A Program part 5 with .3, .45, .03, .95 gives 1953 per sample . Alternative solution: Using the formula, .3(1−.3)+.45(1−.45) = .4575. Multiply by (invNorm(1−.05/2)/.03)² as before to get 1952.74157 Õ 1953 per sample. Again, you must do this as two-population binomial. If you do the under-30 group and the 30+ group separately, you get sample sizes of 897 and 1057, which are way too small. If your samples are that size, the margins of error for under-30 and 30+ will each be 3%, but the margin of error for the difference, which is what you care about, will be around 4.2%, and that’s greater than the desired 3%.

2

(a) You have numeric data in two independent samples. You’re testing the difference between the means of two populations, Case 4 in Inferential Statistics: Basic Cases. (The data aren’t paired because you have no reason to associate any particular Englishman with any particular Scot.) (1)

Population 1 = English; population 2 = Scots. H 0 : µ1 = µ2 (or µ1 −µ2 = 0) H 1 : µ1 > µ2 (or µ1 −µ2 > 0)

(2)

= 0.05

(RC) The problem states that samples were random. For English, r=.9734 and crit=.9054; for Scots, r=.9772 and crit=.9054. Both r’s are greater than crit, so both are nearly normally distributed. The stacked boxplot shows no outliers. And obviously the samples of 8 are far less than 10% of the populations of England and Scotland.



(3/4)

English numbers in L1, Scottish numbers in L2. 2-SampTTest with Data; L1, L2, 1, 1, µ1 >µ2 , Pooled:No Outputs: t=1.57049305 Õ t = 1.58, p=.0689957991 Õ p = 0.0690 , df=13.4634, x1=6.54, x2=4.85, s1=1.91, s2=2.34, n1=8, n2=8

(5)

p > . Fail to reject H 0 .

(6)

At the 0.05 level of significance, we can’t say whether English or Scots have a stronger liking for soccer. Or, We can’t say whether English or Scots have a stronger liking for soccer (p = 0.0690).

(b) Requirements are already covered. 2-SampTInt, C-Level=.90 Results: (−.2025, 3.5775) We’re 90% confident that, on a scale from 1=hate to 10=love, the average Englishman likes soccer between 0.2 points less and 3.6 points more than the average Scot.

3

(a) This is the difference of proportions in two populations, Case 5 in Inferential Statistics: Basic Cases.

(1)

Population 1 = English, population 2 = Scots. H 0 : p1 = p2 (or p1 −p2 = 0) H 1 : p1 ≠ p2 (or p1 −p2 ≠ 0)

(2)

= 0.05

(RC)

Populations of England and Scotland are greater than 10×150 = 1500 and 10×200 = 2000. England: 105 successes, 150−105 = 45 failures, both ≥ 10. Scotland: 160 successes, 200−160 = 40 failures, both ≥ 10. The samples were stated to be random.

(3/4)

2-PropZTest x1=105, n1=150, x2=160, n2=200, p1≠p2 results: z=−2.159047761 Õ z = −2.16, p=.030846351 Õ p = 0.0308 , p1 = 0.70, p2 = 0.80, p = 0.7571428751

(5)

p < . Reject H 0 and accept H 1 .

(6)

The English and Scots are not equally likely to be soccer fans, at the 0.05 level of significance ; in fact the English are less likely to be soccer fans. Or, The English and Scots are not equally likely to be soccer fans, (p = .0308) ; in fact the English are less likely to be soccer fans.

(b) Requirements already checked. 2-PropZInt with C-Level = .95 Õ (−.1919, −.0081) That’s the estimate for p1 −p2 , English minus Scots. Since that’s negative, English like soccer less than Scots do. With 95% confidence, Scots are more likely than English to be soccer fans, by 0.8 to 19.2 percentage points. (c) [(−.0081) − (−.1919)] / 2 = 0.0919 , a little over 9 percentage points. (d) MATH200A part 5, 2-pop binomial, p1 =.7, p2 =.8, E=.04, C-Level .95 gives 889 per sample By formula, z/2 = z0.025 = invNorm(1−0.025) = 1.96. n1 = n2 = [.7(1−.7)+.8(1−.8)]×(1/96/.04)² = 888.37 Õ 889 per sample

4

(a) This is before-and-after paired data, Case 3 in Inferential Statistics: Basic Cases. You’re testing the mean difference.

(1)

d = After−Before H 0 : µd = 0, running makes no difference in HDL H 1 : µd > 0, running increases HDL Remark: If this was a research study, they would probably test for a difference in HDL, not just an increase. Maybe this study was done by a fitness center or a running-shoe company. They would want to find an increase, and HDL decreasing or staying the same would be equally uninteresting to them.

(2)

= 0.05

(RC) Before in L1, After in L2, L3=L2−L1 Random sample. Five women is obviously less than 10% of all women. Box-whisker (L3) shows no outliers. Normality check (L3): r(.9131)>crit(.8804).

(3/4)

T-Test 0, L3, 1, µ>0 results: t=3.059874484 Õ t = 3.06, p=.0188315555 Õ p = 0.0188 , d=4.6, s=3.36, n=5

(5)

p < . Reject H 0 and accept H 1 .

(6)

At the 0.05 level of significance, running 4 miles daily for six months raises HDL level. Or, Running 4 miles daily for six months raises HDL level (p = 0.0188).

(b) TInterval with C-Level .9 gives (1.3951, 7.8049). Interpretation: You are 90% confident that running an average of four miles a day for six months will raise HDL by 1.4 to 7.8 points for the average woman. Caution! Don’t write something like “I’m 90% confident that HDL will be 1.4 to 7.8”. The confidence interval is not about the HLD level, it’s about the change in HDL level. Remark: Notice the correspondence between hypothesis test and confidence interval. The one-tailed HT at = 0.05 is equivalent to a two-tailed HT at = 0.10, and the complement of that is a CI at 1− = 0.90 or a 90% confidence level. Since the HT did find a statistically significant effect, you know that the CI will not include 0. If the HT had failed to find a significant effect, then the CI would have included 0. See Confidence Interval and Hypothesis Test.

5

(a) Each participant either had a heart attack or didn’t, and the doctors were all independent in that respect. This is binomial data. You’re testing the difference in proportions between two populations, Case 5 in Inferential Statistics: Basic Cases.

(1)

Population 1: Aspirin takers; population 2: non-aspirin takers. H 0 : p1 = p2 , taking aspirin makes no difference H 1 : p1 ≠ p2 , taking aspirin makes a difference

(2)

= 0.001

(RC)

SRS. 10n1 = 10×11,037 = 110,370. According to A Census of Actively Licensed Physicians in the United States, 2010 (Young (2011)), in that year there were 850,085 actively licensed physicians in the US. Even if we assume half were women and there were fewer doctors in 1982 when the study began, still 10n1 is lower. 10n2 = 10×11,034 = 110,340, also within the limit. Treatment group: 139 successes, 11037−139 = 10898 failures, both ≥ 10. Placebo group: 239 successes, 11034−239 = 10795 failures, both ≥ 10.

(3/4)

2-PropZTest: x1=139, n1=11037, x2=239, n2=11034, p1≠p2 results: z=−5.19, p-value = 2×10-7 , p1 = .0126, p2 = .0217, p = .0171

(5)

p < . Reject H 0 and accept H 1 .

(6)

At the 0.001 level of significance, aspirin does make a difference to the likelihood of heart attack. In fact it reduces it. Or, Aspirin makes a difference to the likelihood of heart attack (p < 0.0001). In fact, aspirin reduces the risk.

Remark The study was conducted from 1982 to 1988 and was stopped early because the results were so dramatic. For a non-technical summary, see Physicians’ Health Study (2009). More details are in the original article from the New England Journal of Medicine (Steering Committee 1989). (b) 2-PropZInt with C-Level .95 gives (−.0125, −.0056). We’re 95% confident that 325 mg of aspirin every other day reduces the chance of heart attack by 0.56 to 1.25 percentage points. Caution! You’re estimating the change in heart-attack risk, not the risk of heart attack. Saying something like “with aspirin, the risk of heart attack is 0.56 to 1.25%” would be very wrong.

6

(a) You’re estimating the difference in means between two populations. This is Case 4 in Inferential Statistics: Basic Cases. Requirements: Random samples (given). Sample sizes both >30. 10×30 = 300 and 10×32 = 320 are less than the numbers of houses in the two counties.

Population 1 = Cortland County houses, population 2 = Broome County houses. 2-SampTInt, 134296, 44800, 30, 127139, 61200, 32, .95, No results: (−20004, 34318) June is 95% confident that the average house in Cortland County costs $20,004 less to $34,318 more than the average house in Broome County. (b) A 95% confidence interval is the complement of a significance test for ≠ at = 0.05. Since 0 is in the interval, you know the p-value would be >0.05 and therefore June can’t tell, at the 0.05 significance level, whether there is any difference in average house price in the two counties or not. If both ends of the interval were positive, that would indicate a difference in averages at the 0.05 level, and you could say Cortland’s average is higher than Broome’s. Similarly, if both ends were negative you could say Cortland’s average is lower than Broome’s. But as it is, nada. Remark: Obviously Broome County is cheaper in the sample. But the difference is not great enough to be statistically significant. Maybe the true mean in Broome really is less than in Cortland; maybe they’re equal; maybe Broome is more expensive. You simply can’t tell from these samples.

7

The immediate answer is that those are proportions in the sample, not the proportions among all voters. This is two-population binomial data, Case 5 in Inferential Statistics: Basic Cases. Requirements check: Random samples, OK. Each sample 10n = 10×1000 = 10,000. There are far more than 10,000 voters nationally; OK. The two samples were independent, OK. Red: 520 successes and 1000−520 = 480 failures, OK. Blue: 480 successes and 1000−480 = 520 failures, OK.

Population 1 = Red voters, population 2 = Blue voters. 2-PropZInt 520, 1000, 480, 1000, .95 Results: (−.0038, .08379), p1 =.48, p2 =.52 With 95% confidence, the Red candidate is somewhere between 0.4 percentage points behind Blue and 8.4 ahead of Blue. The confidence interval contains 0, and so it’s impossible to say whether either one is leading. Remark: Newspapers often report the sample proportions p1 and p2 as though they were population proportions, but now you know that they aren’t. A different poll might have similar results, or it might have samples going the other way and showing Blue ahead of Red.

8

(a) For a confidence interval, each sample must have at least 10 successes and at least 10 failures. Sample 1 has only 7 successes. Requirements are not met, and you cannot compute a confidence interval with 2-PropZInt.

(b) For a hypothesis test, we often use “at least 10 successes and 10 failures in each sample” as a shortcut requirements test, but the real requirement is at least 10 successes and 10 failures expected in each sample, using the blended proportion p. If the shortcut procedure fails, you must check the real requirement. In this problem, the blended proportion is p = (x1 +x2 )/(n1 +n2 ) = (7+18)/(28+32) =25/60, about 42%. For sample 1, with n1 = 28, you would expect 28×25/60 » 11.7 successes and 28−11.7 = 16.3 failures. For sample 2, with n2 = 32, you would expect 32×25/60 » 13.3 successes and 32−13.3 = 18.7 failures. Because all four of these expected numbers are at least 10, it’s valid to compute a pvalue using 2-PropZTest.

Solutions for Chapter 12

← Exercises for Ch 12

Because this textbook helps you, please click to donate!

1

There is no difference. What matters in a model is the relative sizes of the predictions for the categories. 40% is 1.6 times 25%, just as 40 is 1.6 times 25.

2

This is attribute data, one population, more than two possible responses: Case 6, goodness-of-fit, in Inferential Statistics: Basic Cases. There are 6 categories, therefore 5 degrees of freedom.

(1)

H 0 : The 25:25:20:15:8:7 model for ice cream preference is good. H 1 : The 25:25:20:15:8:7 model for ice cream preference is bad.

(2)

= 0.05

(3–4) Use MATH200A part 6. df=5, ²=9.68, p-value = 0.0849 Here are the input and output data screens:

(If you have MATH200A V6, you’ll see the p-value, degrees of freedom, and ² test statistic on the same screen as the graph.) Common mistake: When a model is given in percentages, some students like to convert the observed numbers to percentages. Never do this! The observed numbers are always actual counts and their total is always the actual sample size. Remark: You could give the model as decimals, .25, .20, .15 and so on. But for the model, all that matters is the relative size of each category to the others, so it’s simpler to use wholenumber ratios. Common mistake: If you do convert the percentages to decimals, remember that 8% and 7% are 0.08 and 0.07, not 0.8 and 0.7. (RC)

L3 shows the expected counts, and the lowest is 70, so all are ≥5. The problem says that the 1000 people were a random sample. There are millions of ice cream lovers, so the sample of 1000 is less than 10% of population.

(5)

p > . Fail to reject H 0 .

(6)

At the 0.05 level of significance, you can’t say whether the model is good or bad. Or, It’s impossible to determine from this sample whether the model is good or bad (p = 0.0849). Remark: For Case 6 only, you could write your non-conclusion as something like “the model is not inconsistent with the data” or “the data don’t disprove the model.” Remark: The ² test keeps you from jumping to false conclusions. Eyeballing the observed and expected numbers (L2 and L3), you might think they’re fairly far off and the model must be wrong. Yet the test gives a largish p-value. Remark: If it had gone the other way — if p was less than — you would say something like “At the .05 level of significance, the model is inconsistent with the data” or “the data disprove the model” or simply “the model is wrong”. Solution: Use Case 7, 2-way table, in Inferential Statistics: Basic Cases.

3 (1)

H 0 : Gun opinion is independent of party H 1 : Gun opinion depends on party

(2)

= .05

(3–4) Put the two rows and three columns in matrix A. (Don’t enter the totals.) Select ²-Test from the menu. Outputs are ² = 26.13, df = 2, p=2.118098E-6 Õ p = 0.000 002 or . Fail to reject H 0 .

(6)

At the 0.01 significance level, we can’t determine whether Echinacea is effective against the common cold or not. Or, We can’t determine whether Echinacea is effective against the common cold or not (p = 0.5769). Remark: Researchers might write something like “Echinacea made no significant difference to infection rates in our study” with the p-value or significance level. It’s understood that this does not prove Echinacea ineffective — this particular study fails to reach a conclusion. But as additional studies continue to find p > , our confidence in the null hypothesis increases.

Remark: If you used MATH200A part 7, there’s some interesting information in matrix C. The top left 7 rows and 2 columns are the ² contributions for each of the seven treatments and two outcomes. All are all quite low, in light of the rule of thumb that only numbers above 4 or so are significant, even at the less stringent 0.05 level. The last two rows are the total numbers and percentages of people who did and didn’t catch cold: 349 (87.5%) and 50 (12.5%). If Echinacea is ineffective, you’d expect to see about that same infection rate for each of the seven treatments. Sure enough, compute the rates from the rows of the data table, and you’ll find that they vary between 81% and 92%. The third column is the total subjects in each of the seven treatments, and the overall total. Of course you were given those in the data table, but it’s always a good idea to use this information to check your data entry. The fourth column is the percentage of subjects who were assigned to each of the seven treatments, totaling 100% of course.

Solutions to Review Problems

← Review Problems

Because this textbook helps you, please click to donate!

Problem Set 1: Short Answers Write your answer to each question. There’s no work to be shown. Don’t bother with a complete sentence if you can answer with a word, number, or phrase.

1

Disjoint events cannot be independent. Why? Disjoint events, by definition, can’t happen on the same trial. That means if A happens, P(B) = 0. But if A and B are independent, whether A happens has no effect on the probability of B. With disjoint events, whether A happens does affect the probability of B. Therefore disjoint events can’t be independent.

2

(a) C (b) For numeric data with sample size under 30, you check for outliers by making a boxwhisker plot and check for normality by making a normal probability plot.

3

qualitative = attribute, non-numeric, categorical. Examples: political party affiliation, gender. quantitative = numeric. Examples: height, number of children.

Common mistake: Binomial is a subtype of qualitative data so it’s not really a synonym. Discrete and continuous are subtypes of numeric data.

4

equal to 1/6 . The die has no memory: each trial is independent of all the others. The Gambler’s Fallacy is believing that the die is somehow “due for a 6”. The Law of Large Numbers says that in the long run the proportion of 6’s will tend toward 1/6, but it doesn’t tell us anything at all about any particular roll.

5

(a) pop. 1 = control, pop 2 = music H 0 : p2 = p1 and H 1 : p2 < p1 Or: H 0 : p2 – p1 = 0 and H 1 : p2 – p1 < 0 (b) Case 5 , Difference between Two Pop. Proportions; or 2-PropZTest Common mistake: You must specify which is population 1 and which is population 2. Common mistake: The data type is binomial: a student is in trouble, or not. There are no means, so µ is incorrect in the hypotheses.

6

Check this against the definition: Are there a fixed number of trials? Yes, you are rolling five dice, n = 5. Are there only two outcomes, success and failure? Yes, each die is either a 3 or not. Is the probability of success the same from trial to trial; are the trials independent? Yes, p = 1/6 for each die, and the dice are independent. This is a binomial PD.

7

A

8

B,D — B if p

Remark: The significance level is the level of risk of a Type I error that you can live with. If you can live with more risk, you can reach more conclusions.

9

“Disjoint” means the same as “mutually exclusive”: two events that can’t happen at the same time. Example: rolling a die and getting a 3 or a 6. Complementary events can’t happen at the same time and one or the other must happen. Example: rolling a die and getting an odd or an even. Complementary events are a subtype of disjoint events.

10

For any set of continuous data, or discrete data with many different values . If the variable is discrete with only a few different answers, you could use a bar graph or an ungrouped histogram. For a small- to moderate-sized set of numeric data, you might prefer a stemplot.

11

For mutually exclusive (disjoint) events . Example: if you draw one card from a standard deck, the probability that it is red is ½. The probability that it is a club is ¼. The events are disjoint; therefore the probability that it is red or a club is ½+¼ = ¾.

12

A, B

Remark: C is wrong because “model good” is H 0 . D is also wrong: every hypothesis test, without exception, compares a p-value to . For E, df is number of cells minus 1. F is backward: in every hypothesis test you reject H 0 when your sample is very unlikely to have occurred by random chance.

13

Continuous data are measurements and answer “how much” questions. Examples: height, salary Discrete data usually count things and answer “how many” questions. Example: number of credit hours carried

14

C, D

Remark: As stated, what you can prove depends partly on your H 1 . There are three things it could be: If H 1 : p > 0.01 (machine producing too many defectives) and you calculate p-value 30”. That’s true, but not relevant here. Sample size 30 is important for numeric data, not binomial data.

Problem Set 2: Calculations

36

Numeric data, two populations, independent samples with unknown: Case 4 (2SampTTest).

Common mistake: You cannot do a 2-SampZTest because you do not know the standard deviations of the two populations. (1)

Population 1 = Judge Judy’s decisions; Population 2 = Judge Wapner’s decisions H 0 : µ1 = µ2 , no difference in awards H 1 : µ1 > µ2 , Judge Judy gives higher awards

(2)

= 0.05

(RC)

Random sample Sample sizes are both above 30, so there’s no worry about whether the population data are normal.

(3–4) 2-SampTTest: x1=650, s1=250, n1=32, x2=580, s2=260, n2=32, µ1 >µ2 , Pooled: No Results: t=1.10, p-value = .1383 (5)

p > . Fail to reject H 0 .

(6)

At the 0.05 level of significance, we can’t tell whether Judge Judy was more friendly to plaintiffs (average award higher than Judge Wapner’s) or not. BTW: Some instructors do a preliminary F-test. It gives p=0.9089>0.05, so after that test you would use Pooled:Yes in the 2-SampTTest and get p=0.1553.

37

normalcdf(20.5, 10^99, 14.8, 2.1) = .00332. Then multiply by population size 10,000 to obtain 33.2, or about 33 turkeys .

38

Solution: This is one-population numeric data, and you don’t know the standard deviation of the population: Case 1. Put the data in L1, and 1-VarStats L1 tells that x = 4.56, s = 1.34, n = 8.

(1)

H 0 : µ = 4, 4% or less improvement in drying time H 1 : µ > 4, better than 4% decrease in drying time Remark: Why is a decrease in drying time tested with > and not

Smile Life

When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile

Get in touch

© Copyright 2015 - 2024 PDFFOX.COM - All rights reserved.