Idea Transcript
Estimating with Confidence, Part II
Review • We use y-bar to estimate a population mean, µ. • When sampling from a population with true mean µ, the true mean of the distribution of ybar is µ. • On the average, the mean of means from larger samples should be closer to the true mean than the mean of the means from smaller samples
• When sampling from a population with true standard deviation σ, the standard deviation of the distribution of y-bar is
• Lastly, one nearly magical property of the sample mean, y-bar, is that it is normally distributed no matter what the original distribution of y. • That is, as long as certain conditions are met, either: – the original distribution is normal, – or if the sample size is large.
Sample Size • The rule of thumb is-in most practical situations-n =30 is satisfactory. • As a practical matter though, if the original distribution is severely non-normal then it may take much more 30 samples to assure us that the sample mean will be normally distributed.
Central Limit Theorem • More formally, what we've been discussing is the implications of the Central Limit Theorem. • The CLT is the only theorem we'll cover in BST 621 (because it's that important).
CLT • Draw a simple random sample of size n from any population whatsoever with mean µ and finite standard deviation σ. • When n is large, the sampling distribution of the sample mean y-bar is approximately normally distributed with mean µ and standard deviation σ/√n (Daniel, p.134)
• It's not surprising that when sampling from a normal population the means will be normally distributed. • It's far more useful to know that no matter what the underlying distribution is, your means will be normally distributed, as long as you have sufficient n. • How large an n is required? It depends on the underlying distribution, but the rule of thumb is 30.
• However, this theorem can not save us from an ill-conceived sampling methodology. • That is, if we draw a simple random sample then we can trust that the CLT will hold. • Say we didn't do a simple random sample; are we in trouble? We're not in great danger if the data can plausibly be thought of as observations taken at random from a population. • If the data are representative, we're probably OK.
• However, there is no way to rescue a study using data collected haphazardly. • The data will have unknown biases and no fancy formula can rescue badly produced data. • Garbage in, garbage out
• Let’s assume the data are representative. • So far, our estimation methods have resulted in point estimates. • Confidence intervals are even more useful.
Confidence Intervals • Confidence intervals use point estimates and an estimate of dispersion to form interval estimates. • Recall that estimating a parameter with an interval involves three components:
CIs 1. The point estimate of µ . This is the sample mean y-bar. 2. When the population standard deviation is known to be σ, the standard error of y-bar is σ/√n. 3. The reliability coefficient, we use the 100(1- α)% z value
Reliability Coefficients • For 90% confidence, use z = 1.645. • For 95% confidence, use z = 1.96. • For 99% confidence, use z = 2.575.
General Form Estimate ± (reliability coefficient) x (standard error) This will yield two values, a lower limit and an upper limit, around the point estimate. The confidence interval will, with specified reliability, contain the true (unknown) population mean.
Known Variance • So, if we know the population standard deviation, σ , then a 95% confidence interval for the population mean is:
Examples • In our example population the known σ is 45.9194. • Using a sample of size n = 9, the first simulated experiment yielded a y-bar of 217.6: • [187.6, 247.6] • Notice that this interval covers the true mean of 205.7
Problem • There is a major problem with this method for calculating confidence intervals. • It requires knowledge of the population standard deviation, σ .
Unknown Variance • In practice, we never know σ . • The obvious solution is to use the estimated standard deviation, s, we determined from our sample. • But this does not work. The problem is that the reliability coefficient (1.96) is wrong. • It's wrong because now there are two random terms entering into the confidence interval, y-bar and s. • Both of these are subject to random fluctuation.
Solution • Gosset, a statistician who worked at the Guinness brewery, figured out the solution to this problem: the t-distribution. • But to keep from getting fired, he had to publish the work under a pseudonym "Student." • Thus, you may have seen a reference to "Student's t." • The t-distribution is very close to the z but the t distribution has wider tails, reflecting the extra variability ignored by z.
• The degrees of freedom for the tdistribution when estimating a single mean is df = n - 1. • It's no accident that this is the denominator used to calculate s, the estimated standard deviation.
New CI • So, the correct formula for the 100(1- α)% confidence interval on a population mean when estimating both the mean and standard deviation is:
• In Appendix Table E, Daniel gives the appropriate t-values for various df. • If you use this table, you want to use the value labeled t.975 for a 95% CI. • That is for a 95% CI, α = 0.05; so, (1 - α /2) = 0.975.
• Notice as n gets larger the t value gets closer to the z value.
Using JMP • JMP automatically calculates the 95% confidence interval on the mean and shows it in the Distribution of Y report window. • For instance, the Moments report from the first n = 9 cholesterol sample. • The 95% confidence interval from this sample is [173.4, 261.7].
Sample Size and Confidence • A 95% confidence interval implies that we're 95% sure that the interval covers the true (but unknown) mean. • On the other hand, it also means that 5% of the intervals we calculate will not cover the true mean. • This is true whether we use n = 2 or n = 2,000,000.
• What changes with sample size is the width of the interval. • With larger sample sizes the width of the interval is narrower; we're still going to be wrong 5% of the time but by narrower amounts. • Let's look at confidence intervals when the population is not normal
CI's for Triglyceride • Now let's look at 95% confidence intervals using the triglyceride population – the nonnormal population. • Just as before, we simulate 100 studies, each with a different sample size.
• Notice how much more variable the widths are. • The first sample's y-bar estimate was 164.6 and estimated standard deviation s = 101.2. • The second sample's y-bar estimate was 323.2 and estimated standard deviation s = 383.3. • With larger estimates, you're seeing the effect that an outlier can have.
• With larger n, the intervals are narrower. • The effect of outliers is diminished. • Here, we have sufficient sample size to trust to the Central Limit Theorem.
Summary • Sample estimates have distributions that are affected by the underlying distribution and sample size. • Estimates may be totally worthless if obtained from a haphazard "sample" with unknowable bias. • But, if the data are representative of the population then we can rely on the sample mean to estimate the center of the distribution. • The sample mean is unbiased.
Summary (cont) • Further, if the population is known to be normal, then a sample mean will also be normal. • If the population distribution has an unknown shape then, with a sufficient sample size, we can rely on the CLT and trust that a sample mean will also be normally distributed.
Assessing Normality • Use the Normal Quantile Plot in JMP to assess whether a distribution appears normal.
SD versus SE • The standard error of the sample mean will be smaller with larger samples. • The standard error describes the variability of the sample mean. • Not the variability of the sample data. • If the variability of the data is σ , then the standard error of the mean is σ/√n .
CIs • The confidence interval on the population mean, obtained from a sample of n observations is:
• Here, y-bar is the sample mean, s is the sample standard deviation, and the t-reliability coefficient is the (1 - α/2) percentile of the tdistribution with df = n - 1. • When describing a confidence interval in a sentence or table, be sure to indicate the level of confidence and the sample size.
• Always be aware that the shape of the underlying distribution and the size of your sample will directly affect the believability of your point- and interval-estimates
Example write ups • In the case where you judge that the distribution is markedly non-normal (skewed), say we begin with the following raw data:
• Since the sample was small and the distribution was skewed the distribution of the sample is described by the median and range: • A random sample of n = 20 subjects was assessed for serum triglycerides. The median triglyceride was 115 and the values ranged between 31 and 755. Half of the values were between 91.25 and 195.0.
Another example
• One example write-up: A random sample of n = 20 subjects was assessed for serum cholesterol. The average cholesterol was 201.8, SD = 53.25. We are 95% confident that the range 176.8-226.7 includes the population mean.