workshop - Neeley School of Business [PDF]

The approximation to the normal distribution will become closer as the sample size increases. ... In fact, we can do all

60 downloads 58 Views 2MB Size

Recommend Stories


School of Business
The butterfly counts not months but moments, and has time enough. Rabindranath Tagore

Leeds School of Business
Sorrow prepares you for joy. It violently sweeps everything out of your house, so that new joy can find

School of Business
You often feel tired, not because you've done too much, but because you've done too little of what sparks

DeVille School of Business
Seek knowledge from cradle to the grave. Prophet Muhammad (Peace be upon him)

Alliance School of Business
Your big opportunity may be right where you are now. Napoleon Hill

Lubin School of Business
Learning never exhausts the mind. Leonardo da Vinci

Tepper School of Business
Learn to light a candle in the darkest moments of someone’s life. Be the light that helps others see; i

School School of Business Major LIU-Worms Master of Business
Ask yourself: What am I leaving unresolved or unfinished that needs my attention? Next

private business use workshop
The greatest of richness is the richness of the soul. Prophet Muhammad (Peace be upon him)

Small Business Development Workshop
Raise your words, not voice. It is rain that grows flowers, not thunder. Rumi

Idea Transcript


START WORKSHOP - STATISTICS

INSC 60010 STATISTICAL MODELS FOR MANAGERIAL DECISIONS START WORKSHOP - FALL 2013

Ranga Ramasesh

Page 1

START WORKSHOP - STATISTICS

1. 2.

3. 4. 5.

DISCUSSION TOPICS

Workshop Objectives 1. Data Analysis – Motivations and Goals 2. Preparatory Tools

Graphs for Exploratory Data Analysis 1. Histograms / Frequency Distributions 2. Scatter Plots 3. Time Series Plots Models of Uncertainty 1. Sampling and Sampling Distributions 2. Z-Distribution 3. T-Distribution Inference based on Sample data 1. Confidence Interval Estimation 2. Hypothesis Testing

Module Overview 1. Introduction to Regression Analysis 2. Syllabus and Administrative Details Page 2

START WORKSHOP - STATISTICS

NOTE USE OF GRAPHS FOR EXPLORATORY DATA ANALYSIS Example: Graduation Rates Data 142 colleges is available in a spreadsheet with filename prefix college06. This information was obtained from the 2006 issue of U.S. News and World Report and it includes: 1. name of the college 2. graduation rate (GRATE) 3. freshman retention rate (FRESH) 4. percent of classes with fewer than 20 students (CLASS20) 5. percent of classes with more than 50 students (CLASS50) 6. percent of full time faculty (FTFAC) 7. 75th percentile of SAT scores (SAT75) 8. percent of incoming students in top 10% of high school class (TOP10) 9. acceptance rate (ARATE) 10. alumni giving rate (ALUM) 11. indicator for private school (1 = private; 0 = public) (PRIV) A small sub-set of data is shown in the following table. Managerial Concerns 1. How do we make sense of the graduation rates across the different colleges? 2. How do we understand the relationship, if any, between graduation rate and the SAT scores? Page 3

START WORKSHOP - STATISTICS Sample Data set School Name

GRATE

FRESH

CLAS20

CLAS50

FTFAC

SAT75

TOP10

ARATE

ALUM

PRIV

Harvard University

0.98

0.97

0.70

0.13

0.92

1580

0.96

0.11

0.47

1

Princeton University

0.97

0.98

0.74

0.11

0.91

1560

0.94

0.13

0.61

1

Yale University

0.96

0.98

0.74

0.08

0.89

1560

0.95

0.10

0.46

1

University of Pennsylvania

0.94

0.98

0.75

0.07

0.88

1500

0.94

0.21

0.40

1

Duke University

0.94

0.97

0.72

0.05

0.97

1530

0.87

0.24

0.45

1

Stanford University

0.93

0.98

0.69

0.12

0.99

1550

0.87

0.13

0.38

1

California Institute of Technology

0.88

0.96

0.63

0.09

0.98

1570

0.93

0.21

0.32

1

Massachusetts Institute of Technology

0.92

0.98

0.61

0.16

0.91

1560

0.97

0.16

0.37

1

Columbia University

0.93

0.98

0.69

0.10

0.91

1540

0.86

0.13

0.34

1

Dartmouth College

0.95

0.97

0.61

0.10

0.93

1550

0.88

0.19

0.49

1

Washington University in St. Louis

0.92

0.97

0.74

0.08

0.92

1520

0.93

0.22

0.39

1

Northwestern University

0.92

0.97

0.73

0.08

0.93

1500

0.82

0.30

0.29

1

Cornell University

0.92

0.96

0.44

0.22

0.99

1490

0.85

0.29

0.35

1

Johns Hopkins University

0.91

0.95

0.55

0.17

1.00

1490

0.80

0.30

0.33

1

Brown University

0.96

0.97

0.65

0.12

0.94

1520

0.90

0.17

0.38

1

University of Chicago

0.87

0.95

0.55

0.06

0.95

1530

0.82

0.40

0.29

1

Rice University

0.91

0.96

0.60

0.10

0.93

1540

0.86

0.22

0.36

1

University of Notre Dame

0.96

0.98

0.56

0.10

0.85

1470

0.85

0.30

0.49

1

Vanderbilt University

0.86

0.94

0.67

0.06

0.97

1440

0.77

0.38

0.28

1

Emory University

0.86

0.94

0.67

0.07

0.95

1460

0.90

0.39

0.19

1

University of California - Berkeley

0.87

0.96

0.58

0.15

0.91

1450

0.99

0.25

0.15

0

Page 4

START WORKSHOP - STATISTICS

Frequency Distributions and Histograms Definitions:

A frequency distribution is a table that summarizes the numerical values of a variable by recording the number of times (frequency) values fall within certain ranges called classes or bins. Definition: A histogram is a graph of a frequency distribution. Using Excel

A frequency distribution can be constructed in Excel by choosing the “Data” tab, and then choosing “Data Analysis” from the “Analysis” category and then choosing “Histogram” (or by choosing “Histogram” from the “Data Analysis” option on the “Tools” menu in earlier versions of Excel).

The variable examined in this example will be the graduation rates for the colleges in our sample. Bin limits are the upper inclusive values for each bin. The bin limits chosen by Excel may be somewhat awkward values to work, so Excel also allows you to specify the bin limits. Note that the numbers shown in the Bin column of the Excel frequency distribution represent the upper limits of the bin.

Note: If you don’t like the bin limits that Excel chooses (and I don’t), you can choose your own. On the next page is another frequency distribution of graduation rate using bin limits that I chose. The first (lower) bin limit is 0.20 with increments of 0.05 up to a maximum of 1.00. Excel will always add a last row to the frequency distribution and label it “More”. You can delete it, as I did here.

Page 5

START WORKSHOP - STATISTICS

Page 6

START WORKSHOP - STATISTICS

Page 7

START WORKSHOP - STATISTICS

Page 8

START WORKSHOP - STATISTICS

Frequency Distribution

bins 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% 100%

Frequency 1 1 0 5 4 9 14 15 6 14 21 13 11 9 14 5

Histogram This is the histogram (formatted) for the above frequency distribution.

Histogram

25

Frequency

20

15

10

5

0 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% 100% bins

Page 9

Scatterplots

START WORKSHOP - STATISTICS

Definition: A scatterplot is a plot showing the relationship between two variables X and Y.

Suppose we want to examine the relationship between graduation rates and SAT score. The following plot is the scatterplot of GRATE versus SAT75.

Page 10

START WORKSHOP - STATISTICS

Page 11

START WORKSHOP - STATISTICS

Page 12

START WORKSHOP - STATISTICS

Scatterplot of Graduation Rate versus SAT 1.00 0.90 0.80

y = 0.0011x - 0.7635 R² = 0.7596

Graduation Rate

0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 800

900

1000

1100

1200

1300

SAT 75th Percentile

Page 13

1400

1500

1600

1700

START WORKSHOP - STATISTICS

Time Series Plots

Time-series plot (Line Plot) of furniture sales in millions of dollars, January 1992 through December 2007

Data file: Furnsales

Page 14

1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103 109 115 121 127 133 139 145 151 157 163 169 175 181 187

START WORKSHOP - STATISTICS

FURNSALES

14000

12000

10000

8000

6000

4000

2000

0

Page 15

START WORKSHOP - STATISTICS

Art and Science of Graphical Presentations Excellence in statistical graphics consists of complex ideas communicated with clarity, precision, and efficiency. Graphical displays should A.

Show the data

C.

Avoid distorting what the data have to say

B.

Induce the viewer to think about the substance rather than about methodology, graphic design, the technology of graphic production, or something else

D.

Present many numbers in a small space

F.

Encourage the eye to compare different pieces of data

E.

G. H.

I.

Make large data sets coherent

Reveal the data at several levels of detail, from a broad overview to the fine structure Serve a reasonably clear purpose: description, exploration, tabulation, or decoration

Be closely integrated with the statistical and verbal descriptions of a data set.

Graphics reveal data. Indeed graphics can be more precise and revealing than conventional statistical computations.

From: The Visual Display of Quantitative Information by Edward R. Tufte, Cheshire, Connecticut: Graphics Press 1983. Page 16

START WORKSHOP - STATISTICS

NOTE

STATISTICAL INFERENCE Introduction A common challenge faced by business managers is making judgments about the

characteristics of large populations. For example, managers in a large electronics retail

organization with several thousand stores all across the country may want to know the

average time per day spent by its salespersons with pseudo customers. (Pseudo customers are those who come to retail stores to get help from salespeople in understanding and

comparing different products but stop short of buying anything in the store.) The managers in this organization are interested in a characteristic of the entire group or the “population”

of the salespersons across all of its retail stores. In statistical terms, the managers are interested in the numerical value of a “population parameter.” In the first case, the

parameter they are interested in is the “population mean”.

Statistical Inference is the technique of making judgments about the unknown

“population parameters” of our interest such as a population mean based on appropriate “sample statistics.” For example, the appropriate sample statistic to estimate an unknown mean of a population is the sample mean. There are two distinct, but related approaches to statistical inference. These are called “Estimation” and “Hypothesis Testing”. Estimation

In some situations a manager or an analyst may not know what the numerical value

of the population parameter of interest is (or may not even have a tentative or claimed value for the population parameter). For example, what is the average number of miles driven by families during a summer vacation or what proportion of the eligible voters will

vote in favor of a particular candidate in an upcoming election? In such the technique called Page 17

START WORKSHOP - STATISTICS

“Estimation” is used. Estimation deals with the determination of the numerical value of an unknown population parameter, such as the population mean. (i.e., the average value of a

specific measurement variable) using data from a random sample. The error that arises

from the fact that the estimates are based on the data from a relatively small sample is quantified what is called the "standard error of the sample estimate”. In general, we

estimate the population parameter, by specifying an estimate (sometimes called the “point estimate”) and a margin of error around it, i.e. by specifying a range of values called the

“confidence interval”. We multiply the standard of error of the sample estimate by an appropriate factor to give us the margin of error that results in confidence intervals in

which we have a specified level of confidence. The appropriate factor is based on the distribution of the sample estimate.

Note: The confidence level is the proportion of times that our estimation procedure is correct. Page 18

Hypothesis Testing

START WORKSHOP - STATISTICS

In some situations a manager or a business analyst may already have some idea

about the population parameter based on some intuition, or a claim made by others. For example, a firm’s marketing department might claim that the average sales for a new model of computer will be 500 units per week. The managerial concern here is whether to reject

the claim made and take an alternative position or not to reject the claim made and go with it. In such situations, the “Hypothesis Testing” technique is used.

Recall that in the confidence interval approach, a manager has no idea about the

population parameter of interest (e.g., the population mean) and the manager constructs a

confidence interval using the sample estimate. The sample estimate is the value of an

appropriate statistic (i.e., the sample mean) based on a random sample of observations drawn from the population.

In the hypothesis testing approach, a manager does have some specific numerical

value for the population parameter of interest. This value may come from a variety of

considerations. It may be the manager’s target value for the parameter or it may be a claim made by someone. To test this claim, data from a random sample drawn from the population is collected and it is used as evidence to test if it supports the claim or not. An Informal Understanding of Hypothesis Testing A good way to understand the hypothesis testing approach is to think of a criminal

trial. The jury’s concern here is the (unknown) truth about the defendant. There are two positions: not guilty (defense’s position) or guilty (prosecution’s position). These are mutually exclusive (i.e., non-overlapping) and collectively exhaustive (i.e., there are no

other positions). The jury must decide if it should (a) reject the defense’s position of “not guilty” and go with the alternate position or (b) not reject it. How does the jury decide? It

looks at all the evidence presented and seeks answer to the question: Does the evidence enable us to reject the “not-guilty” position beyond reasonable doubt? Page 19

START WORKSHOP - STATISTICS

In an analogous way, in hypothesis testing we start with two mutually exclusive and

collectively exhaustive positions (or claims) about a population parameter. One of these is called the null hypothesis and the other is called the alternate hypothesis. (We will

discuss the technicalities of how to designate the null and the alternate hypotheses later.)

Similar to the jury’s concern of whether to reject the defense’ “not-guilty” claim or not, our concern is to take one of the following two decisions: (a) Reject the Null Hypothesis and

(b) Do Not Reject the Null Hypothesis.

The jury considers a variety of evidence presented to it. In our case, the evidence is

simply the appropriate “test statistic” calculated from the data from a random sample.

How does the jury decide if there is evidence beyond reasonable doubt? For example, if the

prosecution establishes that the murderer was wearing blue jeans and points out that the defendant owns a pair of blue jeans would it constitute evidence beyond reasonable doubt?

We would say: most likely not. Since many people own blue jeans, the likelihood or probability of the evidence presented due to pure chance is quite high. There is nothing

extraordinary about this observation. It is not compelling enough. On the other hand, if the

prosecution presents evidence that the DNA of the defendant matches the DNA of the body

fluids found on the murder victim, we would say: most likely yes. The fundamental

consideration here is the likelihood or the chance associated with the evidence presented or the sample outcome. The jury asks the question: “Assuming that the defendant is not

guilty, what is the probability that there is a DNA match?” Since the probability of a DNA match is extremely small, the evidence that there is a DNA match is compelling enough to

reject our initial position that the defendant is not guilty. In a similar vein we first

tentatively assume that the null hypothesis is true. We then determine the probability of getting a test statistic as large as or as small as the one we got. This probability is called

the p-value of the test statistic. (We will discuss how to calculate this probability “or pvalue” later.)

Page 20

START WORKSHOP - STATISTICS

We finally ask the question: Is this probability small enough for us consider that it is

rather extreme or extraordinary? If it is, we feel compelled to reject our tentative

assumption that the claim (or the hypothesized value) contained null hypothesis is true. To answer the above question, we must establish how low the p-value should be for us to

consider that it is low enough for us to reject the null hypothesis. This is analogous to the issue of what is considered “beyond reasonable doubt” in the jury decision. What is the threshold for the p-value to be considered significant? Analysts often use the rule that a pvalue below 5% can be viewed as “statistically significant”. In different settings different

levels of significance may be appropriate. In medical research, for example, a significance level of 1% to 2% may be required to consider the results publication worthy. In some

business applications 5% or even 10% may be acceptable. In the hypothesis testing jargon

the threshold or cutoff percentage or probability is called “the significance level for the hypothesis test”. It is denoted by the Greek letter α (alpha).

If the p-value is less than the significance level, we conclude that the sample

outcome is so extreme that it could not have from a population having the parameter value

specified in the null hypothesis. Hence we reject the null hypothesis and adopt the position stipulated in the alternate hypothesis. This is analogous to the jury’s conclusion that the

probability of a DNA match is so small that it does not seem reasonable to continue to harbor the position that the defendant is not guilty and dismiss this evidence as fluke. Hence the jury will reject the null position and return the “guilty” verdict.

The choice of an appropriate value for α is a managerial policy decision.

Generally speaking, if the consequences of mistaking chance variation for a real discovery are very bad, managers use a very strict cutoff (a low number like 1% for the significance

level). If the consequences are not serious managers use a more lenient cutoff (like 5% or 10%). In medical research the consequences of this type of mistake may be serious because an ineffective or dangerous vaccine may be approved by the FDA and given to millions of

people. The p-value is often quoted in research journals. A researcher may write something

like “we found that patients got better after taking our new drug (p-value < 0.01)” to mean Page 21

START WORKSHOP - STATISTICS

that benefits of the drug were found to be statistically significant with a p-value of less than

1%. The advice to the manager is not to worry so much about exactly which level of significance to use. If the conclusions of your study change drastically if you adopt a slightly

different level of significance, then your results are probably close to the borderline of

statistical significance — and you should consider this when basing decisions on the study.

Finally, note that the jury decisions are not perfect. A jury sometimes reaches a

wrong verdict. Likewise, a hypothesis test may sometimes lead to an incorrect decision. There are two types of errors: (1)

A jury may reject the “non-guilty” position and wrongly convict a defendant who is

truly not guilty. Likewise, a hypothesis test might reject a null hypothesis which is

indeed true. This type of error is called a Type I error. It occurs if a hypothesis test

finds the sample evidence to be statistically significant when in reality it is due to (2)

pure chance variation.

A jury may fail to reject the “non-guilty” position and wrongly let go a defendant who is truly guilty. Likewise, a hypothesis test might fail to reject a null hypothesis

which is indeed false. This type of error is called a Type II error. It occurs if the sample evidence is due to something beyond chance variation but the test does not recognize it as statistically significant.

The significance level (α) represents the maximum probability of committing a

Type I error that is acceptable. If Type I errors are very costly you might use a smaller value for α such as 0.01 or 0.001. For example, in testing the effectiveness of a new drug for

a common ailment, committing a Type I error means that an ineffective drug may be

prescribed to millions of people. This is very bad and therefore it is appropriate to use a small value for α such as 0.001. On the other hand, in the case of an experimental treatment for a fatal but otherwise incurable disease, a Type II error is probably worse than a Type I

error. Now a Type II error means that someone could miss out on the opportunity to be cured, and a higher value for α such as 0.01 might be better. In fact, you might consider

giving this treatment to someone even if it has only been tested on a few people in the past; Page 22

START WORKSHOP - STATISTICS

waiting until you can establish the adequacy of the drug at the usual level of statistical significance may be too conservative and patients may die in the meantime.

The maximum probability of committing a Type II error is often denoted by the

Greek letter β and (1- β) is called the power of a hypothesis test. In this course we will not

discuss the power of a test. But it is important for you to note that once you have chosen a

certain level of significance to control the probability of Type I error (i.e., chosen a value for α), the probability of making a Type II error may be reduced by only increasing sample size.

Let us recap and formally state the steps in Hypothesis testing. Step 1: State the Null and Alternate Hypotheses

 Note that in any decision context, there will be a certain specific numerical value claimed for the unknown population parameter µ. Let us denote it by µ0 .

 There are three possible positions that you might take with respect to µ0 . 1. µ ≠ µ0 2. µ > µ0 3. µ < µ0

 Corresponding to each of the above positions the alternate positions are: 1. µ = µ0 2. µ ≤ µ0 3. µ ≥ µ0

 Always choose the position that has a strict inequality as the “Alternate Hypothesis” (usually denoted by Ha). Conversely, the position which has an equality component will be chosen as the “Null Hypothesis” (denoted by H0). Thus, the three possible scenarios are:

1. H0: µ = µ0 2. H0: µ ≤ µ0 3. H0: µ ≥ µ0

Ha: µ ≠ µ0 Ha: µ > µ0 Ha: µ < µ0

Scenario 1 is called a “two-tailed” test. Scenarios 2 and 3 are “one-tailed” tests. Page 23

START WORKSHOP - STATISTICS

Step 2: Determine the Test Statistic (based on the sample data) or “Observed TS”

𝑇𝑆 =

𝑆𝑎𝑚𝑝𝑙𝑒 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒 − 𝑉𝑎𝑙𝑢𝑒 𝑐𝑙𝑎𝑖𝑚𝑒𝑑 𝑖𝑛 𝑡ℎ𝑒 𝑁𝑢𝑙𝑙 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟 𝑜𝑓 𝑡ℎ𝑒 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒

Step 3: See if the Test Statistic is significant (Two approaches) 1. Find the Critical value of the Test Statistic corresponding to the significance level of the hypothesis test and establish the Rejection Region. The Critical Value of the Test Statistic for the specified significance level is found using

the distribution of the test statistic. The type of test determines the Rejection region. In this situation (two-tail test), we want to know how large or small should the observed

test statistic be so that we can consider it as large enough at the specified significance level and hence reject the null hypothesis.

2. Find the P-value of the Observed Test Statistic The p-value is the probability of observing a value for the test statistic as extreme as (i.e., as large as or as small as) or more extreme than the one we observed under the

assumption that the null hypothesis is true. The p-value associated with the TS is found

using the distribution of the test statistic.

Page 24

START WORKSHOP - STATISTICS TWO-TAILED TEST

ONE-TAILED TESTS

Step 4: Make the Statistical Decision 1. Reject the Null if the Observed Test Statistic falls in the Rejection Region 2. Reject the Null if P-value is less than α-value

Step 5: State the Managerial Conclusion in plain English Page 25

START WORKSHOP - STATISTICS

Standard Deviation of the Sample Estimate

In the above discussion we used the standard error of sample estimate (a) to

establish a desired confidence interval for an unknown population parameter or (b) to test a claim about the population parameter. How do we get this standard error of estimate? To

answer this question, we must understand the behavior of samples randomly drawn from the population of interest in our study.

First, we will first understand the behavior of the sample means (𝑋�) when we take a

sample (of size n) from a population with a known population mean (µ). This behavior is

described by what is called the “sampling distribution of sample means”. It is the

foundation for statistical inference. Also, it has significant applications in statistical process control.

Then we will learn the application of the concepts and the techniques of “Confidence

Interval Estimation” and “Hypothesis Testing” in the context of a case. We will limit our focus to the estimation of unknown population mean and testing claims about it. Important Note: Statistical Significance versus Practical Significance Statistical significance doesn’t tell you whether or not the results are of practical significance. Something is statistically significant if it is clearly more than just a chance

occurrence, whereas something is practically significant if it would have an important

impact. If you gather enough data everything looks statistically significant because then

there is very little room for chance variation. Something is statistically significant if it is clearly more than just a chance occurrence. Something is practically significant if it would have an important impact on the business situation.

Page 26

START WORKSHOP - STATISTICS

NOTE SAMPLING DISTRIBUTION OF SAMPLE MEANS A Thought Experiment Before we get into the theoretical concepts, definitions, and analytical details, please

visualize playing a simple game and answer the questions that follow. In this game let us go

back to the case of a consumer products company like Proctor and Gamble. This firm

manufactures liquid detergent, which is sold in 100-ounce plastic bottles. In the final stages of the manufacturing process the 100-ounce plastic bottles are filled with liquid detergent on an automated filling and packaging line.

The automatic bottle-filling machine used to fill the bottles is set to fill an average of

100 ounces of the detergent in each bottle. However, no machine is guaranteed to fill

exactly 100 ounces in each bottle. Rather, the fill amount varies from bottle to bottle. Thus, some bottles will have slightly more than 100 ounces and some slightly less, although these bottles will be labeled as “100 ounce” bottles. It has been established through extensive

data analysis that the variability in the “fill volume” is adequately represented by a normal

distribution with a mean of 100 ounces and a standard deviation of 0.2 ounce. The

company distributes shrink-wrapped bundles of one-dozen or 12 bottles to its retails. Now

let us think of the average fill volume in a bundle of 12 bottles randomly selected from the output of the machine.

1. Before we draw a 12-bottle bundle (i.e., a sample of 12 bottles), what do you expect “Average Fill Volume” of this sample of 12 bottles - call it the “Sample Mean”?

_______________________________________________________________________________

2. We draw a bundle at random i.e., sample of 12 bottles, measure the volume of the 12

individual bottles and find the sample average. Suppose this average value, i.e. the sample mean is 98 ounces. Would you consider this to be significantly lower than what you expect? How could you tell?

________________________________________________________________

Now let us formalize our thoughts and pick up some theoretical fundamentals. Page 27

START WORKSHOP - STATISTICS

Distribution of Sample Means

Consider a population of observations such as heights, weight, test scores, weekly demands, salaries, and so on.

µ = Population Mean

σ = Population standard deviation

We draw random samples of n observations. In statistical jargon, we say

Sample size = n

The sample mean is a random variable. It varies from one sample to another. To work with

sample means, i.e. to make judgments about the sample means or to use sample means to make judgments about the populations, we must know the distribution of the sample

means. This distribution is called the sampling distribution of sample means. Recall that a

distribution must specify at three things: The central measure or the mean, the variability around the mean or the standard deviation and the shape of the distribution. From the Central Limit Theorem in Statistics, we have the following result:

𝑀𝑒𝑎𝑛 𝑜𝑟 𝑡ℎ𝑒 𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛𝑠 𝑋�, 𝐸(𝑋�) = 𝜎

𝑇ℎ𝑒 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑜𝑓 𝑡ℎ𝑒 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛𝑠 𝑋�, 𝜎 𝑆𝐷𝑋� = √𝑛 𝑇ℎ𝑒 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛𝑠 𝑓𝑜𝑙𝑙𝑜𝑤𝑠 𝑡ℎ𝑒 𝑵𝒐𝒓𝒎𝒂𝒍 𝑴𝒐𝒅𝒆𝒍. Page 28

START WORKSHOP - STATISTICS

The following conditions must be satisfied for the above result to hold:

1. Randomization Condition

 Sampling method must be unbiased and representative of the population

2. 10% Condition

 The sample size, n, must be no more than 10% of the population size, N. If it is, then a “Finite Population Correction Factor” must be applied to the standard error.

3. Nearly Normal Condition and Sample size Requirement

 The data must come from a distribution that is unimodal and approximately symmetric. 

If the data distribution is known to be normal then any sample size is OK.

 If the data distribution is not known to be normal, then the sample size must be must be ≥ 30.  The approximation to the normal distribution will become closer as the sample size increases. If the parent distribution is symmetric, smaller samples are adequate than if the parent population is skewed or long-tailed. Symmetry of the parent distribution is particularly important.

Page 29

START WORKSHOP - STATISTICS

Example: In 2008, the average salary for federal workers whose occupations also exist in the private

sector was $67,691. By contrast, the average salary for employees working in similar jobs in the private sector in 2008 was $60,046. Assume that the population standard deviation of the salaries of federal workers is $15,300. A random sample of 34l employees is selected. a) What is the probability that the sample mean will be less than $64000?

b) What is the probability that the sample mean will be more than $70000? c) What is the probability that the sample mean will be less than $60046?

Analysis and Solution First, we make sure that the conditions are satisfied. In this case, we have (1) a random sample (2) the sample size is ≥ 30, and (3) reason to believe that the 10% condition is satisfied.

We are interested finding probabilities about the sample mean, i.e. the average salary for

�. federal workers whose occupations also exist in the private sector. Let us call it 𝑋

� . The distribution of 𝑋� is Normal with We must determine the distribution of 𝑋 Mean = Population mean = 67691 Standard Deviation = 𝑆𝐷𝑋�

=

𝜎

√𝑛

=

15300 √34

= 2623.98

Now, we can find the desired probabilities using the NORM.DIST function in EXCEL. In fact, we can do all the computation in an EXCEL worksheet.

Page 30

START WORKSHOP - STATISTICS

NOTE ESTIMATION OF AN UNKNOWN POPULATION MEAN Guided Case Analysis: Retail Store Operation ELCO is a chain of stores that sells consumer electronics in the US. ELCO suspects that the main reasons for declining profits are the falling quality of service and growing

competition. Managers at ELCO want to know the average time that a salesperson spends with customers. They are worried about pseudo customers who take up a salesperson’s

time to get details about a product but then make their purchases elsewhere. ELCO managers are concerned about the average time spent by the sales persons with pseudo customers across the entire population of sales persons in the company.

Specifically, ELCO managers are interested estimating the population average time spent

with pseudo customers by the company’s sales persons. They want a 95% confidence in their estimate.

ELCO has collected data on the service time spent with pseudo customers in a day from a random (i.e. representative) sample of 100 salespersons. The data set is given in the following table. See EXCEL file service.

Page 31

Data Set

START WORKSHOP - STATISTICS

Time spent by salespersons with pseudo customers.

Service time in seconds 3897 6743 6692 5301 2466 5702 4973 3482 5456 6981 3589 4320 1245 562 6824 9010 8910 1003 8821 5797 6712 1349 4239 2134 4687 1688 8904 3099 921 5817 6984 2485 8901 4111 8903 8933 6986 7133 2349 9042 7120 4713 4344 5921 1471 7432 7059 8425 7027 5479 6934 7234 1358 2302 8324 2309 2329 7912 2399 4456 7632 11921 1357 5691 3216 4865 9249 8349 3369 4771 9214 5578 2316 1279 3130 5892 3870 2390 3190 7243 2390 2891 8238 4349 1208 3999 4389 2348 5681 3123 4992 3356 1217 1109 5002 4006 1730 2100 2305 7349

Page 32

Analysis and Solution

START WORKSHOP - STATISTICS

The variable of our interest here is “the time spent by salespersons with pseudo

customers.” Let us denote it by the symbol X. X is a random variable because its value changes across salespersons and days of operation.

We are interested in “the average time spent by entire population of the company’s salespersons in with pseudo customers”. This is the population parameter, i.e., the population mean (denoted by µ). The numerical value of µ is unknown.

We estimate the value of an unknown population parameter using data from a

representative sample and determining the appropriate sample statistic. In this case the appropriate sample statistic is the sample mean. We call it the sample estimate.

Obviously the sample estimate is not perfect in the sense that we cannot be 100% confident

that the population mean is exactly equal to the sample estimate. There is the inevitability of sampling error. We must recognize a margin of error surrounding our sample estimate that is based on (a) the variability in our sampling process and (b) the level of confidence

we desire. Therefore, we estimate the unknown population mean not by a single number (i.e., the sample estimate) but by an interval called the “Confidence Interval”. A confidence interval for the population mean is given by

Sample Estimate ± [Margin of Error]

The margin of error is a product of two components: Margin of Error = Confidence Factor x Standard Deviation of the Sample Estimate

Page 33

START WORKSHOP - STATISTICS

In the present case, the Sample estimate = Sample mean “= AVERAGE (Data)” Standard deviation of the estimate (i.e. Sample mean) is given by the formula

𝑆𝐷𝑋� = In this formula, while we know that

𝜎

√𝑛

𝑛 = 100. But, we do not know 𝜎

or the population

standard deviation. (In fact, except in some rare situations, 𝜎 is usually unknown.) So how do we proceed? Statisticians have offered a solution to this problem. Recognize that

𝜎 is

essentially a measure of the variability in population. If we don’t know its numerical value,

the best we could do is to substitute the measure of variability in the sample that is

representative of the population. This measure is the sample standard deviation (denoted by s). We can easily calculate (or let EXCEL calculate) the numerical value of the sample

standard deviation from the sample data. But this substitution comes with some

adjustments.

First, we use a slightly different terminology. Instead of using the term “Standard Deviation of the sample estimate” we use the term “Standard Error of the sample estimate”. We use the notation 𝑆𝐸𝑋� .

The standard error of estimate for the mean given by the following formula:

𝑆𝐸𝑋� = We can compute the standard error easily.

𝑠

√𝑛

Page 34

START WORKSHOP - STATISTICS

Second, − more important − what about the confidence factor? If we use the above formula, the appropriate sampling distribution will no longer be the normal distribution that we

used in the previous discussion. This means that, we cannot find the confidence factors using the Z-distribution functions. We must use a slightly different distribution called the T-distribution. But it is not a big deal!

What it means is just this: to find the value of the confidence factor we should use

the T-distribution rather than the Z-distribution. Using T-distribution functions is very

much like using the normal and standard normal distribution functions, although the way EXCEL’s T-distribution functions work is somewhat different. We must learn how to use the T-distribution functions in EXCEL.

A key difference between the two distributions is this: Whereas there is a unique

standard normal distribution − no matter what the sample size is − the T-distribution

depends on the sample size. More precisely it depends on (n−1), which is usually called the degrees of freedom. A T-distribution resembles the Z-distribution but has thicker ‘tails’. A

T-distribution with large degrees of freedom closely resembles the standard normal distribution.

We will now learn how to use the EXCEL functions related to the T-distribution. Page 35

START WORKSHOP - STATISTICS T.INV.2T function

Page 36

START WORKSHOP - STATISTICS

T.DIST.2T function

Page 37

T.INV function

START WORKSHOP - STATISTICS

Page 38

T.DIST function

START WORKSHOP - STATISTICS

Page 39

START WORKSHOP - STATISTICS

T.DIST.RT function

Page 40

START WORKSHOP - STATISTICS

Beck to the Case, How to find the 95% Confidence Factor?

For a confidence level is 95% the T-value must be such that 5% of the area (equally divided in the two-tails) under the T-distribution curve falls outside this value. Visualization:

Finding the desired T-value:

Use the T.INV.2T function, enter 0.05 for probability and (100 − 1) or 99 for Degrees of Freedom or enter “=T.INV.2T(0.05, 99)” in any cell.

We get the desired confidence level factor = 1.98421

Now, we have all the three pieces required to build the 95% Confidence Interval. Sample Estimate ± Confidence Factor x Standard Error of the Sample Estimate

Answer: A 95% confidence interval for the population mean time spent by salespersons with pseudo customers is: [4362 seconds, 5398 seconds]. We are 95% confident that this

interval contains the true population mean.

Page 41

START WORKSHOP - STATISTICS

NOTE

HYPOTHESIS TESTING - UNKNOWN POPULATION MEAN Guided Case Analysis: Retail Store Operation ELCO is a chain of stores that sells consumer electronics in the US. ELCO suspects that the main reasons for declining profits are the falling quality of service and growing

competition. Managers at ELCO want to know the average time that a salesperson spends with customers. They are worried about pseudo customers who take up a salesperson’s

time to get details about a product but then make their purchases elsewhere. ELCO managers are concerned about the average time spent by the sales persons with pseudo customers across the entire population of sales persons in the company.

The manager of the Human Resources Department has claimed that average time spent by salespersons with pseudo customers is equal to 15% of an 8-hour work day time or 4320

seconds. Senior managers at ELCO are wondering how they should react to the claim made by the HR Department manager – Should they reject the claim or not reject it? They deem that the decision to reject the claim must be at high level significance of 1 percent or 0.01.

ELCO has collected data on a simple random sample of 100 observations (service time

spent with pseudo customers in a day by 100 salespersons). The data set is in the EXCEL file service.

Page 42

START WORKSHOP - STATISTICS

Analysis

We first realize that this is a hypothesis test of a claim made about the unknown population mean.

Step 1: Statement of the Hypotheses H0:

µ = 4320

Ha:

µ ≠ 4320

It is a two-tailed test. Significance level α = 0.01 Step 2: Test Statistic

𝑇𝑆 =

𝑆𝑎𝑚𝑝𝑙𝑒 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒−𝑉𝑎𝑙𝑢𝑒 𝑐𝑙𝑎𝑖𝑚𝑒𝑑 𝑖𝑛 𝑡ℎ𝑒 𝑁𝑢𝑙𝑙 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟 𝑜𝑓 𝑡ℎ𝑒 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒

This Test Statistic follows a T-distribution with (n-1) = (100-1) = 99 degrees of freedom Step 3: Decision Criteria 1. Critical Values of the Test Statistic and the Rejection Region

2. P-value Step 4: Statistical Decision

Step 5: Managerial Conclusion

Page 43

START WORKSHOP - STATISTICS Let us do the computations in EXCEL and complete the above steps. EXCEL Worksheet

Hypothesis Testing of Population Mean - Two-tailed Test Claimed value of the Population Mean

µ0

Type of test

Direction of alternate hypothesis: ≠

Significance Level

α

(1-α)

Confidence Level Sample size

Sample Estimate = Sample Mean Sample Standard Deviation, s

n = COUNT(A2:A101)

Standard Error of the Sample Estimate, SE Observed Sample Test Statistic Degrees of Freedom

Critical value of Test Statistic for the level of significance P-value of the Observed Sample Test Statistic = the area in the two tails beyond the Absolute Value of the Observed Test Statistic

=AVERAGE(A2:A101) =STDEV.S(A2:A101)

Sample Standard Deviation/ SQRT(n)

T = (Observed value - Claimed value)/ Standard Error of the Estimate (n-1)

T*=T.INV.2T(α, df)

=T.DIST.2T(Observed Sample Test Statistic, df)

Page 44

4320

2-Tail

0.01 0.99 100

4880.03

2610.622 261.0622 2.145197 99

2.626405 0.034383

START WORKSHOP - STATISTICS

Appendix 1: Normal Distribution The normal distribution is a continuous probability distribution useful in describing many real-world situations. It is also a very important distribution in statistical applications. There is actually a family of normal distributions, with each distribution completely specified by the values of two parameters, the mean, µ, and the standard deviation, σ. Every normal distribution is symmetric and centered at its mean. The standard deviation determines how spread out are the values in the distribution.

Although the limits of any normal distribution are, in theory, ± ∞, 99.7% of the values are within ± 3σ of µ.

Probability in a normal distribution is determined as the area under the normal curve. The total area (total probability) under the normal curve is 1. The Standard Normal Distribution If X is normal with mean, µ, and standard deviation, σ, then

Z=

X -µ σ

is called standard normal. A probability statement about any normal random variable X can be transformed into an equivalent probability statement about the standard normal random variable Z. Z-distribution has: Mean: μ = 0 Standard Deviation: σ = 1.0 Z will always be used to represent a standard normal random variable. Probabilities under the standard normal curve have been tabulated and are shown in a table of standard normal probabilities.

Note that P(Z ≤ z) = P(Z < z) (including or excluding a single number does not change the probability) Page 45

START WORKSHOP - STATISTICS

Cumulative Standard Normal (CSN) Table: P(Z < z) Z

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4

0 0.5000 0.5398 0.5793 0.6179 0.6554 0.6915 0.7257 0.7580 0.7881 0.8159 0.8413 0.8643 0.8849 0.9032 0.9192 0.9332 0.9452 0.9554 0.9641 0.9713 0.9772 0.9821 0.9861 0.9893 0.9918 0.9938 0.9953 0.9965 0.9974 0.9981 0.9987 0.9990 0.9993 0.9995 0.9997 0.9998 0.9998 0.9999 0.9999 1.0000 1.0000

0.01 0.5040 0.5438 0.5832 0.6217 0.6591 0.6950 0.7291 0.7611 0.7910 0.8186 0.8438 0.8665 0.8869 0.9049 0.9207 0.9345 0.9463 0.9564 0.9649 0.9719 0.9778 0.9826 0.9864 0.9896 0.9920 0.9940 0.9955 0.9966 0.9975 0.9982 0.9987 0.9991 0.9993 0.9995 0.9997 0.9998 0.9998 0.9999 0.9999 1.0000 1.0000

0.02 0.5080 0.5478 0.5871 0.6255 0.6628 0.6985 0.7324 0.7642 0.7939 0.8212 0.8461 0.8686 0.8888 0.9066 0.9222 0.9357 0.9474 0.9573 0.9656 0.9726 0.9783 0.9830 0.9868 0.9898 0.9922 0.9941 0.9956 0.9967 0.9976 0.9982 0.9987 0.9991 0.9994 0.9995 0.9997 0.9998 0.9999 0.9999 0.9999 1.0000 1.0000

0.03 0.5120 0.5517 0.5910 0.6293 0.6664 0.7019 0.7357 0.7673 0.7967 0.8238 0.8485 0.8708 0.8907 0.9082 0.9236 0.9370 0.9484 0.9582 0.9664 0.9732 0.9788 0.9834 0.9871 0.9901 0.9925 0.9943 0.9957 0.9968 0.9977 0.9983 0.9988 0.9991 0.9994 0.9996 0.9997 0.9998 0.9999 0.9999 0.9999 1.0000 1.0000

0.04 0.5160 0.5557 0.5948 0.6331 0.6700 0.7054 0.7389 0.7704 0.7995 0.8264 0.8508 0.8729 0.8925 0.9099 0.9251 0.9382 0.9495 0.9591 0.9671 0.9738 0.9793 0.9838 0.9875 0.9904 0.9927 0.9945 0.9959 0.9969 0.9977 0.9984 0.9988 0.9992 0.9994 0.9996 0.9997 0.9998 0.9999 0.9999 0.9999 1.0000 1.0000

Page 46

0.05 0.5199 0.5596 0.5987 0.6368 0.6736 0.7088 0.7422 0.7734 0.8023 0.8289 0.8531 0.8749 0.8944 0.9115 0.9265 0.9394 0.9505 0.9599 0.9678 0.9744 0.9798 0.9842 0.9878 0.9906 0.9929 0.9946 0.9960 0.9970 0.9978 0.9984 0.9989 0.9992 0.9994 0.9996 0.9997 0.9998 0.9999 0.9999 0.9999 1.0000 1.0000

0.06 0.5239 0.5636 0.6026 0.6406 0.6772 0.7123 0.7454 0.7764 0.8051 0.8315 0.8554 0.8770 0.8962 0.9131 0.9279 0.9406 0.9515 0.9608 0.9686 0.9750 0.9803 0.9846 0.9881 0.9909 0.9931 0.9948 0.9961 0.9971 0.9979 0.9985 0.9989 0.9992 0.9994 0.9996 0.9997 0.9998 0.9999 0.9999 0.9999 1.0000 1.0000

0.07 0.5279 0.5675 0.6064 0.6443 0.6808 0.7157 0.7486 0.7794 0.8078 0.8340 0.8577 0.8790 0.8980 0.9147 0.9292 0.9418 0.9525 0.9616 0.9693 0.9756 0.9808 0.9850 0.9884 0.9911 0.9932 0.9949 0.9962 0.9972 0.9979 0.9985 0.9989 0.9992 0.9995 0.9996 0.9997 0.9998 0.9999 0.9999 0.9999 1.0000 1.0000

0.08 0.5319 0.5714 0.6103 0.6480 0.6844 0.7190 0.7517 0.7823 0.8106 0.8365 0.8599 0.8810 0.8997 0.9162 0.9306 0.9429 0.9535 0.9625 0.9699 0.9761 0.9812 0.9854 0.9887 0.9913 0.9934 0.9951 0.9963 0.9973 0.9980 0.9986 0.9990 0.9993 0.9995 0.9996 0.9997 0.9998 0.9999 0.9999 0.9999 1.0000 1.0000

0.09 0.5359 0.5753 0.6141 0.6517 0.6879 0.7224 0.7549 0.7852 0.8133 0.8389 0.8621 0.8830 0.9015 0.9177 0.9319 0.9441 0.9545 0.9633 0.9706 0.9767 0.9817 0.9857 0.9890 0.9916 0.9936 0.9952 0.9964 0.9974 0.9981 0.9986 0.9990 0.9993 0.9995 0.9997 0.9998 0.9998 0.9999 0.9999 0.9999 1.0000 1.0000

START WORKSHOP - STATISTICS

Examples Using the Standard Normal Table a.

Find P(Z < 1.00)

b. Find P(0 < Z < 1)

c.

Find P(-1.3 < Z < 2.0)

d.

Find P(-1.57 < Z < -0.82)

Page 47

START WORKSHOP - STATISTICS

e.

Find P(Z < -2.53)

f.

P(Z < -6)

g.

Find z so that a probability of 5% falls above (to the right) of that value: P(Z > z) = 0.05.

Page 48

START WORKSHOP - STATISTICS

Using EXCEL to Find the Desired Probabilities

Let us use the following example to provide us the context to illustrate the use of EXCEL to find probabilities of our interest. Example:

Each year thousands of high school students take the Scholastic Aptitude Test (SAT). The

distribution of the scores on each SAT is approximately unimodal and symmetric and it is well described by a Normal model with a mean of 500 and a standard deviation of 100.

1. Suppose a student scored 600 on an SAT test. Where does this student stand among all the students that took this SAT? 2. What proportion of students scored between 450 and 600 on this SAT? 3. Suppose a college sys it accepts only students with SAT scores among the top 10%. How high should a student’s SAT score be in order to be accepted at this college? Solution Procedure First, visualize the situation by drawing a picture of the Normal distribution and marking out the desire probability (i.e., the area under the normal curve).

To find areas (or probabilities) we use the NORM.DIST function in EXCEL.

The function NORM.DIST(X, mean, std-dev, TRUE) gives the area under the normal curve

to the LEFT of the value that you input for “X”.

You may follow one of two alternative approaches:

1. In any EXCEL cell enter the formula “=NORM.DIST(X, mean, std-dev, TRUE)” with appropriate numerical values and then hit “enter”

2. With the cursor on any EXCEL cell, click on the built-in function button and choose NORM.DIST from the menu of “Statistical” functions. This will open the dialog box and you can fill the appropriate numerical values. Page 49

Answer to Question 1

START WORKSHOP - STATISTICS

Visualization

Page 50

START WORKSHOP - STATISTICS

Answer: The student’s score of 600 is such that about 84% were below his score. Page 51

Answer to Question 2

START WORKSHOP - STATISTICS

Visualization

The desired area is the difference between two areas. We can find these separately and

then do the subtraction. Or we can directly enter the formula that represents the subtraction into an EXCEL cell and get the answer.

Answer: 53.28% of the students scored in the range between 450 and 600. Page 52

Answer to Question 3

START WORKSHOP - STATISTICS

In this case, we know the probability (or the area) and we must find a corresponding score (or the X-value). Specifically, we must find X such that the area under the curve to the right of X is 10% or equivalently the area to the left of X should be equal to 90%. NORM.INV(Probability, mean, standard-dev) gives the X value for which the area under the normal curve to the LEFT is equal to the value you input for probability.

Answer: The cutoff score at this college is 628.

Page 53

START WORKSHOP - STATISTICS

Appendix 2: Practice Problems 1. Filling Tide Detergent Bottles Proctor and Gamble manufactures liquid Tide detergent (among many other products). Liquid Tide is sold in plastic bottles. One of the final steps in the manufacturing process is to fill the bottles of Tide. One machine used to fill the bottles is set to put an average of 100 ounces of Tide in each bottle. However, this machine cannot be guaranteed to put exactly 100 ounces of Tide in each bottle. Rather, the fill amount is known to follow a normal distribution with mean of 100 ounces and standard deviation of 0.2 ounces. Thus, some bottles will contain slightly more than 100 ounces and some slightly less, even though these bottles will be labeled as “100 ounce” bottles. a.

b. c.

d. e.

What is the probability that less than 99.6 ounces will be put into a “100 ounce” bottle of liquid Tide?

Calculate the probability that a single bottle of Tide will contain between 99.9 and 100.1 ounces. What is the 90th percentile of the fill amounts?

Suppose P&G can adjust the mean fill amount on the machine that fills the Tide bottles. At what value should the mean fill be set in order to insure that only 5% of the Tide bottles will contain less than 99.8 ounces?

We plan to examine a random sample of 100 Tide bottles to assess the operating efficiency of the machine. Calculate the probability that the average fill for a random sample of 100 Tide bottles is between 99.9 and 100.1 ounces.

Page 54

START WORKSHOP - STATISTICS

2. Stereo Component Warranty

A company that produces an expensive stereo component is considering offering a warranty on the component. Suppose the population of lifetimes of the components is a normal distribution with a mean of 84 months and a standard deviation of 7 months. If the company wants no more than 2% of the components to wear out before they reach the warranty date, what number of months should be used for the warranty? (Answer: 69.68 or 70 months)

Page 55

3. Textbook

START WORKSHOP - STATISTICS

A large required chemistry course at a state university has been using the same textbook for a number of years. Over the years, the students have been asked to rate this text on a 10-point scale, and the average rating has been stable at about 5.2. This year the faculty decided to try a new text. After the course, 35 randomly selected students were asked to rate this new text. The results are shown below: 6 3 6 7 6

10 6 8 7 10

3 6 5 7 8

10 6 7 6 4

6 6 4 6 8

7 7 9 10 9

5 8 6 8 7

The sample mean of the 35 sample values is 6.77. The sample standard deviation is 1.85. Do the data provide evidence that the average rating for the new book is different from that of the old book (5.2)?

Page 56

Hypotheses:

START WORKSHOP - STATISTICS

H0:

Ha:

Decision Rule:

Test Statistic:

Decision: Conclusion:

Page 57

START WORKSHOP - STATISTICS

4. Battery Lifetimes

DC Company makes batteries for cell phones. Recently, the R&D department at the company came up with a new battery design that they believed would last longer than batteries currently on the market. However, senior managers were concerned about making the claim that the new battery lifetime was greater, on average, than the current industry standard of 30 hours. They believed there would be serious bad publicity and sales would decline if trade publication tests showed otherwise. Before the new battery is put into production, the company planned to test a random sample of 100 batteries. The battery lifetime data is shown below: 42.2 30.9 27.4 35 26.8 33.2 38.5 31.2 29.4 25.2

34 26.1 28.4 31.3 39.1 32.8 24.6 30.2 19.6 37.9

29.9 42.8 30 30 32.8 34.6 37.8 26.4 32.2 32.9

26 39.8 28.7 30.5 26.6 31.8 31.1 34.3 22.3 29.6

29.6 30.3 35.3 34.3 32 29.5 27.5 21.7 21.6 35

19.7 38.3 32.1 26.3 30.7 30.7 29.6 26.8 25.1 33.3

26.5 29.5 31.6 31.3 38.8 31.4 28.9 26.6 33.9 25.1

35.7 37.2 28.9 32.2 30.1 32.3 32.7 28.2 26.2 30.4

29.7 29.1 44.1 28.2 30.5 30.8 29.4 33.7 29.2 25.6

29.4 28.1 26.1 26.4 30.3 30.9 21.3 37.3 24.8 38.2

Is there sufficient evidence to conclude that the population of new batteries will have average lifetime greater than 30 hours? Set up the hypotheses and conduct the test using α = .05. The sample standard deviation is 4.816 hours. The sample mean of the 100 lifetimes is 30.639 hours.

Page 58

Hypotheses:

START WORKSHOP - STATISTICS

H0:

Ha:

Decision Rule:

Test Statistic:

Decision: Conclusion:

Page 59

Smile Life

When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile

Get in touch

© Copyright 2015 - 2024 PDFFOX.COM - All rights reserved.