Idea Transcript
Statistics for Molecular Medicine -- Statistical inference and statistical tests-Barbara Kollerits & Claudia Lamina Medical University Innsbruck, Division of Genetic Epidemiology
Molekulare Medizin SS 2016
The principles of statistical testing:
Formulating Hypothesis & Test-statistics & p-values
1
Formulating Hypothesis & Statistical Tests Steps in conducting a statistical test:
■ Quantify the scientific problem from a clinical / biological perspective ■ Formulate the model assumptions (distribution of the variable of interest) ■ Formulate the problem as a statistical testing problem: Nullhypothesis versus alternative hypothesis
■ Define the „error“ you are willing to tolerate ■ Calculate the appropriate test statistic ■ Decide for the nullhypothesis or against it
Formulating Hypothesis & Statistical Tests Hypothesis Formulation:
■ Null hypothesis H0: The conservative hypothesis you want to reject ■ Alternative Hypothesis H1: The hypothesis you want to proof ■ Examples: Scientific hypothesis: A new therapy is assumed to better prevent myocardial infarctions in risk patients than the old therapy.
Scientific hypothesis: Women and men achieve equally good scores in the EMS-AT test
Statistical hypothesis: H0: new ≥ old H1: new < old
Statistical hypothesis: H0: men=women H1: men≠women
with new : the proportion of patients experiencing a MI during the study receiving the new therapy old : the proportion of patients experiencing a MI during the study receiving the old therapy
with men : mean scores for men women : mean scores for women
2
Formulating Hypothesis & Statistical Tests Possible decisions in statistical tests: Decide for Reality
H0
H1
H0
Correct decision
Wrong decision: Type I error ()
H1
Wrong decision: Type II error ()
Correct decision: Power (1-)
■ Type I and Type II error cannot be minimized simultaneously ■ Statistical tests are constructed in that way, that the probability of a Type I
error is not bigger than the significance level (typically set to 0.01 or 0.05)
Example: ■ Test the new MI-therapy on patients to a significance level of 5%. ■ In reality, H0 is true and there is no difference between therapies. ■ If the study is repeated 100 times on 100 different samples, the statistical test rejects the Nullhypothesis in maximum 5 of the100 tests.
Formulating Hypothesis & Statistical Tests The One-sample test of the mean: Gauß-Test (also called z-Test)
■ ■ ■
Situation: Compare the sample mean (sample) with a specified mean (0) Assumption: normal distribution of the sample Hypothesis: H0: sample= 0 versus H1: sample≠0
Example: From a former population-based sample, you know, that the mean of the nonfasting cholesterol level was 230. Now, you have finished the measurements in your new sample. Since you want to conduct a study on cholesterol levels, that is comparable to the old study, you want to test, if the mean on the new study is equal to the mean in the old study:
One-sample test of the mean: H0: sample= 230 vs. H1: sample ≠ 230 3
Formulating Hypothesis & Statistical Tests The One-sample test of the mean: Gauß-Test (also called z-Test)
■ ■ ■
Situation: Compare the sample mean (sample) with a specified mean (0) Assumption: normal distribution of the sample Hypothesis: H0: sample= 0 versus H1: sample≠0
Assuming normal distribution & assuming H0 is true:
Standardization Test-statistic: If the test-statistic is “too extreme”, it is no very likely, that H0 is true Reject H0 , of T is “too extreme”
Formulating Hypothesis & Statistical Tests ■ You cannot avoid a Type I error, but you can control it , since it can only occur, if H0 is real. Since H0 follows a distribution, the probability of each value given H0 can be calculated The probability distribution given H0 ~ N(0,1):
Rejection region
Acceptance region /2
/2
Area = 1-
-z1-/2
Rejection region
z1-/2
■ All values falling in the rejection region ( z1-/2) do not support the null hypothesis (significance level ).
4
Formulating Hypothesis & Statistical Tests ■ You cannot avoid a Type I error, but you can control it , since it can only occur, if H0 is real. Since H0 follows a distribution, the probability of each value given H0 can be calculated The probability distribution given H0 ~ N(0,1):
critical value
Rejection region
Acceptance region /2
/2
Area = 1-
-z1-/2
Rejection region
z1-/2
■ All values falling in the rejection region ( z1-/2) do not support the null hypothesis (significance level ).
Formulating Hypothesis & Statistical Tests The One-sample test of the mean revisited: Gauß-Test (also called z-Test)
■ ■ ■
Situation: Compare the sample mean (sample) with a specified mean (0) Assumption: normal distribution of the sample Hypothesis: H0: sample= 0 versus H1: sample≠0
Assuming normal distribution & assuming H0 is true
Test-statistic:
Test decision: |T| > z1-/2 : Reject H0 Test is „significant“ to 5
Formulating Hypothesis & Statistical Tests
■
Test decision: |T| > z1-/2 : Reject H0 Test is „significant“ to
Attention: Rejection of H0 is not a decision for H1, but a decision against H0, since no distribution can be specified, if H1 is really true.
■
The (1-/2)- Quantiles can be found in tables or can be calculated by computers, e.g. the 97.5%-Quantile of the normal distribution used for a 5%significance test (two-sided) is ~1.96
■
As for confidence intervals: If is not known & the sample size is not large enough estimate by S
Formulating Hypothesis & Statistical Tests
■
■ ■ ■ ■
So far a statistical test gives you a test statistic. If the test statistic is more extreme than the critical values, decide against the nullhypothesis. If the test statistic is smaller than the critical values, decide for the nullhypothesis simple yes/no decision rule This decision rule does not give you the certainty / strength of the decision Assume two two-sided z-Tests to =5%, one gives T=2 the other T=2.6 Both teststatistics are > z1-/2 (~1.96) both are „significant“ However, 2.6 is „more extreme“ than 2, given the truth of the Nullhypothesis!
-2.6
-2-2
2 2 2.6
6
Formulating Hypothesis & Statistical Tests
■
The probability to estimate an effect, that is as extreme as the observed effect or even more extreme, under the assumption, that the null-hypothesis (= no association) is true
■
For a two sided test, the area under the curve on both tails of the distribution function have to be added
~0.025
~0.025
~0.005
-2.6
~0.005
-2
2 2.6
Formulating Hypothesis & Statistical Tests
■
The probability to estimate an effect, that is as extreme as the observed effect or even more extreme, under the assumption, that the null-hypothesis (= no association) is true
■
For a two sided test, the area under the curve on both tailes of the distribution function have to be added
P-value ~ 0.05 ~ ~ 0.025
~ ~ 0.025
~0.005
-2.6
~0.005
-2
2 2.6
7
Formulating Hypothesis & Statistical Tests
■
The probability to estimate an effect, that is as extreme as the observed effect or even more extreme, under the assumption, that the null-hypothesis (= no association) is true
■
For a two sided test, the area under the curve on both tailes of the distribution function have to be added
P-value ~ 0.05
P-value ~ 0.01 ~ ~ 0.025 v
~ ~ 0.025
~0.005
-2.6
~0.005
-2
2 2.6
Formulating Hypothesis & Statistical Tests
■
The P-value p is a measure of certainty against the nullhypothesis.
Example: A one sample z-Test comparing the sample mean to 0: H0: sample= ; H1: sample≠ results in a test statistic T=2.6, which corresponds to a p-value of 0.01. A popular interpretation, but wrong: „The probability, that the sample mean is different from 0 is 1%“ The sample mean does not have a probability. It is 0 or not! Correct interpretation: „A different random sample is drawn 100 times from the population of interest. The population mean is 0 (=Nullhypothesis). Maximum 1 of the 100 experiments results in a teststatistic, which is ≥ |2.6|“ The randomness lies in the sample! If p z1-/2 p z1-/2 : 6.3 > 1.96
reject H0
p < 2.97e-10 < 0.05 reject H0 : 230 not in [234.89, 239.31] reject H0 Sample mean significantly different from specified mean
Formulating Hypothesis & Statistical Tests: Exercise The manufacturer of a laboratory measurement device claims that one measurement will take 5 minutes on average. You want to test that statement with 10 measurements. (Mean = 5.3, standard deviation = 0.3) What hypothesis do you want to test ? H0:
versus H1: Test decision:
Determine the test-statistic T = What is the critical value (significance level 5%) ?
Quantiles of the standard normal distribution:
p
0.95
0.975 0.995
zp
1.64
1.96
2.58 10
The most common statistical tests:
Testing measures of location
The most common statistical tests Quantitative / Continuous outcome variable
Qualitative / Categorical outcome variable
Normal distribution
Any other distribution
Expected frequency in each cell of the crosstable „high“
Expected frequency in each cell of the crosstable „low“
Compare 2 groups
t-test
Wilcoxon-test / Mann-Whitney UTest
Chi-Square
Fishers exact test
Compare >2 groups
Analysis of Variance (ANOVA)
Kruskal-WallisTest
Chi-Square
Fishers exact test
Testing measures of location: Does the mean/median differ between groups
Testing frequencies in a crosstable: Are the rows and columns independent from each other? 11
Testing measures of location The One-sample t-test (the “standard test” for mean comparisons): ■ Situation: Compare the sample mean (sample) with a specified mean (0) ■ Assumption: normal distribution of the sample, is not known ■ Hypothesis: H0: sample= 0 versus H1: sample ≠ 0 ■ Teststatistic:
■ Test decision for a two sided test: |T| > tn-1,1-/2 : Reject H0 ■ Test decision for a one sided test: |T| > tn-1,1- : Reject H0 ■ The Gauß-Test is the general form of the t-test, if were known ■ The t-test approximates the Gauß-Test ■ Quantiles of the t-distribution needed to decide for or against the Nullhypothesis ■ But in practice: Statistical programs give out p-values
Testing measures of location Comparing the normal and t-distribution:
Quantiles t-distribution:
0.95-quantile: critical value for a one-sided test to the significance level 5%, n=16 Example: t0.95(15) = 1.753 0.975-quantile: critical value for a two-sided test to the significance level 5%, n=823 Example: t0.975(822) ~ 1.96 Quantiles of the standard normal distribution
df (= n‐1) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 120 ∞
0.95 6.314 2.920 2.353 2.132 2.015 1.943 1.895 1.860 1.833 1.812 1.796 1.782 1.771 1.761 1.753 1.746 1.740 1.734 1.729 1.725 1.721 1.717 1.714 1.711 1.708 1.706 1.703 1.701 1.699 1.697 1.658 1.645
0.975 12.706 4.303 3.182 2.776 2.571 2.447 2.365 2.306 2.262 2.228 2.201 2.179 2.160 2.145 2.131 2.120 2.110 2.101 2.093 2.086 2.080 2.074 2.069 2.064 2.060 2.056 2.052 2.048 2.045 2.042 1.980 1.960
12
Testing measures of location Determine the following critical values:
One-sided or twosided test
Significance level
n
two
0.05
>> 120
one
0.05
>> 120
two
0.05
15
one
0.05
28
two
0.01
10
one
0.01
20
two
0.01
>> 120
Critical value
df (= n‐1) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 120 ∞
0.95 6.314 2.920 2.353 2.132 2.015 1.943 1.895 1.860 1.833 1.812 1.796 1.782 1.771 1.761 1.753 1.746 1.740 1.734 1.729 1.725 1.721 1.717 1.714 1.711 1.708 1.706 1.703 1.701 1.699 1.697 1.658 1.645
0.975 12.706 4.303 3.182 2.776 2.571 2.447 2.365 2.306 2.262 2.228 2.201 2.179 2.160 2.145 2.131 2.120 2.110 2.101 2.093 2.086 2.080 2.074 2.069 2.064 2.060 2.056 2.052 2.048 2.045 2.042 1.980 1.960
0.99 31.821 6.965 4.541 3.747 3.365 3.143 2.998 2.896 2.821 2.764 2.718 2.681 2.650 2.624 2.602 2.583 2.567 2.552 2.539 2.528 2.518 2.508 2.500 2.492 2.485 2.479 2.473 2.467 2.462 2.457 2.358 2.326
0.995 63.657 9.925 5.841 4.604 4.032 3.707 3.499 3.355 3.250 3.169 3.106 3.055 3.012 2.977 2.947 2.921 2.898 2.878 2.861 2.845 2.831 2.819 2.807 2.797 2.787 2.779 2.771 2.763 2.756 2.750 2.617 2.576
Testing measures of location The two-sample t-test for unpaired samples: ■ Situation: Compare the means (1, 2) of two unpaired samples ■ Assumption: normal distribution of both samples, is not known Here: Equal assumed, but there are methods (Welch t-test) for unequal Hypothesis: ■ Teststatistic:
with the pooled variance ■ Test decision for a two sided test: |T| > tn1+n2-2,1-/2 : Reject H0 ■ Test decision for a one sided test: |T| > tn1+n2-2,1- : Reject H0 13
Testing measures of location Example: A biotech company claims that their new biomarker XY can distinguish diseased from non-diseased; A pilot study on 10 diseased and 10 healthy persons gives the following results: Labparameter XY in Diseased
Labparameter XY in Healthy
8.70
3.36
11.28
18.35
13.24
5.19
8.37
8.35
12.16
13.1
11.04
15.65
10.47
4.29
11.16
11.36
4.28
9.09
19.54
(missing)
X
11.024
9.86
S2
15.227
27.038
_
Testing measures of location Example: A biotech company claims that their new biomarker XY can distinguish diseased from non-diseased; A pilot study on 10 diseased and 10 healthy persons gives the following results: Labparameter XY in Diseased
Labparameter XY in Healthy
8.70
3.36
11.28
18.35
13.24
5.19
8.37
8.35
12.16
13.1
11.04
15.65
10.47
4.29
11.16
11.36
4.28
9.09
19.54
(missing)
X
11.024
9.86
S2
15.227
27.038
_
s2 = (9*15.227 + 8*27.038)/17 = 20.78512
T=
(11.024 – 9.86)_____ ________________ __ = 0.556
√(20.78512*(1/10+1/9))
~t(17)
Critical value of a t(17)-distribution to α = 5% (two-sided test) = 2.11 0.556 < 2.11 XY does not differ between diseased and non-diseased P-value = 0.29 14
Testing measures of location: Exercise Patient ID
Physically active
Cholesterol level
1
0
195
2
0
159
3
0
166
4
0
244
5
0
169
6
0
168
7
0
222
8
0
238
9
0
216
10
0
180
11
1
146
12
1
145
13
1
147
14
1
208
15
1
182
16
1
145
17
1
187
18
1
206
19
1
218
1
161
20
Comparing the cholesterol levels between physically active (=1) and inactive (=0) patients: Meaninactive =
Meanactive =
What hypothesis do you want to test ? H0:
versus H1:
The following information is given: Sinactive= 31.94 ; Sactive= 29.27 Pooled S2 = 938.48 Test-statistic T = Critical value (two-sided test, = 5%): Test decision ?
Testing measures of location The two-sample t-test for paired samples:
■
Situation: Compare the means of two paired samples, e.g. compare the means of variables in the same patients before a treatment and after the treatment
■ ■
Assumption: normal distribution of both samples, is not known Hypothesis: H0: before= after versus H1: before≠after Calculate d = xbefore-xafter for each patient new Hypothesis: H0: The mean of the difference is 0: d = versus H1: The mean of the difference is ≠ 0: d ≠
Same situation as the one-sample t-test T-tests can also be used approximatively for any distribution, that is not too skewed. 15
Testing measures of location Example: A wannabe health guru claims, that he has invented the perfect weight loss method; A pilot study on 10 obese individuals gives the following results: ID
kg at baseline
kg after 6 months
Difference
1
108
90
18
2
97
97
0
3
88
91
-3
4
120
111
9
5
98
94
4
6
95
91
4
7
87
82
5
8
85
77
8
9
99
103
-4
_
134
127
7
X
101.1
96.3
4.8
S_2
242.767
209.122
41.07 s = 6.41
10
Testing measures of location Example: A wannabe health guru claims, that he has invented the perfect weight loss method; A pilot study on 10 obese individuals gives the following results: Paired t-test = one-sample t-test on the difference:
ID
kg at baseline
kg after 6 months
Difference
1
108
90
18
2
97
97
0
3
88
91
-3
T = √10*(4.8 – 0)/6.41 = 2.368
4
120
111
9
5
98
94
4
t0.975(9) = 2.262 (two-sided)
6
95
91
4
7
87
82
5
8
85
77
8
9
99
103
-4
Since you want to prove, that kg(before)>kg(after):
_
134
127
7
one-sided test more appropriate (more power)
X
101.1
96.3
4.8
S2
242.767
209.122
41.07 s = 6.41
t0.95(9) = 1.833 , p = 0.021 (=0.5*two-sided p)
10
H0 can be rejected (p = 0.042)
An unpaired t-test would have yielded nonsignificant results: p = 0.24 (one-sided test) 16
Testing measures of location Analysis of Variance (ANOVA)
■ ■ ■
Situation: Compare the means of k samples (k>2) Assumption: normal distribution of the population, =…= k Hypothesis: H0: 1= 2 =… = k versus H1: i ≠ j (i ≠ j): At least two of the means differ
How to construct the teststatistic ?
Testing measures of location Analysis of Variance (ANOVA)
■ ■ ■
Situation: Compare the means of k samples (k>2) Assumption: normal distribution of the population, =…= k Hypothesis: H0: 1= 2 =… = k versus H1: i ≠ j (i ≠ j): At least two of the means differ
Idea from the two-sample t-Test: Difference /Variability between the groups Relate the variability between the groups to the variability within the groups Variability within the group 17
Testing measures of location Analysis of Variance (ANOVA): Variance partitioning
■ ■ ■
There are k groups (j=1,…k) X1j, X2j,… Xnjj are the observed values of the variable of interest for i=1,…nj patients in the jth group is the overall mean of the variable,
are the means within the groups
SST= Sum of squares Total
SSE/SSR
SSM
SSE= Sum of squares Error or SSR = Sum of squares Residual ~ Variance „within“
Group 1 Group 2 Group 3
| | |
1
| | |
2
| | | all observations xij
3
SSM= Sum of squares Model ~ Variance „between“
Testing measures of location Analysis of Variance (ANOVA): Variance partitioning
SST= Sum of squares Total SSE= Sum of squares Error or SSR = Sum of squares Residual ~ Variance „within“
SSM= Sum of squares Model ~ Variance „between“
Mean of Squares:
18
Testing measures of location ■ Test statistic: ■ Test decision for a two sided test: F > Fk-1, n-k, 1- : Reject H0 ■ Since F is always positive, there are no one-sided tests ■ If H0 is rejected, you can tell, that there are at least two groups, which differ from each other significantly. You can‘t tell, which groups differ!
perform pairwise t-tests after overall F-Test (see Closed test procedure) Example: There are 3 different medications (Med1, Med2, Med3), which are intended to increase the HDL-cholesterol levels in patients 1. perform ANOVA as an overall test, if there is a difference between the groups 2. If the F-Test was significant, you know, that there is a difference 3. Test Med1 against Med2, Med1 against Med3, Med2 against Med3
Testing measures of location All tests so far assumed a normally distributed variable parametric tests If the assumption does not hold nonparametric tests Parametric Tests
Nonparametric Tests
T-Test
• Wilcoxon-Test • Wilcoxon rank-sum test • Mann-Whitney U-Test
Different names for the same test
Testing, if two independent samples come from the same distribution Testing equality of medians ANOVA
Kruskal-Wallis-Test: Testing equality of population medians between groups
Characteristics of nonparametric tests:
• Robust against outliers and skewed distributions • However: Parametric tests should be preferred over nonparametric
test, if appropriate, since they have the higher power.
19
Testing measures of location Two sample test on equality of distributions: Wilcoxon Test
■
Situation: Compare location measures of two unpaired samples X and Y, if the assumption of a t-test does not hold
■
Assumption: the form of the continuous distributions of the variables X and Y is the same test on equality of distributions = test on equality of the medians
■ ■ ■
Hypothesis: H0: xmed= ymed versus H1: xmed ≠ ymed Test is based on the ranks What are ranks?
39
Testing measures of location Example: Wilcoxon Test Original values: X={1,2,4,6,9,9,11}; xmed =6; Y={1,3,4,5,6,7,8}; ymed =5
■ ■ ■
Sort both variables into one: 1/1,2,3,4/4,5,6/6,7,8,9/9,11 Ranking: 1.5,1.5,3,4,5.5,5.5,7,8.5,8.5,10,11,12.5,12.5,14 A teststatistic is calculated using these ranks by the computer !
Efficiency of Wilcoxon-Test (Kruskal-Wallis-Test) compared to t-test (ANOVA):
■ ■
To achieve the same power as a t-test/ANOVA, a higher sample size is needed! The parametric tests are in general more powerful (if their assumptions are fulfilled)
20
The most common statistical tests:
Testing frequencies
Testing frequencies One sample test on frequencies: -square goodness-of-fit test for categorical traits: Compare the frequencies of a categorical variable to specified proportions
■
Example: The Quality manager of a gummi bear production fabric claims, that the 5 favours do not have the same proportions in each package as claimed. proportions under H0 1=2=3=4=5=1/5.
■
Hypothesis: H0: P(X=i) = i versus H1: P(X=i) ≠ i for one i at least, i=1,..k number of categories
■
Idea: Compare the observed numbers (O) in each category (hi) with the expected numbers (E=n i) in each category
■
Assumption: ni≥1 for all i & ni≥5 for at least 80% of the categories
Teststatistic:
Test decision for a two sided test:
: Reject H0
42
21
Testing frequencies Excursion: What are degrees of freedom (df)? Definition: The number of values that can vary freely. Example: The -square goodness-of-fit test with k categories has (k-1) degrees of freedom. Why? There is the constraint, that all proportions sum up to 1:
1, 2,…k-1 can be chosen freely. k is then fixed Example with real data: suppose you have three groups, then 1 = 0.20, 2 =0.50 3 =1-0.70 (=0.30)
43
Testing frequencies Two sample test on frequencies: -test of independence:
■
Situation: Compare the frequencies between two or more groups Or: Test, if two categorical variables X (i=1,…k) and Y (j=1,…m) depend on each other
All situations you can group into contingency tables
Y
X
1
…
m
Row sum
1
h11
…
h1m
h1.
2
h21
…
h2m
h2.
:
:
:
:
k
hk1
hkm
hk.
h.m
n
Column sum h.1
■
…
A possible scenario: Compare the number of smokers, ex-smokers and neversmokers between men and women
44
22
Testing frequencies Two sample test on frequencies: -test of independence:
■
Hypothesis: H0: X and Y are independent from each other H1: X and Y are dependent from each other (are associated)
■
Assumption: expected frequencies ≥1 for all & expected frequencies ≥5 for at least 80% of the cells none of the cells should have a very rare expectancy if assumption is not fulfilled use Fishers exact test Idea to construct the teststatistic: Compare the observed numbers in each cell with the expected numbers, if H0 and therefore independence of the two factor variables is assumed
45
Testing frequencies Table of observed numbers Y
X
1
… m
1
h11
… h1m h1.
2
h21
… h2m h2. :
k
h1. …hk. , h.1 …h.m are the margin probabilities
:
hk1
… hkm hk.
h.1
h.m
n
Current Smoker
ExSmoker
Never Smoker
Row Total
Men
144
310
268
722
Women
117
143
475
735
Column Total
261
453
743
1457
Smoking status Gender
46
23
Testing frequencies Table of expected numbers: Y
X
…
m
1
h1.h.1/n
…
h1.h.m/n
h1.
2
h2.h.1/n
…
h2.h.m/n
h2.
: k
hk.h.1/n
Smoking status
Current Smoker
ExSmoker
Never Smoker
Row Total
1
…
h.1
:
hk.h.m/n
hk.
h.m
n
Expected number in each cell: (Row sum * Columns sum)/ Total sum
Gender Men
144
310
268
722
Women
117
143
475
735
Column Total
261
453
743
1457
Expected number in the upper left cell: 722*261/1457 = 129.336
Testing frequencies Observed:
Expected: Y
X
Y
… m
1
h11
… h1m h1.
2
h21
… h2m h2. :
k
1
:
hk1
… hkm hk.
h.1
h.m
Oij
n
X
…
m
1
h1.h.1/n
…
h1.h.m/n
h1.
2
h2.h.1/n
…
h2.h.m/n
h2.
: k
1
hk.h.1/n h.1
…
:
hk.h.m/n
hk.
h.m
n
Eij
Teststatistic:
Test decision:
: Reject H0
48
24
Testing frequencies Example: Observed:
Expected: Current Smoker
ExSmoker
Never Smoker
Row Total
Men
129.336
224.479
368.185
722
735
Women
131.664
228.521
374.815
735
1457
Column Total
261
453
743
1457
Current Smoker
ExSmoker
Never Smoker
Row Total
Men
144
310
268
722
Women
117
143
475
Column Total
261
453
743
Smoking status
Smoking status Gender
Gender
= (144-129.336)2/129.336 + (310- 224.479)2/224.479 + (268- 368.185)2/368.185 + + (117- 131.664)2/131.664 + (143- 228.521)2/228.521 + (475- 374.815)2/374.815 = 121.9218
((2-1)*(3-1)) = (2) = 5.99 121.9218 >> 5.99 test is significant (p = 3.3e-27) the Null-Hypothesis, that gender and smoking status are independent can be rejected the test itself does not tell you, however, if men smoke more than women etc.
Testing frequencies Two sample test on frequencies, if the expected number of cells is rare: Fishers exact test
■
Situation: If the assumptions of a Chi-square-Test do not hold (number of expected values in each cell ≥ 1 and number of expected values ≥ 5 in 80% of cells)
■
Idea of the test: Simulate all tables that are possible with the observed margin probabilities. Then, count all the tables that are „more extreme“ in the opposite direction of the null-hypothesis than the observed table. p = number of more extreme tables / number of all tables computer intensive & not really solvable „by hand“
■
Test decision: p < : Reject H0
25
Testing frequencies : Exercise In an experiment with mice, you want to test the development of insulin resistance in three different diets (D1, D2, D3). After following up the mice for several months, you observe the following: Diet D1 insulin resistant 6 not insulin resistent 4 Σ 10
Σ D2 1 9 10
D3 3 7 10
10 20 30
Calculate the expected numbers in each cell under the assumption of independence between diet and insulin resistance: Which diet results Diet Σ in more insulin D1 D2 D3 resistent mice than insulin resistant 10 not insulin resistent 20 expected, which in Σ 10 10 10 30 less? Which statistical test should be performed?
The multiple testing problem
26
The multiple testing problem The situation: Consider a dataset with 100 independent parameters, which do not play a role in the etiology of the disease of interest (what you don‘t know, of course) 100 statistical tests are performed with a significance level of =0.05 The tests are constructed in that way, that maximum 5 of 100 tests reject the null hypothesis, although it is true
The multiple testing problem The situation:
■
Consider a dataset with 100 independent parameters, which do not play a role in the etiology of the disease of interest (what you don‘t know, of course)
100 statistical tests are performed with a significance level of =0.05 The tests are constructed in that way, that maximum 5 of 100 tests reject the null hypothesis, although it is true
You expect 5 tests to be significant just by chance !!!
54
27
The multiple testing problem
■
The probability to get at least one Type I error increases with increasing number of tests.
■
Family-wise error rate (the error rate for the complete family of tests performed): *=1-(1-)k, with being the comparison-wise error rate k
* (=0.05)
1
0.05
5
0.226
10
0.401
100
0.994
The probability to get one or more false discoveries (Type I error)
The significance level has to be modified for multiple testing situations
55
The multiple testing problem The Bonferroni correction method:
■ ■
Control the comparison-wise error rate: Reject H0, if p < Control the family-wise error rate (including k tests): Reject H0, if p < /k This is equivalent to pBonferroni=p*k < Advantage: simple
■
Problem: Bonferroni-correction increases the probability of a type II error the power of detecting a true association is reduced Disadvantage: too conservative
■
k
/k (=0.05)
1
0.05
5
0.01
10
0.005
100
0.0005
0.05/5=0.01
Can be used, if tests are dependent, but is just too conservative for this case
56
28
The multiple testing problem Some modifications of the Bonferroni Method: Bonferroni-Holm-method
■
You have a list of k p-values sort it, so that the minimal p-value (p(1)) comes first: p(1), p(2), p(3), p(4), ……. p(k)
■
If p(1) ≥ /k stop and accept all null hypotheses If p(1) < /k reject H0(1) and continue with the reduced list: p(2), p(3), p(4), ……. p(k)
■
If p(2) ≥ /(k-1) stop and accept all remaining null hypothesis If p(2) < /(k-1) reject H0(2) and continue with the reduced list: p(3), p(4), ……. p(k)
■ ■ ■
Continue, until the hypothesis with the smallest p-value cannot be rejected stop and accept all hypotheses that have not been rejected before
Less conservative than Bonferroni Can also be used, if tests are dependent There are multiple other methods, e.g. Hochberg-method etc.
In R: function
57 p.adjust()
The multiple testing problem Example: Assume you have this p-value vector (10 independent tests to =0.05): „significant“, if multiple testing is not accounted for
Unadjusted p:
0.0001, 0.001, 0.002, 0.005, 0.007, 0.01, 0.012, 0.02, 0.04, 0.1
With Bonferroni: 0.0001, 0.001, 0.002, 0.005, 0.007, 0.01, 0.012, 0.02, 0.04, 0.1 With B.-Holm:
0.0001, 0.001, 0.002, 0.005, 0.007, 0.01, 0.012, 0.02, 0.04, 0.1
E.g: Bonferroni.: 0.05 /10 = 0.005 all p values smaller than 0.005 significant E.g: B.-Holm.: First p value (0.0001): 0.0001 < (0.05/10=0.005), continue! Next p value (0.001): 0.001 < (0.05/9=0.006), continue! …until: p value (0.007): 0.007< (0.05/6=0.008), stop, because: next p value (0.01): 0.01 ≥ (0.05/5=0.01). All these methods are still too conservative if tests are not independent (e.g. highly correlated parameters)!
29
Multiples Testing: Exercise You conduct an experiment with knock-out mice and compare the mean values of 5 different parameters between wildtype (WT) and knockout (KO) mice. In the following list, the p-values for all t-tests are given. Para meter
Uncorrected p-value
1
0.04
2
0.001
3
0.012
4
0.015
5
0.1
Significant to α = 5% ?
Significant after BonferroniCorrection ?
Significant after Bonferroni-Holm Correction ?
Significance level after Bonferroni correction? α = Conclusion:
The multiple testing problem The closed testing principle
■ ■ ■
In cases, where a set of hypotheses can be tested simultaneously („overall“ or „omnibus tests“) Suppose there are 3 hypotheses H1,H2,H3 to be tested: Then H1 can be rejected at level , if all intersections including H1 can be rejected, i.e. H1 ∩ H2 ∩ H3, H1 ∩ H2, H1 ∩ H3 and H1 can all be rejected at level Example: There are 4 different medications (Med1, Med2, Med3, Med4), which are intended to lower the HDL-cholesterol levels in patients 1. perform ANOVA as an overall test, if there is a difference between the groups 2. If the F-Test was significant, perform an ANOVA on all possible intersections 3. If there was any significant F-Test, perform pairwise t-tests for all Medications included in this significant F-Test, etc…
■
Attention: It is not sufficient (if there are more than 3 groups) to test first an ANOVA and then pairwise t-tests, if the ANOVA was significant test also all intersections in between
60
30
The multiple testing problem
■
Example: Testing the following hypotheses:
1 = 2 = 3 = 4
1. Step ANOVA:
sign.level
Significant 2. Step ANOVA:
1 = 2 = 3 1 = 2 = 4 1 = 3 = 4 2 = 3 = 4
for all intersections 3. Step pairwise t-tests
1 = 2
sign.level
1 = 3 2 = 3
1 = 4 2 =
sign.level
Significant to family-wise error rate *=
The multiple testing problem
■
Example: Testing the following hypotheses:
1 = 2 = 3 = 4
1. Step ANOVA:
sign.level
Significant 2. Step ANOVA:
1 = 2 = 3 1 = 2 = 4 1 = 3 = 4 2 = 3 = 4
for all intersections 3. Step pairwise t-tests
1 = 2
sign.level
1 = 3 2 = 3
Significant to family-wise error rate *=
1 = 4 2 =
sign.level
But: Very complicated to perform, if there are more than 3 groups See e.g. Tukeys Test (in R) 31