Idea Transcript
CM229S: Machine Learning for Bioinformatics
Lecture 2 - 03/31/2016
Introductory statistics Lecturer: Sriram Sankararaman
Scribe: Sriram Sankararaman
We will provide an overview of statistical inference focussing on the key concepts. For more details as well as more precise statements of the mathematical results (e.g., “regularity conditions”), consult a standard statistical text (Casella and Berger).
1
Statistical inference
(X1 , . . . , Xn ) is the data drawn from a distribution P . θ is an unknown population parameter, i.e., a function of P . The goal of statistical inference is to learn θ from (X1 , . . . , Xn ). Three broad and related tasks in statistical inference 1. Estimation 2. Hypothesis testing 3. Prediction For each of these tasks, we have procedures for finding an estimator or test or predictor and procedures for evaluating them. We will focus on the first two tasks.
2
Point estimation
A statistic is a function of the data. The goal of point estimation is to find a statistic t such that θˆn = t(X1 , . . . , Xn ) is “close” to θ. θˆn is an estimator of θ. Some procedures to find point estimators: 1. Maximum likelihood 2. Method of moments 3. Bayesian methods
2.1
Maximum Likelihood Estimation iid
Definition 2.1. Given X1 , . . . , Xn ∼ P (x|θ), the likelihood function L(θ|x) = Definition 2.2. A maximum likelihood estimator ∆ ˆ θ(x) = argmaxθ L(θ|x)
= argmaxθ LL(θ|x)
1
Qn
i=1
P (xi |θ).
∆
LL(θ|x) =
log L(θ|x) n X = log P (xi |θ) i=1
For a differentiable log likelihood function, possible candidates for MLE are θ that solve d LL(θ) dθ
=
0
Why only possible candidates ? • Not a sufficient condition for a maximum (unless the log likelihood is concave). • Maximum could also be on the boundary of the parameter space. Example 1 (Normal likelihood). iid
X1 , . . . , Xn ∼ N (µ, σ 2 )
x=
µ ˆ =
n X xi
n
i=1
σˆ2
n X
=
i=1
(xi − x)2 n−1 2 = S n n
We would like to evaluate point estimators. Generally, we can evaluate estimators in a finite sample (asking how it behaves for a fixed sample size) or in an asymptotic setting (asking how it behaves as the sample size becomes large).
2.2
Finite sample
Definition 2.3. Bias of an estimator
h i bias(θˆn ) = E θˆn − θ
An estimator is unbiased if bias = 0. Definition 2.4. Variance of an estimator var(θˆn ) = E
h i2 ˆ θn − E θˆn
Definition 2.5. Mean Square Error of an estimator mse(θˆn ) = E
2 θˆn − θ
2
The mse decomposes as mse(θˆn ) = bias(θˆn ) + var(θˆn ) These definitions assess the performance of the estimator over repeated experiments, i.e., over new data that we do not see in practice. Comparing estimators is not straightforward though. Besides depending on the evaluation criterion, the performance of the estimator can depend on the true parameter. Example 1 (Normal likelihood (continued)).
1. X1 is an unbiased estimator of µ.
2
2. µ ˆ = X is also an unbiased estimator of µ 3. Var [X1 ] = σ 2 while Var [ˆ µ] =
σ2 n
4. σˆ2 is not an unbiased estimator of σ 2 . Its bias is
σ2 n .
5. S 2 is an unbiased estimator of σ 2 6. On the other hand, mse(σˆ2 ) < mse(S 2 ). 7. S is not an unbiased estimator of σ 8. X and S 2 are unbiased estimators of the population mean and variance for any i.i.d sample from a population with finite mean and variance.
2.3
Asymptotics
It is often challenging to compute finite-sample properties. Alternately, we can ask how the estimator behaves when sample size goes to infinity. To analyze behavior when sample size goes to infinity, we will need a notion of the limit of sequences of random variables much like the limit of sequences of numbers. Definition 2.6 (Convergence in probability). A sequence of random variable X1 , . . . , Xn converges in probp ability to a random variable X, denoted Xn → X, if for every > 0, lim Pr(|Xn − X| > ) = 0
n→∞
Definition 2.7 (Convergence in distribution). A sequence of random variable X1 , . . . converges in distribud
tion to a random variable X, denoted Xn → X, lim FXn (x) = F (x) n→
for all x where F (x) is continuous. Theorem 1 (Weak Law of Large Numbers). Let X1 , . . . be iid random variables with E [Xi ] = µ, Var [Xi ] = p σ 2 < ∞. X n → µ. Theorem 2 (Central Limit Theorem). Let X1 , . . . be iid random variables with E [Xi ] = µ, Var [Xi ] = σ 2 < ∞. √ (X n − µ) d n → N (0, 1) σ . With these concepts, we can now talk about asymptotic properties of point estimators. Definition 2.8 (Consistency). A sequence of estimators θˆn is a consistent sequence of estimators of a p parameter θ if θˆn → θ. Example 1 (Normal likelihood (continued)). µ ˆn = xn is a consistent estimator of µ. σˆ2 n is a consistent estimator of σ 2 . Sn2 is a consistent estimator of σ 2 . Sn is a consistent estimator of σ. iid Theorem 3 (Consistency of the MLE). Let X1 , . . . , Xn ∼ f (x|θ) and let θˆn denote the MLE of θ. p θˆn → θ
If the bias of θˆ → 0 and its variance → 0, i.e., its MSE goes to 0, then it is consistent.
3
Theorem 4 (Asymptotic Normality of the MLE). √ I(θ) = −E
h
∂ 2 log P (X|θ) ∂2θ
i
d
n(θˆn − θ) → N (0, I −1 (θ))
is the Fisher information associated with the model.
Example 1. Normal likelihood Let us focus on the mean. For a single observation X ∼ N (µ, σ 2 ) ∂ log P (X|θ) ∂µ 2 ∂ log P (X|θ) ∂2µ I(µ)
= = =
x−µ σ2 1 − 2 σ 1 σ2
This makes intuitive sense. Inverse of the Fisher information is the variance. Smaller the variance, larger the Fisher information, and the estimates of the mean should be more precise.
3
Hypothesis testing
A hypothesis is a constraint on the parameter θ. Examples: 1. Offspring has equal chance of inheriting either of two alleles from a parent (Mendel’s first law holds) 2. Two genes segregate independently (Mendel’s second law holds) 3. A genetic mutation has no effect on disease 4. A drug has no effect on blood pressure Definition 3.1. Null hypothesis: H0 : θ ∈ Θ0 . Alternate hypothesis: H1 : θ ∈ Θ1 Definition 3.2 (Hypothesis test). A rule that specifies values of the sample for which H0 should be accepted or rejected. Typically, a hypothesis test is specified in terms of a test statistic t(X1 , . . . , Xn ) and a rejection/acceptance region.
3.1
Finding tests
One commonly used procedure to find tests is the Likelihood Ratio Test. Other procedures include the Wald Test and Score test. Definition 3.3 (LRT statistic). λ(x)
= =
Θ
=
supθ∈Θ0 L(θ|x) supθ∈Θ L(θ|x) L(θˆ0 |x) ˆ L(θ|x) Θ0 ∪ Θ1
Definition 3.4 (Likelihood Ratio Test (LRT)). A likelihood ratio test (LRT) is a test that rejects H0 if λ(x) < c.
4
iid
Example 2 (Test the mean of normal model). Given X1 , . . . , Xn ∼ N (µ, 1), H0 : µ = 0 versus H1 : µ 6= 0.
λ(x)
= =
xi 2 ) 2 Pn (x −ˆ µ )2 i exp(− i=1 2 ) 2
exp(−
exp(−
Reject H0 if λ(x) < c equivalent to reject H0 if |x| >
3.2
q
Pn
i=1
nx ) 2
2 log( 1c ) . n
How do we choose c?
Evaluating tests Truth H0 H1
Decision H0 H1 TN FP FN TP
FP: False positive (type-I error) FN: False negative (type-II error) TP: True positive TN: True negative For θ ∈ Θ0 , probability of false positive for a test R: = Pθ (Test rejects). For θ ∈ Θ1 , probability of false negative: Pθ (Test accepts). To evaluate a test, we need to tradeoff false positives and false negatives. For tests with false positive probability below some threshold, what is the false negative threshold ? Definition 3.5 (Level of a test). A test has level α if supθ∈Θ0 Pθ (Test rejects) ≤ α. α is an upper bound on the false positive probability. Remark. Often we set α = 0.05 For the LRT, choose c to control α: supθ∈Θ0 Pθ (λ(X) ≤ c) = α. Trivial test is to never reject. Instead, we would like to maximize the probability of rejecting the null, i.e., power, if it is false while controlling the false positive probability. Example 2 (Normal mean (continued)). For the test of the normal mean, Θ0 = {0}. For θ = 0, the test statistic X ∼ N (0, n1 ) to achieve the desired significance level.
s P0 (λ(X) ≤ c)
= P0 (|X| ≥ r = P (|Z| ≥
2 log( 1c ) ) n
1 2 log( )) c
Here Z ∼ N (0, 1). P (|Z| ≥ z1− α2 ) = α where zα is the α-quantile of the standard normal distribution. We q set 2 log( 1c ) = z1− α2 .
5
The significance level is a crude measure. If we reject the null hypothesis at level α = 0.05, it does not tell us how strong the evidence against H0 is. The p-value of a test is the smallest α at which the test rejects H0 . Thus, small values of p-value implies stronger evidence against H0 . Definition 3.6 (p-value). For a statistic t(X) such that large values of t give evidence against H0 , the p-value p(x): p(x) = supθ∈Θ0 Pθ (t(X) ≥ t(x)) A p-value is valid if 0 ≤ α ≤ 1 supθ∈Θ0 Pθ (p(X) ≤ α) ≤ α If p(X) is a valid p-value, a test that rejects H0 because p(X) ≤ α is a level α test. Example 2 (p-value for the test of normal mean). The LRT rejects H0 : µ = 0 if |t(X)| = If µ = 0, then t(X) ∼ Z(0, 1). So if we observe data x, the p-value for the LRT is p(x)
√
n|X| is large.
= P (|t(X)| ≥ t(x)) = P (|Z| ≥ t(x)) =
Φ(−t(x)) + 1 − Φ(t(x))
=
2(1 − Φ(t(x)))
We can compute the p-value for different values of the test statistic t (Table 1). t 1 2 3 4 5
p-value 0.31 0.04 2.69e-03 6.33e-05 5.73e-07
√ Table 1: P-value for test of normal mean as a function of the test statistic t = s n|X|
To use the LRT, we need to choose c so that the false positive probability is bounded. To compute false positive probability, we need to know how the statistic is distributed under the null (the sampling distribution of the statistic). This is easy for the test of the normal mean but difficult more generally. An alternative approach to evaluate the LRT (or equivalently to choose c) relies on asymptotics. 3.2.1
Asymptotics of the LRT iid
Theorem 5 (Wilks’ theorem). Let X1 , . . . , Xn ∼ p(x|θ). For testing θ = θ0 versus θ 6= θ0 d
−2 log λ(X n ) → χ21 For testing θ ∈ Θ0 versus θ 6= Θ0 , if ν is the difference in the number of free parameters between θ ∈ Θ and θ ∈ Θ0 : d
−2 log λ(X n ) → χ2ν Reject H0 iff −2 log λ(X) ≥ χ2ν,1−α gives us an asymptotic (hence approximate) level α test. Remark. One of the conditions for the Wilks’ theorem to hold is that the true parameter lies in the interior of the parameter space. This condition can be violated in practice, e.g., when the null hypothesis restricts the parameter to the boundary of the parameter space.
6
3.2.2
Permutation tests
In many instances, even asymptotic results are difficult to obtain. It is also not clear if sample size is large enough for asymptotics to hold. Example 3 (Permutation test). Test if distribution of two samples is the same X1 , . . . , Xn
∼ F
Y1 , . . . , Yn
∼ G
H0 : F
= G = P0
We choose the statistic: t((X1 , . . . , Xn ), (Y1 , . . . , Yn )) = X − Y . The key idea is that under H0 , all permutations of the data are equally likely. Use this idea to compute the sampling distribution. 1. Randomly choose n out of 2n elements to belong to group 1. The remaining belong to group 2. 2. Compute the difference in means between the two groups. 3. How often does the difference in means on the permuted data exceed the observed difference ? In many cases, the number of permutations is too large to exhaustively enumerate. Then we use a random sample of permutations (Monte-Carlo approximation).
4
Interval estimation
We would like a measure of confidence on our estimates. Definition 4.1. For a scalar parameter θ, given data X, an interval estimator is a pair of functions L(x1 , . . . , xn ) and U (x1 , . . . , xn ), L ≤ U , such that [L(X), U (X)] covers θ. Remark. The interval is random while θ is fixed
4.1
Evaluating interval estimators
Definition 4.2 (Coverage probability). The coverage probability of an interval estimator [L(X), U (X)] is the probability that this interval covers the true parameter θ, denoted Pθ (θ ∈ [L(X), U (X)]). Definition 4.3 ((1 − α) confidence interval). An interval estimator is termed a 1 − α confidence interval if infθ Pθ (θ ∈ [L(X), U (X)]) ≥ 1 − α.
4.2
Finding interval estimators
A common method for constructing confidence intervals is to invert a test statistic. iid
Example 4. Given X1 , . . . , Xn ∼ N (µ, 1), find (1 − α) confidence interval for µ. Set H0 : µ = µ0 versus H1 : µ 6= µ0 . One test is to reject H0 if {x : |¯ x − µ0 | > So H0 is accepted for {x : |¯ x − µ0 | ≤
z 1−α √2 n
z 1−α √2 n
}
} or equivalently:
1 1 x ¯ − z α2 √ ≤ µ0 ≤ x ¯ + z α2 √ n n
7
Since the test has size α, P (H0 is accepted|µ = µ0 ) = 1 − α, 1 1 P (¯ x − z 1−α √ ≤ µ0 ≤ x ¯ + z 1−α √ |µ = µ0 ) 2 2 n n
=
1−α
Since this is true for all µ0 , we have 1 1 Pµ (¯ x − z 1−α √ ≤ µ ≤ x ¯ + z 1−α √ ) 2 2 n n ¯ + z 1−α √1n ] is a 1 − α confidence interval. So [¯ x − z 1−α √1n , x 2
2
8
=
1−α