Introductory statistics 1 Statistical inference 2 Point estimation [PDF]

statistical text (Casella and Berger). 1. Statistical inference. (X1,...,Xn) is the data drawn from a distribution P. θ

0 downloads 6 Views 198KB Size

Recommend Stories


Statistical inference and resampling statistics
I tried to make sense of the Four Books, until love arrived, and it all became a single syllable. Yunus

Introductory Statistics
The best time to plant a tree was 20 years ago. The second best time is now. Chinese Proverb

Introductory Statistics
Before you speak, let your words pass through three gates: Is it true? Is it necessary? Is it kind?

Introductory Statistics
We must be willing to let go of the life we have planned, so as to have the life that is waiting for

Introductory Statistics
You have survived, EVERY SINGLE bad day so far. Anonymous

Statistical Inference
Courage doesn't always roar. Sometimes courage is the quiet voice at the end of the day saying, "I will

Statistical Inference
Almost everything will work again if you unplug it for a few minutes, including you. Anne Lamott

SPSS for introductory statistics
Don't ruin a good today by thinking about a bad yesterday. Let it go. Anonymous

Introductory Statistics using R
I cannot do all the good that the world needs, but the world needs all the good that I can do. Jana

[PDF] IBM SPSS for Introductory Statistics
When you do things from your soul, you feel a river moving in you, a joy. Rumi

Idea Transcript


CM229S: Machine Learning for Bioinformatics

Lecture 2 - 03/31/2016

Introductory statistics Lecturer: Sriram Sankararaman

Scribe: Sriram Sankararaman

We will provide an overview of statistical inference focussing on the key concepts. For more details as well as more precise statements of the mathematical results (e.g., “regularity conditions”), consult a standard statistical text (Casella and Berger).

1

Statistical inference

(X1 , . . . , Xn ) is the data drawn from a distribution P . θ is an unknown population parameter, i.e., a function of P . The goal of statistical inference is to learn θ from (X1 , . . . , Xn ). Three broad and related tasks in statistical inference 1. Estimation 2. Hypothesis testing 3. Prediction For each of these tasks, we have procedures for finding an estimator or test or predictor and procedures for evaluating them. We will focus on the first two tasks.

2

Point estimation

A statistic is a function of the data. The goal of point estimation is to find a statistic t such that θˆn = t(X1 , . . . , Xn ) is “close” to θ. θˆn is an estimator of θ. Some procedures to find point estimators: 1. Maximum likelihood 2. Method of moments 3. Bayesian methods

2.1

Maximum Likelihood Estimation iid

Definition 2.1. Given X1 , . . . , Xn ∼ P (x|θ), the likelihood function L(θ|x) = Definition 2.2. A maximum likelihood estimator ∆ ˆ θ(x) = argmaxθ L(θ|x)

= argmaxθ LL(θ|x)

1

Qn

i=1

P (xi |θ).



LL(θ|x) =

log L(θ|x) n X = log P (xi |θ) i=1

For a differentiable log likelihood function, possible candidates for MLE are θ that solve d LL(θ) dθ

=

0

Why only possible candidates ? • Not a sufficient condition for a maximum (unless the log likelihood is concave). • Maximum could also be on the boundary of the parameter space. Example 1 (Normal likelihood). iid

X1 , . . . , Xn ∼ N (µ, σ 2 )

x=

µ ˆ =

n X xi

n

i=1

σˆ2

n X

=

i=1

(xi − x)2 n−1 2 = S n n

We would like to evaluate point estimators. Generally, we can evaluate estimators in a finite sample (asking how it behaves for a fixed sample size) or in an asymptotic setting (asking how it behaves as the sample size becomes large).

2.2

Finite sample

Definition 2.3. Bias of an estimator

h i bias(θˆn ) = E θˆn − θ

An estimator is unbiased if bias = 0. Definition 2.4. Variance of an estimator var(θˆn ) = E



h i2  ˆ θn − E θˆn

Definition 2.5. Mean Square Error of an estimator mse(θˆn ) = E

 2  θˆn − θ

2

The mse decomposes as mse(θˆn ) = bias(θˆn ) + var(θˆn ) These definitions assess the performance of the estimator over repeated experiments, i.e., over new data that we do not see in practice. Comparing estimators is not straightforward though. Besides depending on the evaluation criterion, the performance of the estimator can depend on the true parameter. Example 1 (Normal likelihood (continued)).

1. X1 is an unbiased estimator of µ.

2

2. µ ˆ = X is also an unbiased estimator of µ 3. Var [X1 ] = σ 2 while Var [ˆ µ] =

σ2 n

4. σˆ2 is not an unbiased estimator of σ 2 . Its bias is

σ2 n .

5. S 2 is an unbiased estimator of σ 2 6. On the other hand, mse(σˆ2 ) < mse(S 2 ). 7. S is not an unbiased estimator of σ 8. X and S 2 are unbiased estimators of the population mean and variance for any i.i.d sample from a population with finite mean and variance.

2.3

Asymptotics

It is often challenging to compute finite-sample properties. Alternately, we can ask how the estimator behaves when sample size goes to infinity. To analyze behavior when sample size goes to infinity, we will need a notion of the limit of sequences of random variables much like the limit of sequences of numbers. Definition 2.6 (Convergence in probability). A sequence of random variable X1 , . . . , Xn converges in probp ability to a random variable X, denoted Xn → X, if for every  > 0, lim Pr(|Xn − X| > ) = 0

n→∞

Definition 2.7 (Convergence in distribution). A sequence of random variable X1 , . . . converges in distribud

tion to a random variable X, denoted Xn → X, lim FXn (x) = F (x) n→

for all x where F (x) is continuous. Theorem 1 (Weak Law of Large Numbers). Let X1 , . . . be iid random variables with E [Xi ] = µ, Var [Xi ] = p σ 2 < ∞. X n → µ. Theorem 2 (Central Limit Theorem). Let X1 , . . . be iid random variables with E [Xi ] = µ, Var [Xi ] = σ 2 < ∞. √ (X n − µ) d n → N (0, 1) σ . With these concepts, we can now talk about asymptotic properties of point estimators. Definition 2.8 (Consistency). A sequence of estimators θˆn is a consistent sequence of estimators of a p parameter θ if θˆn → θ. Example 1 (Normal likelihood (continued)). µ ˆn = xn is a consistent estimator of µ. σˆ2 n is a consistent estimator of σ 2 . Sn2 is a consistent estimator of σ 2 . Sn is a consistent estimator of σ. iid Theorem 3 (Consistency of the MLE). Let X1 , . . . , Xn ∼ f (x|θ) and let θˆn denote the MLE of θ. p θˆn → θ

If the bias of θˆ → 0 and its variance → 0, i.e., its MSE goes to 0, then it is consistent.

3

Theorem 4 (Asymptotic Normality of the MLE). √ I(θ) = −E

h

∂ 2 log P (X|θ) ∂2θ

i

d

n(θˆn − θ) → N (0, I −1 (θ))

is the Fisher information associated with the model.

Example 1. Normal likelihood Let us focus on the mean. For a single observation X ∼ N (µ, σ 2 ) ∂ log P (X|θ) ∂µ 2 ∂ log P (X|θ) ∂2µ I(µ)

= = =

x−µ σ2 1 − 2 σ 1 σ2

This makes intuitive sense. Inverse of the Fisher information is the variance. Smaller the variance, larger the Fisher information, and the estimates of the mean should be more precise.

3

Hypothesis testing

A hypothesis is a constraint on the parameter θ. Examples: 1. Offspring has equal chance of inheriting either of two alleles from a parent (Mendel’s first law holds) 2. Two genes segregate independently (Mendel’s second law holds) 3. A genetic mutation has no effect on disease 4. A drug has no effect on blood pressure Definition 3.1. Null hypothesis: H0 : θ ∈ Θ0 . Alternate hypothesis: H1 : θ ∈ Θ1 Definition 3.2 (Hypothesis test). A rule that specifies values of the sample for which H0 should be accepted or rejected. Typically, a hypothesis test is specified in terms of a test statistic t(X1 , . . . , Xn ) and a rejection/acceptance region.

3.1

Finding tests

One commonly used procedure to find tests is the Likelihood Ratio Test. Other procedures include the Wald Test and Score test. Definition 3.3 (LRT statistic). λ(x)

= =

Θ

=

supθ∈Θ0 L(θ|x) supθ∈Θ L(θ|x) L(θˆ0 |x) ˆ L(θ|x) Θ0 ∪ Θ1

Definition 3.4 (Likelihood Ratio Test (LRT)). A likelihood ratio test (LRT) is a test that rejects H0 if λ(x) < c.

4

iid

Example 2 (Test the mean of normal model). Given X1 , . . . , Xn ∼ N (µ, 1), H0 : µ = 0 versus H1 : µ 6= 0.

λ(x)

= =

xi 2 ) 2 Pn (x −ˆ µ )2 i exp(− i=1 2 ) 2

exp(−

exp(−

Reject H0 if λ(x) < c equivalent to reject H0 if |x| >

3.2

q

Pn

i=1

nx ) 2

2 log( 1c ) . n

How do we choose c?

Evaluating tests Truth H0 H1

Decision H0 H1 TN FP FN TP

FP: False positive (type-I error) FN: False negative (type-II error) TP: True positive TN: True negative For θ ∈ Θ0 , probability of false positive for a test R: = Pθ (Test rejects). For θ ∈ Θ1 , probability of false negative: Pθ (Test accepts). To evaluate a test, we need to tradeoff false positives and false negatives. For tests with false positive probability below some threshold, what is the false negative threshold ? Definition 3.5 (Level of a test). A test has level α if supθ∈Θ0 Pθ (Test rejects) ≤ α. α is an upper bound on the false positive probability. Remark. Often we set α = 0.05 For the LRT, choose c to control α: supθ∈Θ0 Pθ (λ(X) ≤ c) = α. Trivial test is to never reject. Instead, we would like to maximize the probability of rejecting the null, i.e., power, if it is false while controlling the false positive probability. Example 2 (Normal mean (continued)). For the test of the normal mean, Θ0 = {0}. For θ = 0, the test statistic X ∼ N (0, n1 ) to achieve the desired significance level.

s P0 (λ(X) ≤ c)

= P0 (|X| ≥ r = P (|Z| ≥

2 log( 1c ) ) n

1 2 log( )) c

Here Z ∼ N (0, 1). P (|Z| ≥ z1− α2 ) = α where zα is the α-quantile of the standard normal distribution. We q set 2 log( 1c ) = z1− α2 .

5

The significance level is a crude measure. If we reject the null hypothesis at level α = 0.05, it does not tell us how strong the evidence against H0 is. The p-value of a test is the smallest α at which the test rejects H0 . Thus, small values of p-value implies stronger evidence against H0 . Definition 3.6 (p-value). For a statistic t(X) such that large values of t give evidence against H0 , the p-value p(x): p(x) = supθ∈Θ0 Pθ (t(X) ≥ t(x)) A p-value is valid if 0 ≤ α ≤ 1 supθ∈Θ0 Pθ (p(X) ≤ α) ≤ α If p(X) is a valid p-value, a test that rejects H0 because p(X) ≤ α is a level α test. Example 2 (p-value for the test of normal mean). The LRT rejects H0 : µ = 0 if |t(X)| = If µ = 0, then t(X) ∼ Z(0, 1). So if we observe data x, the p-value for the LRT is p(x)



n|X| is large.

= P (|t(X)| ≥ t(x)) = P (|Z| ≥ t(x)) =

Φ(−t(x)) + 1 − Φ(t(x))

=

2(1 − Φ(t(x)))

We can compute the p-value for different values of the test statistic t (Table 1). t 1 2 3 4 5

p-value 0.31 0.04 2.69e-03 6.33e-05 5.73e-07

√ Table 1: P-value for test of normal mean as a function of the test statistic t = s n|X|

To use the LRT, we need to choose c so that the false positive probability is bounded. To compute false positive probability, we need to know how the statistic is distributed under the null (the sampling distribution of the statistic). This is easy for the test of the normal mean but difficult more generally. An alternative approach to evaluate the LRT (or equivalently to choose c) relies on asymptotics. 3.2.1

Asymptotics of the LRT iid

Theorem 5 (Wilks’ theorem). Let X1 , . . . , Xn ∼ p(x|θ). For testing θ = θ0 versus θ 6= θ0 d

−2 log λ(X n ) → χ21 For testing θ ∈ Θ0 versus θ 6= Θ0 , if ν is the difference in the number of free parameters between θ ∈ Θ and θ ∈ Θ0 : d

−2 log λ(X n ) → χ2ν Reject H0 iff −2 log λ(X) ≥ χ2ν,1−α gives us an asymptotic (hence approximate) level α test. Remark. One of the conditions for the Wilks’ theorem to hold is that the true parameter lies in the interior of the parameter space. This condition can be violated in practice, e.g., when the null hypothesis restricts the parameter to the boundary of the parameter space.

6

3.2.2

Permutation tests

In many instances, even asymptotic results are difficult to obtain. It is also not clear if sample size is large enough for asymptotics to hold. Example 3 (Permutation test). Test if distribution of two samples is the same X1 , . . . , Xn

∼ F

Y1 , . . . , Yn

∼ G

H0 : F

= G = P0

We choose the statistic: t((X1 , . . . , Xn ), (Y1 , . . . , Yn )) = X − Y . The key idea is that under H0 , all permutations of the data are equally likely. Use this idea to compute the sampling distribution. 1. Randomly choose n out of 2n elements to belong to group 1. The remaining belong to group 2. 2. Compute the difference in means between the two groups. 3. How often does the difference in means on the permuted data exceed the observed difference ? In many cases, the number of permutations is too large to exhaustively enumerate. Then we use a random sample of permutations (Monte-Carlo approximation).

4

Interval estimation

We would like a measure of confidence on our estimates. Definition 4.1. For a scalar parameter θ, given data X, an interval estimator is a pair of functions L(x1 , . . . , xn ) and U (x1 , . . . , xn ), L ≤ U , such that [L(X), U (X)] covers θ. Remark. The interval is random while θ is fixed

4.1

Evaluating interval estimators

Definition 4.2 (Coverage probability). The coverage probability of an interval estimator [L(X), U (X)] is the probability that this interval covers the true parameter θ, denoted Pθ (θ ∈ [L(X), U (X)]). Definition 4.3 ((1 − α) confidence interval). An interval estimator is termed a 1 − α confidence interval if infθ Pθ (θ ∈ [L(X), U (X)]) ≥ 1 − α.

4.2

Finding interval estimators

A common method for constructing confidence intervals is to invert a test statistic. iid

Example 4. Given X1 , . . . , Xn ∼ N (µ, 1), find (1 − α) confidence interval for µ. Set H0 : µ = µ0 versus H1 : µ 6= µ0 . One test is to reject H0 if {x : |¯ x − µ0 | > So H0 is accepted for {x : |¯ x − µ0 | ≤

z 1−α √2 n

z 1−α √2 n

}

} or equivalently:

1 1 x ¯ − z α2 √ ≤ µ0 ≤ x ¯ + z α2 √ n n

7

Since the test has size α, P (H0 is accepted|µ = µ0 ) = 1 − α, 1 1 P (¯ x − z 1−α √ ≤ µ0 ≤ x ¯ + z 1−α √ |µ = µ0 ) 2 2 n n

=

1−α

Since this is true for all µ0 , we have 1 1 Pµ (¯ x − z 1−α √ ≤ µ ≤ x ¯ + z 1−α √ ) 2 2 n n ¯ + z 1−α √1n ] is a 1 − α confidence interval. So [¯ x − z 1−α √1n , x 2

2

8

=

1−α

Smile Life

When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile

Get in touch

© Copyright 2015 - 2024 PDFFOX.COM - All rights reserved.