# Hypothesis Tests, Significance - Caltech High Energy Physics

Hypothesis Tests, Signiﬁcance Signiﬁcance as hypothesis test Pitfalls Avoiding pitfalls Goodness-of-ﬁt Context is frequency statistics.

1

Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Signiﬁcance as Hypothesis Test When asking for the “signiﬁcance” of an observation (of, perhaps a new eﬀect), you ask for a test of the hypotheses: Null hypothesis H0 : There is no new eﬀect; against Alternative hypothesis H1 : There is a new eﬀect. Reject the null hypothesis (that is, claim a new eﬀect) if the observation falls in a region that is “unlikely” if the null hypothesis is correct. “Signiﬁcance” (as typically used in HEP, e.g., “a signiﬁcance of 5σ”) is the probability that we erroneously reject the null hypothesis. Also called “conﬁdence level”, or “P -value”, or the probability of a “Type I error”.

2

Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Signiﬁcance A 68% conﬁdence interval does not always tell you much about significance. The tails may be non-normal. A separate analysis is generally required, which models the tails appropriately. No recommendation on when to label result as “signiﬁcant”. Label implies interpretation. – No uniform prescription seems to make sense, involves judgement. Eg, bizarre new particle vs. expected branching fraction. – Not our most essential experimental role; up to consumer ultimately to decide what they want to believe. Nevertheless, people insist on making qualitative statements (“observation of”, “evidence for”, “discovery of”, “not signiﬁcant”, “consistent with”) Code: “observation of” ≡

3

> 4σ, “evidence for” ≡

> 3σ

Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Signiﬁcance (more semantics. . . ) From Physics Today, http://www.physicstoday.org/pt/vol-54/iss-9/p19.html (coloring mine, references deleted) nb: Just an observation on human nature, no criticism should be inferred.

“In March, back-to-back papers in Physical Review Letters reported the measurement of CP symmetry violation in the decay of neutral B mesons by groups in Japan and California. Now the word “measurement” has been replaced by “observation” in the titles of two new back-to-back reports by these same groups in the 27 August Physical Review Letters. That is to say, with a lot more data and improved event reconstruction, the BaBar collaboration at SLAC and the Belle collaboration at KEK in Japan have at last produced the ﬁrst compelling evidence of CP violation in any system other than the neutral K mesons.” Some people think a measurement should not be called a “measurement” unless the result is signiﬁcantly diﬀerent from zero. A senior Assistant Editor at a prominent journal suggested that “bounds on” might be more appropriate than “measurement” in reference to a CP asymmetry angle which was observed as consistent with zero. Finding sin 2β = 0.00 ± 0.01 would be pretty exciting. But it isn’t a “measurement”? 4

Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Computation of signiﬁcance: Example Flat normal background (avg 100/bin) + gaussian signal (100 events, ﬁxed) with mean 0, σ = 1 160 140

Events/0.5

120 100 80

Generated signal = 100 events _ 39 events Cut&Count signal = 194 + Significance = 5.0 Expected significance = 2.7

60 40 20 0 -10

-8

-6

-4

-2

0

x

2

4

6

8

10

Distribution is normal here, so P = 5.7 × 10−7 (two-tail). Note: H0 is ﬂat distribution with no signal. H1 is Nsig = 0. 5

Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Note: The signiﬁcance is not obtained by dividing the signal estimate (194) by the uncertainty in the signal (39), 194/39 = 5.0. That would be akin to asking how likely a signal of the estimated size would be to ﬂuctuate to zero. It is a good approximation in this example, however, because B/S is large. The signiﬁcance was estimated here with a simple cut-and-count method: – Level of background estimated from region |x| > 3. – Counts in “signal region” |x| < 3 added up. – The excess in the signal region over the estimated background is divided by the square root of the estimated background in the signal region. In this example, we assumed we knew the mean and width of the signal we were looking for, as well as the background shape. The uncertainty in the background estimate is negligible in this example. A more sophisticated ﬁt may yield a more powerful test by incorporating the known shape of the signal. 6

Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

What about Systematic Uncertainties? B(e+e− → Nobel prize) = 10 ± 1 ± 5. They may be important! – Maybe the ±5 is a systematic uncertainty in the estimate of the background expectation. A “10σ” statistical signiﬁcance is really only a “2σ” eﬀect. They may be irrelevant – Maybe the ±5 is a systematic uncertainty on the eﬃciency, entering as a multiplicative factor. It makes no diﬀerence to the signiﬁcance whether the result is 10 ± 1 or 5 ± 0.5. They may be “fuzzy” – E.g., how is the background expectation estimated? What is the sampling distribution? – E.g., Are “theoretical” uncertainties present? 7

Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Systematic Uncertainties “Blind checks” and “educated checks”: – Blind check: testing for mistakes; no correction is expected. If pass test, no contribution to systematic error. Eg, divide data into chronological subsets and compare results. – Educated check: measuring biases, corrections. May aﬀect quoted result. Always contributes to systematic error. Eg, dependence of eﬃciency on model. Quote systematic uncertainty separately from statistical – Systematic uncertainty may contain statistical components, eg, MC statistics in evaluation of eﬃciency.

8

Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

y’

Systematic Uncertainties Example: D mixing and DCSD revisited – Want simple procedure. Willing to accept approximation. – Scale statistical-only contour uniformly along ray from best-ﬁt value.   – Factor is 1 + m2i , where mi is an estimate of the eﬀect of systematic uncertainty i measured in units of the statistical uncertainty. This estimate is obtained by determining the eﬀect of the systematic uncertainty on x 2,  y.

0.04 0.02 -0 -0.02 2

Physical (x’ , y’) 2 Central (x’ , y’) No CPV 95% CL CPV allowed 95% CL CPV allowed, stat only 95% CL CP conserved 95% CL CP conserved, stat only

-0.04 -0.06 -0.5

0

0.5

1

1.5

2

2

x’ / 10

2.5

-3

Method conservative (≡ lazy) in sense that scaling for a given systematic in one direction is applied uniformly in all directions. On the other hand, a linear approximation is being made.

9

Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Aside: Signiﬁcance as “nσ” HEP parlance is to say an eﬀect has, e.g., “5σ” signiﬁcance. At face value, this means the observation is “5 standard deviations” away from the mean: σ ≡ (x − x¯ )2.

But we often don’t really mean this. Note that a 5σ eﬀect of this sort may not be improbable: P(x)

0.8

P (|x − x¯ | = 5σ ) = 20% ! 0.1

0.1 -1

10

0

1

x

Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Aside: Signiﬁcance as “nσ” (continued) Instead, we often mean that the probability (P -value) for the eﬀect is given by the probability of a ﬂuctuation in a normal distribution 5σ from the mean, i.e., P = P (|x| > 5), for x ∈ N (0, 1) = 5.7 × 10−7

(two-tailed probability). But sometimes we really do mean 5σ, usually presuming that the sampling distribution is approximately normal. [This may not be an accurate presumption when far out in the tails!] √ Also now popular to call −2Δ ln L the “n” in “nσ”.     2 From: L0(θ = 0; x) = √ 1 exp − 12 x/σ , Lmax( θ = x; x) = √ 1 , 2πσ 2πσ  √ giving −2Δ ln L = Δχ2 = x/σ = n. Desirable to be more concise by quoting probabilities, or “P -values” as is common in the statistics world. At least say what you mean! 11

Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Estimating signiﬁcance: Pitfalls What are the dangers? In a nutshell: Unknown or unknowable sampling distributions Ways to not know the distribution: The Improbable Tails Systematic Unknowns The Stopping Problem The exploratory Bump Hunt

12

Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

The Improbable Tails 160 140

Our earlier example was known to be normal sampling.

Events/0.5

120 100 80

Generated signal = 100 events _ 39 events Cut&Count signal = 194 + Significance = 5.0 Expected significance = 2.7

60 40 20 0 -10

-8

-6

-4

-2

0

x

2

4

6

8

10

Often, this is true approximately (central limit theorem). But for signiﬁcance, often interested in distribution far into the tails. The normal approximation may be very bad here! If there is any doubt, need to compute the actual distribution. Typically this is done with a “toy Monte Carlo” to simulate the distribution of the signiﬁcance statistic. To get to the tails, this may require a fair amount of computing time. Still need to be wary of pushing calculation beyond its validity as a model of the actual distribution. BaBar (and others) now routinely performs these calculations.

13

Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Systematic Unknowns Nuisance parameters – Unknown, but relevant parameters. Estimated somehow, but with some uncertainty. Even if sampling distribution is known, cannot in general derive exact P -values in lower dimensional parameter space. – Central Limit Theorem is our friend. – Can try other values besides best estimate of nuisance parameter. – See our discussion of conﬁdence intervals. “Theoretical” systematic uncertainties. Guesses, no sampling distribution. – Use worst case values when evaluating signiﬁcance. requires understanding what is meant by the theory “errors”. – Or, give the dependence, e.g., as a range.

14

Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

The Stopping Problem There is a strong tendency to work on an analysis until we are convinced that we got it “right”, then we stop. Simple example: “Keep sampling” until we are satisﬁed. Motivate our example: • Ample historical evidence that experimental measurements are sometimes biased by some preconception of what the answer “should be”. For example, a preconception could be based on the result of another experiment, or on some theoretical prejudice. • A model for such a biased experiment is that the experimenter works “hard” until s/he gets the expected result, and then quits. Let’s Consider a simple example of a distribution which could result from such a scenario.

15

Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Stopping Problem: Normal likelihood function example

1

N (x; θ, 1)dx = √

2/2 −(x−θ) e dx.

Frequency

• Consider an experiment in which a measurement of a parameter θ corresponds to sampling from a Gaussian distribution of standard deviation one:

θ− 2

θ x

θ+2

• Suppose the experimenter has a prejudice that θ is greater than one. • Subconsciously, s/he makes measurements until the sample mean, m = 1 n x , is greater than one, or until s/he becomes convinced (or i=1 i n tired) after a maximum of N measurements. • The experimenter then uses the sample mean, m, to estimate θ.

16

Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Stopping Problem: Normal likelihood function example For illustration, assume that N = 2. In terms of the random variables m and n, the pdf is: ⎧ 1 2 ⎪ n = 1, m > 1 ⎨ √12π e− 2 (m−θ) , f (m, n; θ) = 0, n = 1, m < 1

⎪ 2 2 1 ⎩ 1 e−(m−θ) −(x−m) dx n = 2 −∞ e π

4000

Histogram of sampling distribution for m, with pdf given by above equation, for θ = 0.

Number of experiments

3000

2000

1000

0 -4

-2

0

2

4

Sample mean

The likelihood function, as a function of θ, has the shape of a normal distribution, given any experimental result. The peak is at θ = m, so m is the maximum likelihood estimator for θ. 17

Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Stopping Problem: Normal likelihood function example In spite of the normal form of the likelihood function, the sample mean is not sampled from a normal distribution. The “4σ” tail is more probable (for some θ) than the experimenter thinks.

Probability( θ < m - 4s )

0.00007 0.00006 0.00005 0.00004 0.00003 0.00002

Straight line is probability for a “4σ” fluctuation of a normal distrinbution.

0.00001 0 -8

-6

-4

-2

0

2

4

θ

18

Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Stopping Problem: Normal likelihood function example The likelihood function, as a function of θ, has the shape of a normal distribution, given any experimental result. The peak is at θ = m, so m is the maximum likelihood estimator for θ. In spite of the normal form of the likelihood function, the sample mean is not sampled from a normal distribution. The interval deﬁned by where the likelihood function falls by e−1/2 does not correspond to a 68% CI:

Probability (θ-0 constrained _

15000

12000

H0: Fit with no signal H1: Float signal yield

0

Frequency

25000



0

2

4

6

8

Δχ 2 (H0-H1)

10

12

0

2

4

6

8

Δχ 2 (H0-H1)

Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

10

12

Counting Degrees of Freedom - Sample ﬁts H0 fit

H1 fit

Signal>=0 Fix mean 40

80 x

Any signal Float mean 0

40

80 x

Fits to data with no signal.

37

500

H1 fit

Counts/bin 800 500 800

800 500

Counts/bin 500 800

Any signal Fix mean

0

H0 fit

H1 fit

H1 fit

Any signal Fix mean H1 fit

H1 fit

Any signal Float mean

Signal>=0 Fix mean 0

40

80

0

x

40

80 x

Fits to data with a signal.

Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Signiﬁcance – Conclusions Evaluation of signiﬁcance is a “hypothesis test”. It is essentially the same problem as evaluating conﬁdence intervals. – Except for the more obvious role played by improbable tails. Pitfalls amount to (not) knowing the sampling distribution. Techniques exist to avoid pitfalls: – Simulating the sampling distribution – Vary the nuisance/theory parameters – Blind your experiment Doing it properly requires patience and discipline; the beneﬁt is a more meaningful, convincing result to yourself and to others.

38

Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Goodness-of-Fit No perfect general goodness-of-ﬁt test: – Given a dataset generated under null hypothesis, can usually ﬁnd a test which rejects the null hypothesis (ie, choosing the test after you see the data is dangerous). – Given a dataset generated under alternative hypothesis, can usually ﬁnd a test for which the null passes (ie, should think about what you want to test for). Nominal recommendation: – If you have a speciﬁc question, test for that. – χ2 test when valid. – Consider more general likelihood ratio test, Kolmogorov-Smirnov, etc., otherwise. – Monte Carlo evaluation of distribution of test statistic.

39

Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Goodness-of-Fit (continued) But recognize when test may not answer desired question, eg, in sin 2β analysis, a likelihood ratio (or a χ2) test on the time distribution may

Entries / 0.6 ps

have little sensitivity to testing goodness-of-ﬁt of the asymmetry. 150

0

B tags

Background

−0

100

B tags

Raw Asymmetry

50

0 0.5

0

-0.5 -5

40

0

5

Δt (ps)

Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Consistency of two correlated results BaBar has encountered several times the question of whether a new analysis is consistent with an old analysis. Often, new analysis is a combination of additional data plus changed (improved. . . ) analysis of original data. The stickiest issue is handling the correlation in testing for consistency in the overlapping data. People sometimes have diﬃculty understanding that statistical diﬀerences can arise even comparing results based on the same events. Given a sampling  θ1 ,  θ2 from a bivariate normal distribution N (θ, σ1, σ2, ρ), with  θ1 =  θ2 = θ, the diﬀerence Δθ ≡  θ2 −  θ1 is N (0, σ)-distributed with σ 2 = σ12 + σ22 − 2ρσ1σ2 . If the correlation is unknown, all we can say is that the variance of the diﬀerence is in the range (σ1 − σ2)2 . . . (σ1 + σ2)2. If we at least believe ρ ≥ 0 then the maximum variance of the diﬀerence is σ12 + σ22. 41

Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Consistency – Simple example of two analyses on same events Suppose we measure a neutrino mass, m, in a sample of n = 10 independent events. The measurements are xi, i = 1, . . . , 10. Assume the sampling distribution for xi is N (m, σi). We may form unbiased estimator, m  1, for m: m 1 =

1 n x i=1 i n

±



1 n2

n

2 i=1 σi .

The result (from a MC) is m  1 = 0.058 ± 0.039. Then we notice that we have some further information which might be useful: we know the experimental resolutions, σi for each measurement. We form another unbiased estimator, m  2, for m: n

m 2 =

xi i=1 σ 2 n 1i i=1 σ 2 i

± n1

1 i=1 σ 2 i

.

The result (from the same MC) is m  1 = 0.000 ± 0.016. 42

Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Example continued The results are certainly correlated, so question of consistency arises (we know the error on the diﬀerence is between 0.023 and 0.055). In this example, the diﬀerence between the results is 0.058 ± 0.036, where the 0.036 error includes the correlation (ρ = 0.41).

43

Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Consistency – Evaluating the Correlation Art Snyder developed an approximate formula for evaluating the correlation in a comparison of maximum likelihood analyses (eg, in one-dimensional case). Suppose we perform two maximum likelihood analysis, with event likelihoods L1, L2, on the same set of events [nb, may use diﬀerent information in each analysis]. The results are estimators  θ1,  θ2 for parameter θ. The correlation coeﬃcient ρ may be estimated according to: N d ln L2i d ln L1i i=1 Ri dθ |θ=θ1 dθ |θ=θ2 ρ ≈  ,    N d2 ln L1i N d2 ln L2i | θ=θ 2 i=1 dθ i=1 dθ2 |θ=θ0 0 where (θ0 is an expansion reference point)     d ln L  d ln L 2 2 ln L ln L d d 1i 1i 2i 2i  Ri = 1 − ( 1 − ( θ θ1 − θ0) | − θ ) | | |  . θ=θ 2 0 θ=θ  0 0 dθ2 dθ θ=θ0 dθ2 dθ θ=θ0 If θ0 ≈  θ1 ≈  θ2 , then ρ≈σ ˜θ1 σ ˜θ2

N  d ln L1i i=1

where

σ ˜θ2k

≡ 1/ 44

N  dLki i=1

|θ=θ0

2

|θ=θ0

d ln L2i | , dθ θ=θ0

.

Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

Consistency – Example: sin 2β ¯ pairs – PRL, vol 87, 27 August 2001: 32 × 106 B B sin 2β = 0.59 ± 0.14(stat) ± 0.05(syst) ¯ pairs – SLAC-PUB-9153, March 2002: 62 × 106 B B sin 2β = 0.75 ± 0.09(stat) ± 0.04(syst) Second result includes the earlier data, re-reconstructed. Analysis involves multivariate maximum likelihood ﬁts; reprocessing changes, eg, relative likelihood for an event to be signal or background. Not simply counting events. Question: are the two results statistically consistent? If these were independent data sets, a diﬀerence of 0.16 ± 0.17 would not be a worry. The issue is the correlation. A specialized analysis deriving from the previous formula is performed on the events in common between the two analyses. A correlation of ρ = 0.87 is deduced, yielding a diﬀerence of ∼ 2.2σ. 45

Frank Porter, BaBar analysis school – Statistics, 12-15 February 2008

## Hypothesis Tests, Significance - Caltech High Energy Physics

Hypothesis Tests, Signiﬁcance Signiﬁcance as hypothesis test Pitfalls Avoiding pitfalls Goodness-of-ﬁt Context is frequency statistics. 1 Frank Port...

#### Recommend Documents

1 Introduction 2 The Density Operator - Caltech High Energy Physics
discussion. We may further remark that we can imagine any measurement as a sort of âcounterâ experiment: First, cons

Significance tests
Jul 23, 2009 - An example: the sign test; General principles of significance tests; Significant and not significant; Pre

Angular momentum - High Energy Physics
string) while in B the string is attached to the spindle. After 2 seconds ... Question about frictionless surface. For s

Inferential Statistical Significance Tests
statistics. Whenever we use statistics to summarize the data from a sample, we are using descriptive statistics, but usu

Tests of Significance
Tests of Significance. Diana Mindrila, Ph.D. Phoebe Balentyne, M.Ed. Based on Chapter 15 of The Basic Practice of Statis

Tests of significance - WikiofScience
Sep 3, 2014 - Fisher, starting around 19252, standardized the interpretation of statistical significance and was the mai

Significance Tests and Tests of Hypotheses - UTSA
research hypothesis, null hypothesis, test criterion, significance level, p-value one-sided (one-tail) test, two-sided.

Statistical Significance and Bivariate Tests
Sampling distribution. â¢ Imagine taking a sample of size 100 from a population and computing some kind of statistic. â

Radiological Significance of Ligamentum Flavum - Caltech Authors
Sep 17, 2012 - toms, and more severe neurological signs and symptoms were significant factors for patients with RNRs whi

The Equipartition Theorem - FSU High Energy Physics
The Equipartition Theorem. Degrees of freedom are associated with the kinetic energy of translations, rotation, vibratio