Idea Transcript
Introduction to Statistical Methods for Data Analysis
Dr Lorenzo Moneta CERN PH-SFT CH-1211 Geneva 23
sftweb.cern.ch root.cern.ch
1
Outline • • • • • •
Probability definition Probability Density Functions Some typical distributions Bayes Theorem Parameter Estimation Hypothesis Testing
Lorenzo Moneta CERN PH-SFT
Data Analysis Tutorial at UERJ 2015: Introduction to Statistics
2
References • A lot of the material for this introduction to statistical methods is extracted from a course: –Statistical Methods for Data Analysis
(Luca Lista, INFN Napoli)
–Material available also in his book • Statistical Methods for Data Analysis in Particle Physics (Springer) – http://www.springer.com/us/book/9783319201757
• Other suggested book is –Data Analysis in High Energy Physics (Wiley) Lorenzo Moneta CERN PH-SFT
Data Analysis Tutorial at UERJ 2015: Introduction to Statistics
3
Definition Of Probability • Two main different definitions: –Frequentist • Probability is the ratio of the number of occurrences of an event to the total number of experiments, in the limit of very large number of repeatable experiments. • Can only be applied to a specific classes of events (repeatable experiments) • Meaningless to state: “probability that the lightest SuSy particle’s mass is less tha 1 TeV”
–Bayesian • Probability measures someone’s the degree of belief that
something is or will be true: would you bet? • Probability measures someone’s the degree of belief that
something is or will be true: would you bet? – Probability that Barcelona will win the next Champion League Lorenzo Moneta CERN PH-SFT
Data Analysis Tutorial at UERJ 2015: Introduction to Statistics
4
Classical Probability • Assume all accessible cases are equally probable • Valid on discrete cases only –Problem in continuous cases (definition of metrics)
Lorenzo Moneta CERN PH-SFT
Data Analysis Tutorial at UERJ 2015: Introduction to Statistics
5
Binomial Distribution • Distribution of number of successes on N trials –e.g. spinning a coin or a dice N times
• Each trial has a probability p of success
• • • •
Average: = Np Variance: -2 = Np(1-p) Used for efficiency In ROOT is available as ROOT::Math::binomial_pdf(n,p,N)
Lorenzo Moneta CERN PH-SFT
Data Analysis Tutorial at UERJ 2015: Introduction to Statistics
6
Frequentist Probability • Law of large numbers
• this means also that
• circular definition of probabilities – a phenomenon can be proven to be random only if we observe infinite cases Lorenzo Moneta CERN PH-SFT
Data Analysis Tutorial at UERJ 2015: Introduction to Statistics
7
Conditional Probability • Probability of A, given B : P(A|B) – probability that an event known to belong to set B is also member of set A – P(A | B) = P(A ∩ B) / P(B) – A is independent of B if
the conditional probability
of A given B is equal to the
probability of A: • P(A | B) = P(A)
– Hence, if A is independent on B • P(A | B) = P(A) P(B)
– If A is independent on B, B is independent on A Lorenzo Moneta CERN PH-SFT
Data Analysis Tutorial at UERJ 2015: Introduction to Statistics
8
Prob. Density Functions (PDF)
Lorenzo Moneta CERN PH-SFT
Data Analysis Tutorial at UENRJ 2015: Introduction to Statistics
9
Gaussian (Normal) Distribution
• Average = µ • Variance = σ2 • Widely used
because of the
central limit theorem TMath::Gaus(x, μ, σ,true) ROOT::Math::normal_pdf( x, σ, μ ) TF1 f(“f”,”gausn”,xmin,xmax); x = gRandom->Gaus(μ, σ);
PDF(x)
Gaussian PDF
µ=0 σ=0.3 µ=0 σ=1
1.2
µ=0 σ=3
1
µ=-2 σ=1
0.8 0.6
0.4 0.2 0 −5
−4
−3
−2
−1
0
1
2
3
4
5 x
N.B. “gausn” for a normalised (PDF) Gaussian Lorenzo Moneta CERN PH-SFT
Data Analysis Tutorial at UERJ 2015: Introduction to Statistics
10
Central limit theorem • Sum of n random variables xn converges to a Gaussian, irrespective of the original distributions of the variables xn
(only some basic regularity conditions must hold) – ∑xn → Gaussian – Example adding n flat distributions for n = 2 (x is uniform in [0,10])
for n = 5 (x is uniform in [0,10])
χ2 / ndf = 422.9 / 97
220
Constant
200
Mean
180
190.8 ± 2.3
χ 2 / ndf
300
Constant
4.989 ± 0.022
250 Sigma
160
87.47 / 83 306.4 ± 3.7
Mean
5.011 ± 0.013
Sigma
1.293 ± 0.009
2.031 ± 0.015
140
200
120 150
100 80
100 60 40
n=2
20 0 0
Lorenzo Moneta CERN PH-SFT
50
1
2
3
4
5
6
7
8
9
10
0 0
Data Analysis Tutorial at UERJ 2015: Introduction to Statistics
n=5 1
2
3
4
5
6
7
8
9
10
11
Uniform (“flat”) distribution
• Standard Deviation
• Model for position of rain drops, time of cosmic ray passage, etc.. • Basic distribution for pseudo-random number generation ROOT::Math::uniform_pdf( x, a, b) x = gRandom->Uniform(a, b);
Lorenzo Moneta CERN PH-SFT
Data Analysis Tutorial at UERJ 2015: Introduction to Statistics
12
Cumulative Distribution • Given a PDF f(x) the cumulative is defined as
• The PDF for F is uniform distributed in [0,1]
• Inverting the cumulative distribution one can generate pseudo-random numbers according to any distribution Lorenzo Moneta CERN PH-SFT
Data Analysis Tutorial at UERJ 2015: Introduction to Statistics
13
Example of Cumulative Distributions 0.4
• Probability density function – ROOT::Math::normal_pdf(x,σ,μ)
0.35
normal_pdf
0.3 0.25 0.2 0.15 0.1 0.05
• Cumulative distribution and its complement (right tail integral) – ROOT::Math::normal_cdf(x,σ,μ)
p
0−5
– ROOT::Math::normal_quantile(p,σ) – ROOT::Math::normal_quantile_c(p,σ)
−2
−1
0
1
2
3
4
x5
1
normal_cdf
0.6
normal_cdf_c
0.4 0.2 −4
−3
−2
−1
0
1
2
3
4
x5
0.8
0.9
p1
x
0−5
3
normal_quantile
2
normal_quantile_c
1 0 −1 −2 −3 0
Lorenzo Moneta CERN PH-SFT
−3
0.8
– ROOT::Math::normal_cdf_c(x,σ,μ)
• Inverse of the cumulative distributions (quantile distributions)
−4
0.1
Data Analysis Tutorial at UERJ 2015: Introduction to Statistics
0.2
0.3
0.4
0.5
0.6
0.7
14
Poisson Distribution • Probability to have n entries in x a subset of X >> x
• Limit of binomial distribution when
p = x/X = 𝜈/N