Functional Data Analysis [PDF]

Collect, clean, and organize the raw data. â¢ Convert the data to functional form. â¢ Explore the data through plots a

0 downloads 3 Views 1MB Size

Report

Download PDF

PNG Network

Recommend Stories

Functional Data Analysis & Variable Selection

Love only grows by sharing. You can only have more for yourself by giving it away to others. Brian

[PDF] Data Envelopment Analysis

At the end of your life, you will never regret not having passed one more test, not winning one more

[PDF] Data Analysis

Those who bring sunshine to the lives of others cannot keep it from themselves. J. M. Barrie

[PDF] Spatial Data Analysis

Knock, And He'll open the door. Vanish, And He'll make you shine like the sun. Fall, And He'll raise

High-dimensional and functional data analysis

There are only two mistakes one can make along the road to truth; not going all the way, and not starting.

High-dimensional and functional data analysis

We can't help everyone, but everyone can help someone. Ronald Reagan

Functional linear regression analysis for longitudinal data

When you do things from your soul, you feel a river moving in you, a joy. Rumi

[PDF] Analysis of Longitudinal Data

Learn to light a candle in the darkest moments of someone’s life. Be the light that helps others see; i

PdF Python for Data Analysis

Pretending to not be afraid is as good as actually not being afraid. David Letterman

[PDF] Data Mining and Analysis

Make yourself a priority once in a while. It's not selfish. It's necessary. Anonymous

Idea Transcript

Functional Data Analysis An introduction David Wooff [email protected]

University of Durham

FDA – p. 1/42

Main steps in FDA • Collect, clean, and organize the raw data. • Convert the data to functional form. • Explore the data through plots and summary statistics • Register the data, if necessary, so that important features

occur at the same argument values. • Carry out exploratory analysis, such as functional principal

components analysis • Construct models, if appropriate • Evaluate model performance

FDA – p. 2/42

Example: Weather data • Daily average temperature computed for N = 365

consecutive days at each of 35 Canadian weather stations • Are there common patterns of interest? • Is there variation in amplitude (peakedness)? • Is there variation in phase (timing of features)?

FDA – p. 3/42

Raw data: Daily mean temperatures at 35 Canadian weather stations

−10 −20 −30

Temperature

0

10

20

Average daily temperature 35 Canadian weather stations

0

100

200

300

Day FDA – p. 4/42

Averaged data: Monthly mean of daily mean temperatures at 35 Canadian weather stations

−10 −20 −30

Temperature

0

10

20

Average monthly temperature 35 Canadian weather stations

2

4

6

8

10

12

Month FDA – p. 5/42

Summary of visual features • Clear that temperatures rise in the summer, fall in the winter • Differences in level: some very cold places in arctic Canada • Strong differences in amplitude. • Coastal stations display little amplitude: cool winters

and summers. • Continental stations show high amplitude/peakedness:

cold winters and hot summers. • Some variation in phase: peak for some (coastal?) stations

is after the peak for others (continental?)

FDA – p. 6/42

Basis construction • Main idea is that we take each observed series xi (t) and

approximate it by xˆi (t) chosen from the same functional family. • We represent a function x by a linear expansion

x(t) =

K X

ck φk (t)

k=1

in terms of K known basis functions φk (t). • c1 , . . . , cK are coefficients to be estimated. • Let n be the sample size, i.e. the number of observations in

each series.

FDA – p. 7/42

Basis construction • We hope that small K leads to a reasonable fit and the

capture of essential characteristics, balanced by large degrees of freedom to allow computation of CIs. • Once we have the approximation, we have

x(t) = xˆ(t) + (t), and observed residual series ˆi (t) = xi (t) − xˆi (t), from which we may obtain a standard measure of quality of fit for an individual series: s2i

1 = n−K

n X j=1

ˆi (tj )2 . FDA – p. 8/42

Basis choice • Several possibilities: Bsplines, exponential, polynomial

(φk = tk ). Bsplines mostly used for aperiodic data. • Reasonable to assume that the weather data is periodic • Reasonable choice is Fourier basis system:

xˆ(t) = c1 + c2 sin ωt + c3 cos ωt + c4 sin 2ωt + c5 cos 2ωt + . . . , where ω = 2π/T , here T = 11 • Here we choose K = 5; for a Fourier basis system we

choose an odd number to capture variation in phase, i.e. we require sine, cosine pairs • Various ways of estimating c1 , . . . , cK : the simplest is

ordinary least squares (used for this example), but other approaches - e.g. roughness penalty - may be superior.

FDA – p. 9/42

The basis curves, xˆ(t) for the weather data

−10 −20 −30

Temperature

0

10

20

Average monthly temperature: 35 Canadian weather stations Fourier basis, K=5

2

4

6

8

10

12

Month FDA – p. 10/42

Residual plots, (ˆ xi (t) − xi (t)), i = 1, . . . , 35, for K = 5.

0 −2 −4

Temperature

2

4

Raw data minus basis representation Fourier basis, K=5

2

4

6

8

10

12

Month FDA – p. 11/42

Residual plots • Estimated curves appear plausible • Should also explore the fitted curves: plot raw residuals

ˆi (t) = xi (t) − xˆi (t). • Periodicity in the residuals is probably an artefact of the

basis representation. • Do we get a better answer if we choose more basis

functions?

FDA – p. 12/42

Residual plots for K = 5 and K = 7

2 −4

−2

0

Temperature

0 −2 −4

Temperature

2

4

Raw data minus basis representation Fourier basis, K=7

4

Raw data minus basis representation Fourier basis, K=5

2

4

6 Month

8

10

12

2

4

6

8

10

12

Month FDA – p. 13/42

Residual plots • Raw residuals decrease as we increase K (of course), but

the model is more complicated. • Simple way of assessing the number of basis functions to

s2i

1 n−K

P use is to examine = (ˆ xi (t) − xi (t))2 P 2 • For example, we can plot si versus K - note that s2i is penalized by the number of basis functions used • Analagous to the usual measure of variability in LS

regression,

1 n−p

Pn

2 (y − y ˆ ) . i i i=1

FDA – p. 14/42

s2i versus K

100 80 60

Penalized RSS

120

140

Plots of

P

3

4

5

6

7

8

9

basis functions, K FDA – p. 15/42

Number of basis functions • The plot suggests that there is a big advantage in using

K = 5 rather than K = 3 • Not much extra advantage in using K = 7: we are starting

to lose degrees of freedom • Pragmatic choice seems to be K = 5 • This is a black art!

FDA – p. 16/42

Exploring derivatives • Derivatives of the fitted curves can be explored and can be

interpreted • First derivative may suggest common landmarks • Plots of derivatives may suggest certain models, eg the

differential equation ∆x = −α(x − b1 ) is consistent with an exponential model, x(t) = b1 + b2 eαt , so a plot showing linear association between x(t) and x0 (t) suggests the exponential model. FDA – p. 17/42

The basis curves, xˆ0 (t)

0 −5 −10

Temperature

5

10

Average monthly temperature: 35 Canadian weather stations Fourier basis, K=5, First derivative

2

4

6

8

10

12

Month FDA – p. 18/42

The basis curves, xˆ00 (t)

0 −5

Temperature

5

10

Average monthly temperature: 35 Canadian weather stations Fourier basis, K=5, Second derivative

2

4

6

8

10

12

Month FDA – p. 19/42

Derivatives plots • Plot of first derivative shows the locations of winter minima

and summer maxima at xˆ0 (t) = 0. • These don’t quite align, indicating that the change points

differ according to location • Plot of second derivative can highlight more subtle

differences - e.g. the July shape is peaked for most locations, but troughed for others. • We may need to be careful when interpreting such features,

they may simply be artefacts of the fitting process. • Avoid post-hoc rationalization

FDA – p. 20/42

Plotting the mean curve, x¯(t) =

1 N

PN

i=1

xˆi (t)

0 −5 −10

Temperature

5

10

15

Average monthly temperature: Mean curve Fourier basis, K=5

2

4

6

8

10

12

Month FDA – p. 21/42

Superimposing the mean curve on the estimated curves

−10 −20 −30

Temperature

0

10

20

Average monthly temperature: 35 Canadian weather stations Fourier basis, K=5, mean curve added

2

4

6

8

10

12

Month FDA – p. 22/42

The deviations from the mean curve: shows quite a lot of variation beyond the mean

0 −10 −20

Temperature

10

Deviations from the mean: 35 Canadian weather stations Fourier basis, K=5, mean curve added

2

4

6

8

10

12

Month FDA – p. 23/42

Plotting the standard deviation curve, q PN 1 xi (t) − x¯(t))2 sX (t) = N −1 i=1 (ˆ

7 6 5 4

Temperature

8

9

Average monthly temperature: Standard deviation curve Fourier basis, K=5

2

4

6

8

10

12

Month FDA – p. 24/42

Mean curve with ad hoc SD envelope based on x¯(t) ±

√1 sX (t) N

0 −10 −20

Temperature

10

20

Average monthly temperature: Mean curve + standard deviation envelope Fourier basis, K=5

2

4

6

8

10

12

Month FDA – p. 25/42

Interpretation • Variation across the series is highest in the winter months,

and lowest in the summer • Can plot an ad-hoc confidence envelope for the underlying

mean curve, assuming these 35 stations to be a random sample from a superpopulation of weather stations. • Prediction envelope for a new weather station would be

much wider.

FDA – p. 26/42

Principal components analysis for functional data • To extract the first functional PC we seek a further function

R

ζ1 (t) = β(s)x(s)ds which in some sense maximises variation over the space of interest. • This requires choosing weight functions β(s). • This is analagous to standard PCA, in which we choose

weights β1 , . . . , βn to maximise n X var( βi xi ). i=1

• Subsequent PCs are ζ2 , . . . , ζK , with the extraction

arranged so that the PCs are orthogonal: Z ζi (s)ζj (s)ds = 0, ∀i 6= j. FDA – p. 27/42

Principal components estimation • The estimation of the PCs is usually carried out on centred

data, and so the components may be considered as perturbations of the mean curve. • Estimation procedure is quite complicated: see Ramsay &

Silverman.

FDA – p. 28/42

Monthly data PCA, raw components shown as weights PCA function 2 (Percentage of variability 8.5 )

0.2 −0.2 −0.4

−0.4

2

4

6

8

10

12

2

4

6

8

10

12

Month

PCA function 3 (Percentage of variability 1.9 )

PCA function 4 (Percentage of variability 0.4 )

0.2 0.0 −0.2 −0.4

−0.2

0.0

Harmonic 4

0.2

0.4

0.4

Month

−0.4

Harmonic 3

0.0

Harmonic 2

0.0 −0.2

Harmonic 1

0.2

0.4

0.4

PCA function 1 (Percentage of variability 89.1 )

2

4

6 Month

8

10

12

2

4

6

8

10

12

Month

FDA – p. 29/42

The first PC for the weather data • ζ1 (t) explains 89.1% of the variation. It contrasts winter

months (high variation) with summer months (low variation), i.e. the weights placed on winter measurements are higher than than the weights for summer measurements, and so lead to higher variability. • Scores can be computed for each data series (weather

station), as ζ1i (t) =

R

β(s)xi (s)ds.

• Weather stations with high positive values of ζ1i (t) will

have much warmer than average winters and warmer than average summers. • Weather stations with high negative values of ζ1i (t) will

have much colder than average winters and colder than average summers. FDA – p. 30/42

Subsequent PCs for the weather data • ζ2 (t) explains 8.5% of the variation. It contrasts warm

winter months with cool summer months. • Weather stations with high positive values of ζ2i (t) will

have warmer than average winters and colder than average summers. • Weather stations with high negative values of ζ2i (t) will

have colder than average winters and warmer than average summers (for example, for mid-continental stations). • The remaining PCs appear consonant with random

variation and explain little of the variation.

FDA – p. 31/42

Re-adding the mean curve to the PCs • An alternative visualization is obtained by plotting

x¯(t) ± γζj for some appropriate value of γ. • That is, we add the average to the PCs and so can see the

implication of high positive and high negative values for each ζ(t). • γ can chosen by inspection, to distinguish sufficiently

between the curves; alternatively Ramsay et al suggest a semi-automatic procedure for choosing γ.

FDA – p. 32/42

Monthly data PCA: plotting components as perturbations of the mean PCA function 2 (Percentage of variability 8.5 )

10 5 −5 −15

−20

2

4

6

8

10

12

2

4

6

8

10

12

Month

PCA function 3 (Percentage of variability 1.9 )

PCA function 4 (Percentage of variability 0.4 )

5 0 −10

−5

−5

0

5

Harmonic 4

10

10

15

15

Month

−10

Harmonic 3

0

Harmonic 2

0 −10

Harmonic 1

10

15

20

PCA function 1 (Percentage of variability 89.1 )

2

4

6 Month

8

10

12

2

4

6

8

10

12

Month

FDA – p. 33/42

Interpreting the remaining PCs • Using this visualization, we might now interpret ζ3 (t) as a

slight time shift component: high values have slightly later summers and winters, and vice-versa for low values. • Still hard to interpret ζ4 (t), but might concern timing of

Autumn and Spring. • It is also possible to carry out rotations of the PCs in order

to improve interpretability. • These can be orthogonal or oblique, as desired. • Oblique rotations lead to correlated components, which is

ok in my view, but . . . . . . • . . . . . . rotation doesn’t give appreciably better answers for

this data. FDA – p. 34/42

Daily data PCA, K = 65. We don’t seem to have lost much by smoothing from daily to monthly data. PCA function 2 (Percentage of variability 8.5 )

10

−20

−10

0

Harmonic 2

0 −10

Harmonic 1

10

20

20

PCA function 1 (Percentage of variability 88.4 )

0

100

200

300

0

200

300

Day

PCA function 3 (Percentage of variability 2 )

PCA function 4 (Percentage of variability 0.5 )

5 0 −5 −15

−5

0

5

Harmonic 4

10

10

15

15

Day

−15

Harmonic 3

100

0

100

200 Day

300

0

100

200

300

Day

FDA – p. 35/42

Plot of PC scores of 35 weather stations. Neighbouring

15

stations have similar PC scores Resolute

10

Iqaluit

St. Johns Victoria

Vancouver Yarmouth

5

Sydney

Scheffervll Churchill

Pr. George Halifax Charlottvl Calgary

0

Whitehorse Inuvik Fredericton Sherbrooke Edmonton Thunderbay Toronto London Kamloops

−5

Quebec Bagottville Arvida Montreal Ottawa Yellowknife Uranium Cty

Dawson

−10

PC2: Colder in the winter and warmer in the summer

Pr. Rupert

−60

−40

−20

The Pas Pr. AlbertRegina

Winnipeg

0

20

PC1: Colder in the winter and colder in the summer FDA – p. 36/42

Registration •

•

•

•

•

•

The weather data have a natural starting and ending position. This is not so for other data. After transforming to functional form, it may be necessary to align them prior to further exploration. The process of aligning curves according to their principal features is called landmark registration. Example: 68 cross-sections of ends knee bones were taken; each datum is a bitmap image. 12 landmarks were identified as capturing the essential features A smooth curve is fitted through the landmarks and standard fda methods applied FDA – p. 37/42

Warp functions • One way of aligning functional objects is via a warping

function. • A warping function maps from (0, T ) to (0, T ) in the x

direction, but compresses part of the range whilst stretching the remainder. • An example with warping parameter α is:

eαx − 1 h(x, α) = α , e −1

α 6= 0.

• α < 0 deforms to the right and α > 0 deforms to the left.

We have limα→0 h(x, α) = x, so that a value of α = 0 implies no warping. • Each data object requires the warping parameter to be

estimated.

FDA – p. 38/42

Warping applied to height in order to align oil reservoir porosities, h(x, α) = (eαx − 1)/(eα − 1)

0.2

0.4

0.6

0.8

1.0

Actual Warped

0.0

0.2

0.4

0.6

0.8

Location 117, warp factor 1

Location 117, warp factor 2

Porosity 0.2

0.4

0.6

Elevation

0.8

1.0

0.11 0.12 0.13 0.14 0.15

Elevation

0.11 0.12 0.13 0.14 0.15

Elevation

Actual Warped

0.0

0.11 0.12 0.13 0.14 0.15

Actual Warped

0.0

Porosity

Location 117, warp factor −1

Porosity

0.11 0.12 0.13 0.14 0.15

Porosity

Location 117, warp factor −2

1.0

Actual Warped

0.0

0.2

0.4

0.6

0.8

1.0

Elevation

FDA – p. 39/42

Other techniques • Canonical correlation: which modes of variability in two

sets of curves are most closely associated? • Discriminant analysis: determination of a function which

separates two sets of curves • Functional linear models: • The response variable x(t) is a function • At least one explanatory variable is a function • both the above • Functional analysis of variance • Linear differential equations: principal differential analysis

FDA – p. 40/42

Summary • Data treated as functions, with functional summaries and

functional analogues to classical methods. • Methods extend to more complicated objects such as

shapes in many dimensions. • Tend to work with smoothed versions of the data and not

the raw data: understates actual variability • Powerful methods with much elegant theory • Needs some experience to make interpretations, may be a

danger in post-hoc rationalization • Useful exploratory tool. If the voice of the data is strong,

can also be a useful inferential tool. • Used R and fda package for computation and plots FDA – p. 41/42

References and resources • J.O. Ramsay & B.W. Silverman (2002) Applied functional

data analysis: methods and case studies. New York: Springer. Contains a lot of examples, but little theory. • http://www.stats.ox.ac.uk/ silverma/fdacasebook/ Data and

some (Matlab, R, S+) code for the case studies book. • J.O. Ramsay & B.W. Silverman (2005) Functional data

analysis. New York: Springer. Second edition, much expanded • http://ego.psych.mcgill.ca/misc/fda/ Main FDA website,

containing R library package and data sets. • J.O. Ramsay & B.W. Silverman (2009) Functional data

analysis with R and Matlab. New York: Springer. FDA – p. 42/42

Functional Data Analysis [PDF]

Recommend Stories

Idea Transcript

Helpful Links

Smile Life

Get in touch