MULTIVARIATE REGRESSION [PDF]

Keywords: data analysis, multivariate statistical methods, chemometrics; cluster analysis, linear regression, orthogonal

3 downloads 3 Views 7MB Size

Recommend Stories


Multivariate multiple regression
Learning never exhausts the mind. Leonardo da Vinci

Regression Analysis of Multivariate Fractional Data
Keep your face always toward the sunshine - and shadows will fall behind you. Walt Whitman

estimasi multivariate adaptive regression splines (mars)
Almost everything will work again if you unplug it for a few minutes, including you. Anne Lamott

pemodelan multivariate adaptive regression splines (mars)
Seek knowledge from cradle to the grave. Prophet Muhammad (Peace be upon him)

Bias-corrected inference for multivariate nonparametric regression
Open your mouth only if what you are going to say is more beautiful than the silience. BUDDHA

Testing Multivariate Adaptive Regression Splines (MARS)
I cannot do all the good that the world needs, but the world needs all the good that I can do. Jana

(IHSG) MENGGUNAKAN MULTIVARIATE ADAPTIVE REGRESSION SPLINES
I cannot do all the good that the world needs, but the world needs all the good that I can do. Jana

[PDF] Applied Logistic Regression
Never let your sense of morals prevent you from doing what is right. Isaac Asimov

Idea Transcript


Helsinki University of Technology Control Engineering Laboratory Espoo 2001

Report 125

MULTIVARIATE REGRESSION Techniques and tools Heikki Hyötyniemi

TEKNILLINEN KORKEAKOULU TEKNISKA HÖGSKOLAN HELSINKI UNIVERSITY OF TECHNOLOGY TECHNISCHE UNIVERSITÄT HELSINKI UNIVERSITE DE TECHNOLOGIE D´HELSINKI

Helsinki University of Technology Control Engineering Laboratory Espoo July 2001

Report 125

MULTIVARIATE REGRESSION Techniques and tools Heikki Hyötyniemi Abstract: Multivariate statistical methods are powerful tools for analysis and manipulation of large data sets. This report introduces the most important statistical tools that can be used for multivariate regression in a general framework.

Keywords: data analysis, multivariate statistical methods, chemometrics; cluster analysis, linear regression, orthogonal least squares, ridge regression, principal component analysis and regression, partial least squares, canonical correlation analysis and regression, independent component analysis, factor analysis; neural networks; subspace identification

Helsinki University of Technology Department of Automation and Systems Technology Control Engineering Laboratory

Distribution: Helsinki University of Technology Control Engineering Laboratory P.O. Box 5400 FIN-02015 HUT, Finland Tel. +358-9-451 5201 Fax. +358-9-451 5208 E-mail: [email protected] http://www.control.hut.fi/

ISBN 951-22-5587-1 ISSN 0356-0872

Picaset Oy Helsinki 2001

Preface In psychology, regression means degeneration, or return to a prior, lower level of development. And, indeed, speaking of statistical methods sounds somewhat outdated today | all those neural networks and fuzzy systems being now available. However, it has been recognized that the new soft computing methods are no panacea. These methods cannot nd models any better than the more conventional ones if there is not enough information available in the data to start with; on the other hand, many celebrated soft computing applications could have been solved with age-old methods | assuming that somebody were familiar with them. After all, neural networks can only operate on the statistical properties visible in the data. The starting point here is that when analysing complex systems, simple methods are needed. When the system under study is very complicated, one needs data analysis methods that are reliable and fast and that give possibility for closer analysis. Unfortunately, the statistical literature seems to be mathematically rather unpenetrable for normal engineers (for those that are not too ashamed to admit that). Researchers seem to represent the results in such sophisticated forms that the very simple and elegant ideas remain hidden | the problem is that the powerful methods may not be applied in practice. What is more, di erent methods have been developed in di erent research communities: It is diÆcult to see the connections between the methods when the approaches and notations di er. New methods are constantly being developed; there exists no one-volume book that would cover the whole eld in suÆcient detail. This text tries to show how simple the basic ideas are and how closely related di erent methods turn out to be. The approach is rather pragmatic, many proofs being omitted for readability (for more material, see, for example, 3 and 36). All the methods are presented in a homogeneous framework. One objective is to show that there is still room for new ideas and innovations. As the reader can see, there are still plenty of ideas waiting to be explored and exploited | one is very near to frontier science, maybe able to cross the boundary towards new discoveries. The theoretical methods are supported by Matlab routines. In addition to the printed version, this report is available in public domain in PostScript format. The Matlab Toolbox and the textual material can be accessed through the HTML page at the Internet address http://saato014.hut.fi/hyotyniemi/publications/01_report125.htm.

During this work nancial support was received from Nokia Research Center, Visual Communications Laboratory, and this support is gratefully acknowledged.

List of symbols The same variable names are used consistently in the theoretical discussion and in the accompanying Regression Toolbox for Matlab, if possible.

                   

A, B , C , D: Matrices determining a state-space system d: Dimension of a dynamic system i, j : Matrix and vector indices e, E : Measurement error vector and matrix, dimensions m  1 and k  m, respectively : State error, dimension d  1 f , F : Vector and matrix de ning a linear mapping g(): Any function (scalar or vector-valued) J (): Cost criterion

I , In : Identity matrix (of dimension n  n) k: Time index, sample number (given in parentheses) K : Kalman gain m: Dimension of the output space M : Arbitrary matrix (or vector) n: Dimension of the input space N : Dimension of the latent space / Number of levels or substructures P : Error covariance matrix, dimension d  d p(): Probability density

R: Covariance matrix (or, more generally, association matrix) u, U : Input vector and matrix, dimensions n  1 and k  n, respectively v, V : Combined data vector and matrix (x and y together), dimensions m + n  1 and k  m + n, respectively

 w, W : Weight vector and matrix, respectively  x, X : Data vector and matrix, dimensions n  1 and k  n, respectively. In the case of a dynamic system, state vector and matrix, dimensions d  1 and k  d, respectively.  y, Y : Output data vector and matrix, dimensions m  1 and k  m, respectively

 z, Z :

Latent data vector and matrix, dimensions N  1 and k  N , respectively

    

 ,  : Arbitrary scalars , : Vector of eigenvalues and eigenvalue matrix, respectively , : Lagrange multipliers , : Reduced base, dimensions n  N and m  N , respectively

, : Matrices of data basis vectors, dimensions n  n and m  m, respectively

Notations  M: Unprocessed data  M : Mean value of M (columnwise mean matrix if M is matrix)  M^ : Estimate of M  M~ : Error in M  M 0 : M modi ed (somehow)  M i : Power of M / Level i data structure  Mi : Column i for matrix M / Element i for vector-form M  M T : Transpose of M  M y : Psedoinverse of M (for de nition, see page 18)  EfM g: Expectation value of M  M , M : Independent testing data or run-time data   M1 M2 : Partitioning of a matrix  M : Matrix dimensions ( rows,  columns)  test

est

Abbreviations       

CCA/CCR: Canonical Correlation Analysis/Regression (page 94) CR: Continuum Regression (page 89) CA: Cluster Analysis (page 25) DA or FDA: (Fisher) Discriminant Analysis (page 28) EIV: Error In Variables model (page 60) FOBI: Fourth-Order Blind Identi cation (page 105) GHA: Generalized Hebbian Algorithm (page 126)

          

ICA/ICR: Independent Component Analysis/Regression (page 99) MLR: Multi-Linear Regression (page 57) NNR: Neural Networks based Regression (page 117) OBW: \OÆcial Bullshit Warning" OLS: Orthogonal Least Squareas (page 66) PCA/PCR: Principal Component Analysis/Regression (page 71) PLS: Partial Least Squares regression (page 87) RR: Ridge Regression (page 67) SOM: Self-Organizing Map (page 114) SSI: SubSpace Identi cation (page 129) TLS: Total Least Squares (page 61)

Contents I Theoretical Toolbox

7

1 Introduction to Multivariate Modeling

9

1.1 About systems and models . . . . . 1.2 Paradigms and practices . . . . . . . 1.3 About mathematical tools . . . . . . 1.3.1 Challenge of high dimension . 1.3.2 About matrices . . . . . . . . 1.3.3 Optimization . . . . . . . . . 1.3.4 Lagrange multipliers . . . . .

. . . . . . .

2 About Distributions

2.1 Gaussian distribution . . . . . . . . . . 2.1.1 About distribution parameters 2.1.2 Why linear models? . . . . . . 2.2 Mixture models . . . . . . . . . . . . . 2.2.1 On \natural data" . . . . . . . 2.2.2 Outliers . . . . . . . . . . . . . 2.3 Cluster analysis . . . . . . . . . . . . . 2.3.1 K-means algorithm . . . . . . . 2.3.2 EM algorithm . . . . . . . . . . 2.3.3 Fisher discriminant analysis . . 2.4 Histogram equalization . . . . . . . . .

3 Role of Knowledge 3.1 From intuition to information . . 3.1.1 Some philosophy . . . . . 3.1.2 About a priori structure . 3.1.3 Structuring the data . . . 1

. . . .

. . . .

. . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

9 11 11 11 12 16 17

21

21 22 23 23 24 25 25 25 26 28 30

33 33 34 34 35

2

CONTENTS

3.2

3.3

3.4 3.5 3.6

3.1.4 Experiment design . . . . . . . . . . . Data preprocessing . . . . . . . . . . . . . . . 3.2.1 Selection of variables . . . . . . . . . . 3.2.2 Missing data . . . . . . . . . . . . . . 3.2.3 \Operating point" . . . . . . . . . . . 3.2.4 Data scaling . . . . . . . . . . . . . . . Model construction . . . . . . . . . . . . . . . 3.3.1 Mathematics vs. reality . . . . . . . . 3.3.2 Analysis vs. synthesis . . . . . . . . . Postprocessing . . . . . . . . . . . . . . . . . 3.4.1 Evaluation criteria . . . . . . . . . . . Summary: Modeling procedures . . . . . . . . Case studies . . . . . . . . . . . . . . . . . . . 3.6.1 Analysis of the paper machine dry line 3.6.2 Modeling of otation froth . . . . . .

4 \Quick and Dirty" 4.1 Linear regression model . . . . . 4.1.1 Least-squares solution . . 4.1.2 Piece of analysis . . . . . 4.1.3 Multivariate case . . . . . 4.2 Modi cations . . . . . . . . . . . 4.2.1 Error in variables . . . . . 4.2.2 Total least squares . . . . 4.3 Collinearity . . . . . . . . . . . . 4.4 Patch xes . . . . . . . . . . . . 4.4.1 Orthogonal least squares . 4.4.2 Ridge regression . . . . .

. . . . . . . . . . .

5 Tackling with Redundancy 5.1 Some linear algebra . . . . . . . . . 5.1.1 On spaces and bases . . . . 5.1.2 About linear mappings . . . 5.1.3 Data model revisited . . . . 5.2 Principal components . . . . . . . 5.2.1 Eigenproblem properties . . 5.2.2 Analysis of the PCA model

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36 37 37 38 40 40 43 43 44 46 46 47 48 48 51

57 57 57 58 60 60 60 61 64 66 66 67

71 71 71 72 73 75 76 77

3

CONTENTS

5.2.3 Another view of \information" . . . . 5.3 Practical aspects . . . . . . . . . . . . . . . . 5.3.1 Regression based on PCA . . . . . . . 5.3.2 Selection of basis vectors . . . . . . . . 5.3.3 Analysis tools . . . . . . . . . . . . . . 5.3.4 Calculating eigenvectors in practice . . 5.4 New problems . . . . . . . . . . . . . . . . . . 5.4.1 Experiment: \Associative regression"

6 Bridging Input and Output

6.1 Partial least squares . . . . . . . . . . . . . 6.1.1 Maximizing correlation . . . . . . . 6.2 Continuum regression . . . . . . . . . . . . 6.2.1 On the correlation structure . . . . . 6.2.2 Filling the gaps . . . . . . . . . . . . 6.2.3 Further explorations . . . . . . . . . 6.3 Canonical correlations . . . . . . . . . . . . 6.3.1 Problem formulation . . . . . . . . . 6.3.2 Analysis of CCA . . . . . . . . . . . 6.3.3 Regression based on PLS and CCA . 6.3.4 Further ideas . . . . . . . . . . . . .

7 Towards the Structure

7.1 Factor analysis . . . . . . . . . . . . . 7.2 Independent components . . . . . . . . 7.2.1 Why independence? . . . . . . 7.2.2 Measures for independence . . 7.2.3 ICA vs. PCA . . . . . . . . . . 7.3 Eigenproblem-oriented ICA algorithms 7.3.1 Data whitening . . . . . . . . . 7.3.2 Deformation of the distribution 7.3.3 Further explorations . . . . . . 7.4 Beyond independence . . . . . . . . . 7.4.1 Sparse coding . . . . . . . . . .

8 Regression vs. Progression

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79 80 80 81 82 83 84 84

87

87 87 89 89 90 92 94 94 95 96 97

99

100 101 101 101 102 102 104 105 107 109 110

113

8.1 Neural clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 8.1.1 Self-organizing maps . . . . . . . . . . . . . . . . . . . . . 114

4

CONTENTS

8.1.2 \Expectation Maximizing SOM" 8.1.3 Radial basis function regression . 8.2 Feedforward networks . . . . . . . . . . 8.2.1 Perceptron networks . . . . . . . 8.2.2 Back-propagation of errors . . . 8.2.3 Relations to subspace methods . 8.3 \Antropomorphic models" . . . . . . . . 8.3.1 Hebbian algorithms . . . . . . . 8.3.2 Generalized Hebbian algorithms 8.3.3 Further extensions . . . . . . . .

9 Application to Dynamic Models

9.1 State-space models . . . . . . . 9.2 Subspace identi cation . . . . . 9.2.1 Determining the state . 9.2.2 Finding system matrices 9.3 Stochastic models . . . . . . . . 9.3.1 Discussion . . . . . . . .

. . . . . .

10 Relations to Systems Engineering

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

10.1 MIMO vs. SISO systems . . . . . . . . . . . . . 10.2 Dimension reduction . . . . . . . . . . . . . . . 10.2.1 About state-space systems . . . . . . . . 10.2.2 Motivating (?) experiments . . . . . . . 10.2.3 Balanced realizations . . . . . . . . . . . 10.2.4 Eliminating states . . . . . . . . . . . . 10.3 State estimation . . . . . . . . . . . . . . . . . 10.3.1 Kalman lter . . . . . . . . . . . . . . . 10.3.2 Optimality vs. reality . . . . . . . . . . 10.3.3 Reducing the number of measurements . 10.4 SISO identi cation . . . . . . . . . . . . . . . . 10.4.1 Black-box model . . . . . . . . . . . . . 10.4.2 Recursive least-squares algorithm . . . . 10.4.3 Structure of dynamic data . . . . . . . . 10.4.4 Further analysis: System identi ability .

II Practical Toolbox

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

115 116 117 118 119 122 124 124 126 126

129

129 130 130 132 133 134

137

137 138 138 140 141 144 145 145 146 147 148 148 149 151 152

161

CONTENTS

A Good Book Must Have Obscure Quotes A theory has only the alternative of being wrong. A model has a third possibility | it might be right but irrelevant. | M. Eigen Statistics in the hands of an engineer are like a lamppost to a drunk | they're used more for support than illumination. | B. Sangster Like other occult techniques of divination, the statistical method has a private jargon deliberately contrived to obscure its methods from non-practitioners. | G. O. Ashley Maske the model as simple as possible | but not simpler! | A. Einstein There are three kinds of lies: lies, damned lies, and statistics. | B. Disraeli Torture the data long enough and they will confess to anything. | unknown

5

6

CONTENTS

Part I

Theoretical Toolbox

7

Lesson 1

Introduction to Multivariate Modeling Statistics tell the biggest lies, everybody knows that! | However, this is not necessarily true. It all depends on the user who is interpreting the results: If the statistical methods are applied appropriately, by somebody who understands their properties, excellent results are often reached. Gaining this understanding is the objective of this report.

1.1 About systems and models The role of a model is to organize information about a given system. There are various questions that need to be answered when a model is being constructed | the selection of the modeling principle being perhaps the most signi cant, a ecting all subsequent analysis. First, one can start from the physical rst principles, constructing the model in the bottom-up fashion. However, this modeling style is very knowledge-intensive, and the scarcity of the domain-area experts is an ever increasing problem. And, what is more, the larger the system is, the more probable it is that such a model will not really be suited for real use, for analysis of the system behavior or for control design. To understand this, see Fig. 1.1: What is the dimension of this constant-volume system (what is the number of independent system states)? The problem with qualitative analyses is that the relevance of di erent constructs is not at all judged. For example, assume that one of the tanks is negligible, so that its volume is small as compared to the other two tanks: This tank does not contribute in the system behavior, and the nominal three-state model is unnecessarily complex for practical use. On the extreme, all the system time constants may be negligible as compared to the sampling interval, and, seemingly, the system becomes static with no dynamics whatsoever. On the other hand, assume that the mixing is not perfect: If the tanks are not ideal mixers, the one-state-per-tank approach is no more suÆcient, and one ends in a partial di erential equation model. The relevant dimension of the system can be anything between zero and in nity! 9

10

Lesson 1. Introduction to Multivariate Modeling C0 C1

C2

C3

Figure 1.1: What is the dimension of the dynamic system? As compared to bottom-up methods, the multivariate statistical methods operate in the top-down fashion: Plenty of data is collected and one tries to nd the relevant system properties underlying the measurements. Instead of being knowledge-intensive, multivariate methods are data-intensive. The eld of systems engineering is getting wider and wider, and the systems to be modeled are getting more and more complicated and less structured (see Fig. 1.2). All these trends necessitate data-oriented approaches in both ends of the systems continuum. Another fact is that totally new branches of research | for example, data mining and microactuators must be based on massively parallel analysis and modeling methods. “Black box” Psychological systems

eo

Prediction

Mechanical systems Electric circuits

Analysis

el

em

yst

fs

eo

Chemical systems

od

p Ty

Biological systems

fm

Economic systems

Us

Speculation

Social systems

Control Design

“White box”

Figure 1.2: Spectrum of systems to be modeled

11

1.2. Paradigms and practices

MLR PCA PLS CCA FA ICA SSI NNR

Status

Community

Well known

All Almost all

\Try and pray"!

Chemometrics

\Forgotten"

Statistics

Well known

Social sciences

Emerging

Neural networks

Emerging

Control theorists

Mature

Heuristic

Almost all!

Keywords

Curve tting Data compression Calibration models Canonical variates Hidden factors Pattern recognition System identi cation Learning synapses

Era        

1800's { 30's { 70's { 60's { 30's { 90's { 90's { 80's {

Figure 1.3: Simpli ed view over the eld of multivariate analysis

1.2 Paradigms and practices The wide and heterogeneous eld of modern multivariate methods can brie y (and informally) be presented as shown in Fig. 1.3. For the notations, see page 6. The di erent elds are somewhat diÆcult to compare against each other | researchers in di erent disciplines may have di erent objectives and application elds, and certainly they do have di ering terminology and notations. The methods are still in turmoil and their overall relevance has not yet been generally understood. An illustration of this heterogeneity is the fact that there seems not to exist an engineering level treatment of all these methods in a common framework. This report tries to do that: To o er a homogeneous view over the various disciplines in multivariate analysis.

1.3 About mathematical tools 1.3.1 Challenge of high dimension In multivariate framework the structural complexity of the system is changed into a problem of high dimensionality: It is assumed that when one includes enough measurement data in the model, arbitrary accuracy concerning the system behavior can be achieved. One needs to keep in mind the lesson learned in \Flatland" [1]: Assume that there existed two-dimensional creatures, being only able to recognize their environment in two dimensions. Now, assume that a three-dimensional ball goes through this two-dimensional world | what the creatures perceive, is a point coming from nowhere, expanding into a circle, and nally again contracting into a dot before vanishing. This circle can emerge anywhere in the two-dimensional space. How can the creatures understand what happened? The answer is that this behavior exceeds the capacity of those creatures, there is no way they could really understand it. The point is that we are bound to the three-dimensional world | the behavior of higher-dimensional objects similarly exceeds our capacity. This is the basic problem when trying to make multivariate methods comprehen-

12

Lesson 1. Introduction to Multivariate Modeling

sible: The phenomena emerging in higher dimensions cannot really be visualized as two-dimensional plots (or even three-dimensional wire models). When more sophisticated methods are analyzed, illustrations do not help much. On the other hand, mathematics is a robust tool for these purposes. Additionally, if there are some special constraints posed on the structure of the high-dimensional problem (typically, linearity will be assumed), essentially the same methodologies work no matter what is the dimension of the problem. Linear algebra is the theoretical and matrix calculus is the practical language for concisely discussing high-dimensional phenomena, and good understanding of them is vital.

Figure 1.4: Cover page of E. Abbott's \Flatland"

1.3.2 About matrices The principles of matrix calculus are not repeated here (for more information, see, for example, [2] or [10]); however, let us repeat what are eigenvalues and eigenvectors, what are singular values, and how matrix-form expressions are di erentiated and how their extrema can be found. The discussion is restricted to real-valued matrices.

Eigenvalues and eigenvectors It turns out that, rather astonishingly, most of the major regression methods can be presented in a homogeneous framework; this is the framework of the so called eigenproblem. A square matrix M of dimension n  n generally ful lls the following formula for some  and :

M   =   :

(1.1)

Here,  is a scalar called eigenvalue and  is a vector called eigenvector. This means that the eigenvector directions are, in a sense, \natural" to the matrix M :

13

1.3. About mathematical tools

Vectors in this direction, when multiplied by M , are only scaled, so that their direction is not changed (later, when speaking of linear mappings, this property will be elaborated on further). From the construction, it is clear that if  is eigenvecter, then  also is, where is an arbitrary scalar. For uniqueness, from now on,pit will be assumed that the eigenvectors are always normalized, so that k k =  T  = 1 (this constraint is automatically ful lled by the eigenvectors that are returned by the eig function of Matlab, for example). In those cases that we will study later the matrices will be non-defective by construction (however, see the exercise); this means that there will exist n distinct eigenvectors ful lling the expression (1.1). These eigenvectors i and corresponding eigenvalues i , where 1  i  n, can be compactly presented as matrices  and , respectively: 0

 = 1

   n



and

=B @

1 0

...

0

n

1 C A;

(1.2)

where the dimension of matrices  and  is n  n. Using these notations, it is easy to verify that the n solutions to the eigenproblem (1.1) can now be expressed simulateneously in a compact matrix form as

M   =   :

(1.3)

In those cases that we will be later studying, the eigenvectors are linearly independent,  has full rank and the matrix M is diagonalizable: The above expression equals

  = ;

(1.4)

M =      1;

(1.5)



1M

or

so that M is similar to a diagonal matrix consisting of its eigenvalues. One of the specially useful properties of the above eigenvalue decomposition is due to the following: 

Mi =      1 i 1      =  |     {z i times

i 1 =  0  i 1 0 . . = B @ . 0 in

1

}

(1.6)

1 C A

  1:

That is, calculation of matrix powers reduces to a set of scalar powers. From this it follows that all matrix functions determined as power series can be calculated using their scalar counterparts after the matrix eigenstructure has been

14

Lesson 1. Introduction to Multivariate Modeling

determined. On the other hand, it is interesting to note that matrix functions cannot a ect the matrix eigenvectors. The eigenvalues can be determined also as the roots of the characteristic equation detf  I

M g = 0:

(1.7)

Even though the eigenvalues should not be calculated this way, the roots of highorder polynomials being numerically badly behaving, some theoretical properties can easily be proven in this framework. For example, if a matrix of a form q  I , where q is scalar, is added to the matrix M , all of the eigenvalues are shifted by that amount: detf( q)  I

M g = 0:

(1.8)

The properties of the eigenvalues and eigenvectors will be discussed more when we know more about the properties of the matrix M .

Singular value decomposition The eigenvalue decomposition is de ned only for square matrices (and not even all square matrices matrices can be decomposed in such a way). The generalization, the singular value decomposition (SVD), on the other hand, is de ned for all matrices M :

M =     T :

(1.9)

Here  and are orthogonal square matrices, so that T  = I and T = I , and  is a diagonal matrix of singular values. Note that  does not need to be square; if M has dimension  times  , where  >  (and analogous results are found if  <  ), the decomposition looks like 0 B B B B M =  B   B B @



1 0

...

0

0

1

C C C T  C C : C   C A



(1.10)

The singular values i are characteristic to a matrix; they are positive real numbers, and it is customary to construct  so that they are ordered in descending order. The singular values are close relatives of eigenvalues: Note that, because of the orthogonality of  and there holds 0

M T M =  T   T =  B @

12 0

...

0

2

1 C A

 T

(1.11)

15

1.3. About mathematical tools

and

0

MM T =   T  T =  

B B B B B B B @

12 0

1

0

...

0

2

0

0

C C C C C C C A

 T :

(1.12)

These are eigenvalue decompositions of M T M and MM T , respectively; this means that the (non-zero) eigenvalues of M T M (or MM T ) are squares of the singular values of M , the corresponding eigenvectors being collected in (or , respectively). Generalization of functions to non-square matrices can be based in the following matrix power de nition, following the idea of (1.6): 0

Mi =  

B B B B B B B @

1i 0

1

0

...

C C C i C C C C A

0

 T :

(1.13)

Matrix di erentiation Corresponding to the di erentiation with respect to a scalar, we can di erentiate a scalar-valued function f () with respect to a vector; the result is a vector called gradient. Assume that the function f : R ! R is being di erentiated: 0

d dz

f (z ) .. . d f (z ) d z

d f (z ) = B @ dz

1

1

C A:

(1.14)

Note that now we choose that gradients to be column vectors (in literature, this is not always the case). Assuming that the matrix M has dimension    , its row dimension being compatible with z , so that there exists

zT M =

 P 

i=1 zi Mi1



P

i=1 zi Mi



;

the di erentiation can be carried out columnwise: 0

d dz

zT M



= =

d P z M i=1 i i1 d z B . B @

1

..

 ...

d P z M i=1 i i1 1   0 d z M11    M1 B @

= M:

.. . . . . .. . M 1    M

C A

(1.15) 1 d P z M i=1 i i C dz 1

d d z

.. . P

 zM i=1 i i

C A

(1.16)

16

Lesson 1. Introduction to Multivariate Modeling

This is how one would expect a linear matrix function to behave. On the other hand, it turns out that if z is multiplied from the other side, the matrix has to be transposed:  d M T z = M: dz

(1.17)

Thus, using the product di erentiation rule, d dz

z T Mz



d dz





z T Mz + ddz zT Mz z=z M + M T z:

= =

(1.18)

Here, M must have dimension    ; z is assumed to be a (dummy) constant with respect to z . For symmetric M the above coincides with 2Mz , something that looks familiar from scalar calculus. It turns out that more sophisticated di erentiation formulas are not needed in what follows.

1.3.3 Optimization Matrix expressions can be optimized (minimized or maximized) similarly as in the scalar case | set the derivative to zero (this time, \zero" is a vector consisting of only zeros):

d J (z ) = 0: dz

(1.19)

For matrix functions having quadratic form (like xT Ax) the minima (maxima) are unique; this is the case in all optimization tasks encountered here. For an extremum point to be maximum, for example, the (hemo) Hessian matrix must be negative semide nite:  T d2 d d J (z ) = J (z ) d z2 dz dz

 0:

(1.20)

Note the transpose; the other gradient must be written as a row vector, so that the nal result would be a square matrix. Here \ n, generally no exact solutions can be found. To nd the best approximation, the model needs to be extended to include the modeling errors as

Yi = X  Fi + Ei ;

(4.3)

where Ei is a k  1 vector containing the reconstruction error for each measurement sample k. The errors can be solved from (4.3) as Ei = Yi XFi ; these errors should be somehow simultaneously minimized. It turns out that the easiest way to proceed is to minimize the sum of error squares. The sum of the squared errors can be expressed as

EiT Ei = (Yi XFi )T (Yi = YiT Yi YiT XFi

XFi ) FiT X T Yi + FiT X T XFi :

(4.4)

This (scalar) can be di erentiated with respect to the parameter vector Fi : 

d EiT Ei = 0 X T Yi d Fi

X T Yi + 2X T XFi :

(4.5)

Because 

d2 EiT Ei = 2X T X > 0; d Fi2

(4.6)

this extremum is minimum, and because the extremum of a quadratic function is unique, setting the derivative to zero (vector) gives the unique optimum parameters: 2X T Yi + 2X T XFi = 0;

(4.7)

resulting in

Fi = X T X

 1

X T Yi :

(4.8)

The estimate for yi is found as

y^ ;i = FiT x = YiT X X T X est

est

 1

x : est

(4.9)

4.1.2 Piece of analysis In what follows, some theoretical analyses that can be used to evaluate the above least squares model are presented. More analysis can be found in various sources, for example, in [34].

59

4.1. Linear regression model

Model consistency First, study the consistency of the least squares model. According to (4.3), assuming that the only available information about the output, the really measured Y~ = Y + Ei is used in (4.9) for model construction, the expression for the estimate becomes 

y^ ;i = (Yi + Ei )T X X T X 1  x  = YiT X X T X 1  x + EiT X X T X = y ;i + EiT X  X T X 1  x : est

est

1

est

est

x

(4.10)

est

est

Only assuming that X and Ei are uncorrelated, so that EiT X = 0, this gives an unbiased estimate for yi . The opposite case | the noise being correlated | is studied in 4.2.1.

Parameter sensitivity The reliability of the regression model (4.8) can be approximated, for example, by checking how much the parameters vary as there are stochastic variations in Yi . The parameter vector covariance matrix becomes EfFi FiT g = E = = =



XT X 

 1

X T Yi 



XT X

 1



X T Yi

T 



X T X  1 X T  E Yi YiT  X X T X 1 X T X  1 X T  varfyi ()gI  X X T X 1 X T X 1  varfyi ()g:

(4.11)

The parameter variance is also intimately related to the properties of the matrix X T X | see Sec. 4.3 for more analysis.

Measures of t To evaluate how well the regression model matches the training data, the so called R squared criterion can be applied: how much of the real output variance can be explained by the model. That is, one can calculate the quantity SS R2 = 1 ; (4.12) SS where, for the i'th output, E

T



The \error sum of squares" is de ned as SS = (Yi Y^i )T (Yi Y^i ) = (Yi XFi )T (Yi E



XFi ):

(4.13)

The \total sum of squares" (for zero-mean data) is

SS = YiT Yi : (4.14) So, R2 measures how much of the total variation in the output can be explained by the model. This quantity has value 1 if all the variation in the output can be exactly predicted, and lower value otherwise. T

60

Lesson 4. \Quick and Dirty"

4.1.3 Multivariate case If there are various output signals, so that m > 1, the above analysis can be carried out for each of them separately. When collected together, there holds 8 > > < > > :

F1 = .. . Fm =

XT X

 1

X T Y1

(4.15)

 X T X 1 X T Ym :

It turns out that this set of formulas can simultaneously be rewritten in a compact matrix form, so that

F = F1

   Fm



= XT X

 1

X T  Y1



   Ym :

(4.16)

This means that the multilinear regression (MLR) model from X to estimated Y can be written as

F

MLR

= XT X

 1

X T Y:

(4.17)

The MLR solution to modeling relationships between variables is exact and optimal in the sense of the least squares criterion. However, in Sec. 4.3 it will be shown that one has to be careful when using this regression method: In practical applications and in nonideal environments this MLR approach may collapse altogether. The problem is that trying to explain noisy data too exactly may make the model sensitive to individual noise realizations. In any case, in later chapters, the above MLR model is used as the basic engine to reach mappings between variables; the de ciencies of the basic approach are taken care of separately.

4.2 Modi cations There are various problems with the original MLR model, and di erent kinds of methods have been proposed for avoiding them. Even though the above MLR solution will be the standard approach to implementing the mapping between two sets of variables in later chapters, let us brie y study some alternatives.

4.2.1 Error in variables In (4.3) it is assumed that the error is added exclusively to the Y values. However, the X measurements can also contain uncertainty; the model matching problem becomes very di erent if both X and Y blocks are regarded as equally stochastic data values and errors in all variables should be taken into account (see Fig. 4.1). This assumption results in the so called Error In Variables (EIV) model. Let us study the MLR solution, assuming that, in addition to the errors Ey added to the Y matrix, there are also errors Ex added to X . Assume that Ex and Ey have zero mean, and they are mutually uncorrelated, and also uncorrelated

61

4.2. Modi cations

y

y

yi

yi

ei

ei

x

x

Figure 4.1: Normal least-squares matching principle, on the left, assuming that only the y variables contain noise, and the total least squares principle, assuming that errors in all variables are equally meaningful, on the right. For visualization purposes only one x and y variable is employed with X and Y , so that there holds ExT X = 0, EyT X = 0, and ExT Ey = 0, etc. Instead of Y = XF + E , the measurement data that is used for model construction is then

X~ = X + Ex ; and ~ + Ey = (X + Ex )  F + Ey = Y + Ex F + Ey ; Y~ = XF

(4.18)

where F is the unknown correct mapping matrix. If this data is used for MLR model construction, one has 

F~ = =

 1



X~ T X~ X~ T Y~   X T X + ExT Ex 1 X T Y + ExT Ex F ;

(4.19)

and, further, one can formally write the formula for the estimate 0

Y~ = X~ est

est

1 10

 @X T X + E| xT{zEx} A scale error

@X T Y

1

+ ExT Ex F A : | {z }

(4.20)

bias error

It is clear that if Ex is non-zero, there is a bias in the estimates caused by ExT Ex F . Additionally, it can be seen that the model parameters are typically more conservative than in the nominal MLR case (in (4.19) there is the additional positive de nite matrix ExT Ex also in the \denominator").

4.2.2 Total least squares One approach to attacking this bias problem in the EIV model is the Total Least Squares (TLS) algorithm1 [10]. The idea is to search for such a regression 1 there are other alternatives, like the Instrumental Variable (IV) method, etc., but they are not discussed here. Also, note that the more sophisticated regression methods that will be concentrated on in subsequent chapters, naturally include errors in the variables

62

Lesson 4. \Quick and Dirty"

hyperplane that when data points are orthogonally projected onto this plane, the (squared) distances reach minimum. To achieve this, some knowledge of analytic geometry is needed. Note that the analysis below is again carried out for only one of the outputs yi at a time; for notational simplicity, denote the corresponding regression vector as f = Fi . The regression formula (parameters unknown at this stage) can then be written as

yi = f1 x1 +    + fn xn :

(4.21)

We would like the output to be formally identical with the inputs | assume that the output to be modeled, yi , is included among the other input variables. Expression (4.21) can be presented in the form

yi

f1 x1

   fnxn = 0;

or, more generally, f 0 yi + f 0 x1 +    + f 0 xn = 0; y

n

1

(4.22) (4.23)

where all variables now have similar roles. De ning 0 B

v=B B @

yi x1 .. . xn

1 C C C; A

(4.24)

this can be written as vT f 0 = 0:

(4.25)

In other words, vectors f 0 and v must be orthogonal. The regression hyperplane spanned by the vectors v that lie on that plane is also uniquely characterized by its normal vector f 0 . Now, assume that f 0 is normalized, so that kf 0 k = 1; then the dot product e = vT f 0 directly tells the shortest distance from the point v to the regression hyperplane (for points lying exactly on the plane this measure, of course, giving 0). Indeed, if f 0 is normalized, the average of squared distances for a set of points v(1) to v(k) can be expressed as k k  1 X 1 X 1  e2 () =  f 0T v()  vT ()f 0 =  f 0T  V T V  f 0 ; k =1 k =1 k

(4.26)

where 

V = Yi X : kn+1

(4.27)

To minimize this with the requirement that the normal vector must be normalized, 1 0T T 0 Maximize k 0T f 0  V V  f (4.28) when f f = 1;

63

4.2. Modi cations

leads to the Lagrangian formulation (see page 17) where one has 

f (f 0 ) = k1  f 0T  V T V  f 0 ; g(f 0 ) = 1 f 0T f 0 :

when

(4.29)

The cost criterion becomes

J (f 0 ) =

1 0T T 0  f V V f + i (1 f 0T f 0 ): k

(4.30)

This results in 



d 1 0T T 0  f V V f + i (1 f 0T f 0 ) = 0; d f0 k

(4.31)

giving 1  2V T V  f 0 2i  f 0 = 0; k

(4.32)

1 T  V V  f 0 = i  f 0 : k

(4.33)

or

The distance minimization has become an eigenvalue problem with the searched normal vector f 0 being an eigenvector of the data covariance matrix R = k1 V T V . The eigenvectors of data covariance matrix are called principal components. Later, principal components will be discussed more (see next chapter). The solution is also given by some of the eigenvectors | but there are generally n + 1 of them, which one of them to choose? Look at the second derivative:

d2 J (f 0 ) 1 =  2V T V d f 02 k

2i  I:

(4.34)

To reach the minimum of J (f 0 ), there must hold d2 J (f 0 )=d f 02  0, that is, the second derivative matrix (Hessian) must be positively semide nite: For any vector f 0 there must hold

f 0T 



1  2V T V k

2i  I



 f 0 > 0:

(4.35)

Specially, this must hold for the eigenvector fj0 : 

fj0T  k1  2V T V 2i  I  fj0 = k1  2fj0T  V T V fj0 2i  fj0T fj0 = k1  2j  fj0T fj0 2i  fj0T fj0 = 2j 2i  0:

(4.36)

Because j can be any of the eigenvectors, the above can hold only for the eigenvector fi0 corresponding to the smallest eigenvalue.

64

Lesson 4. \Quick and Dirty"

The searched normal vector is also given by the smallest principal component of data V , meaning that it is the \null space" of data that determines the best dependency between variables! In a sense, this is understandable: if the smallest eigenvalue happens to be 0, there must be exact linear dependency between the variables. Remembering the de nition of the vector f 0 , the regression formula becomes f0 f0 yi = 10  x1    n0  xn : (4.37) fy fy For a multivariate system, the same analysis can be repeated for all outputs yi separately; note that the eigenproblem is generally di erent for all outputs. However, one needs to be careful: In the derivation yi was interpreted as any of the other input variables, meaning that it is not the output that was explicitly being explained (as is the case with MLR). This means that the TLS model not necessarily gives a good regression model for estimating the output. This method is also called \last principal component analysis"; you can compare this approach to the next chapters, where, astonishingly, the solutions (to di erent problems) are given in terms of the most signi cant principal components!

4.3 Collinearity The MLR regression model is optimal2 . In simple cases it is diÆcult to see why optimality is in contrast with usability. Today, when the problems to be modeled involve large amounts of poor data, the problems of MLR have become evident. The main problem plaguing MLR is caused by (multi)collinearity. What this means can best be explained using an example. Assume that we can observe two variables x1 and x2 , so that x = ( x1 x2 )T . Further, assume that these variables are not strictly independent; they can be written as x1 () =  () + 1 () and x2 () =  () + 2 (), where the sequences 1 () and 2 () are mutually uncorrelated, both having the same variance 2 . This can be interpreted so that we have two noisy measurements of the same underlying variable  , and together these measurements should give a more reliable estimate for it. Let us check what this collinearity of x1 and x2 means in practice. First, calculate the matrix X T X that has an essential role in the regression formula

XT X =



 k

P x2 () P kappa 1 x ()x ()  kappa2 1 2 2

Ef g +  Ef 2 g

P

x1 ()x2 () kappa P x2 () kappa  2 2 Ef g Ef 2 g + 2 :



(4.38)

To understand the properties of the regression formula, let us study the eigenvalues of the above matrix. It turns out that the solutions to the eigenvalue 2 of course, only in the least-squares sense; but, because of the mathematical bene ts, the same criterion will be applied later, too

65

4.3. Collinearity

Figure 4.2: Collinearity visualized in two dimensions equation 

det   I2

k



Ef 2 ()g + 2 Ef 2 ()g 2 Ef ()g Ef 2 ()g + 2



=0

(4.39)

are 

1 = 2k  Ef 2 ()g + k2 ; 2 = k2 :

and

(4.40)

The theory of matrices reveals that the condition number of a matrix determines its numerical properties | that is, the ratio between its largest and smallest eigenvalue dictates how vulnerable the formulas containing it are to unmodeled noise. As the condition number grows towards in nity the matrix becomes gradually uninvertible. In this case, the matrix X T X has the condition number condfX T X g = 1 + 2 

Ef 2 ()g ; 2

(4.41)

telling us that the smaller the di erence between the variables x1 and x2 is (2 being small), the higher the sensitivity of the regression formula becomes. The above result reveals that when using regression analysis, one has to be careful: It is the matrix X T X that has to be inverted, and problems with invertibility are re ected in the model behavior. There only need to exist two linearly dependent measurements among the variables in x, and the problem instantly becomes ill-conditioned. In practice, it may be extremely diÆcult to avoid this kind of \almost" collinear variables | as an example, take a system that has to be modeled using partial di erential equation (PDE) model (say, a rod that is being heated). PDE models are often called \in nite-dimensional"; that is, one needs very high number (in principle, in nitely many) measurements to uniquely determine the process state. It is not a surprise that temperature readings along the rod do not change rapidly, or nearby measurements deliver almost identical values, variables becoming linearly dependent; a regression model

66

Lesson 4. \Quick and Dirty"

trying to utilize all the available information becomes badly behaving. When aiming towards accuracy, the model robustness is ruined! To see an example of what collinear data looks like in a two-dimensional space, see Fig. 4.2: the data points in the gures are created using the above model, where Ef 2 ()g = 1:0 and 2 = 0:01, the sequences being normally distributed random processes. The data points seem to be located along a line; they do not really seem to \ ll" the whole plane. Intuitively, this is the key to understanding the ideas of further analyses in later chapters. The TLS approach by no means solves the above collinearity problem | on the contrary, even more severe problems emerge. Note that the last principal component essentially spans the null space of the covariance matrix, that is, if there is linear dependency among the variables, this dependency dominates in f 0 . Assuming that the linear dependency is between, say, input variables xi and xj , the parameters li and lj have high values, all other coeÆcients being near zero. Now, if (4.37) is applied, the parameter ly (having negligible numerical value) in the denominator makes the model badly conditioned. The main problem with TLS is that while solving a minor problem (error in variables), it may introduce more pathological problems in the model.

4.4 Patch xes Because of the practical problems caused by collinearity, various ways to overcome the problems have been proposed. In what follows, two of such propositions are brie y presented | more sophisticated analyses are concentrated on in next chapters.

4.4.1 Orthogonal least squares Because the basic source of problems in linear regression is related to inversion of the matrix X T X , one can try to avoid the problem by enhancing the numerical properties of this matrix. Intuitively, it is clear that if the input variables were mutually orthogonal, so that X T X = I , the numerical properties would be nice. Indeed, one can construct new variables Z so that this orthogonality holds using the so called Gram-Schmidt procedure: Corresponding to all indices 1  i  n, de ne Zi by

Zi0 = Xi

i 1 X j =1

XiT Zj  Zj ;

(4.42)

and normalize it, q

Zi = Zi0 = Zi0T Zi0 ;

(4.43) p

starting from Z1 = X1 = X1T X1 . These data manipulation operations can be presented in a matrix form

Z = X  M;

(4.44)

67

4.4. Patch xes

where M is an upper-triangular matrix3 . It is easy to see that there holds

ZiT Zj =



1; 0;

if i = j , and otherwise,

(4.45)

so that Z T Z = I . Using these intermediate variables one has the mapping matrix from Z to Y as

F = ZT Z

 1

Z T Y = Z T Y;

(4.46)

or returning to the original variables X , the Orthogonal Least Squares (OLS) formula becomes

F

OLS

= MZ T Y:

(4.47)

Of course, reformatting formulas does not solve the fundamental problems | the inversion of the matrix is implicitly included in the construction of M . However, it turns out that reorganizing the calculations still often enhances the numerical properties of the problem.

4.4.2 Ridge regression Ridge Regression (RR) is another (ad hoc) method of avoiding the collinearity problem | the basic idea is to explicitly prevent the covariance matrix from becoming singular. Ridge regression belongs to a large class of regularization methods where the numerical properties of the data | as seen by the algorithms | are somehow enhanced. The idea here is not to minimize exclusively the squared error, but to include weighting for parameter size in the optimization criterion, so that instead of (4.4), what is really minimized is

EiT Ei + FiT Qi Fi = YiT Yi YiT XFi

FiT X T Yi + FiT X T XFi + FiT Qi Fi ;

(4.48)

where Qi is a positive de nite weighting matrix. Di erentiation yields 

d EiT Ei = 0 X T Yi X T Yi + 2X T XFi + 2Qi Fi : dFi

(4.49)

Setting the derivative to zero again gives the optimum: 2X T Yi + 2X T XFi + 2Qi Fi = 0;

(4.50)

resulting in

Fi = X T X + Qi 3 Actually,

 1

X T Yi :

(4.51)

the so called QR factorization of X that is readily available, for example, in

Matlab, gives the same result (note that the resulting R matrix is the inverse of our M . The

inversions of the triangular matrix are, however, nicely conditioned)

68

Lesson 4. \Quick and Dirty"

q1 q2

Figure 4.3: The \virtual" distribution of collinear data as seen by the ridge regression algorithm for di erent values of q In the multi-output case, assuming that Qi = Q is the same for all outputs, one can compactly write

F

RR

 1

= XT X + Q

X T Y:

(4.52)

Usually there is no a priori information about the parameter values and the weighting matrix Q cannot be uniquely determined. The normal procedure is to let Q be diagonal; what is more, it is often chosen as Q = q  I , where q > 0 is a small number. This approach eÆciently prevents the matrix being inverted from becoming singular | when adding q  I to the matrix X T X , its all eigenvalues are shifted up by the amount q, so that originally zero eigenvalues will have numerical value q. Note that the same ridge regression behavior in standard MLR is achieved also if white noise with covariance q  I is added to data | this approach is often used, for example, when training neural networks. Just think of it: Noise is deliberately added to data to make it better behaving! There is an uneasy feeling of heuristics here, and something more sophisticated is clearly needed | alternatives are presented in the following chapters.

4.4. Patch xes

69

Computer exercises 1. Check how the MLR sensitivity is a ected when the data properties are changed; that is, try di erent values for the parameters k (number of samples), n (data dimension), dofx (true degrees of freedom), and x (deviation of the noise) below, and calculate the covariance matrix condition number: k = 20; n = 10; dofx = 5; sigmax = 0.001; X = dataxy(k,n,NaN,dofx,NaN,sigmax); Lambda = eig(X'*X/k); max(Lambda)/min(Lambda)

2. Study how robust the di erent regression algorithms are. First generate data, and test the methods using cross-validation (try this several times for fresh data): [X,Y] = dataxy(20,10,5,5,3,0.001,1.0); E = crossval(X,Y,'mlr(X,Y)'); errorMLR = sum(sum(E.*E))/(20*5) E = crossval(X,Y,'tls(X,Y)'); errorTLS = sum(sum(E.*E))/(20*5) E = crossval(X,Y,'ols(X,Y)'); errorOLS = sum(sum(E.*E))/(20*5) E = crossval(X,Y,'rr(X,Y,0.001)'); % Change this value! errorRR = sum(sum(E.*E))/(20*5)

70

Lesson 4. \Quick and Dirty"

Lesson 5

Tackling with Redundancy The collinearity problem is essentially caused by redundancy in the data: Measurements are more or less dependent of each other. However, none of the measurements is completely useless, each of them typically delivers some fresh information. Qualitative analyses cannot help here | on the other hand, when the quantitative approach is adopted, powerful methods turn out to be readily available.

5.1 Some linear algebra Linear algebra is a highly abstract eld of systems theory. In this context, it suÆces to concentrate on just a few central ideas, and theoretical discussions are kept in minimum; these issues are studied in more detail, for example, in [14] or [32].

5.1.1 On spaces and bases To have a deeper understanding of how the mapping from the \space" of input variables into the \space" of output variables can be analyzed, basic knowledge of linear algebra is needed. The main concepts are space, subspace, and basis. The de nitions are brie y summarized below: The set of all possible real-valued vectors x of dimension n constitutes the linear space Rn . If S 2 Rn is a set of vectors, a subspace spanned by S , or L(S ), is the set of all linear combinations of the vectors in S . An (ordered) set of linearly independent vectors i spanning a subspace is called a basis for that subspace. Geometrically speaking, subspaces in the n dimensional space are hyperplanes (lines, planes, etc.) that go through the origin. The number of linearly independent vectors in the subspace basis determines the dimension of the subspace. The basis vectors 1 to N can conveniently be represented in a matrix form:

 = 1



   N :

(5.1) 71

72

Lesson 5. Tackling with Redundancy

This basis matrix has dimension n  N , assuming that the dimension of the subspace in the n dimensional space is N . Given a basis, all points x in that subspace have a unique representation; the basis vectors i can be interpreted as coordinate axes in the subspace, and the \weights" of the basis vectors, denoted now zi , determine the corresponding \coordinate values" (or scores) of the point:

x=

N X i=1

zi  i :

(5.2)

The elements in i are called loadings of the corresponding variables. In matrix form, the above expression can be written as

x =   z:

(5.3)

Further, if there are various data vectors, the matrix formulation can be written as

X = Z  T :

(5.4)

There is an in nite number of ways of choosing the basis vectors for a (sub)space. One basis of special relevance is the so called \natural" basis: fundamentally, all other bases are de ned with respect to this natural basis. For the space of n measurements the natural basis vector directions are determined directly by the measurement variables; formally speaking, each entry in the data vector can be interpreted as a coordinate value, the basis vectors constituting an identity matrix,  = In . However, even though this trivial basis is easy to use, it is not necessarily mathematically the best representation for the data (as was shown in the example about collinearity above). Next we see how to change the basis.

5.1.2 About linear mappings The matrix data structure can have various roles. It can be used as a collection of data values (as X and Y above, for example), or it can be used as a frame for a vector system (as in the case of basis vectors); but perhaps the most important role of a matrix is its use as a means of accomplishing linear transformations between di erent bases of (sub)spaces. Whereas all matrix operations can be interpreted as linear transformations, now we are specially interested in mappings between di erent bases. The transformations from a given basis to the natural basis are straightforward: applying (5.3) gives the transformed coordinates directly. The question that arises is how one can nd the coordinate values z for a given x when the new basis  is given. There are three possibilities depending on the dimensions n and N :

 If n  N , matrix  is square and invertible (because the linear independence of the basis vectors was assumed). Then one can directly solve

z=

1  x:

(5.5)

73

5.1. Some linear algebra

 If n > N , the data point cannot necessarily be represented in the new ba-

sis. Using the least squares technique (see the previous section; substitute symbols so that X ! , Yi ! x, and Fi ! z ) results in an approximation  z^ = T  1 T  x:

(5.6)

 If n < N , there are an in nite number of exact ways of representing the

data point in the new basis. Again, the least squares method o ers the solution, now in the sense of minimizing z T z , that is, nding the minimum numerical values of the coordinates (see page 17):

z = T T

 1

 x:

(5.7)

All of the above cases can be conveniently presented using the pseudoinverse notation:

z = y  x:

(5.8)

If the basis vectors are orthonormal (orthogonal and normalized at the same time, meaning that iT j = 0, if i 6= j , and iT j = 1, if i = j ) there holds T  = IN (or T = In , whichever is appropriate). Thus, all the above formulas (5.5), (5.6), and (5.7) give a very simple solution:

z = T  x;

(5.9)

or, corresponding to (5.4),

Z = X  :

(5.10)

The above result visualizes the bene ts of basis orthonormality; there are additional advantages that are related to the numerical properties of orthogonal transformation matrices (manipulations in an orthonormal basis are optimally conditioned)1 .

5.1.3 Data model revisited To enhance the basic regression method, a more sophisticated scheme is now adopted (see Fig. 5.1). Speaking informally, we search for an \internal structure" that would capture the system behavior optimally; this internal structure is assumed to be implemented as a linear subspace. The data samples are rst projected onto the internal basis, and from there they are further projected onto the output space, the nal projection step being based on MLR regression. Note that because all of the mappings are linear, they can always be combined so that, seen from outside, the original \one-level" model structure is still valid: Y = (XF 1 )F 2 = X (F 1F 2 ) = XF . 1 Note that the orthogonality condition is always ful lled by the basis vectors that are generated by the PCR and PLS approaches that will be presented later. Furthermore, when using Matlab, say, for calculating the eigenvectors, they will be automatically normalized; this means that the practical calculations are rather straightforward

74

Lesson 5. Tackling with Redundancy

F1

X

F2

Z

Y

Figure 5.1: The dependency model y = f (x) re ned The overall regression model construction becomes a two-phase process, so that there are the following tasks: 1. Determine the basis .

 2. Construct the mapping F 1 =  T  1 .

3. Calculate the \latent variables" Z = XF 1 . 4. Construct the second-level mapping F 2 = Z T Z

 1

ZT Y .

5. Finally, estimate Y^ = X F = X F 1 F 2 . est

est

est

Here Z stands for the internal coordinates corresponding to the training data X and Y . In special cases (for example, for orthonormal ) some of the above steps may be simpli ed. The remaining problem is to determine the basis  so that the regression capability would be enhanced. How the internal structure should be chosen so that some bene ts would be reached? When the rank of the basis is the same as the number of degrees of freedom in the data (normally meaning that there are n basis vectors representing the n dimensional data), the data can be exactly reconstructed, or the mapping between data and the transformed representation can be inverted. This means that also the random noise that is present in the samples will always remain there. A good model, however, should only represent the relevant things, ignoring something, hopefully implementing this compression of data in a clever way. In concrete terms, this data compression means dimension reduction, so that there are fewer basis vectors than what is the dimension of the data, or N < n. Let us study this a bit closer | assume that the dimension of input is n, the dimension of output is m, the dimension of the latent basis is N , and the number of samples is k. The nominal regression model, matrix F mapping input to output contains n  m free parameters; there are k  m constraint equations. This means that on average, there are

km k = nm n

(5.11)

constraints for each parameter. The higher this gure is, the better the estimate becomes in statistical sense, random noise having smaller e ect. On the other hand, if the latent basis is used in between the input and output, there is rst the mapping from input to the latent basis (n  N parameters) and additionally

75

5.2. Principal components

the mapping from the latent basis to the output (N  m parameters). Altogether the average number of constraints for each parameter is

k  km = : n  N + N  m N 1 + mn

(5.12)

Clearly, if N  n, bene ts can be achieved, or the model sensitivity against random noise can be minimized | of course, assuming that these N latent variables can carry all the relevant information. How an automatic modeling machinery can accomplish such a clever thing of compression, or \abstracting" the data? There are di erent views of how the relevant phenomena are demonstrated in the data properties. Speaking philosophically, it is the ontological assumption that is left to the user: The user has to decide what are the most interesting features carrying most of the information about the system behavior. Concentrating on di erent aspects and utilizing the statistical properties of the data accordingly results in di erent regression methods.

5.2 Principal components The hypothesis that will now be concentrated on is that data variance carries information. This is the assumption underlying Principal Component Analysis (PCA), also known as Karhunen{Loeve decomposition, and the corresponding regression method PCR. In brief, one searches for the directions in the data space where the data variation is maximum, and uses these directions as basis axes for the internal data model. Whereas noise is (assumed to be) purely random, consistent correlations between variables hopefully reveal something about the real system structure. Assume that i is the maximum variance direction we are searching for. Data points in X can be projected onto this one-dimensional subspace determined by i simply by calculating Zi = Xi ; this gives a vector with one scalar number for each of the k measurement samples in X . The (scalar) variance of the projections can be calculated2 as E fzi2(k)g = k1  ZiT Zi = k1  iT X T Xi . Of course, there can only exist a solution if the growth of the vector i is restricted somehow; the length of this vector can be xed, so that, for example, there always holds iT i = 1. This means that we are facing a constrained optimization problem (see Sec. 1.3.4) with 

f (i ) = k1  iT X T Xi ; g(i ) = 1 iT i :

and

(5.13)

Using the the method of Lagrange multipliers, the optimum solution i has to obey

d J (i ) d = (f (i ) i  g(i )) = 0 di di

(5.14)

2 Here, again, maximum degrees of freedom existent in the data is assumed; for example, if the centering for the data is carried out using the sample mean, the denominator should be k 1. However, this scaling does not a ect the nal result, the directions of the eigenvectors

76

Lesson 5. Tackling with Redundancy

or 2

1 T  X Xi k

2i i = 0;

(5.15)

giving 1 T X X  i = i  i : k

(5.16)

Now, the variance maximization has become an eigenvalue problem with the searched basis vector i being an eigenvector of the matrix R = k1  X T X . The eigenvectors of the data covariance matrix are called principal components. Because of the eigenproblem structure, if i ful lls the equation (5.16), so does i , where is an arbitrary scalar; it will be assumed that the eigenvectors are always normalized to unit length, so that iT i = 1. The solution to the variance maximization problem is also given by some of the eigenvectors | but there are n of them, which one to choose? Look at the second derivative:

d2 J (i ) 2 T = X X di2 k

2i  I:

(5.17)

To reach the maximum of J (i ), there must hold d2 J (i )=di2  0, that is, the second derivative matrix (Hessian) must be semi-negative de nite: For any vector  there must hold

T



 k2  X T X

2i  I



   0:

(5.18)

For example, one can select  as being any of the eigenvectors,  = j :

jT



2 k

 XT X



2i  I  j = k2  jT  X T X  j 2i  jT j = 2j  jT j 2i  jT j = 2j 2i  0:

(5.19)

This always holds regardless of the value of 1  j  n only for the eigenvector i corresponding to the largest eigenvalue.

5.2.1 Eigenproblem properties Let us study closer the eigenvalue problem formulation (5.16). It seems that the matrix R = k1  X T X (or the data covariance matrix) determines the properties of the PCA basis vectors, and, indeed, these properties turn out to be very useful. First, it can be noted that R is symmetric, because the elements Rij and Rji have equivalent expressions:

Rij =

1 T  X X = 1  X T X = Rji : k i j k j i

(5.20)

5.2. Principal components

77

Next, let us multiply (5.16) from left by the vector iT (note that, of course, this vector is rank de cient, and only \one-way" implication can be assumed): 1 T T   X  Xi = i  iT i : (5.21) k i This expression consists essentially of two dot products (iT X T  Xi on the left, and iT  i on the right) that can be interpreted as squares of vector lengths. Because these quantities must be real and non-negative, and because k is positive integer, it is clear that the eigenvalue i is always real and non-negative. Let us again multiply (5.16) from left; this time by another eigenvector jT :

jT Ri = i  jT i :

(5.22)

Noticing that because R is symmetric (or R = RT ), there must hold jT R = (RT j )T = (Rj )T = j jT , so that we have an equation

j  jT i = i  jT i ;

(5.23)

j )  jT i = 0:

(5.24)

or (i

For i 6= j this can only hold if jT i = 0. This means that for a symmetric matrix R, eigenvectors are orthogonal (at least if the corresponding eigenvalues are di erent; for simplicity, this assumption is here made). Further, because of the assumed normalization, the eigenvectors are orthonormal. The above orthogonality property is crucial. Because of orthogonality, the eigenvectors are uncorrelated; that is why, the basis vectors corresponding to the maximum variance directions can, at least in principle, be extracted one at a time without disturbing the analysis in other directions.

5.2.2 Analysis of the PCA model Let us study the properties of the variables in the new basis. There are n eigenvectors i corresponding to eigenvalues i ; from now on, assume that they are ordered in descending order according to their numerical values, so that i  j for i < j . This is possible because it was shown that the eigenvalues are real and positive (note that the eig function in Matlab does not implement this ordering automatically). When the eigenvectors and eigenvalues are presented in the matrix form 0 1 1 0  C ...  = 1    n and  = B (5.25) @ A; 0 n where the dimension of  and  is n  n, the eigenproblem can be expressed compactly as 1 T X X   =   : (5.26) k

78

Lesson 5. Tackling with Redundancy

It was shown that the vectors constituting  are orthonormal; this means that the whole matrix  also is, so that T =  1. Noticing that X  = Z is the sequence of variables as presented in the new latent basis, one can write 1 T  Z Z = k1  T X T X  = T    = : (5.27) k What this means is that the new variables are mutually uncorrelated (because their covariance matrix  is diagonal); what is more, the eigenvalues i directly reveal the variances of the new variables. Let us elaborate on this a bit closer. varfz1 g +    + varfzng = 1 +    + n = trfg De nition of matrix trace = trf k1  T X T  X g = trf k1  X T X  T g (See below) = trf k1  X T X g Orthonormality of  = k1 x21 +    + k1 x2n = varfx1 g +    + varfxn g:

(5.28)

The matrix trace used above returns the sum of the diagonal elements of a square matrix. The change of the multiplication order above is motivated by the trace properties: Note that for all square matrices A and B there must hold trfAB g =

nA X nB X i=1 j =1

Aij Bji =

nB X nA X j =1 i=1

Bji Aij = trfBAg:

(5.29)

The above result (5.28) means that the total variability in x is redistributed in z . It was assumed that variance directly carries information | the information content is then redistributed, too. If the dimension is to be reduced, the optimal approach is to drop out those variables that carry least information: If an N < n dimensional basis is to be used instead of the full n dimensional one, it should be constructed as

 = 1



   N ;

(5.30)

where the vectors 1 to N are the directions of the most variation in the data space. If one tries to reconstruct the original vector x using the reduced basis variables, so that x^ = z , the error

x~ = x x^ = x

N X i=1

zi  i =

n X i=N +1

zi  i

(5.31)

has the variance

E fx~T (k)~x(k)g =

n X i=N +1

i :

(5.32)

This reveals that the the eigenvalues of R = k1  X T X give a straightforward method for estimating the signi cance of PCA basis vectors; the amount of data variance that will be neglected when basis vector i is dropped is i .

79

5.2. Principal components

As an example, study the case of Sec. 4.3 again. The eigenvalues of the data covariance matrix are 

1 = 2  E f 2 ()g + 2 2 = 2 ;

(5.33)

and the corresponding eigenvectors are

1 =

p1  2



1 1



and

2 =

p1  2



1 1



:

(5.34)

These basis vectors are shown in Fig. 5.2 (on the right); in this example, the data variance was E f 2 (k)g = 1 and the noise variance was 2 = 0:01. In this case, the ratio between the eigenvalues becomes very large, 1 =2  200; the basis vector 1 is much more important as compared to 2 . When a reduced basis with only the vector 1 is applied, all deviations from the line x2 = x1 are assumed to be noise and are neglected in the lower-dimensional basis. The data collinearity problem is avoided altogether.

Figure 5.2: Illustration of the \natural" and the PCA bases for the collinear data

5.2.3 Another view of \information" In the beginning of the chapter it was claimed that it is variance maximization that is the means of reaching good data models. But why does this seemingly arbitrary assumption really seem to do a good job? It must be recognized that the main goal in the data compression is to enhance the signal-to-noise ratio, so that the amount of misleading disinformation would be minimized as compared to the valuable real information. And it is here that the assumptions about \noise ontology" are utilized: The distribution of the noise hopefully di ers from that of real information. Typically the underlying basic assumption is that the noise is \more random" than the real signal is; this assumption can have di erent manifestations: 1. Truly random signals ful ll the assumptions of central limit theorem, so that noise distribution is more Gaussian than that of real information (this starting point is elaborated on in Chapter 7).

80

Lesson 5. Tackling with Redundancy

2. If one assumes that noise signals are uncorrelated with other signals, the noise is distributed approximately evenly in di erent directions in the n dimensional space. The second assumption is utilized in PCA: It is assumed that the same information is visible in various variables, so that the information introduces correlation in the data, whereas noise has no correlations or preferred directions in the data space (see Fig. 5.3). The noise variation remaining constant regardless of the direction, the maximum signal-to-noise ratio is reached in the direction where the signal variation is maximum | that is, in the direction of the rst principal component. PCR is strongest when MLR is weakest | in large-scale systems with high number of redundant measurements. Note that PCA gives tools also for further data analysis: For example, if one of the variables varies alone (just one variable dominating in the loadings), this variable seemingly does not correlate with other variables | one could consider leaving that variable out from the model altogether (however, see the next section). pc 2

Noise Signal

li pc 1 Noise Signal

i

Figure 5.3: Two views of the \directional" information vs. the \undirectional" noise: Five-dimensional data projected onto the rst two principal components, on the left, and the corresponding PCA eigenvalues on the right (note that adding a matrix q  I , the noise covariance, to the data covariance matrix shifts all eigenvalues up by the amount q). Relatively the most of the noise seems to be concentrated in the directions of lowest overall variation

5.3 Practical aspects Below, some practical remarks concerning the PCA method are presented. For more theoretical discussions, for the validity of the principal components model, etc., the reader should study, for example, [3].

5.3.1 Regression based on PCA The PCA approach has been used a long time for data compression and classi cation tasks. In all applications the basic idea is redundancy elimination | this is the case also in regression applications.

81

5.3. Practical aspects

Summarizing, it turns out that the eigenvector corresponding to the largest eigenvalue explains most of the data covariance. The numeric value of the eigenvalue directly determines how much of the data variation is contained in that eigenvector direction. This gives a very concrete way of evaluating the importance of the PCA basis vectors: One simply neglects those basis vectors that have minor visibility in the data. Using this reduced set of vectors as the internal model subspace basis  , principal component regression (PCR) is directly implemented3 . Because of the orthogonality of the basis vectors there holds Z = X , and the general modeling procedure (see page 74) reduces into the expression PCA

PCA

F

PCR

= F 1 F 2 = T X T X PCA

PCA

 1 T 

PCA

X T Y:

(5.35)

5.3.2 Selection of basis vectors How P n

to determine the dimension of the latent basis? For normalized data = n; a crude approximation is to include only those latent vectors i in the model for which there holds i > 1 | those directions carry \more that average amount" of the total information (being manifested in the variances). However, the overall behavior of the eigenvalue envelope should be taken into account: That is, plot the eigenvalues in descending order; if there is a signi cant drop between some of them, this may suggest where to put the model order. As a rule, it can be argued that the directions of largest eigenvalues are the most important, the dependency relations between variables being concentrated there, whereas the e ects of noise are pushed to the later principal components. However, analysis of the later components may also reveal some pecularities in the system operation, like outlier data, etc., and this information should not automatically be rejected. Often the rst few eigenvectors represent general dependencies within data whereas some may start representing individual outliers if they are dominant enough; this all is dependent of the numerical ratios between di erent phenomena. If the rst principal component dominates excessively, it may be reasonable to check whether the data preprocessing has been successfull: If the data is not mean-centered, it is this mean that dominates in the model rather than the true data variation, specially if the numerical data values are far from origin. The absolute minimum eigenvalue is zero, meaning that the set of measurements is linearly dependent; this can happen also if there are too few measurements, so that k < n; note, however, that PCA type data modeling can still be carried out in such case, whereas MLR would collapse. In general, the more there are goodquality samples as compared to the problem dimension, that is, if k  n, MLR often given good results, whereas the latent basis methods outperform MLR if the number of samples is low (and random variations are visible in data). It needs to be noted that the PCA results are very dependent of scaling: The i=1 i

3 Even if the basis would not be reduced, the orthogonality of the basis vectors already enhances the numeric properties of the regression model: in a non-orthogonal basis, the di erent coordinates have to \compete" against each other (heuristically speaking), often resulting in excessive numeric values

82

Lesson 5. Tackling with Redundancy

principal components can be turned arbitrarily by de ning an appropriate orthogonal transformation matrix D. Assume that X 0 = XD; if there holds  = k1  T X T X , then =

1 T   D  X 0T X 0  DT ; k

(5.36)

so that the new set of eigenvactors is DT  | directions being freely adjustable. If there exist eigenvectors with exactly equal eigenvalues in the covariance matrix, the selection of the eigenvectors is not unique; any linear combination of such eigenvectors also ful lls the eigenvalue equation (5.16). This is specially true for whitened data, where the data is preprocessed so that the covariance matrix becomes identity matrix; PCA can nd no structure in whitened data (however, see Chapter 7).

5.3.3 Analysis tools The numerical values of the principal component loadings reveal the dependencies (covariances) between di erent variables, and they also give information about the relevances of di erent input variables in the regression model. This kind of analysis is important specially when the model structure is iteratively re ned: Non-existent weighting of some of the inputs in all of the latent variables suggests that these inputs could perhaps be excluded from the model altogether. The PCA model can be analyzed against data in various ways in practice. One can for example calculate the measure for lack of t, the parameter called Q. This is simply the sum of error squares when a data sample is tted against the reduced basis, and then reconstructed. Because z () = T x() and x^() = z (), there holds x^() = T x(), so that the reconstruction error becomes x~() = (In T )  x(). The sum of error squares is then

Q() = = = =

x~T ()~x()   xT ()  In T T In T  x() xT ()  In 2T + T T  x() xT ()  In T  x();

(5.37)

because due to orthonormality of  there holds   T   T = T . The Q statistic indicates how well each sample conforms to the PCA model telling how much of the sample remains unexplained. Another measure, the sum of normalized squared scores, known as Hotellings T 2 statistic, reveals how well the data ts the data in another way: It measures the variation in each sample within the PCA model. In practice, this is revealed by the scores z (); the T 2 () is calculated as a sum of the squared normalized scores. Because the standard deviation of zi to be normalized is known to be p , there holds i

T 2 () = z T ()  N1  z () = xT ()  N1 T  x():

(5.38)

Roughly speaking, the smaller both of these Q() and T 2() turn out to be, the better the data ts the model. There are essential di erences, though: For

5.3. Practical aspects

83

example, in ating the basis, or letting N grow, typically increases the value of T 2(), whereas Q() decreases. Closer analyses could be carried out to nd exact statistical con dence intervals for these measures; however, these analyses are skipped here.

5.3.4 Calculating eigenvectors in practice There exist very eÆcient methods for calculating eigenvalues and eigenvectors, available, for example, in Matlab. However, let us study such a case where the dimension n is very high, and only few of the eigenvectors are needed. Assuming that the measurement signals are linearly independent, the (unknown) eigenvectors of the covariance matrix span the n dimensional space, that is, any vector  can be expressed as a weighted sum of them:

 = w1 1 + w2 2 +    + wn n :

(5.39)

If this vector is multiplied by the covariance matrix, each of the eigenvectors behaves in a characteristic way:

R = 1  w1 1 + 2  w2 2 +    + n  wn n :

(5.40)

Further, if this is repeated k times:

Rk  = k1  w1 1 + k2  w2 2 +    + kn  wn n :

(5.41)

If some of the eigenvalues is bigger than the others, say, 1 , nally it starts dominating, no matter what was the original vector  ; that is, the normalized result equals the most signi cant principal component : 



Rk  lim = 1 : k!1 kRk  k

(5.42)

Assuming that the eigenvalues are distinct, this power method generally converges towards the eigenvector 1 corresponding to the highest eigenvalue 1 | but only if w1 6= 0. Starting from a random initial vector  this typically holds. However, one can explicitly eliminate 1 from  , so that

0 = 

1T   1 :

(5.43)

Now there holds

1T  0 = 1T 

1T   1T 1 = 0;

(5.44)

meaning that 1 does not contribute in  0 , and necessarily wi = 0. If the power method is applied starting from this  0 as the initial guess, the iteration converges towards the eigenvector direction corresponding to the next highest eigenvalue 2 . Further, after the second principal component 2 is found, the procedure can be continued starting from  00 were both 1 and 2 are eliminated, resulting in the third eigenvector, etc. If only the most signi cant eigenvectors

84

Lesson 5. Tackling with Redundancy

are needed, and if the dimension n is high, the power method o ers a useful way to iteratively nd them in practice (in still more complex cases, where the matrix R itself would be too large, other methods may be needed; see Sec. 8.3.1). Of course, numerical errors cumulate, but the elimination of the contribution of the prior eigenvectors (5.43) can be repeated every now and then. The elimination of basis vectors can be accomplished also by applying so called de ation methods for manipulating the matrix R explicitly.

5.4 New problems The PCR approach to avoiding the collinearity problem is, however, not a panacea that would always work. To see this, let us study another simple example. Again, assume that we can observe two variables x1 and x2 , so that x = ( x1 x2 )T . This time, however, these variables are independent; and to simplify the analysis further, assume that no noise is present. This means that the covariance matrix becomes 1 k

Pk

x21 ()  X T X =  Pk =1 =1 x1 ()x2 ()  2  E fx01 ()g E fx02 ()g 2 1 k

! Pk x (  ) x (  ) 1 2 P =1 k x2 () =1 2 

:

(5.45)

The eigenvalues are now trivially 1 = E fx21 ()g and 2 = E fx22 ()g, and the eigenvectors are 1 = ( 1 0 )T and 2 = ( 0 1 )T , respectively. If either of the eigenvalues has much smaller numerical value, one is tempted to drop it out (as was done in the previous PCA example). So, assume that 2 is left out. What happens if the underlying relationship between x and y can be expressed as y = f (x2 ), so that x1 (or 1 ) is not involved at all? This means that a regression model that uses the reduced PCA basis will fail completely.

5.4.1 Experiment: \Associative regression" It is evident that one has to take the output into account when constructing the latent variables | so, what if we de ne

v() =



x() y()



;

(5.46)

and construct a PCA model for this | then the input and output variables should be equally taken into account in the construction of the latent variables. The corresponding covariance matrix becomes 1 1 T  V V=  k k



XT X XT Y Y TX Y TY



;

(5.47)

85

5.4. New problems

so that the eigenproblem can be written as 1  k



XT X XT Y YTX YTY

 

 ii



= i 



i i



:

(5.48)

Here, the eigenvectors are divided in two parts: First, i corresponds to the input variables and i to outputs. The selection of the most important eigenvectors proceeds as in standard PCA, resulting in the set of N selected eigenvectors 

 



:

(5.49)

The eigenvectors now constitute the mapping between the x and y variables, and the matrices  and  can be used for estimating y in an \associative way". During regression, only the input variables are known; these x variables are tted against the \input basis" determined by , giving the projected z variables4:

Z = X  T (T ) 1 :

(5.50)

The output mapping is then determined by the \output basis" : Because the coordinates z are known, the estimate is simply

Y^ = Z  :

(5.51)

Combining these gives the regression model

F

ASS

= T (T ) 1 :

(5.52)

The problem of loosely connected input and output variables still does not vanish: The correlated variables dominating in the eigenvectors can be in the same block, that is, they may both be input variables or they may both be output variables. Modeling their mutual dependency exclusively may ruin the value of the regression model altogether. What one needs is a more structured view of the data; the roles of inputs and outputs need to be kept clear during the analysis, and it is the regression models duty to bind them together. This objective is ful lled when applying the methods that are presented in the following chapter. It needs to be noted that when concentrating on speci c details, something always remains ignored. Now we have seen two methods (MLR and PCA) that o er the best possible solutions to well-de ned compact problems. In what follows, MLR will routinely be used when it is justi ed, and PCA will be used for data compression tasks, understanding their de ciencies; the problems they may possibly ignore are then solved separately. It is expert knowledge to have such a mental \theoretical toolbox" for attacking di erent problems using appropriate combinations of basic methods depending on the situation at hand.

4 Note that, whereas the eigenvectors of the whole system are orthogonal, the truncated vectors in  are not

86

Lesson 5. Tackling with Redundancy

Computer exercises 1. Study how the data properties a ect the principal component analysis; that is, change the degrees of data freedom and noise level (parameters dofx and sigmax, respectively): dofx = 5; sigmax = 0.5; X = dataxy(100,10,NaN,dofx,NaN,sigmax); pca(X);

2. Compare the eigenvectors and eigenvalues of the matrix R = k1  X T X when the data preprocessing is done in di erent ways; that is, create data as DATA = dataclust(3,1,100,50,5);

and analyze the results of showclust(X,ones(size(X))); hold on; plot(0,0,'o'); pca(X)

when the following approaches are used: X X X X X X X

= = = = = = =

DATA; center(DATA); scale(DATA); scale(center(DATA)); center(scale(DATA)); whiten(DATA); whiten(center(DATA));

Explain the qualitative di erences in the eigenvalue distributions. Which of the alternatives is recommended for PCR modeling?

Lesson 6

Bridging Input and Output In the previous chapter it was shown that (one) thing plaguing PCA is its exclusive emphasis on the input variables. The next step to take is then to connect the output variables in the analysis. But, indeed, there are various ways to combine the inputs and outputs. In this chapter, two strategies from the other ends of the scienti c community are studied | the rst of them, Partial Least Squares, seems to be very popular today among chemical engineers. This approach is pragmatic, usually presented in an algorithmic form1 . The second one, Canonical Correlation Analysis, has been extensively studied among statisticians, but it seems to be almost unknown among practicing engineers. However, both of these methods share very similar ideas and structure | even though the properties of the resulting models can be very di erent.

6.1 Partial least squares The Partial Least Squares (PLS)2 regression method has been used a lot lately, specially for calibration tasks in chemometrics [30],[37]. In this section, a different approach to PLS is taken as compared to usual practices, only honoring the very basic ideas. The reason for this is to keep the discussion better comprehensible, sticking to the already familiar eigenproblem-oriented framework.

6.1.1 Maximizing correlation The problem with PCA approach is that it concentrates exclusively on the input data X , not taking into account the output data Y . It is not actually the data variance one wants to capture, it is the correlation between X and Y that should be maximized. The derivation of the PLS basis vectors can be carried out as in the PCA case, 1 On page 11, PLS was characterized as a \try and pray" method; the reason for this is | it can be claimed | that a practicing engineer simply cannot grasp the unpenetrable algorithmic presentation of the PLS ideas. He/she can just use the available toolboxes and hope for the best 2 Sometimes called also Projection onto Latent Structure

87

88

Lesson 6. Bridging Input and Output

now concentrating on correlation rather than variance. The procedure becomes slightly more complex than in the PCA case: It is not only the input X block that is restructured, but the internal structure of the output Y block is also searched for. The regression procedure becomes such that the X data is rst projected onto a lower dimensional X oriented subspace spanned by the basis vectors i ; after that, data is projected onto the Y oriented subspace spanned by the basis vectors i , and only after that, the nal projection onto the Y space is carried out.

F1

X

F2

Z1

F

Z2

3

Y

Figure 6.1: The dependency model y = f (x) re ned The objective now is to nd the basis vectors i and i so that the correlation between the projected data vectors Xi and Y i is maximized while the lengths of the basis vectors are kept constant. This objective results in the constrained optimization problem (1.27) where 8 < :

f (i ; i ) = k1  iT X T  Y i ; g1 (i ) = 1 iT i and g2 (i ) = 1 Ti i :

when

(6.1)

There are now two separate constraints, g1 and g2 ; de ning the corresponding Lagrange multipliers i and i gives the Hamiltonian 1 T T   X  Y i k i

i 1 iT i





i 1 Ti i ;

(6.2)

and di erentiation gives (

d di d di

1 T T k  i X  Y i 1 T T k  i X  Y i



i (1 iT i ) i (1 Ti i )  = 0 i (1 iT i ) i (1 Ti i ) = 0;

(6.3)

resulting in a pair of equations  1

 X T Y i  Y T Xi

k1 k

2i i = 0 2i i = 0:

(6.4)

Solving the rst of these for i and the second for i , the following equations can be written:  1

k1 k

2 2

 X T Y Y T Xi = 4i i  i  Y T XX T Y i = 4i i  i :

(6.5)

This means that, again, as in Sec. 5.2, the best basis vectors are given as solutions to eigenvalue problems; the signi cance of the vectors i (for the X block) and i (for the Y block) is revealed by the corresponding eigenvalues i = 4i i .

89

6.2. Continuum regression

Because the matrices X T Y Y T X and Y T XX T Y are symmetric, the orthogonality properties again apply to their eigenvectors. The expression (5.35) can directly be utilized; the internal basis  consists of a subset of eigenvectors, selection of these basis vectors being again based on the numeric values of the corresponding eigenvalues. In practice, the basis vectors i are redundant and they need not be explicitly calculated (see Sec. 6.3.3). Because the rank of a product of matrices cannot exceed the ranks of the multiplied matrices, there will be only minfn; mg non-zero eigenvalues; that is why, the PCR approach may give higher dimensional models than PLS (when applying this eigenproblem oriented approach). It should be recognized that the PLS model is usually constructed in another way (for example, see [30]); this \other way" may sometimes result in better models, but it is extremely uninstructive and implicit, being de ned through an algorithm. It can be shown that the two approaches exactly coincide only what comes to the most signi cant basis vector. Let us study the example that was presented in the previous chapter, now in the PLS framework. The output is scalar; it is assumed that it is linearly dependent of the second input variable, so that y() = f  x2 (), where f is a constant. The matrix in (5.45) becomes PLS

1 k

TX  X T YY P  P P P x (  ) y (  ) x (  ) y (  ) x (  ) y (  ) x (  ) y (  ) 1 1 1 2 1     P P P P =k  )  x21 ()y ()  x2 ()y ()  x2 ()y ()  x2 ()y (  E f x (  ) y (  ) g E f x (  ) y (  ) g  E f x (  ) y (  ) g 1 2  Efx1 ()y()1g  Efx2 ()y()g E2 fx2 ()y()g  

2

2

=

0 0 0 f 2  E 2 fx22 ()g

;

because x1 and y are not assumed to correlate. This result reveals that the maximum eigenvalue is f 2 E2 fx22 ()g corresponding to the second input variable | no matter what is the ratio between the variances of x1 and x2 . This means that the basis always includes the vector ( 0 1 )T | and according to the assumed dependency structure, this is exactly what is needed to construct a working regression model. As a matter of fact, it can be seen that the eigenvalue corresponding to the rst input variable is zero, re ecting the fact that x1 has no e ect on y whatsoever.

6.2 Continuum regression 6.2.1 On the correlation structure Let us study the correlation between input and output from yet another point of view. The correlation structure is captured by the (unnormalized) crosscorrelation matrix

X T Y:

(6.6)

90

Lesson 6. Bridging Input and Output

The eigenvalues and eigenvectors are already familiar to us, and it has been shown how useful they are in the analysis of matrix structures. Perhaps one could use the same approaches to analysis of this correlation matrix? However, this matrix is generally not square and the eigenstructure cannot be determined; but the singular value decomposition, the geralization of the eigenvalue decomposition exists (see Sec. 1.3.2)

X T Y = XY XY TXY :

(6.7)

Here XY and XY are orthogonal matrices, the rst being compatible with X and the other being compatible with Y ; XY is a diagonal matrix, but if the input and output dimensions do not match, it is not square. Multiplying (6.7) by its transpose either from left or right, the orthonormality of XY and XY (so that TXY = XY1 and TXY = XY1 ) means that there holds

X T Y Y T X = XY XY TXY XY1

(6.8)

Y T XX T Y = XY TXY XY XY1 :

(6.9)

and Because XY TXY and TXY XY are diagonal square matrices, these two expressions are eigenvalue decompositions (1.5) of the matrices X T Y Y T X and Y T XX T Y , respectively. This means that there is a connection between the singular value decomposition and the above PLS basis vectors: The matrices XY and XY consist of the (full sets) of PLS basis vectors i and i . The diagonal elements of XY , thepsingular values, are related to the PLS eigenvalues in such a way that i = k  i . What is more, one can see that the SVD of the input data block X alone is similarly closely related to the PCA constructs:

X T = X X TX ;

(6.10)

so that

X T X = X X TX X1 ;

(6.11)

meaning that, again, the singular value decomposition does the trick, principal components being collected p in X and singular values being related to the eigenvalues through i = k  i .

6.2.2 Filling the gaps What if one de nes the matrix3 (X T ) X (Y ) Y ;

(6.12)

so that both of the analysis methods, PCA and PLS, would be received by selecting the parameters X and Y appropriately (for PCA, select X = 1 and 3 The

powers of non-square matrices being de ned as shown in Sec. 1.3.2

91

6.2. Continuum regression

Figure 6.2: Schematic illustration of the relation between regression approaches

Y = 0, and for PLS, select X = 1 and Y = 1), and applying SVD? And, further, why not try other values for X and Y for emphasizing the input and output data in di erent ways in the model construction? Indeed, there is a continuum between PCA and PLS | and this is not the whole story: Letting the ratio X = Y go towards zero, we go beyond PLS, towards models where the role of the output is emphasized more and more as compared to the input, nally constructing an singular value decomposition for Y alone (or eigenvalue decomposition for Y T Y ). It is only the ratio between X and Y that is relevant; we can eliminate the other of them, for example, by xing X = 1. Representing the problem in the familiar eigenproblem framework, multiplying (6.12) from left by its transpose and compensating the number of samples appropriately one has the eigenproblem formulation for the Continuum Regression (CR) basis vectors de ned as4 1

k1+



 X T Y Y T X  i = i  i :

(6.13)

Correspondingly, the \dual" formulation becomes 1

k1+1=

1

 Y T XX T Y  i = 0i  i :

(6.14)

When grows from 0 towards 1, the modeling emphasis is rst put exclusively on the input data, and nally exclusively on the output data (see Fig. 6.2); some special values of do have familiar interpretations:

 If = 0, the PCA model results, only input being emphasized.  If = 1, the PLS model results, input and output being in balance.  If ! 1, an \MLR type" model results, only output being emphasized5 . Which of the regression approaches, MLR, PCR, or PLS, is the best, cannot be determined beforehand; it depends on the application and available data. All of these methods have only mathematical justi cation; from the physical point of 4 These eigenproblems should not be solved directly in this form: The matrix XXT has dimension k  k, even though there are only n non-zero eigenvalues (or singular values) 5 Note that MLR is not based on basis vectors; that is why, the correspondence is somewhat arti cial (the rst basis vector of the CR model explaining the rst principal component of the output data, thus explaining maximum amount of the output variance)

92

Lesson 6. Bridging Input and Output

view, none of them can be said to always outperform the others. It may even be so that the ultimate optimum model lies somewhere on the continuum between PCR, PLS, and MLR (it may also lie somewhere else outside the continuum). In Figs. 6.3 and 6.4, the CR performance is visualized: There were 30 machinegenerated data samples with 20 input and 20 output variables; the number of independent input variables was 10 and the \correct" dimension of the output was 5; relatively high level of noise was added. And, indeed, it seems that when the cross-validation error is plotted as the function of latent variables N and continuum parameter as a two-dimensional map, interesting behavior is revealed: Starting from = 0, the minimum error is reached for about N = 12 whereas the overall optimum is found near MLR with N = 6. 1

a

0.5

a

N 0

N 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Figure 6.3: Continuum regression performance for di erent parameter values N and

Figure 6.4: Continuum regression performance as a \mountain view"

6.2.3 Further explorations It needs to be emphasized again that there are typically no absolutely correct methods for determininf physically optimal latent basis vectors. As in the whole report, the goal here is to show that there is plenty of room for experimenting and research (after all, the history of CR is less than ten years long; by no means one should assume that the nal word has been said). For example, a whole class of methods can be de ned that share the idea of continuum regression. Let us study a slightly di erent approach. MLR can be interpreted as modeling the covariance structure of the estimated output Y . The problem that emerges is that the output space usually does not have the same dimension as the input space has; that is why, the output variations need to be somehow presented in the input space to make this approach compatible with the other ones, PCR and PLS. The outputs can be projected into the input space by applying MLR in the \inverse direction", that is, X^ = Y  (Y T Y ) 1 Y T X , so that the covariance to be modeled has the form 1 k

 X^ T X^

= k1  X T Y (Y T Y ) 1  Y T Y  (Y T Y ) 1 Y T X = k1  X T Y (Y T Y ) 1 Y T X:

(6.15)

93

6.2. Continuum regression a1

1

a1

1

a

a 1

1 a2

a2 -1

-1

Option 1

Option 2

Figure 6.5: Two alternative feasible function forms (see text) Actually, this formulation gives a new \latent structure" oriented view of MLR. Assuming that all eigenvectors are utilized, the normal MLR results (of course, this is true for all latent variables based methods if all latent variables are employed), but if a lower dimensional internal model is constructed, the output properties are preserved based on their \visibility" in Y . It turns ot that if one de nes the latent vectors i as 

1

 XT Y Y T Y k1+ (1+ ) 1

2

 2

YT

 1

X  i = i  i ;

(6.16)

all of the above regression methods can be simulated by appropriately selecting the parameters 1 and 2 :

  

PCR is given by 1 = 0, whereas parameter 2 can have any value; PLS results if 1 = 1 and 2 = 0; and MLR is found if 1 = 1 and 2 = 1.

We would like to have a single parameter spanning the continuum between the approaches, so that = 0 would give MLR, = 1=2 would give PLS, and = 1 would give PCR (note that the range of is now from 0 to 1). There is an in nity of alternative options | for example, the following de nitions ful ll our needs: 1. 1 = 2 2 + + 1 and 2 = 2 1, or 2. 1 =

3 2



j

1j 2

1 2

and 2 =

+

j

1 j. 2

The outlooks of these functions are presented in Fig. 6.5. As an example, selecting the option 1 above, the latent vectors of CR can be calculated as solutions to the following eigenproblem:   1 T Y Y T Y 2 1 Y T 2 + +1 X   =   :  X k 2

(6.17)

Here, the parameter can be selected as = 4 3 + 2 2 + 2 + 1 to compensate for the changes in the number of samples. It needs to be noted that the

94

Lesson 6. Bridging Input and Output

outer matrix that one has to calculate the power function of may be very large (dimension being k  k); however, there are only m eigenvalues di erent from zero, meaning that (in principle) only m power functions have to be calculated. The matrix power is best to calculate using the singular value decomposition. The basis  is again constructed from the selected eigenvectors; because of the symmetricity of the matrix in (6.17), the basis is orthonormal. CR

6.3 Canonical correlations Another approach to modeling the dependency structure between the input and the output is o ered by Canonical Correlation Analysis (CCA) [31].

6.3.1 Problem formulation Again, one would like to nd the latent basis vectors i and i so that the correlation between the input and output blocks would be maximized. The criterion to be maximized is again 1 f (i ; i ) =  i T X T Y i ; (6.18) k but the constraints are modi ed slightly: 

g1 (i ) = k1  i T X T Xi = 1 g2 (i ) = k1  i T Y T Y i = 1:

(6.19)

Note the di erence as compared to the PLS derivation: It is not the basis vector i itself that is kept constant size; it is the projected data vector size Zi = Xi that is regulated, i T X T  Xi being kept constant. The same applies also in the output block: The size of i T Y T  Y i is limited. Again using the Lagrangian technique the following expression is to be maximized: 1 T T   X Y i + i  (1 k1  i T X T Xi ) + i  (1 k1  i T Y T Y i ):(6.20) k i This expression can be minimized with respect to both i and i separately: (

1 d k  di 1 d k  di



iT X T Y i i (1 iT X T Xi ) i (1 Ti Y T Y i )  = 0 iT X T Y i i (1 iT X T Xi ) i (1 Ti Y T Y i ) = 0;

resulting in a pair of equations 

X T Y i 2i X T Xi = 0 (6.21) Y T Xi 2i Y T Y i = 0: Solving the rst of these for i and the second for i , the following equations can be written (assuming invertibility of the matrices): 

X T Y (Y T Y ) 1 Y T Xi = 4i i  X T Xi Y T X (X T X ) 1 X T Y i = 4i i  Y T Y i ;

(6.22)

95

6.3. Canonical correlations

or 

(X T X ) 1X T Y (Y T Y ) 1 Y T X  i = 4i i  i (Y T Y ) 1 Y T X (X T X ) 1 X T Y  i = 4i i  i :

(6.23)

This means that, again, the best basis vectors are given as solutions to eigenvalue problems; the signi cance of the vectors i (for the X block) and i (for the Y block) is revealed by the corresponding eigenvalues i = 4i i (note the equivalences of the corresponding eigenvalues in di erent blocks). If either X T X or Y T Y is not invertible, either one of the generalized eigenvalue problems in (6.22) can directly be solved. It needs to be recognized that data must be explicitly scaled in the CCA case6 : The property k1  T X T X = I is not automatically guaranteed by the eigenproblem formulation. The matrix is diagonal (see the next section), but the diagonal elements are ones only after appropriate scalings: r

i

i =

1 T T   X Xi : k i

(6.24)

6.3.2 Analysis of CCA If the former equation in (6.22) is multiplied from left by jT , one has

jT X T  Y (Y T Y ) 1 Y T  Xi i  jT X T  Xi = 0:

(6.25)

When rearranged in the above way, one can see that the matrix Y (Y T Y ) 1 Y T is symmetric | meaning that (as in Chapter 5) the eigenproblem can be read in the \inverse" direction, and the following must hold 

jT X T  Y (Y T Y ) 1 Y T  Xi i  jT X T  Xi = j  jT X T Xi i  jT X T Xi = (j i )  jT X T  Xi = 0;

(6.26)

meaning that Xi and Xj must be orthogonal if i 6= j so that iT X T Xj = 0 (remember that for i = j it was assumed that iT X T Xj = 1). The same result can be derived for the output block: The projected variables are mutually uncorrelated. Further, if the equations in (6.21) are multiplied from left by jT and Tj , respectively, one has 

jT X T Y i = 2i  jT X T Xi Tj Y T Xi = 2i  Tj Y T Y i :

(6.27)

Observing the above uncorrelatedness result, one can conclude that also for the cross-correlations between the projected input and output blocks the same structure has emerged: Only for j = i there is correlation, otherwise not; this 6 This kind of extra scaling is not needed in the above PCA and PLS approaches: By construction, the eigenvectors were assumed to be normalized to unit length

96

Lesson 6. Bridging Input and Output

p

correlation coeÆcient is 2i = 2i = i . The above results can be summarized by showing the correlation structure between the latent input and output bases: 0 

  X T X Y Y

B B  B B =B B B B @

1

p

... 1

...

p

1

1

p

1 n

1

... ...

C C C n C C: C C C A

p

(6.28)

1

For notational simplicity, it is assumed here that n = m (otherwise, the nondiagonal blocks are padded with zeros). The basis vectors i and p i are called canonical variates corresponding to the canonical correlations i . The very elegant structure of (6.28) suggests that there must be going on something more important | the dependencies between the input and output blocks are channelled exclusively through these variates. Indeed, it has been recognized that the canonical variates typically reveal some kind of real physical structure underlying the observations, and they have been used for \exploratory data analysis" already in the 1960's. The underlying real structure will be concentrated on more in the next chapter. Note that, because of the non-symmetricity of the eigenproblem matrices, the bases are now generally not orthogonal! This is one concrete di erence between CCA and PCA/PLS. It can be claimed that whereas PCA and PLS are mathematically better conditioned, CCA is often physically better motivated | the underlying real structures seldom represent orthogonality.

6.3.3 Regression based on PLS and CCA In Fig. 6.1, it was explained that regression is a three-step procedure with two latent bases. However, it needs to be noted that this cumulating complexity is only illusion, presented in this form only to reach conceptual comprehensibility. In practice, it is only the rst mapping from X to Z 1 where the data compression takes place, the step between Z 1 to Z 2 introducing no additional information loss | thus, the same functionality as in the \stepwise" procedure is reached if one maps the data directly from Z 1 to Y , discarding the level Z 2 . With PLS, the structure of the regression model reduces into the same expression as with PCR (see page 74):

F

PLS

= T X T X PLS

PLS

 1 T 

PLS

X T Y:

(6.29)

With CCR, however, the basis vectors are not orthogonal but the projected data score vectors are | see (6.28). That is why, there is again reduction to the algorithm presented on page 74, but the result looks very di erent7 :

F

CCR

7 By

= T  CCA

CCA

 1 T 

CCA

X T Y:

(6.30)

the way, note the similarity between these regression formulas and the expressions (4.17) and (5.35): It is always the correlation between X and Y , or X T Y , that is used as the basis for the mapping between input and output; how this basic structure is modi ed by the additional matrix multiplier is only dependent of the method

97

6.3. Canonical correlations

6.3.4 Further ideas There are various bene ts when all methods are presented in the same eigenproblem-oriented framework | one of the advantages being that one can uently combine di erent approaches. For example, it turns out that if one de nes

R

CR2

1 = k

 XT X

2 1  T X Y

YTY

2 1

Y TX

1

;

(6.31)

the methods from CCR to PLS and PCR are found for = 0, = 21 , and = 1, respectively!8 Parameter can be selected as = 2 1 + (1 )(2 1) = 2 2 + 5 2. MLR could also easily be included somewhere along the continuum when using another choice of expressions for the exponents There is one drawback, though | only for the distinct values = 12 and = 1 the eigenvectors are orthogonal, as compared with the standard continuum regression. Study yet another idea: Observe the combination of matrices in the CCA solution (X T X ) 1 X T Y (Y T Y ) 1 Y T X:

(6.32)

Note that this can be divided in two parts: The rst part can be interpreted as a mapping X from input to Y^ , and the second part maps Y^ to X^ :

X^ = X  F 1 F 2 ;

(6.33)

where

F 1 = (X T X ) 1 X T Y; and F 2 = (Y T Y ) 1 Y T X:

(6.34)

That is, CCA can be interpreted as modeling the behaviors of the mappings when data X is rst projected onto output Y and from there back to input. This introduces yet another (CCA oriented) way of constructing the latent basis: One can study what are the statistical properties of this \twice projected" data in the PCA way, that is, the orthogonal basis vectors can be de ned trough the eigenproblem

X^ T X^  i = i  i ;

(6.35)

X T Y (Y T Y ) 1 Y T X (X T X ) 1X T Y (Y T Y ) 1 Y T X  i = i  i :

(6.36)

or

8 In this case, all the matrices that are involved are low-dimensional and the powers are easily calculated; also note that in the PLS case the square root of the nominal formulation is used for notational simplicity | the eigenvectors, however, remain invariant in both cases

98

Lesson 6. Bridging Input and Output

Computer exercises 1. Study the robustness of the di erent regression methods trying di erent values for parameter k (number of samples): k = 20; [X,Y] = dataxy(k,5,4,3,2,0.001,1.0); E = crossval(X,Y,'mlr(X,Y)'); errorMLR = sum(sum(E.*E))/(k*4) E = crossval(X,Y,'mlr(X,Y,pca(X,3))'); errorPCR = sum(sum(E.*E))/(k*4) E = crossval(X,Y,'mlr(X,Y,pls(X,Y,2))'); errorPLS = sum(sum(E.*E))/(k*4) E = crossval(X,Y,'mlr(X,Y,cca(X,Y,2))'); errorCCR = sum(sum(E.*E))/(k*4)

% Try different % Try different % Try different

2. If installed on your computer, get acquainted with the Chemometrics Toolbox for Matlab, and PLS Toolbox. Try the following demos: plsdemo; crdemo;

Lesson 7

Towards the Structure During the previous discussions, the role of the latent structure has become more and more emphasized. And, indeed, now we are taking yet another leap in that direction: It will be assumed that there really exists some underlying structure behind the observations (see Fig. 7.1)1 . The observations x are used to determine the internal phenomena taking place within the system; the output variables are calculated only after that. Truly knowing what happens within the system no doubt helps to pinpoint the essential behavioral patterns, thus promising to enhance the accuracy of the regression model. In the earlier chapters the latent structure was just a conceptual tool for compressing the existing data, now it takes a central role in explaining the data.

X

Z

Y

Figure 7.1: Yet another view of signal dependency structure As has been noticed, the methods presented this far do not o er us intuitively appealing ways to nd the real structure: If simple scaling can essentially change the PCA model, for example (see (5.36), it cannot be the physical structure that is being revealed. On the other hand, somehow the idea of continuity between the methods (as utilized in CR) does not promise that a uniquely correct structure would be found. The mathematically motivated structure is not necessarily physically meaningful. 1 Note that the causal structure is now assumedly di erent as it was before: If both X and Y are only re ections of some internal system structure, so that no causal dependence is assumed between them, the applications of the nal models should also recognize this fact. This means that control applications are somewhat questionable: If x values are altered in order to a ect the y values according to the correlations as revealed by the model, it may be that the intended e ects are not reached. On the other hand, di erent kinds of soft sensor applications are quite all right: The observed correlations justify us to make assumptions about y variables when only x has been observed (assuming invariant process conditions)

99

100

Lesson 7. Towards the Structure

It is an undeniable truth that the underlying primary structure cannot be determined when only observations of the behavior are available. We can only make optimistic guesses | if we trust the benevolence of Nature these guesses are perhaps not all incorrect. However, remember Thomas Aquinas and his theories of \First Cause": \... And so we must reach a First Mover which is not moved by anything; and this all men think of as God."

7.1 Factor analysis An age-old method for feature extraction, or nding the underlying explanations beyond the observations is Factor Analysis. It has been applied widely in social sciences, etc. The basic model is familiar:

x() = z ();

(7.1)

X = ZT :

(7.2)

or The goal is to nd the basis  and the scores Z (factors) so that the residual errors E = X ZT would be minimized. Nothing strange here | actually the goal sounds identical with the PCA problem formulation. However, now we have an additional uncorrelatedness constraint for the residual: The residual errors Ei should be uncorrelated2: 0

Efe()eT ()g =

1 T E E =B @ k

varfe1 ()g 0

...

1

0 varfen()g

C A:

(7.3)

All dependencies between data should be explained by the factors alone. Assuming that the residual errors and factors are uncorrelated, the data covariance matrix can be written as 1 k

 XT X

= k1  Z T ZT + Z T E + E T ZT + E T E 1 T T 1 T k  Z Z + k  E E:



(7.4)

From this it follows that, if one de nes

0 = M 0 T Z Z 0 = M 1Z T Z (M T ) 1 ;

(7.5)

the same residual errors are received for di erent factor structure; the new model is also equally valid factor model as the original one was for any invertible matrix M . This means that the results are not unique. Factor analysis is more like art than science; there are more or less heuristic basis rotations that can be applied to enhance the model. These algorithms will not be studied here. 2 Note

that this uncorrelatedness property is not ful lled by the PCA basis

7.2. Independent components

101

Note that the uniqueness of the PCA model (at least if the eigenvalues are distinct) is caused by the assumed ordering of the basis vectors according to their relevance in terms of explained variance; in the factor analysis model, this kind of ordering is not assumed and uniqueness is not reached.

7.2 Independent components Above, factor analysis tried to nd the original sources by emphasizing uncorrelatedness | but the results were not quite satisfactory, uniqueness of the results remaining lost. Could we de ne more restrictive objectives that would x the problems of traditional factor analysis? And, indeed, the answer is yes: During the last decade, it has turned out that the independence of sources is a good starting point. This approach is called Independent Component Analysis (ICA), and it has lately been studied specially in the neural networks community. It has been successfully applied for blind source separation, image coding, etc. (see [15], [27]).

7.2.1 Why independence? Intuituively, the original sources are those that are independent of other sources. Finding the underlying structure can be based on this idea: Search for data that is maximally independent. In mathematical terms, two variables x1 and x2 can be said to be independent if there holds3 Eff1(xi ())f2 (x2 ()g = Eff1(xi ())g  Eff2 (x2 ()g:

(7.6)

According to the above formulation, it can be said that maximizing independence between signals simultaneously minimizes the mutual information between them. In a way, the idea of ICA is to invert the central limit theorem: When various independent variables are mixed, the net distribution more or less approximates normal distribution. So, when searching for the original, unmixed signals, one can search for maximally non-normal projections of the data distribution!

7.2.2 Measures for independence Probability distributions can uniquely be determined in terms of moments or cumulants. Gaussian distribution is determined by the rst order cumulant (mean value) and the second order cumulant (variance) alone; for this distribution, all higher order cumulants vanish. This means that the \non-normality" of a distribution can be measured (in some sense) by selecting any of the higher order cumulants; the farther this cumulant value is from zero (in positive or 3 Note that independence is much more than simple uncorrelatedness, where the formula (7.6) holds only when both of the functions are identities, f1 (x1 ) = x1 and f2 (x2 ) = x2 . Because independence is so much more restricting condition than what uncorrelatedness is, one is capable of nding more unique solutions than what is the case with traditional factor analysis

102

Lesson 7. Towards the Structure

negative direction), the more the distribution di ers from Gaussian. For example, non-normality in the sense of \non-symmetricity" can be measured using the third-order cumulant skewness. In ICA, the standard selection is the fourth-order cumulant called kurtosis that measures the \peakedness" of the distribution: kurtfxi ()g = E fx4i ()g 3  E 2 fx2i ()g:

(7.7)

For normalized data this becomes kurtfxi ()g = E fx4i ()g 3:

(7.8)

If the data is appropriately normalized, the essence of kurtosis is captured in the fourth power properties of the data; this fact will be utilized later. After the ICA basis has been determined somehow, regression based on the independent components can be implemented (this method could be called \ICR"). Note that the expressions are somewhat involved because the basis vectors are non-orthogonal.

7.2.3 ICA vs. PCA Figs. 7.3 and 7.2 illustrate the di erence between the principal components and the independent components in a two-dimensional case. The data is assumed to have uniform distribution within the diamond-shaped region, and in these gures, ICA and PCA bases for this data are shown, respectively. It really seems that independence means non-Gaussianity: Note that the trapetzoidal marginal distributions in the non-independent PCA case are much more Gaussian than the \ at", negatively kurtotic uniform distributions in the ICA case. The \mixing matrix" (using the ICA terminology) in the case of Fig. 7.3 is

=



p

1=p2 1 1= 2 0



;

(7.9)

meaning that x = z . Note that, as compared to the Gaussian distribution, uniform distribution is rather \ at"; in this case the kurtosis is maximally negative in the directions of the original sources, other projections having smoother, more Gaussian distributions.

7.3 Eigenproblem-oriented ICA algorithms Normally independent component analysis is carried out in an algorithmic, iterative framework [15]; there are good reasons for this, but in this context we would like to bring ICA into the same eigenproblem-oriented framework as all the other approaches before. In what follows, kurtosis (or, equally, the fourth moment of data, as shown in (7.8)) as a measure of independence is concentrated on (even though other contrast functions can also be de ned). The problem with the eigenproblem

103

7.3. Eigenproblem-oriented ICA algorithms

ic 1

z1

Data distribution z1(k) ic 2

p(z1)

p(z2) = p(z2(k)) z2

Figure 7.2: The ICA basis vectors, or \independent components". Knowing the value of z1 (), say, nothing about the value of z2 () can be said. The distribution remains intact, or p(z2 ()) = p(z2 ()jz1 ()), and the two projected variables really are independent (compare to the PCA case below: information about z1 () a ects the posteriori probabilities of z2()) pc 1 Data distribution

z1 pc 2

p(z1)

p(z2)

p(z2(k))

z1(k)

Projected distributions

z2

Figure 7.3: The PCA basis vectors, or \principal components": the rst of them captures the maximum variance direction, and the other one is perpendicular to it. Variables are not independent

104

Lesson 7. Towards the Structure

framework is that it naturally emerges only when the second-order data properties, covariances and correlations, are studied. It is now asked whether the higher-order statistical properties like kurtosis could somehow be captured in the same way. And, indeed, the tensor methods for ICA have been found4 . In principle, the tensors are linear operators just as normal matrices are, and the eigenstructure can be de ned also for the four-dimensional tensors; however, the procedures are computationally involved, tensors consisting of n  n  n  n elements, and also the mathematical theory is cumbersome (the \eigenvectors" now being n  n matrices!). Here the excessive growth of search space (and the sophisticated mathematics) is avoided and some alternative approaches are studied.

7.3.1 Data whitening The key point is to modify the data distribution so that the structural features | as assumedly being revealed by the fourth-order properties | become visible. To reach this, the lower-order properties have to be compensated, because they typically outweight the higher-order properties:



First-order properties are eliminated by only studying mean-centered data, that is, Efxi ()g = 0 for all i;



Third-order properties (or \skewness") vanish if one assumes that the distributions are symmetric, so that Efxi ()xj ()xl ()g = 0 for all i; j; l; and



Second-order properties are eliminated if the data is whitened.

The data whitening means that the data is preprocessed so that its covariance matrix becomes an identity matrix. This can be accomplished by

x() =

q

 1

Efx()xT ()g

 x();

(7.10)

p

Tp

where the square root of a matrix is here de ned so that M = M M . After this modi cation there holds Efx()xT ()g = I . No matter what kind of additional preprocessing is needed, the above elimination of lower-order statistics is assumed in what follows5. We are again searching for a basis  so that x() = z (), signals zi now hopefully being independent; and, again, we assume that in the whitened data space the basis is orthogonal (of course, when expressed in the original coordinates, the orthogonality does not hold | see Fig. 7.2). 4 Note that the rst-order statistical properties of a distribution are captured by the onedimensional mean value vector, and the second-order properties are captured by the twodimensional covariance matrix | similarly, the fourth-order properties can be captured by the four-dimensional tensor 5 If one is capable of nding some structure in the data after this prewhitening, this structure cannot be dependent of the measurement scaling, thus re ecting the real structure in a more plausible way | this dependency of the scaling was one of the arguments against the PCA model

7.3. Eigenproblem-oriented ICA algorithms

105

7.3.2 Deformation of the distribution One way to reduce the fourth-order properties to second-order properties is to explicitly change the distribution. In Fourth-Order Blind Identi cation (FOBI) the data is preprocessed (after rst being whitened) so that the samples are either stretched or contracted about the origin. This can be accomplished as x0 () = f (kx()k)  x(); (7.11) where f is some function. For example, selecting f (kxk) = kxk means that analyzing the variance properties of x0 the fourth order properties of the original x are modeled. This can be seen when the new covariance matrix is studied: Efx0 ()x0T ()g = Efx()xT ()  kx()k2 g = Efz ()z T ()T  z T ()T z ()g (7.12) =   Efz ()z T ()  z T ()z ()g  T : This formulation is justi ed because one assumes that there exists an orthogonal basis  and independent signals zi . Let us study the matrix Efz ()z T ()  z T ()z ()g closer. The element i; j has the form Efzi ()zj ()  z T ()z ()g = Efzi ()zj ()  (z12 () +    + zn2 ())g = Efzi ()zj ()  (z12 ()g +    + Efzi ()zj ()(zn2 ()g 8 P Efzi4()g + Efzi2()g  l6=i Efzl2()g = Efzi4()g + n 1; > > > > if i = j , and < )zj3 ()g+ = Efzi3()zj ()g + Efzi (P > > Efzi ()zj ()g  l=6 i;l=6 j Efzl2()g = 0; > > : otherwise. The above simpli cations are justi ed because of the assumed independence of the signals zi | for example, Efzi ()zj ()g = Efzi ()g  Efzj ()g for  6=  . Also, because of centering, Efzi ()g = 0, and because of whitening, Efzi2()g = 1. Additionally, taking into account the assumed orthogonality of  (in the whitened data space), there holds T =  1 , and 0T ()g Efx0 ()x0 1 Efz14()g + n 1 0 C ... =B @ A  T (7.13) 4 0 Efzn()g + n 1 =      1: This means that the right hand side can be interpreted as the eigenvalue decomposition of the covariance matrix of the modi ed data. The diagonal elements in the eigenvalue matrix are directly related to the fourth-order properties of the (assumed) independent components. The cumulant maximization/minimization task (for whitened data) is also transformed into the variance maximization/minimization task for the modi ed variable6 . This means that 6 Note that if the signals z are independent, kurtosis can be maximized/minimized using i this algorithm even if the distributions are skewed, that is, Efzi3 ()g 6= 0

106

Lesson 7. Towards the Structure

the standard PCA approach can be applied; simultaneously as the covariance structure of x0 is analyzed, so is the kurtosis structure of the original variables x. However, contrary to the standard PCA, now the principal components carrying the least of the variance may be equally interesting as the rst ones are | depending on whether one is searching for the latent basis of maximal or minimal kurtosis. The eigenvalues reveal the kurtoses of the signals zi so that kurtfzi()g = Efzi4 ()g 3 = i n 2. As an example, study the four-dimensional data samples as shown in Fig. 7.4. Here, the data sequence is interpreted as constituting a continuous signal; however, note that this signal interpretation is only for visualization purposes. Using the above scheme, the underlying signals can be extracted | with no additional information, just based on the statistical properties of the samples (see Fig. 7.5)! x1

x2

x3

x4

Figure 7.4: The original mixture of signals (200 of the original 1000 samples are shown as time series). Can you see any structure here? z1

z2

z3

z4

Figure 7.5: The extracted sources and their distributions

7.3. Eigenproblem-oriented ICA algorithms

107

It is important to note that the curve continuity and periodicity, properties that are intuitively used as criteria for \interesting" signals, are not at all utilized by the ICA algorithms | indeed, the samples could be freely rearranged, the continuity and periodicity vanishing, but the analysis results would still remain the same. In fact, the traditional methods like some kind of harmonic analysis could reveal the underlying periodic signal structure, but ICA is specially powerful when such periodicity or continuity properties cannot be assumed.

7.3.3 Further explorations One of the disadvantages of the above algorithm is that it cannot distinguish between independent components that have equal kurtosis7. Let us try to nd another approach o ering more exibility. First, study the basic properties of the fourth power of the data point norm:

kxk4

4

p

x21 +    + x2n =  = x21 +    + x2n 2 = x41 +    + x4n + 2x21 x22 + 2x21 x23 +    + 2x2n 1 x2n :

(7.14)

Let us de ne a modi ed (n2 + n)=2 dimensional data vector as follows: 0

x0

B B B B =B B B B @

x21 .. . 2 x p n 2  x1 x2 .. p . 2  xn 1 xn

1 C C C C C; C C C A

(7.15)

containing all possible second order cross-products between the x elements. Using this kind of modi ed data vector one can express the fourth moment of the original data vector simply as kxk4 = kx0 k2 = x0T  x0 : (7.16) Now in the x0 space one can project the point onto an axis l as x0T l, and, further, it is possible to express the fourth moment of this projected data as lT  x0 x0T  l: (7.17) One should nd such an axis l that the average of this quantity over all the modi ed samples x0 (), where 1    k, would be maximized (or minimized). First, construct the expression for the average of projected fourth moment values: k 1 1 T X  l  x0 ()xT ()  l =  lT  X 0T X 0  l; k k =1

(7.18)

7 Note that the non-uniqueness problem is the same with PCA if there are equal eigenvalues; however, in this case when we are searching for the real explanations beneath the observations, not only some compression of information, this problem is more acute

108

Lesson 7. Towards the Structure

Figure 7.6: The basis vectors corresponding to l of lowest kurtosis. Note that only rst two are \correct" signals, these sources being sub-Gaussian

Figure 7.7: The basis vectors corresponding to l of highest kurtosis. Only the rst two are correct, these sources being superGaussian

where the modi ed data vectors are written in the matrix form. Letting klk = 1, Lagrangian constrained optimization problem results:

J (l ) =

 1 T 0T 0  l  X X  l +   1 lT l ; k

(7.19)

so that

d J (l) 2 0T 0 =  X X  l 2  l = 0; dl k

(7.20)

again resulting in an eigenproblem: 1 0T 0  X X  l =   l: k

(7.21)

Substituting (7.21) in (7.19) one can see that the eigenvalue equals the cost criterion value, that is,  is the average of the projected fourth moments of the samples. Note that here the least signi cant eigenvector can be more important than the most signi cant one, depending whether one is searching for sub-Gaussian or super-Gaussian distribution. This principal component is now presented in the high-dimensional x0 space, and to make it useful as a basis vector, one needs to approximate it in the lower-dimensional space of x vectors. For this purpose, remember what is the interpretation of each of the elements in l: 8 l1 > > > > > > > > < > > > > > > > > :

 x21

.. .

ln  x2np ln+1  2x1 x2 .. . p l(n +n)=2  2xn 1 xn ; 2

!

8 > > > > > > > > > < > > > > > > > > > :

x21  l1 .. . x2n  ln x1 x2 = x2 x1  p12  ln+1 .. . xn 1 xn = xn xn 1  p12  l(n +n)=2 ; 2

109

7.4. Beyond independence

These are not expectation values, but they still tell something about the connections between the variables for some hypothetical data; from the elements of l one can construct an association matrix 0 B B

R=B B @

p12  ln+1    p12  l2n l1 p12  ln+1 l2 .. ... . p12  l2n 1 ln

1

1 C C C: C A

(7.22)

Using this matrix, one can determine the n dimensional basis vectors i that best can span the higher-dimensional space; the answers must be given by the principal components of R. Note that the eigenvalues may now be negative, as well as the diagonal elements; this could be explained assuming that data is complex-valued. However, because the matrix is symmetric (in this case, actually, Hermitian) the eigenvalues and vectors are real-valued.

7.4 Beyond independence Study the distributions in Fig. 7.8: The intuitively correct basis vectors ful ll the non-Gaussianity goal, the marginal distributions being peaked, or positively kurtotic. However, note that the variables z1 and z2 are in this case not independent: knowing, for example, that z1 has high value, one immediately knows that z2 must be near zero, whereas low values of z1 leave much more freedom for z2 ; in a way, these variables are rather mutually exclusive than indepenPCA

kurt{z1} = 0

kurt{z1} > 0

kurt{z2} > 0

kurt{z2} = 0

Figure 7.8: Another (more diÆcult) example of underlying structure: Positive kurtosis in the basis that is suggested by the structure of the distribution (on the left), and zero kurtosis using PCA (two equal projections of normal distributions summed together). In this case, for example, the PCA analysis hides the underlying structure altogether | all samples belonging to di erent distribution regions are mixed up. However, for this distribution the independence assumption also collapses (see text)

110

Lesson 7. Towards the Structure q3

q2

q1

Figure 7.9: Three basis vectors in a two-dimensional space: The basis set is overcomplete dent, either of them telling very much about the other one. The independence objective does not seem to always work when searching for good basis vectors. What kind of alternatives do exist?

7.4.1 Sparse coding It turns out that in those types of distributions that seem to be characteristic to measurement data and that we are specially interested in, meaning mixture models as studied in Sec. 2.2, this mutual exclusiveness is more like a rule rather than exception: If a sample belongs to some speci c subdistribution, the other subdistributions do have no role in explaining it. And there are more surprises: It may be so that the correct number of latent structures is higher than what is the dimension of the data space (see Fig. 7.9). Perhaps it is this exclusiveness that could be taken as starting point? And, indeed, this approach results in a framework that could be called Sparse Component Analysis (SCA). In sparse coding it is assumed that a sample x is represented in latent basis so that most of the scores are zeros. Sparse coding is a rather general framework: For example, the various submodels constituting a mixture model can be presented within the same sparse structure. But sparse models are more general than the mixture models are: Whereas the constructs in mixture models strictly belong to one submodel only, in the sparse framework the components may be shared, so that the submodels can have common substructures. This exchange of substructures is the key to the expressional power of sparse models. Unfortunately, this power also suggests that there exist no explicit algortihms for constructing sparse models8 . Also the varimax, quartimax, and infomax rotation algorithms resemble sparse coding; these approaches are commonly used within the factor analysis community for maximizing the score variance, thus distributing the activity in more specialized factors). As compared to the modeling methods discussed before, ICA is typically not seen as a compression technique; rather, it carries out data reorganization, so that the z vectors often do have the same dimension as x. In the sparse coding 8 However,

various iterative approaches exist; for example, see next chapter

111

7.4. Beyond independence

case, the goal is still di erent: The model may even be in ated, so that the number of constructs is higher than the data dimension, hoping that the structural phenomena become better visible when the data is less compactly packed. One of the implicit general objectives in modeling is simplicity, in the spirit of Occam's razor. Now this simplicity objective has to be interpreted in another way: Usually, \model of minimum size" means minimum number of substructures; in sparse coding it means minimum number of simultaneously active units (see Fig. 7.10).

X

Z

+

Y

Figure 7.10: The structure of a sparse model Regression based on a sparse model is nonlinear; however, the nonlinearity is concentrated on the selection of appropriate latent vectors among the candidates | after they are selected, the model is linear. The latent variables can be selected so that together they can explain the data sample as well as possible. The abrupt switching between latent structures means that the model behavior is discontinuous if no additional measures are applied. When the most specialized constructs are only used, it seems that sparse representations often seem to be \easily interpreted", being sometimes connected to intuitive mental (subsymbolic) constructs. There is some evidence that the human brain organizes at least sensory information in this way: In visual cortex, there are areas, groups of neurons that have specialized in very narrow tasks, like detecting tilted lines in the visual image. The observed image is mentally reconstructed using the low-level visual features | and what is more, it seems that similar principles may be governing the higher level perception, too. There is perhaps room for fruitful cooperation between cognitive science and multivariate statistics.

112

Lesson 7. Towards the Structure

Computer exercises 1. Study the robustness of the eigenproblem-formulated ICA by iteratively running the two commands below with di erent selections of parameter alpha (here denotes the power used in data preprocessing: x0 = kxk  x. Note that the default value = 1 resulting in the nominal strictly kurtosisoriented algorithm is not necessarily the best choice | for example, try = 1 for this data): X1 = dataindep; ica(X1,alpha);

De ne data as X2 = dataindep(1000,... 'randn(1000,1)',... 'sign(randn(1000,1)).*(rand(1000,1)

Smile Life

When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile

Get in touch

© Copyright 2015 - 2024 PDFFOX.COM - All rights reserved.