sparse solution of underdetermined linear systems - CiteSeerX [PDF]

The Army Study to Assess Risk and Resilience in. Servicemembers (Army STARRS). Robert J. Ursano, Lisa J. Colpe, Steven G. Heeringa, Ronald C. Kessler,.

Overdetermination Underdetermined

Don't count the days, make the days count. Muhammad Ali

Idea Transcript

SPARSE SOLUTION OF UNDERDETERMINED LINEAR SYSTEMS: ALGORITHMS AND APPLICATIONS

A DISSERTATION SUBMITTED TO THE INSTITUTE FOR COMPUTATIONAL AND MATHEMATICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

Yaakov Tsaig June 2007

c Copyright by Yaakov Tsaig 2007

All Rights Reserved

ii

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

David L. Donoho Principal Advisor

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Michael A. Saunders

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Michael Elad

Approved for the University Committee on Graduate Studies.

iii

Abstract Recent theoretical developments have generated a great deal of interest in sparse signal representation. The setup assumes a given dictionary of “elementary” signals, and models an input signal as a linear combination of dictionary elements, with the provision that the representation is sparse, i.e., involves only a few of the dictionary elements. Finding sparse representations ultimately requires solving for the sparsest solution of an underdetermined system of linear equations. Such models arise often in signal processing, image processing, and digital communications. The main contributions of this dissertation are in the development and analysis of algorithms to find sparse solutions of underdetermined linear systems, and in offering potential applications in signal and image processing. The first contribution concerns the solution of `1 minimization problems whose solution is sparse. A large volume of work established that the minimum `1 -norm solution to an underdetermined linear system is often, remarkably, also the sparsest solution to that system. However, general-purpose linear optimization algorithms are proving to be far too slow for `1 minimization in many applications. We suggest the use of the Homotopy algorithm, originally developed by Osborne et al. and Efron et al. to solve `1 minimization problems in an overdetermined setting. In the underdetermined context, we show that when the solution is sufficiently sparse, Homotopy has the following k-step solution property: If the sparsest solution to an underdetermined system has at most k nonzeros, the algorithm recovers it in exactly k iterative steps. When the conditions for this property are met, Homotopy runs in a fraction of the time it takes to solve the `1 minimization problem with general-purpose solvers. Our analysis sheds light on the evident parallelism in the ability of `1 minimization and iv

Orthogonal Matching Pursuit (OMP) to recover sparse solutions, by pointing out a series of connections between the two seemingly disparate approaches. The second contribution addresses the problem of finding sparse solutions to underdetermined linear systems from a different perspective, introducing Stagewise Orthogonal Matching Pursuit (StOMP), a rapid iterative method to find sparse approximate solutions to such systems. Each iteration of the algorithm simply involves residualization, thresholding, and projection, with the total number of iterations small and fixed. Thus, StOMP is much faster than competing approaches to recover sparse solutions, and at the same time, as our work demonstrates, its ability to recover the sparsest solution is comparable with that of full-blown `1 minimization. The third contribution of this dissertation comes in analyzing the performance and applicability of the theory of Compressed Sensing from a practitioner’s viewpoint. The notion of Compressed Sensing proposes that a signal or image, unknown but supposed to be compressible by a known transform, can be subjected to fewer measurements than the nominal number of data points, and yet be reconstructed accurately by solving an `1 minimization problem. We offer an empirical analysis of Compressed Sensing, establishing that it may be applied successfully in various practical settings. In addition, we offer several extensions to the original proposal, and describe a natural application of this theory for fast acquisition of MRI data. Finally, we apply the sparse representation model to a classical problem in digital communications, of designing codes that can correct errors in a noisy communication environment. We consider a digital channel model corrupted by two types of interference: additive white Gaussian noise, and additive impulse noise. We propose a class of ‘random’ codes for robust transmission over this communication channel, which may be decoded by solving an `1 minimization problem using the Homotopy algorithm. We present simulation results demonstrating that for certain channel configurations, the rate at which we can reliably transmit information with a negligible probability of error using the proposed codes is comparable to the fundamental limits set by Shannon’s capacity.

v

Acknowledgements The journey leading to this Ph.D thesis has been a wonderful and often overwhelming experience. I am indebted to many people who made the time spent working on this thesis an unforgettable experience. It is hard to overstate my gratitude to my Ph.D advisor, David Donoho. To be in the presence of such a prodigal mind is an eye-opening experience. With his guidance, I learned to challenge traditional patterns of thought, and to tackle seemingly unsolvable scientific problems. He has taught me to follow high scientific standards in my work. He has shaped and influenced my thinking in a way no other person did. I am also indebted to Michael Saunders, who has followed and guided my research throughout my time at Stanford. I have lost count of the number of times I came to him for optimization advice, and left his office with a mind full of new directions and insights. His valuable feedback and expert counsel helped shape much of the research carried out while working on this thesis. Most importantly, Michael is one of the kindest people I have met, and I wholeheartedly enjoyed spending time with him. I am grateful to Michael Elad for his infinite support during my first year in the program. He took me under his wing, and showed me how to tackle research problems and stimulate original thought. He has taught me that with clear, structured thinking, any scientific challenge, no matter how hard, is within reach. His energy and dedication are an inspiration to me. I have had the pleasure of knowing and working with Gene Golub, one of the most inspirational and influential figures in the numerical analysis community. I thank him for his guidance and support during my years at Stanford. His teachings opened my mind to the beauty of linear algebra and numerical analysis, and shaped my career

vi

path as well as my interests. I would like to thank Tsachy Weissman for his counsel on the error correction work appearing in this thesis. His expert advice and stimulating discussions are a major part of that work. I am also grateful to Biondo Biondi, Stephen Boyd, and Robert Tibshirani, who took the time and effort to serve on my oral committee. I would like to express my deepest gratitude to Indira Choudhury, who was always there to help with the administrative requirements of the program. I am forever indebted to her for the tremendous support and help she offered in arranging my oral defense. I also thank Helen Tombropoulos and the administrative staff in the Statistics Department at Stanford, my home away from home in the past few years. Special thanks go to my friends and colleagues Miki Lustig, Ofer Levi, Tomer Yahalom, Vicki Stodden, and Armin Schwartzman, for their advice and for countless hours spent discussing fruitful ideas, as well as for their support and encouragement throughout this endeavor. I am forever grateful to Kristen Backor for her endless love and uncompromising support throughout this entire journey. Sharing my life with her in the past few years made my time at Stanford one of the sweetest and most unforgettable periods in my life. Lastly, I wish to thank my parents, Sammy and Varda Tsaig. They bore me, raised me, supported me, taught me, and loved me. To them I dedicate this thesis.

vii

Contents Abstract

iv

Acknowledgements

vi

1 Introduction

I

1

1.1

The Search for Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

In Pursuit of Sparse Solutions . . . . . . . . . . . . . . . . . . . . . .

3

1.3

Making Sense of Random Sensing . . . . . . . . . . . . . . . . . . . .

5

1.4

Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

Algorithms

11

2 Sparse Solution of `1 Minimization Problems with the Homotopy Method

12

2.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

2.1.1

Greedy Approaches . . . . . . . . . . . . . . . . . . . . . . . .

14

2.1.2

A Continuation Algorithm To Solve (P1 ) . . . . . . . . . . . .

14

2.1.3

LARS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

2.1.4

Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

2.1.5

Relation to Earlier Work . . . . . . . . . . . . . . . . . . . . .

22

2.1.6

Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

2.2

The Homotopy Algorithm . . . . . . . . . . . . . . . . . . . . . . . .

24

2.3

Computational Cost . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

viii

2.3.1

Running Times . . . . . . . . . . . . . . . . . . . . . . . . . .

27

2.3.2

Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . .

28

2.3.3

Number of Iterations . . . . . . . . . . . . . . . . . . . . . . .

29

Fast Solution with the Incoherent Ensemble . . . . . . . . . . . . . .

32

2.4.1

Correct Term Selection . . . . . . . . . . . . . . . . . . . . . .

34

2.4.2

Sign Agreement . . . . . . . . . . . . . . . . . . . . . . . . . .

36

2.4.3

Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . .

37

Fast Solution with the Uniform Spherical Ensemble . . . . . . . . . .

39

2.5.1

Phase Diagrams and Phase Transitions . . . . . . . . . . . . .

40

2.5.2

The k-Step Solution Property . . . . . . . . . . . . . . . . . .

41

2.5.3

Correct Term Selection and Sign Agreement . . . . . . . . . .

43

2.5.4

Random Signs Ensemble . . . . . . . . . . . . . . . . . . . . .

46

2.6

Fast Solution with Partial Orthogonal Ensembles . . . . . . . . . . .

47

2.7

Bridging `1 Minimization and OMP . . . . . . . . . . . . . . . . . . .

52

2.7.1

`1 Minimization −→ Homotopy . . . . . . . . . . . . . . . . .

53

2.7.2

Homotopy −→ LARS . . . . . . . . . . . . . . . . . . . . . . .

53

2.7.3

LARS −→ OMP . . . . . . . . . . . . . . . . . . . . . . . . .

54

2.4

2.5

2.8

Homotopy and PFP

. . . . . . . . . . . . . . . . . . . . . . . . . . .

55

2.9

Recovery of Sparse Solutions . . . . . . . . . . . . . . . . . . . . . . .

59

2.10 Example: Component Separation . . . . . . . . . . . . . . . . . . . .

61

2.11 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

3 Rapid Approximate Solution of Underdetermined Linear Systems with Stagewise Orthogonal Matching Pursuit

67

3.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

3.1.1

STOMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

3.1.2

Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

70

Stagewise Orthogonal Matching Pursuit . . . . . . . . . . . . . . . .

71

3.2.1

The Procedure . . . . . . . . . . . . . . . . . . . . . . . . . .

71

3.2.2

An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

3.2.3

Approximate Gaussianity of Residual Correlations . . . . . . .

75

3.2

ix

3.2.4

Threshold Selection . . . . . . . . . . . . . . . . . . . . . . . .

78

3.3

Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

3.4

Computational Analysis . . . . . . . . . . . . . . . . . . . . . . . . .

81

3.4.1

Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . .

81

3.4.2

Running Times . . . . . . . . . . . . . . . . . . . . . . . . . .

84

Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85

3.5.1

Partial Orthogonal Ensembles . . . . . . . . . . . . . . . . . .

85

3.5.2

Other Coefficient Ensembles . . . . . . . . . . . . . . . . . . .

86

3.5.3

Noisy Data . . . . . . . . . . . . . . . . . . . . . . . . . . . .

87

3.6

Example: Component Separation . . . . . . . . . . . . . . . . . . . .

88

3.7

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

90

3.5

II

Applications

92

4 Extensions of Compressed Sensing

93

4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

93

4.2

`0 Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97

4.3

`p Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.4

Different Ensembles of CS Matrices . . . . . . . . . . . . . . . . . . . 106

4.5

Denoising CS Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

4.6

Noise-Aware Reconstruction . . . . . . . . . . . . . . . . . . . . . . . 108

4.7

Two-Gender Hybrid CS . . . . . . . . . . . . . . . . . . . . . . . . . 111

4.8

Multiscale Compressed Sensing . . . . . . . . . . . . . . . . . . . . . 118

4.9

Compressed Sensing with StOMP . . . . . . . . . . . . . . . . . . . . 121

4.10 Compressed MRI Acquisition . . . . . . . . . . . . . . . . . . . . . . 126 4.11 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5 Error Correction in a Mixed Interference Channel

131

5.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

5.2

Encoding

5.3

Decoding, σU = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

x

5.4

Decoding, General Case . . . . . . . . . . . . . . . . . . . . . . . . . 137

5.5

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

A Reproducible Research with Sparselab

143

A.1 Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 A.2 Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 A.3 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Bibliography

146

xi

List of Tables 2.1

Execution times (seconds) of Homotopy, LP Solve and Pdco, applied to instances of the problem suite S(USE,Gauss; d, n, k).

. . .

28

2.2

Maximum number of Homotopy iterations, as a fraction of d, for 31

2.3

various matrix / coefficient ensembles. . . . . . . . . . . . . . . . . . Deviation of estimated kˆηd,n from d/(2 log(n)) for S(USE,V; d, n, k),

43

2.4

for different coefficient ensembles V . . . . . . . . . . . . . . . . . . . Deviation of estimated kˆηd,n from d/(2 log(n)) for S(RSE,V; d, n, k), for different coefficient ensembles V .

. . . . . . . . . . . . . . . . . .

47

3.1

Worst-case complexity bounds for Stomp, Omp and Pdco. . . . . .

83

3.2

Comparison of execution times (seconds) for instances of the problem suite S(USE; d, n, k).

3.3

84

Comparison of execution times (seconds) in the random partial Fourier suite S(PFE; d, n, k).

4.1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85

Empirically-derived constant C˜p . . . . . . . . . . . . . . . . . . . . . . 103

xii

List of Figures 2.1

Computational cost of Homotopy. . . . . . . . . . . . . . . . . . . .

2.2

Phase diagram of the k-step solution property of Homotopy applied

30

to the problem suite S(USE,Uniform; d, n, k). . . . . . . . . . . . .

42

2.3

Sharpness of phase transition. . . . . . . . . . . . . . . . . . . . . . .

44

2.4

Phase diagrams for correct term selection and sign agreement. . . . .

45

2.5

Phase diagram of the k-step solution property of Homotopy applied to the problem suite S(RSE,Uniform; d, n, k). . . . . . . . . . . . .

2.6

k-step solution property of Homotopy for the suite S(PFE,V; d, n, k) on a ρ-δ grid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.7

47 49

k-step solution property of Homotopy for the suite S(PFE,V; d, n, k) on a ρ-n grid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

2.8

Estimated phase transition curves. . . . . . . . . . . . . . . . . . . .

50

2.9

Bridging `1 minimization and Omp. . . . . . . . . . . . . . . . . . . .

52

2.10 Phase diagrams of Pfp for the problem suite S(USE,Uniform; d, n, k). 59 2.11 The k-sparse solution property for the suite S(USE,V; d, n, k). . . . .

61

2.12 The k-sparse solution property for the suite S(URPE,V; d, n, k). . .

62

2.13 Time-frequency separation with Homotopy.

. . . . . . . . . . . . .

63

2.14 Time-frequency separation in the presence of noise. . . . . . . . . . .

64

3.1

Schematic representation of the Stomp algorithm. . . . . . . . . . . .

72

3.2

Progression of the Stomp algorithm. . . . . . . . . . . . . . . . . . .

74

3.3

Approximate Gaussianity of the correlation vector. . . . . . . . . . .

76

3.4

QQ plots comparing MAI with Gaussian distribution. . . . . . . . . .

77

3.5

QQ plots comparing residual disturbance with Gaussian distribution.

78

xiii

3.6

Phase diagram of the k-sparse solution property of Stomp applied to the problem suite S(USE,Gauss; d, n, k). . . . . . . . . . . . . . . .

3.7

Comparison of the performance of Stomp, `1 minimization, and Omp in recovering the sparsest solution. . . . . . . . . . . . . . . . . . . .

3.8

81

QQ plots comparing MAI with Gaussian distribution for the problem suite S(URPE,Gauss; d, n, k). . . . . . . . . . . . . . . . . . . . . .

3.9

80

86

Phase diagram of the k-sparse solution property of Stomp applied to the problem suite S(URPE,Gauss; d, n, k). . . . . . . . . . . . . . .

87

3.10 Performance of Stomp when the nonzeros have a non-Gaussian distribution.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

88

3.11 Time-frequency separation with Stomp. . . . . . . . . . . . . . . . .

89

3.12 Time-frequency separation with Stomp, in the presence of Gaussian noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

90

4.1

Error of reconstruction versus number of samples. . . . . . . . . . . .

98

4.2

Signal Blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

99

4.3

CS reconstructions of Blocks. . . . . . . . . . . . . . . . . . . . . . . . 100

4.4

CS reconstruction of a signal with controlled `p norm. . . . . . . . . . 101

4.5

Reconstruction error versus number of measurements. . . . . . . . . . 104

4.6

Reconstruction error versus signal length. . . . . . . . . . . . . . . . . 104

4.7

Reconstruction error versus number of measurements, with `0 sparsity. 105

4.8

Error versus number of measurements d for p = 3/4, n = 2048. . . . . 107

4.9

Signal Bumps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

4.10 CS reconstruction of Bumps. . . . . . . . . . . . . . . . . . . . . . . . 110 4.11 CSDN reconstruction of Blocks. . . . . . . . . . . . . . . . . . . . . . 112 4.12 CSDN reconstruction of Bumps. . . . . . . . . . . . . . . . . . . . . . 113 4.13 Multiscale reconstruction of signal Blocks. . . . . . . . . . . . . . . . 116 4.14 Multiscale reconstruction of signal Bumps. . . . . . . . . . . . . . . . 117 4.15 Reconstruction of Mondrian image with Multiscale CS. . . . . . . . . 120 4.16 CS reconstruction of Shepp-Logan Phantom. . . . . . . . . . . . . . . 121 4.17 Reconstructed Curvelet coefficients of Shepp-Logan at scale 7. . . . . 122

xiv

4.18 Reconstruction of Bumps with hybrid CS and Stomp. . . . . . . . . 123 4.19 Reconstruction of 2-D imagery with Multiscale CS and Stomp. . . . 124 4.20 Zoom in on a horizontal slice of wavelet coefficients. . . . . . . . . . . 125 4.21 CS reconstruction of 3-D MRI data. . . . . . . . . . . . . . . . . . . . 129 5.1

A mixed Gaussian/impulse noise channel model. . . . . . . . . . . . . 133

5.2

Example of the `1 minimization based coding scheme when σU = 0 and V ∼ N (0, 10). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

5.3

Performance of coding scheme when σU = 0. . . . . . . . . . . . . . . 137

5.4

Performance of coding scheme with V ∼ N (0, 10), for different configurations of σU and q.

5.5

. . . . . . . . . . . . . . . . . . . . . . . . . . 139

Performance of coding scheme with V ∼ U (0, 10), for different configurations of σU and q.

. . . . . . . . . . . . . . . . . . . . . . . . . . 140

xv

Chapter 1 Introduction The grand aim of all science is to cover the greatest number of empirical facts by logical deduction from the smallest number of hypotheses or axioms. Albert Einstein

1.1

The Search for Sparsity

At the forefront of scientific research lies the goal of finding the most parsimonious description of a phenomenon or dataset. It is often the parsimonious description that is also the most insightful one, and in a sense, the most natural one. This understanding has been guiding researchers in the signal processing community for decades. In a signal processing context, finding a compact description involves finding a transformation from the signal domain to an analysis space, where signals of interest may be represented compactly as a superposition of a few basic elements. Virtually every aspect of signal processing, be it analysis, synthesis, compression, denoising, etc., relies on such a transformation at its core. Consequently, a vast body of research in the area has focused on finding sparsifying transformations. Traditionally, efforts in this area focused mostly on the design of orthogonal transformations, the motivation being twofold. First, since an orthogonal transformation

1

CHAPTER 1. INTRODUCTION

2

is an isomorphism, the dimension of the analysis space equals the dimension of the signal space, and no redundancy is introduced in the analysis phase. Intuitively at least, that seems an important feature, as redundancy and parsimony are seemingly contradicting goals. Second, orthogonal transformations, being essentially rotations in a high-dimensional space, can be readily inverted by applying the adjoint linear operator. The availability of a simple analytic inverse implies that minimal effort is needed to compute an inverse transformation. Together, these two features make most orthogonal transforms very simple to apply in practice. Trends in the signal processing community have seen the rise and fall of several orthogonal transforms. In the ‘80, the Discrete Fourier Transform (DFT), and particularly its variant the Discrete Cosine Transform (DCT) have seen widespread use, with the DCT taking a leading role in several industry standards for compression of still and video imagery, namely JPEG [49], MPEG-1 [50] and MPEG-2 [51]. In the ‘90, the multiscale revolution has led to the popularization of wavelets [58], in particular orthogonal and bi-orthogonal wavelet bases, that have been included in the JPEG-2000 image compression standard [52]. In accord with orthogonality, practitioners in signal processing traditionally observed strict basis monogamy; a fixed representation – Fourier, DCT, Wavelet – was used to represent a wide class of signals, with little adaptation to the properties of a particular family of signals. Recently, perhaps mirroring a loosening of traditional attitudes in other matters, some researchers call for a rejection of orthogonality and/or monogamy: advocating to represent signals and images using redundant representations, sometimes composed of several different bases or frames at once [59, 17, 15]. The motivation is simple, yet iconoclastic; it stems from the realization that no one orthogonal basis can efficiently describe all signals. Different phenomena are best represented with different bases, each particularly suitable for describing a particular phenomenon. Thus, signals exhibiting a multitude of different phenomena are best represented with a mixture of bases, carefully chosen to accommodate a wide range of signal classes. This approach faces an obvious hurdle, stemming from traditional notions of linear algebra. To illustrate the difficulty, suppose we have a signal vector y containing d

CHAPTER 1. INTRODUCTION

3

entries, for simplicity assumed real, i.e. y ∈ Rd . We wish to represent y in a mixture of p different ortho-bases, represented by d × d matrices Φ1 , . . . , Φp . To do so, we construct a dictionary as the collection of all basis elements from these p bases. Formally, it is the matrix A = [Φ1 Φ2 . . . Φp ]. To represent y as a superposition of elements from the dictionary A, we use the linear system y = Aα;

(1.2)

any α solving this system of equations gives a representation of y. Since A is d by pd, with p > 1, there are infinitely many such α’s. In other words, the system (1.2) is underdetermined. By traditional attitudes, this indeterminacy is problematic. To resolve this indeterminacy, researchers have suggested a fresh point of view; it is not the collection of all solutions to the system (1.2) we are interested in; it is the sparsest of all such solutions. As a measure of sparsity, they suggested the `0 quasi-norm, defined as the number of nonzero elements of a vector, kαk0 = #{j | αj 6= 0}.

(1.3)

When kαk0 n for α0 ∈ Rn , we say that α is sparse. Thus, to find the sparsest solution to the underdetermined system of equations (1.2), one must solve (P0 )

min kαk0 subject to Aα = y. α

In words, we look for the solution with the least number of nonzeros, among all solutions to the system of equations Aα = y. We call the optimization problem (P0 ) the Sparse Representation Problem (SRP).

1.2

In Pursuit of Sparse Solutions

The sparse representation problem has been studied for nearly a century in various forms. In the statistical literature, a similar problem is known as subset selection [43], and a related problem in coding theory is called the minimum distance problem

CHAPTER 1. INTRODUCTION

4

[84]. Unfortunately, the SRP formulation does not yield a practical procedure for finding the sparsest solution. In general, solving (P0 ) is known to be NP-hard; it contains classic combinatorial optimization problems as subcases [61]. As a result, a large volume of work is dedicated to finding tractable methods for solving the SRP. In recent years, two approaches to solve the SRP have gained increasing popularity in the scientific literature. The first is best known as Matching Pursuit (MP), and was introduced to the signal processing community by Mallat et al. [59]. In the statistics community, a similar procedure, known as Forward Stagewise Regression or Projection Pursuit Regression [38], has been in use for over 5 decades. In the approximation theory literature, it is better known as the Pure Greedy Algorithm [77]. As the latter name suggests, MP is a simple greedy scheme that, at each step, selects the element of the dictionary (a column of the matrix A) that best correlates with the residual signal. As the residual is decreased at each iteration, the method can be shown to converge to a solution, though convergence is often slow. A variant of MP, Orthogonal Matching Pursuit (Omp) [19] adds a least-squares minimization at each step, to significantly expedite convergence of the iteration. In fact, it has been shown that in certain cases, Omp requires only k iterations to recover a sparse representation consisting of k elements of the dictionary [26, 75, 76, 79]. This property, combined with the fact that it can be applied very efficiently, have made Omp a popular choice in applications. Another approach that has sparked much interest is Basis Pursuit. This method, originating in an influential paper by Chen, Donoho, and Saunders [15], amounts to relaxing the stringent `0 -norm objective in (P0 ) with the `1 -norm, obtaining min kαk1 subject to Aα = y.

(P1 )

α

(P1 ) is a convex optimization problem, and in fact can be recast as a linear program T T T via a simple transformation of variables. Letting A˜ = [A − A], and α ˜ = [α+ α− ] , we obtain (LP1 )

˜α = y, α min 1T α ˜ subject to A˜ ˜ ≥ 0. α ˜

A solution α of (P1 ) corresponds to a solution α ˜ of (LP1 ) via the relation α =

CHAPTER 1. INTRODUCTION

5

α+ −α− . Thus, Basis Pursuit replaces an NP-hard problem with a linear optimization problem, for which many off-the-shelf solvers exist. Of course, such a transformation is meaningless unless there is evidence that the solutions of (P1 ) and (P0 ) coincide. In their original paper, Chen et al. presented extensive empirical evidence suggesting that (P1 ) does, in fact, recover the sparsest solution in many cases [15]. Later work by Donoho and Huo established conditions on the matrix A, for which the solution to (P1 ) is equivalent to the sparsest solution [27]. Their seminal work sparked the interest of several research groups, leading to a series of papers studying connections between solutions to (P1 ) and (P0 ) [25, 26, 37, 39, 44, 45, 80, 82]. The theoretical study of BP and Omp has demonstrated that, perhaps surprisingly, in many cases these two approaches offer comparable performance in recovering the sparsest solution to an underdetermined system of linear equations. These findings agree with empirical studies of the two methods, suggesting that in practice, Omp, despite its heuristic nature, performs roughly as well BP for certain problem suites. This has the air of mystery – why do BP and Omp behave so similarly, when one has a rigorous foundation and the other one apparently does not? Is there some deeper connection or similarity between the two ideas?

1.3

Making Sense of Random Sensing

The study of BP and Omp presented a significant theoretical breakthrough; it established bounds on the performance of these methods in recovering the sparsest solution to the SRP. Yet, researchers faced a major hurdle; the theoretical bounds established for structured dictionaries did not agree with the performance observed in practice. Candes, Romberg and Tao [10] offered an elegant solution to this problem. They concentrated on the case where the dictionary is obtained by sampling at random rows of a Fourier matrix. They showed that when the solution to (P0 ) with a random Fourier dictionary is sufficiently sparse, Basis Pursuit recovers the solution with high probability. They further demonstrated that their theoretical predictions are in line with empirical observations. Their work established that by studying solutions to the SRP in a probabilistic setting, where the underlying dictionary is ‘random’, more insight

CHAPTER 1. INTRODUCTION

6

can be gained about the ability of BP to recover the sparsest solution. Guided by this new approach, several research groups presented results about the ability of BP and Omp to recover sparse solutions in a random setting [11, 14, 13, 22, 23, 20, 24, 81]. In fact, the implications of the work by Candes et al. [10] had even more significance. They suggested that a signal can be perfectly reconstructed from random samples in the frequency domain, with the number of samples proportional to the number of nonzeros in the signal, rather than to the number of degrees of freedom it has. This notion challenged traditional thinking of sampling theory, which suggests that in order to obtain an accurate representation of a signal, uniform sampling at a specified rate should be applied to it. Donoho extended their work and showed that a new theory of sampling can be developed for sparse signals. He coined the name Compressed Sensing for this approach [21]. Here is the basic idea underlying Compressed Sensing. We assume the object of interest is a vector x0 ∈ Rn representing the n sampled values of a digital signal or image. A key assumption is that the object is compressible; there exists some linear transformation Ψ such that the resulting vector ΨT x0 has most of its information contained in a relatively small number of coordinates. The compressibility constraint is a weaker form of the sparsity constraint, allowing for a wider class of signals. To make our compressed measurements, we construct a d by n matrix Φ with d < n, by randomly sampling from a uniform distribution. We take the d-pieces of measured information y = ΦΨT x0 . In words, each measurement yi is some random linear combination of the coefficients of x0 . Since d < n this amounts to having fewer measurements than degrees of freedom for the signal x0 . To reconstruct an approximation to x0 we solve (CS1 )

min kΨT xk1 subject to ΦΨT x = y. x

(1.4)

In [21], Donoho proved error bounds showing that despite the apparent undersampling (d < n), good accuracy reconstruction is possible for compressible objects. Since its inception, the theory of Compressed Sensing has drawn considerable attention from theorists as well as practitioners. Applications of compressed sensing

CHAPTER 1. INTRODUCTION

7

have been proposed in Magnetic Resonance Imaging [10, 55, 56], digital image acquisition [86], wireless sensor networks [2], and analog to digital conversion [54], to name a few.

1.4

Contributions

With the rising popularity of sparse representation models and compressed sensing theory, the need has grown for fast and reliable algorithms to find sparse solutions to underdetermined systems of linear equations. Moreover, the sparse representation model offers a fresh viewpoint that can be applied to a diverse set of problems in signal processing. Accordingly, this thesis is divided into two parts. The first part suggests and analyzes new algorithms to solve sparse approximation problems. The second part offers a practitioner’s view of compressed sensing theory, and suggests several potential application scenarios in which compressed sensing and sparse modeling may be useful. Chapter 2 of this thesis proposes applying the Homotopy algorithm [62], a.k.a. the Lars/Lasso algorithm [32], to solve (P1 ). The Homotopy algorithm, originally developed by Osborne et al. [62] and Efron et al. [32] to solve `1 -penalized regression problems, bears a striking resemblance to the greedy Omp, and yet it is able to solve the convex optimization problem (P1 ) exactly. In particular, Homotopy shares Omp’s modest computational requirements, and solves (P1 ) much faster than generalpurpose LP solvers when the solution is sparse. We characterize this efficiency formally, and demonstrate that in certain cases, Homotopy has the following k-step solution property: If the sparsest solution is composed of k nonzeros, Homotopy finds it in exactly k iterative steps. To show this property, we consider two scenarios. In the deterministic setting, we consider incoherent matrices A, where off-diagonal entries of AT A are all smaller than M . We prove that, as long as the solution has at most k ≤ (M −1 +1)/2 nonzeros, Homotopy recovers it in k steps. In the probabilistic setting, we introduce an empirical analysis framework, utilizing phase diagrams and phase transitions. In this framework, we show that for instances of d × n random matrices A, where A has iid Gaussian

CHAPTER 1. INTRODUCTION

8

entries, Homotopy recovers the sparsest solution with high probability as long as it has at most k < d/(2 log(n)) · (1 − n ) nonzeros, for some small n > 0. We use our analysis framework to empirically characterize the k-step property with matrices A drawn from other random ensembles, including partial Fourier matrices and partial Hadamard matrices. Our approach sheds light on the evident parallelism in results on `1 minimization and Omp that has puzzled researchers in the field. We point out a series of links connecting Homotopy, which solves an `1 minimization problem, to the greedy Omp, thus resolving much of the mystery surrounding the similar results obtained by these two seemingly disparate approaches to solve the SRP. In Chapter 3, we approach the sparse representation problem from a different perspective, presenting Stagewise Orthogonal Matching Pursuit (Stomp), an algorithm to find approximate sparse solutions to underdetermined systems of linear equations. Stomp is an iterative thresholding algorithm that takes a fixed, small number of iterations to generate a solution. Much like Homotopy and Omp, Stomp successively computes a series of sparse approximate solutions, identifying new coordinates to enter the model at each stage by examination of the vector of correlations with the residual. In contrast to Omp or Homotopy, many coefficients can enter the model at each stage in Stomp. Thus, Stomp runs much faster than competing proposals for sparse solutions. Our proposal is based on the observation that when the underdetermined matrix A is drawn from a random ensemble, the correlation vector formed at each iteration can be viewed as the vector of true nonzero coefficients contaminated by a random interference with an approximately Gaussian behavior. In other words, the indeterminacy of the random matrix A manifests itself similarly to Gaussian noise in the measurements (although we assume no noise is present). Relying on this observation, we design Stomp as essentially solving a sequence of Gaussian denoising problems, and propose a natural strategy for threshold selection, based on false alarm rate control. Using the empirical framework devised in Chapter 2, we analyze the performance of Stomp, and compare it with other algorithms. Our analysis suggests that for certain problem suites, Stomp’s ability to recover the sparsest solution is almost as good as

CHAPTER 1. INTRODUCTION

9

`1 minimization — at a significantly lower cost. Together, our two proposed schemes, Homotopy and Stomp, form an arsenal of solution techniques to tackle sparse representation problems. Homotopy is an efficient stepwise algorithm able to solve (P1 ) exactly at a much lower cost than general-purpose LP solvers. It is augmented by Stomp, which finds approximate sparse solutions to underdetermined systems rapidly using a small, fixed number of iterations, and is of great usefulness when dealing with large-scale problems. In the second part of this thesis, we discuss several application scenarios in which these two solution techniques are used. Chapter 4 starts with an investigation of compressed sensing from a practitioner’s point of view. We present an empirical study to investigate the constant term hiding in the error bound derived in [21], for a class of compressible signals that have sparse coefficients as measured by `p norms, 0 < p ≤ 1. Our results suggest that in practice, the simulated behavior of compressed sensing agrees with theoretical predictions, with a modest constant term. Further, we explore the freedom available in the choice of a sensing matrix, describing several different matrix ensembles that seem to give equivalent results. We consider a deployment of compressed sensing in a two-scale and multi-scale fashion, and demonstrate that in certain cases, we can obtain reconstruction results superior to the ones obtained with a mono-scale approach. In addition, we examine the performance of Stomp as an alternative for `1 minimization in a compressed sensing framework, and demonstrate that it is very effective in recovering solutions to large-scale compressed sensing problems. We finally suggest a natural application of compressed sensing to reconstruction of images from partial MR data, and demonstrate that we can achieve a faithful reconstruction using only half the number of measurements needed with standard methods. The last chapter applies ideas and algorithms developed in earlier chapters to a classical problem in communications, of designing codes that can correct errors in a noisy communication environment. We consider a digital channel model corrupted by a mixture of two interferences: additive white Gaussian noise, and additive impulse noise. We propose a class of ‘random’ codes for robust transmission over this communication channel. Our proposed scheme employs the Homotopy algorithm

CHAPTER 1. INTRODUCTION

10

to solve an `1 minimization at the decoding stage. We present simulation results demonstrating that for certain channel configurations, the rate at which we can reliably transmit information with a negligible probability of error using the proposed codes is comparable to the fundamental limits set by Shannon’s capacity. Finally, in Appendix A we describe SparseLab, a library of Matlab functions and scripts to solve sparse approximation problems. SparseLab was developed at Stanford University, and has been used by the author to generate all the simulation results, figures, and tables appearing in this thesis. In the spirit of reproducible research [7], SparseLab is freely available to download from the web [73], and we hope that its release will lead to new and exciting results in the field.

Part I Algorithms

11

Chapter 2 Sparse Solution of `1 Minimization Problems with the Homotopy Method A growing collection of works [15, 27, 37, 25, 22, 23, 39, 44, 10, 14] has demonstrated that the minimum `1 -norm solution to an underdetermined system of linear equations y = Ax is often, remarkably, also the sparsest solution to that system. This sparsity-seeking property has attracted the attention of practitioners in signal and image processing [34, 36, 55, 56, 70, 71, 5]. In Chapter 1, we mentioned that the `1 minimization problem (P1 ) may be expressed as the linear program (LP1 ) and solved as such. Indeed, researchers in the field have traditionally advocated the use of off-the-shelf linear programming solvers to solve `1 minimization problems, largely due to the immediate availability of such solvers. Yet, general-purpose optimizers are proving to be far too slow for `1 minimization in most applications of practical interest. In this chapter, we apply the Homotopy method, originally proposed by Osborne et al. [62] and Efron et al. [32], to solve underdetermined `1 minimization problems. We demonstrate that Homotopy runs much more rapidly than general-purpose LP solvers when sufficient sparsity is present. In detail, we show that in many cases, the method has the following k-step solution property: if the underlying solution has only 12

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

13

k nonzeros, the Homotopy method reaches that solution in only k iterative steps. When this property holds and k is small compared to the problem size, this means that Homotopy solves an `1 minimization problem at a fraction of the cost of solving the corresponding linear program. We demonstrate this k-step solution property for two kinds of problem suites. First, we consider incoherent matrices A with unit length columns, where off-diagonal entries of the Gram matrix AT A are all smaller than M . If y is a linear combination of at most k ≤ (M −1 + 1)/2 columns of A, we show that Homotopy has the k-step solution property. Second, ensembles of d×n matrices A, with iid Gaussian entries. If y is a linear combination of at most k < d/(2 log(n)) · (1 − n ) columns, for some small n > 0, Homotopy again exhibits the k-step solution property with high probability. Further, we give evidence showing that for ensembles of d × n partial orthogonal matrices, including partial Fourier matrices and partial Hadamard matrices, with high probability, the k-step solution property holds up to a dramatically higher threshold k, satisfying k/d < ρˆ(d/n) for a certain empirically-determined function ρˆ(δ). The results in this chapter imply that Homotopy can efficiently solve some very ambitious large-scale problems. In addition, the analysis presented here sheds light on the evident parallelism in results on `1 minimization and Orthogonal Matching Pursuit. The material in this chapter is derived from the manuscript [29], which has been submitted to IEEE Transactions on Information Theory.

2.1

Introduction

Traditionally, the Sparse Representation Problem (P0 ) has been catalogued as belonging to a class of combinatorial optimization problems. Recent theoretical developments have shown that when the underlying solution is sufficiently sparse, the solution to (P0 ) coincides with the solution of the `1 minimization problem (P1 ) [27, 37, 25, 22, 23, 39, 44, 10, 14]. Since (P0 ) is, in general, NP-hard [61], whereas (P1 ) can be solved by linear programming, this equivalence is quite remarkable. Nonetheless, general-purpose linear optimizers ultimately involve repeated solution of ‘full’

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

14

d × d linear systems, each costing order O(d3 ) flops to solve. Such complexity demands limit the usefulness of the LP approach to modestly sized problems. Yet, many of the interesting problems where (P1 ) has been contemplated are very large in scale. For example, over a decade ago, Basis Pursuit [15] tackled problems with d = 8, 192 and n = 262, 144, and problem sizes approached currently [9, 34, 36, 55, 83] are even larger. Such large-scale problems are simply too large for general-purpose strategies to be used in routine processing.

2.1.1

Greedy Approaches

Many of the applications of (P1 ) can instead be attacked heuristically by fitting sparse models, using greedy stepwise least squares. As discussed in Chapter 1, a popular scheme used in signal processing is Orthogonal Matching Pursuit (Omp) [19]. Rather than minimizing an objective function, Omp constructs a sparse solution to a given problem by iteratively building up an approximation; the vector y is approximated as a linear combination of a few columns of A. The set of columns to be used is built column by column, in a greedy fashion; at each iteration the column that best correlates with the residual signal is selected. Although Omp is a heuristic method, in some cases it works marvelously. In particular, there are examples where the data y admit sparse synthesis using only k columns of A, and greedy selection finds exactly those columns in just k steps. These examples are supported by theoretical work establishing this property; work by Tropp et al. [79, 81] and Donoho et al. [26] has shown that Omp can, in certain circumstances, succeed at finding the sparsest solution. However, theoretical work has also showed that in certain scenarios where `1 minimization succeeds in finding the sparsest solution, Omp fails [15, 22, 79].

2.1.2

A Continuation Algorithm To Solve (P1 )

In parallel with developments in the signal processing literature, there has been interest in the statistical community in fitting regression models while imposing `1 norm constraints on the regression coefficients. Tibshirani [78] proposed the so-called

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

15

Lasso problem (Least Absolute Shrinkage and Selection Operator), which we state using our notation as follows: (Lq )

min ky − Axk22 subject to kxk1 ≤ q; x

in words: a least-squares fit subject to an `1 -norm constraint on the coefficients. In Tibshirani’s original proposal, Ad×n was assumed to have d > n, i.e. representing an overdetermined linear system. It is convenient to consider instead the unconstrained optimization problem (Dλ )

min ky − Axk22 /2 + λkxk1 , x

i.e., a form of `1 -penalized least-squares. Indeed, problems (Lq ) and (Dλ ) are equivalent under an appropriate correspondence of parameters. To see that, associate with each problem (Dλ ) : λ ∈ [0, ∞) a solution x˜λ (for simplicity assumed unique). The set {˜ xλ : λ ∈ [0, ∞)} identifies a solution path, with x˜λ = 0 for λ large and, as λ → 0, x˜λ converging to the solution of (P1 ). Similarly, {˜ xq : q ∈ [0, ∞)} traces out a solution path for problem (Lq ), with x˜q = 0 for q = 0 and, as q increases, x˜q converging to the solution of (P1 ). Thus, there is a reparametrization q(λ) defined by q(λ) = k˜ xλ k1 so that the solution paths of (Dλ ) and (Lq(λ) ) coincide. In the classic overdetermined setting (d > n), Osborne, Presnell and Turlach [62, 63] made the useful observation that the solution path of (Dλ ), λ ≥ 0 (or (Lq ), q ≥ 0) is polygonal. Further, they characterized the changes in the solution x˜λ at vertices of the polygonal path. Vertices on this solution path correspond to solution subset models, i.e. vectors having nonzero elements only on a subset of the potential candidate entries. As we move along the solution path, the subset is piecewise constant as a function of λ, changing only at critical values of λ corresponding to the vertices on the polygonal path. This evolving subset is called the active set. Based on these observations, they presented the Homotopy algorithm, which follows the solution path by jumping from vertex to vertex of this polygonal path. It starts at x˜λ = 0 (and an empty active set) for λ > λ0 , where λ0 = kAT yk∞ . As λ is reduced, the active set is updated at each vertex of the polygonal path through the addition

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

16

and removal of variables. Thus, in a sequence of steps, it successively obtains the solutions x˜λ` at a special problem-dependent sequence λ` associated with vertices of the polygonal path. The name Homotopy refers to the fact that the objective function for (Dλ ) is undergoing a homotopy from the `2 constraint to the `1 objective as λ decreases. Efron, Hastie, Johnstone and Tibshirani [32] extended the work of Osborne et al., and formally showed that the Homotopy algorithm solves the minimization problem (Dλ ) for all λ > 0, i.e. that it yields the entire solution path of (Dλ ). Malioutov et al. [57] were the first to apply the Homotopy method to the formulation (Dλ ) in the underdetermined setting (d < n), when the data are noisy. We also suggest using the Homotopy method to solve (P1 ) in the underdetermined setting. Homotopy Algorithm for Solving (P1 ): Apply the Homotopy method, following the solution path from x˜λ0 = 0 to x˜0 . When the λ = 0 limit is reached, (P1 ) is solved.

Traditionally, to solve (P1 ), one would apply the simplex algorithm or an interiorpoint method, which start with a dense solution and converge to the solution of (P1 ) through a sequence of iterations, each requiring the solution of a d × d linear system. In contrast, the Homotopy method starts out at xλ0 = 0 and successively builds a sparse solution by adding or removing elements from its active set. Clearly, in a sparse setting, this latter approach is much more favorable, since as long as the solution has few nonzeros, Homotopy will reach the solution in a few steps. Numerically, each step of the algorithm involves the rank-one update of a linear system, and so if the whole procedure stops in k steps, yielding a solution with k nonzeros, its overall complexity is bounded by k 3 + kdn flops. For k d and d ∝ n, this is far better than the O(d3 ) flops it would take to solve (P1 ) with a general purpose LP solver. Moreover, to solve (Dλ ) for all λ ≥ 0 by a traditional approach, one would need to solve a quadratic program for every λ value of interest. For any problem size beyond the very smallest, that would be prohibitively time consuming. In contrast, Homotopy delivers all the solutions x˜λ to (Dλ ), λ ≥ 0.

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

2.1.3

17

LARS

In [32], Efron et al. suggested an alternate derivation of the Homotopy algorithm, taking a geometric perspective, which is quite instructive. They defined a basic procedure, called Least Angle Regression (Lars), inspired by stagewise regression methods in statistics. The Lars algorithm maintains an active set of nonzero variables composing the current solution. At each step, the algorithm adds a new element to the active set, by taking a step along an equiangular direction, that is, a direction having equal angles with the vectors in the active set. Efron et al. observed that, quite remarkably, experimental results suggest that the solution obtained by following the basic Lars procedure is often identical to the Lasso solution. This equality is very interesting in the present context, because Lars is so similar to Omp. Both algorithms build up a model a step at a time, adding a new variable to the active set at each step, and ensuring that the new variable is in some sense the most important among the potential candidate variables. The details of determining importance differ but in both cases involve the inner product between the candidate new variables and the current residual. They further noted that the Homotopy algorithm is very similar to Lars, with the exception that, when moving to a new vertex of the solution polygon, Homotopy may remove existing elements from the active set. Thus, they suggested a variant of Lars, named Lars with the Lasso modification, which is identical to the Homotopy algorithm, and thus solves (Dλ ). In this thesis, we use the common name Homotopy to refer to both algorithms. From an algorithmic viewpoint, the difference between Homotopy and Lars translates into a few lines of code. And yet this small variation makes Homotopy rigorously able to solve a global optimization problem. On the other hand, it implies that the number of steps required by Homotopy can, in principle, be significantly greater than the number of steps required by Lars, as terms may enter and leave the active set numerous times. If so, we observe “model churning”, which causes Homotopy to run slowly. We present evidence that under favorable conditions, such churning does not occur. The Homotopy algorithm is then roughly as fast as both Lars and Omp.

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

2.1.4

18

Main Results

Our results about the stopping behavior of Homotopy require some terminology. Definition 1. A problem suite S(E,V; d, n, k) is a collection of problems defined by three components: (a) an ensemble E of matrices A of size d × n; (b) an ensemble V of n-vectors α0 with at most k nonzeros; (c) an induced ensemble of left-hand sides y = Aα0 . Where the ensemble V is omitted from the definition of a problem suite, we interpret that as saying that the relevant statements hold for any nonzero distribution. Otherwise, V will take one of three values: (Uniform), when the nonzeros are iid uniform on [0, 1]; (Gauss), when the nonzeros are iid standard normal; and (Bernoulli), when the nonzeros are iid equi-probable Bernoulli distributed with values ±1. Definition 2. An algorithm A has the k-step solution property at a given problem instance (A, y) drawn from a problem suite S(E,V; d, n, k) if, when applied to the data (A, y), the algorithm terminates after at most k steps with the correct solution. Our main results establish the k-step solution property for specific problem suites. Incoherent Systems The mutual coherence M (A) of a matrix A whose columns are normalized to unit length is the maximal off-diagonal entry of the Gram matrix AT A. We call the collection of matrices A with M (A) ≤ µ the incoherent ensemble with coherence bound µ (denoted Incµ ). Let S(Incµ ; d, n, k) be the suite of problems with d × n matrices drawn from the incoherent ensemble Incµ , with vectors α0 having kα0 k0 ≤ k. For the incoherent problem suite, we have the following result. Theorem 1. Let (A, y) be a problem instance drawn from S(Incµ ; d, n, k). Suppose that k ≤ (µ−1 + 1)/2.

(2.1)

Then, the Homotopy algorithm runs k steps and stops, delivering the solution α0 .

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

19

Condition (2.1) has appeared elsewhere; work by Donoho et al. [27, 25], Gribonval and Nielsen [44], and Tropp [79] established that when (2.1) holds, the sparsest solution is unique and equal to α0 , and both `1 minimization and Omp recover α0 . In particular, Omp takes at most k steps to reach the solution. In short, we show here that for the general class of problems S(Incµ ; d, n, k), where a unique sparsest solution is known to exist, and where Omp finds it in k steps, Homotopy finds that same solution in the same number of steps. Note that Homotopy always solves the `1 minimization problem; the result shows that it operates particularly rapidly over the sparse incoherent suite. Random Matrices We first consider d × n random matrices A from the Uniform Spherical Ensemble (USE). Such matrices have columns independently and uniformly distributed on the sphere S d−1 . Consider the suite S(USE; d, n, k) of random problems where A is drawn from the USE and where α0 has at most k nonzeros at k randomly-chosen sites, with the sites chosen independently of A. Empirical Finding 1. Fix δ ∈ (0, 1). Let d = dn = bδnc. Let (A, y) be a problem instance drawn from S(USE; d, n, k). There is n ∈ (0, 1) so that, with high probability, for k≤

d (1 − n ), 2 log(n)

(2.2)

the Homotopy algorithm runs k steps and stops, delivering the solution α0 . Remarks: • We conjecture that corresponding to this empirical finding is a theorem, in which the conclusion is that as n → ∞, n → 0, and “high probability” means with probability exceeding 1 − n . • The empirical result is in fact stronger. We demonstrate that for problem instances with k emphatically not satisfying (2.2), but rather k≥

d (1 + ˜n ), 2 log(n)

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

20

then, with high probability for large n, the Homotopy algorithm fails to terminate in k steps. Thus, (2.2) delineates a boundary between two regions in (k, d, n) space; one where the k-step property holds with high probability, and another where the chance of k-step termination is very low. • This empirical finding also holds for matrices drawn from the Random Signs Ensemble (RSE). Matrices in this ensemble are constrained to have their entries ±1 (suitably normalized to obtain unit-norm columns), with signs drawn independently and with equal probability. We next consider ensembles made of partial orthogonal transformations. In particular, we consider the following matrix ensembles: • Partial Fourier Ensemble (PFE). Matrices in this ensemble are obtained by sampling at random d rows of an n by n Fourier matrix. • Partial Hadamard Ensemble (PHE). Matrices in this ensemble are obtained by sampling at random d rows of an n by n Hadamard matrix. • Uniform Random Projection Ensemble (URPE). Matrices in this ensemble are obtained by generating an n by n random orthogonal matrix and sampling d of its rows at random. As it turns out, when applied to problems drawn from these ensembles, the Homotopy algorithm maintains the k-step property at a substantially greater range of signal sparsities. In detail, we give evidence supporting the following conclusion. Empirical Finding 2. Fix δ ∈ (0, 1). Let d = dn = bδnc. Let (A, y) be a problem instance drawn from S(E,V; d, n, k), with E ∈ { PFE,PHE,URPE }, and V ∈ { Uniform,Gauss,Bernoulli }. There is n ∈ (0, 1) with the following property. The empirical function ρˆ(δ) depicted in Figure 2.6 marks a transition such that, for k/d ≤ ρˆ(δ)(1 − n ), the Homotopy algorithm runs k steps and stops, delivering the solution α0 .

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

21

A third, equally important finding we present involves the behavior of the Homotopy algorithm when the k-step property does not hold. We provide evidence to the following. Empirical Finding 3. Fix δ ∈ (0, 1). Let d = dn = bδnc. Let (A, y) be a problem instance drawn from S(E,V; d, n, k), with E ∈ { PFE,PHE,URPE }, V ∈ { Uniform,Gauss,Bernoulli }, and (d, n, k) chosen so that the k-step property does not hold. With high probability, Homotopy applied to (A, y) operates in either of two modes: 1. It recovers the sparse solution α0 in cs ·d iterations, with cs a constant depending on E, V , empirically found to be less than 1.6. 2. It does not recover α0 , returning a solution to (P1 ) in cf · d iterations, with cf a constant depending on E, V , empirically found to be less than 4.85. We interpret this finding as saying that we may safely apply Homotopy to problems whose solutions are not known a priori to be sparse, and still obtain a solution rapidly. In addition, if the underlying solution to the problem is indeed sparse, Homotopy is particularly efficient in finding it. This finding is of great usefulness in practical applications of Homotopy. To summarize, we consider two settings for performance measurement: deterministic incoherent matrices and random matrices. We demonstrate that, when a k-sparse representation exists, k d, the Homotopy algorithm finds it in k steps. Results in each setting parallel existing results about Omp in that setting. Moreover, each step of Homotopy is identical to a step of Lars and therefore very similar to a corresponding step of Omp. We interpret this parallelism as follows. The Homotopy algorithm, in addition to solving the `1 minimization problem, runs just as rapidly as the heuristic algorithms Omp/ Lars, in those problem suites where those algorithms have been shown to correctly solve the sparse representation problem. Since there exists a wide range of cases where `1 -minimization is known to find the sparsest solution while Omp is known to fail [15, 22, 79], Homotopy gives in some

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

22

sense the best of both worlds by a single algorithm: speed where possible, sparsity where possible. In addition, our viewpoint sheds new light on previously-observed similarities between results about `1 minimization and about Omp. Previous theoretical work [26, 79, 81] found that these two seemingly disparate methods had comparable theoretical results for finding sparse solutions. Our work exhibits inherent similarities between the two methods that motivate these parallel findings.

2.1.5

Relation to Earlier Work

Our results seem interesting in light of previous work by Osborne et al. Efron et al.

[62] and

[32]. In particular, our proposal for fast solution of `1 minimization

amounts simply to applying to (P1 ) previously known algorithms (Homotopy a.k.a Lars/Lasso) designed for fitting equations to noisy data. We make the following contributions. • Few Steps. Efron et al.

considered the overdetermined noisy setting, and

remarked that while, in principle, Homotopy could take many steps, they had encountered examples where it took a direct path to the solution, in which a term, once entered into the active set, stayed in the active set [32]. Our work in the noiseless underdetermined setting formally identifies a precise phenomenon, namely the k-step solution property, and delineates a range of problem suites where this phenomenon occurs. • Similarity of Homotopy and Lars. Efron et al., in the overdetermined noisy setting, commented that while in principle the solution paths of Homotopy and Lars could be different, in practice they came across examples where Homotopy and Lars yielded very similar results [32]. Our work in the noiseless case formally defines a property implying that Homotopy and Lars have the same solution path, and delineates a range of problem suites where this property holds. In addition, we present simulation studies showing that, over a region in parameter space where `1 minimization recovers the sparsest solution, Lars also recovers the sparsest solution.

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

23

• Similarity of Homotopy and Omp. In the random setting, Tropp and Gilbert demonstrated that Omp recovers the sparsest solution under a condition similar to (2.2), although with a somewhat worse constant term; their result (and proof) can be interpreted in the light of our work as saying that Omp actually has the k-step solution property under their condition. In fact it has the k-step solution property under the condition (2.2). Below, we elaborate on the connections between Homotopy and Omp leading to these similar results. • Similarity of Homotopy and Polytope Faces Pursuit. Recently, Plumbley introduced a greedy algorithm named Polytope Faces Pursuit (Pfp) to solve (P1 ) in the dual space [66, 65]. The algorithm bears strong resemblance to Homotopy and Omp. Below, we elaborate on Plumbley’s work, to show that the affinities are not coincidental, and in fact, under certain conditions, Pfp is equivalent to Homotopy. We then conclude that Pfp maintains the k-step solution property under the same conditions required for Homotopy. In short, we provide theoretical underpinnings, formal structure, and empirical findings. We also believe our viewpoint clarifies the connections between Homotopy, Lars, and Pfp.

2.1.6

Contents

The remainder of this chapter is organized as follows. Section 2.2 reviews the Homotopy algorithm in detail. Section 2.3 presents running-time measurements alongside a formal complexity analysis of the algorithm, and discusses evidence leading to Empirical Finding 3. In Section 2.4 we prove Theorem 1, the k-step solution property for the sparse incoherent problem suite. In Section 2.5 we demonstrate Empirical Finding 1, the k-step solution property for the sparse USE problem suite. Section 2.6 follows, with evidence to support Empirical Finding 2. In Section 2.7 we elaborate on the relation between `1 minimization, Lars, and Omp. This provides a natural segue to Section 2.8, which discusses the connections between Homotopy and Pfp. Section 2.9 then offers a comparison of the performance of Homotopy, Lars, Omp

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

24

and Pfp in recovering sparse solutions. A final section offers a discussion and some concluding remarks.

2.2

The Homotopy Algorithm

We start our exposition with a description of the Homotopy algorithm, following the derivation in [32]. The general principle undergirding homotopy methods is to trace a solution path, parametrized by one or more variables, while evolving the parameter vector from an initial value for which the corresponding solution is known, to the desired value. For the Lasso problem (Dλ ), this implies following the solution xλ , starting at λ large and xλ = 0, and terminating when λ → 0 and xλ converging to the solution of (P1 ). The solution path is followed by maintaining the optimality conditions of (Dλ ) at each point along the path. Specifically, let fλ (x) denote the objective function of (Dλ ). By classical ideas in convex analysis, a necessary condition for xλ to be a minimizer of fλ (x) is that 0 ∈ ∂x fλ (xλ ); i.e., the zero vector is an element of the subdifferential of fλ at xλ . We calculate ∂x fλ (xλ ) = −AT (y − Axλ ) + λ∂kxλ k1 ,

(2.3)

where ∂kxλ k1 is the subgradient ) u = sgn(x ), x = 6 0 i λ,i λ,i . u ∈ Rn ui ∈ [−1, 1], xλ,i = 0

( ∂kxλ k1 =

Let I = {i : xλ (i) 6= 0} denote the support of xλ , and call c = AT (y − Axλ ) the vector of residual correlations. Then, equation (2.3) can be written equivalently as the two conditions c(I) = λ · sgn(xλ (I)),

(2.4)

|c(I c )| ≤ λ,

(2.5)

and

In words, residual correlations on the support I must all have magnitude equal to λ,

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

25

and signs that match the corresponding elements of xλ , whereas residual correlations off the support must have magnitude less than or equal to λ. The Homotopy algorithm now follows from these two conditions, by tracing out the optimal path xλ that maintains (2.4) and (2.5) for all λ ≥ 0. The key to its successful operation is that the path xλ is a piecewise linear path, with a discrete number of vertices [32, 62]. The algorithm starts with an initial solution x0 = 0, and operates in an iterative fashion, computing solution estimates x` , ` = 1, 2, . . .. Throughout its operation, it maintains the active set I, which satisfies I = {j : |c` (j)| = kc` k∞ = λ},

(2.6)

as implied by conditions (2.4) and (2.5). At the `-th stage, Homotopy first computes an update direction d` , by solving ATI AI d` (I) = sgn(c` (I)),

(2.7)

with d` set to zero in coordinates not in I. This update direction ensures that the magnitudes of residual correlations on the active set all decline equally. The algorithm then computes the step size to the next breakpoint along the homotopy path. Two scenarios may lead to a breakpoint. First, that a non-active element of c` would increase in magnitude beyond λ, violating (2.5). This first occurs when γ`+

= minc i∈I

λ − c` (i) λ + c` (i) , 1 − aTi v` 1 + aTi v`

,

(2.8)

where v` = AI d` (I), and the minimum is taken only over positive arguments. Call the minimizing index i+ . The second scenario leading to a breakpoint in the path occurs when an active coordinate crosses zero, violating the sign agreement in (2.4). This first occurs when γ`− = min{−x` (i)/d` (i)}, i∈I

(2.9)

where again the minimum is taken only over positive arguments. Call the minimizing

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

26

index i− . Homotopy then marches to the next breakpoint, determined by γ` = min{γ`+ , γ`− },

(2.10)

updates the active set, by either appending i+ to I or removing i− , and computes a solution estimate x` = x`−1 + γ` d` . The algorithm terminates when kc` k∞ = 0, and the solution of (P1 ) has been reached. Remarks: 1. In the algorithm description above, we implicitly assume that at each breakpoint on the homotopy path, at most one new coordinate is considered as candidate to enter the active set (Efron et al. called this the “one at a time” condition [32]). If two or more vectors are candidates, more care must be taken in order to choose the correct subset of coordinates to enter the active set; see [32] for a discussion. 2. In the introduction to this chapter, we noted that the Lars scheme closely mimics Homotopy, the main difference being that Lars does not allow removal of elements from the active set. Indeed, to obtain the Lars procedure, one simply follows the sequence of steps described above, omitting the computation of i− and γ`− , and replacing (2.10) with γ` = γ`+ . The resulting scheme adds a single element to the active set at each iteration, never removing active elements from the set. 3. The Homotopy algorithm may be easily adapted to deal with noisy data. Assume that rather than observing y0 = Aα0 , we observe a noisy version y = Aα0 + z, with kzk2 ≤ n . Donoho et al. [26, 23] have shown that, for certain matrix ensembles, the solution xq of (Lq ) with q = n has an error at worst proportional to the noise level. To solve for xn , we simply apply the

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

27

Homotopy algorithm as described above, terminating as soon as the residual satisfies kr` k ≤ n . Since in most practical scenarios it is not sensible to assume that the measurements are perfectly noiseless, this attribute of Homotopy makes it an apt choice for use in practice.

2.3

Computational Cost

In earlier sections, we claimed that Homotopy can solve the problem (P1 ) much faster than general-purpose LP solvers, which are traditionally used to solve it. In particular, when the k-step solution property holds, Homotopy runs in a fraction of the time it takes to solve one d×d linear system, making it as efficient as fast stepwise algorithms such as Omp. To support these assertions, we now present the results of running-time simulations, complemented by a formal analysis of the asymptotic complexity of Homotopy.

2.3.1

Running Times

To evaluate the performance of the Homotopy algorithm applied to (P1 ), we compared its running times on instances of the problem suite S(USE,Gauss; d, n, k) with two state-of-the-art algorithms for solving Linear Programs. The first, LP Solve, is a Mixed Integer Linear Programming solver, implementing a variant of the Simplex algorithm [4]. The other, Pdco, a Primal-Dual Convex Optimization solver, is a logbarrier interior point method, written by Michael Saunders of the Stanford Systems Optimization Laboratory [69]. Table 2.1 shows the running times for Homotopy, LP Solve and Pdco, for various problem dimensions. The figures appearing in the table were measured on a 3GHz Xeon workstation. Examination of the running times in the table suggests two important observations. First, when the underlying solution to the problem (P1 ) is sparse, a tremendous saving in computation time is achieved using Homotopy, compared to traditional LP solvers. For instance, when A is 500 × 2000 and y admits sparse synthesis with k = 50 nonzeros, Homotopy terminates in about 0.34 seconds, 20 times faster than

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

(d,n,k) (200,500,20) (250,500,100) (500,1000,100) (750,1000,150) (500,2000,50) (1000,2000,200) (1600,4000,320)

28

Homotopy LP Solve Pdco 0.03 2.04 0.90 0.94 9.27 1.47 1.28 45.18 4.62 2.15 85.78 7.68 0.34 59.52 6.86 7.32 407.99 22.90 31.70 2661.42 122.47

Table 2.1: Execution times (seconds) of Homotopy, LP Solve and Pdco, applied to instances of the problem suite S(USE,Gauss; d, n, k). Pdco, and over 150 times faster than LP Solve. Second, when the linear system is highly underdetermined (i.e. d/n is small), even when the solution is not very sparse, Homotopy is more efficient than either LP Solve or Pdco. This latter observation is of particular importance for applications, as it implies that Homotopy may be ‘safely’ used to solve `1 minimization problems even when the underlying solution is not known to be sparse.

2.3.2

Complexity Analysis

The timing studies above are complemented by a detailed analysis of the complexity of the Homotopy algorithm. We begin by noting that the bulk of computation is invested in the solution of the linear system (2.7) at each iteration. Thus, the key to an efficient implementation is maintaining a Cholesky factorization of the Gram minor ATI AI , updating it with the addition/removal of elements to/from the active set. This allows for fast solution for the update direction; O(d2 ) operations are needed to solve (2.7), rather than the O(d3 ) flops it would ordinarily take. In detail, let s = |I|, where I denotes the current active set. The dominant calculations per iteration are the solution of (2.7) at a cost of 2s2 flops, and computation of the step to the next vertex on the solution path, using nd+6(n−s) flops. In addition, the Cholesky factor of ATI AI is either updated by appending a column to AI at a cost of s2 + sd + 2(s + d) flops, or ‘downdated’ by removing a column of AI using at most 3s2 flops. To conclude, without any sparsity constraints on the data, k Homotopy steps

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

29

would take at most 4kd2 /3 + kdn + O(kn) flops. However, if the conditions for the k-step solution property are satisfied, a more favorable estimate holds. Lemma 1. Let (A, y) be a problem instance drawn from a suite S(E,V; d, n, k). Suppose the k-step solution property holds and the Homotopy algorithm performs k steps, each time adding a single element into the active set. The algorithm terminates in k 3 + kdn + 1/2k 2 (d − 1) + n(8k + d + 1) + 1/2k(5d − 3) flops. Notice that, for k d, the dominant term in the expression for the operation count is kdn, the number of flops needed to carry out k matrix-vector multiplications. Thus, we may interpret Lemma 1 as saying that, under such favorable conditions, the computational workload of Homotopy is roughly proportional to k applications of a d × n matrix. To visualize this statement, panel (a) of Figure 2.1 displays the operation count of Homotopy on a grid with varying sparsity and indeterminacy factors. In this simulation, we measured the total operation count of Homotopy as a fraction of dn (the cost of one application of a d × n matrix), for random instances of the problem suite S(USE,Uniform; d, n, k). We fixed n = 1000, varied the indeterminacy of the system, indexed by δ = d/n, in the range [0.1, 1], and the sparsity of the solution, indexed by ρ = k/d, in the range [0.05, 1]. In words, each point on the grid measures the operation count of Homotopy in quanta corresponding to one matrix-vector product. For reference, we superposed the theoretical bound ρW below which the solution of (P1 ) is, with high probability, the sparsest solution (see section 2.9 for more details). Close inspection of this plot reveals that below this curve, Homotopy delivers the solution to (P1 ) rapidly, using a modest number of matrix multiplications.

2.3.3

Number of Iterations

Considering the analysis just presented, it is clear that the computational efficiency of Homotopy greatly depends on the number of vertices on the polygonal Homotopy path. Indeed, since it allows removal of elements from the active set, Homotopy may, conceivably, require an arbitrarily large number of iterations to reach a solution. We note that this property is not shared by Lars or Omp; owing to the fact that

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

30

Figure 2.1: Computational cost of Homotopy. Panel (a) shows the operation count as a fraction of one matrix application on a ρ-δ grid, with n = 1000. Panel (b) shows the number of iterations as a fraction of d = δ · n. The superimposed dashed curve depicts the curve ρW , below which Homotopy recovers the sparsest solution with high probability.

these algorithms never remove elements from the active set, they are guaranteed to terminate with zero residual after at most d iterations. In [32], Efron et al. briefly noted that they had observed examples where Homotopy does not ‘drop’ elements from the active set very frequently, and so, overall, it doesn’t require many more iterations than Lars to reach a solution. We explore this initial observation further, and present evidence leading to Empirical Finding 3. Specifically, consider the problem suite S(E,V; d, n, k), with E ∈ { USE, PFE, PHE, URPE }, and V ∈ { Uniform, Gauss, Bernoulli }. The space (d, n, k) of dimensions of underdetermined problems may be divided into three regions: 1. k-step solution. When (d, n, k) are such that the k-step property holds, then Homotopy successfully recovers the sparse solution in k steps, as suggested by Empirical Findings 1 and 2. Otherwise; 2. k-sparse solution. When (d, n, k) are such that `1 minimization correctly recovers k-sparse solutions but Homotopy takes more than k steps, then, with high probability, it takes no more than cs · d steps, with cs a constant depending on E, V , empirically found to be ∼ 1.6. Otherwise;

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

31

3. `1 solution. With high probability, Homotopy does not recover the sparsest solution, returning a solution to (P1 ) in cf ·d steps, with cf a constant depending on E, V , empirically found to be ∼ 4.85. We note that the so-called “constants” cs , cf depend weakly on δ and n. A graphical depiction of this division of the space of admissible (d, n, k) is given in panel (b) of Figure 2.1. It shows the number of iterations Homotopy performed for various (d, n, k) configurations, as a shaded attribute on a grid indexed by δ and ρ. Inspection of this plot reveals that Homotopy performs at most ∼ 1.6 · d iterations, regardless of the underlying solution sparsity. In particular, in the region below the curve ρW , where, with high probablity, Homotopy recovers the sparsest solution, it does so in less than d steps. More extensive evidence is given in Table 2.2, summarizing the results of a comprehensive study. We considered four matrix ensembles E, each coupled with three nonzero ensembles V . For a problem instance drawn from a S(E,V; d, n, k), we recorded the number of iterations required to reach a solution. We repeated this at many different (d, n, k) configurations, generating 100 independent realizations for each (d, n, k) and computing the average number of iterations observed at each instance. Table 2.2 displays the estimated constants cs , cf for different combinations of matrix ensemble and coefficient ensemble. Thus, the results in Table 2.2 read, e.g., ‘Applied to a problem instance drawn from S(USE,Uniform; d, n, k), Homotopy takes, with high probability, no more than 1.69 · d iterations to obtain the minimum `1 solution’. Coeff. Ensemble / Matrix Ensemble USE PFE PHE URPE

Uniform cs cf 1.65 1.69 0.99 3.44 0.99 4.85 1.05 4.4

Gauss cs cf 1.6 1.7 0.99 1.42 0.92 1.44 1 1.46

Bernoulli cs cf 1.23 1.68 0.86 1.49 0.87 1.46 1 1.44

Table 2.2: Maximum number of Homotopy iterations, as a fraction of d, for various matrix / coefficient ensembles.

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

2.4

32

Fast Solution with the Incoherent Ensemble

Let A be a d×n matrix with unit-length columns aj , kaj k2 = 1. The mutual coherence M (A) = max |hai , aj i| i6=j

measures the smallest angle between any pair of columns. As n > d, this angle must be greater than zero: the columns cannot be mutually orthogonal; in fact, there is an established lower bound on the coherence of A: s n−d , M (A) ≥ d(n − 1) Matrices that satisfy this lower bound with equality are known as optimal Grassmannian frames [74]. There is by now much work establishing relations between the sparsity of a coefficient vector α0 and the coherence of a matrix A needed for successful recovery of α0 via `1 minimization [27, 25, 80, 44] or Omp [26, 79]. In detail, for a general matrix A with coherence M (A), both `1 minimization and OMP recover the solution α0 from data y = Aα0 whenever the number of nonzeros in α0 satisfies kα0 k0 < (M (A)−1 + 1)/2. See, for example, Theorem 7 of [25] or Corollary 3.6 of [79]. Comparing this bound with (2.1), we see that Theorem 1 states that a parallel result holds for the Homotopy algorithm. Before proceeding to the proof of the Theorem, we make a few introductory assumptions. As in the above, we assume that the Homotopy algorithm operates with a problem instance (A, y) as input, with y = Aα0 and kα0 k0 = k. To simplify notation, we assume, without loss of generality, that α0 has its nonzeros in the first k positions. Further, we operate under the convention that at each step of the algorithm, only one vector is introduced into the active set. If two or more vectors are candidates to enter the active set, assume the algorithm inserts them one at a time, in separate stages.

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

33

In the following, we let x` denote the Homotopy solution at the `-th step, and r` = y − Ax` denote the residual at that step. Finally, we define c ` = AT r ` as the corresponding residual correlation vector. To prove Theorem 1, we introduce two useful notions. Definition 3. We say that the Homotopy algorithm has the Correct Term Selection Property at a given problem instance (A, y), with y = Aα0 , if at each iteration, the algorithm selects a new term to enter the active set from the support set of α0 . Homotopy has the correct term selection property if, throughout its operation, it builds the solution using only correct terms. Thus, at termination, the support set of the solution is guaranteed to be a subset of the support set of α0 . Definition 4. We say that the Homotopy algorithm has the Sign Agreement Property at a given problem instance (A, y), with y = Aα0 , if at every step `, for all j ∈ I, sgn(x` (j)) = sgn(c` (j)). In words, Homotopy has the sign agreement property if, at every step of the algorithm, the residual correlations in the active set agree in sign with the corresponding solution coefficients. This ensures that the algorithm never removes elements from the active set. These two properties are the fundamental building blocks needed to ensure that the k-step solution property holds. In particular, these two properties are necessary and sufficient conditions for successful k-step termination, as the following lemma shows. Lemma 2. Let (A, y) be a problem instance drawn from a suite S(E,V; d, n, k). The Homotopy algorithm, when applied to (A, y), has the k-step solution property if and only if it has the correct term selection property and the sign agreement property.

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

34

Proof. To argue in the forward direction, we note that after k steps, correct term selection implies that the active set is a subset of the support of α0 , i.e. I ⊆ {1, . . . , k}. In addition, the sign agreement property ensures that no variables leave the active set. Therefore, after k steps, I = {1, . . . , k} and the Homotopy algorithm recovers the correct sparsity pattern. To show that after k steps, the algorithm terminates, we note that at the k-th step, the step-size γk is chosen so that, for some j ∈ I c , |ck (j) − γk aTj AI dk (I)| = λk − γk ,

(2.11)

with λk = kck (I)k∞ . In addition, for the k-th update, we have AI dk (I) = AI (ATI AI )−1 ATI rk = rk , since rk is contained in the column space of AI . Hence ck (I c ) = ATIc AI dk (I), and γk = λk is chosen to satisfy (2.11). Therefore, the solution at step k has Axk = y. Since y has a unique representation in terms of the columns of AI , we may conclude that xk = α0 . The converse is straightforward. In order to terminate with the solution α0 after k steps, the Homotopy algorithm must select one term out of {1, . . . , k} at each step, never removing any elements. Thus, violation of either the correct term selection property or the sign agreement property would result in a number of steps greater than k or an incorrect solution. Below, we show that when the solution α0 is sufficiently sparse, both these properties hold as the algorithm traces the solution path. Theorem 1 then follows naturally.

2.4.1

Correct Term Selection

We now show that, when sufficient sparsity is present, the Homotopy solution path maintains the correct term selection property.

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

35

Lemma 3. Suppose that y = Aα0 , where α0 has only k nonzeros with k satisfying k ≤ (µ−1 + 1)/2.

(2.12)

Assume that the residual at the `-th step can be written as a linear combination of the first k columns in A, i.e. k X

r` =

β` (j)aj .

j=1

Then the next step of the Homotopy algorithm selects an index from among the first k. Proof. We show that at the `-th step, max |hr` , ai i| > max |hr` , ai i|,

1≤i≤k

i>k

(2.13)

and so at the end of the `-th iteration, the active set is a subset of {1, . . . , k}. Let G = AT A denote the Gram matrix of A. Let ˆi = arg max1≤i≤k |β` (i)|. The left-hand side of (2.13) is bounded below by max |hr` , ai i| ≥ |hr` , aˆi i|

1≤i≤k

≥ |

k X

β` (j)Gˆij |

j=1

≥ |β` (ˆi)| −

X

|Gˆij ||β` (j)|

j6=ˆi

≥ |β` (ˆi)| − M (A)

X

|β` (j)|

j6=ˆi

≥ |β` (ˆi)| − M (A)(k − 1)|β` (ˆi)|,

(2.14)

Here we used: kaj k22 = 1 for all j and Gˆij ≤ M (A) for j 6= ˆi. As for the right-hand

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

36

side of (2.13), we note that for i > k we have

|hr` , ai i| ≤

k X

|β` (j)||Gij |

j=1

≤ M (A)

k X

|β` (j)|

j=1

≤ kM (A)|β` (ˆi)|.

(2.15)

Combining (2.14) and (2.15), we get that for (2.13) to hold, we need 1 − (k − 1)M (A) ≥ kM (A).

(2.16)

Since k was selected to exactly satisfy this bound, relation (2.13) follows. Thus, when k satisfies (2.12), the Homotopy algorithm only ever considers indices among the first k as candidates to enter the active set.

2.4.2

Sign Agreement

Recall that an index is removed from the active set at step ` only if condition (2.4) is violated, i.e. if the signs of the residual correlations c` (i), i ∈ I do not match the signs of the corresponding elements of the solution x` . The following lemma shows that, when α0 is sufficiently sparse, an even stronger result holds: at each stage of the algorithm, the residual correlations in the active set agree in sign with the direction of change of the corresponding terms in the solution. In other words, the solution moves in the right direction at each step. In particular, it implies that throughout the Homotopy solution path, the sign agreement property is maintained. Lemma 4. Suppose that y = Aα0 , where α0 has only k nonzeros, with k satisfying k ≤ (µ−1 + 1)/2. For ` ∈ {1, . . . , k}, let c` = AT r` , and let the active set I be defined as in (2.6). Then

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

37

the update direction d` defined by (2.7) satisfies sgn(d` (I)) = sgn(c` (I)).

(2.17)

Proof. Let λ` = kc` k∞ . We show that for i ∈ I, |d` (i) − sgn(c` (i))| < 1, implying that (2.17) holds. To do so, we note that (2.7) can be rewritten as λ` (ATI AI − Id )d` (I) = −λ` d` (I) + c` (I), where Id is the identity matrix of appropriate dimension. This yields kλ` d` (I) − c` (I)k∞ ≤ kATI AI − Id k(∞,∞) · kλ` d` (I)k∞ 1 − M (A) kλ` d` (I)k∞ ≤ 2 1 − M (A) ≤ (kc` (I)k∞ + kλ` d` (I) − c` (I)k∞ ) 2 1 − M (A) ≤ (λ` + kλ` d` (I) − c` (I)k∞ ), 2 where k · k|(∞,∞) denotes the induced `∞ operator norm. Rearranging terms, we get kλ` d` (I) − c` (I)k∞ ≤ λ` ·

1−M < λ` , 1+M

and thus, kd` (I) − sgn(c` (I))k∞ < 1. Relation (2.17) follows.

2.4.3

Proof of Theorem 1

Lemmas 3 and 4 establish the correct term selection property and the sign agreement property at a single iteration of the algorithm. We now give an inductive argument showing that the two properties hold at every step ` ∈ {1, . . . , k}. In detail, the algorithm starts with x0 = 0. By the sparsity assumption on y = Aα0 , we may apply Lemma 3 with r0 = y to get that at the first step, a column among

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

38

the first k is selected. Moreover, at the end of the first step we have x1 = γ1 d1 , and by Lemma 4, sgn(x1 (I)) = σ1 (I). By induction, assume that at step `, only indices among the first k are in the active set, i.e. r` =

k X

β` (j)aj ,

j=1

and the sign condition (2.4) holds, i.e. sgn(x` (I)) = σ` (I). Applying Lemma 3 to r` , we get that at the (l + 1)-th step, the term to enter the active set will be selected from among the first k indices. Moreover, for the updated solution we have sgn(x`+1 (I)) = sgn(x` (I) + γ`+1 d`+1 (I)). By Lemma 4 we have sgn(d`+1 (I)) = σ`+1 (I). We observe that c`+1 = c` − γ`+1 AT Ad`+1 , whence, on the active set, |c`+1 (I)| = |c` (I)|(1 − γ`+1 ), and so σ`+1 (I) = σ` (I). In words, once a vector enters the active set, its residual correlation maintains the same sign. We conclude that sgn(x`+1 (I)) = σ`+1 (I).

(2.18)

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

39

Hence no variables leave the active set. To summarize, both the correct term selection property and the sign agreement property are maintained throughout the execution of the Homotopy algorithm. We may now invoke Lemma 2 to conclude that the k-step solution property holds. This completes the proof of Theorem 1.

2.5

Fast Solution with the Uniform Spherical Ensemble

Suppose now that A is a random matrix drawn from the Uniform Spherical Ensemble. It is not hard to show that such matrices A are naturally incoherent [27]. In fact, for some > 0, with high probability for large n, r M (A) ≤

4 log(n) · (1 + ). d

Applying Theorem 1, we get as an immediate corollary that if k satisfies k≤

p

d/ log(n) · (1/4 − 0 ),

(2.19)

then Homotopy is highly likely to recover any sparse vector α0 with at most k nonzeros. However, as it turns out, Homotopy operates much more efficiently when applied to instances from the random suite S(USE; d, n, k) than what is implied by incoherence. We now discuss evidence leading to Empirical Finding 1. We demonstrate, through a comprehensive suite of simulations, that the formula d/(2 log(n)) accurately describes the breakdown of the k-step solution property. Before doing so, we introduce two important tools that we use repeatedly in the empirical analysis below.

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

2.5.1

40

Phase Diagrams and Phase Transitions

In statistical mechanics, a phase transition is the transformation of a thermodynamic system from one phase to another. The distinguishing characteristic of a phase transition is an abrupt change in one or more of its physical properties, as underlying parameters cross a region of parameter space [87]. Phase transition phenomena in combinatorial search problems have been studied extensively in the combinatorics literature [41, 60]. Borrowing terminology, we say that the family of events {En ; kn , dn } exhibits a phase transition with threshold function T (d, n) if for each > 0 and every family {En ; kn , dn } that obeys kn < T (dn , n)(1 − ), P {En ; kn , dn } → 1 as n → ∞, while if kn > T (dn , n)(1 − ), P {En ; kn , dn } → 0 as n → ∞. As we show below, the k-step solution property of Homotopy, applied to problem instances drawn from S(USE; d, n, k), exhibits a phase transition with threshold function T (d, n) = d/(2 log(n)). Thus, there is a sequence {n } with each term small and positive, so that for k < (1 − n ) · d/(2 log(n)), Homotopy delivers the sparsest solution in k steps with high probability; on the other hand, if k > (1 + n ) · d/(2 log(n)), then with high probablity, Homotopy, fails to terminate in k steps. To visualize phase transitions, we utilize phase diagrams. In statistical mechanics, a phase diagram is a graphical depiction of a parameter space decorated by boundaries of regions where certain properties hold. Here we use this term in an analogous fashion, to mean a graphical depiction of a subset of the parameter space (d, n, k), illustrating the region where an algorithm has a given property P when applied to a problem suite S(E,V; d, n, k). It will typically take the form of a 2-D grid, with the threshold function associated with the corresponding phase transition dividing the grid into two regions: a region where property P occurs with high probability, and a region where P occurs with low probablity. While phase transitions are sometimes discovered by formal mathematical analysis, more commonly, the existence of phase transitions is unearthed by computer simulations [41, 60]. We now describe an objective framework for estimating phase transitions from simulation data. For each problem instance (A, y) drawn from a suite S(E,V; d, n, k), we associate a binary outcome variable Ykd,n , with Ykd,n = 1 when property P is satisfied on that realization, and Ykd,n = 0 otherwise. Within

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

41

our framework, Ykd,n is modeled as a Bernoulli random variable with probability πkd,n . Thus, P (Ykd,n = 1) = πkd,n , and P (Ykd,n = 0) = 1 − πkd,n . Our goal is to estimate a value k˜d,n where the transition between these two states occurs. To do so, we employ a logistic regression model on the mean response E{Ykd,n } with predictor variable k, E{Ykd,n } =

exp(β0 + β1 k) . 1 + exp(β0 + β1 k)

(2.20)

Logistic response functions are often used to model threshold phenomena in statistical data analysis [47]. Let π ˆ d,n (k) denote the value of the fitted response function. We may then compute a value kˆd,n indicating that, with probability exceeding 1 − η, for a η

problem instance drawn from S(E,V; d, n, k) with k < kˆηd,n , property P holds. Then kˆd,n is given by the implicit relation η

π ˆ d,n (kˆηd,n ) = 1 − η. Thus, computing kˆηd,n for a range of (d, n) values essentially maps out an empirical phase transition of property P. The value η fixes a location on the transition band below which we define the region of success. In our work, we set η = 0.25.

2.5.2

The k-Step Solution Property

We now apply the method described in the previous section to estimate an empirical phase transition for the k-step solution property of the Homotopy algorithm. Before doing so, we briefly describe the experimental setup. For (d, n, k) given, we generated a problem instance (A, y) drawn from the problem suite S(USE,Uniform; d, n, k). To this instance we applied the Homotopy algorithm and recorded the outcome of the response variable Ykd,n . We did so in two cases; first, fixing d = 200, varying n through 40 log-equispaced points in [200, 20000] and k through 40 equispaced points in [1, 40]; second, keeping n fixed at 1000, varying d through 40 equispaced points in [100, 1000] and k through 40 equispaced points in [2, 100]. For each (d, n, k) instance we ran 100 independent trials. To estimate the regression coefficients β0 , β1 in (2.20), we used maximum likelihood estimation [47].

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

42

Figure 2.2: Phase diagram of the k-step solution property of Homotopy applied to the problem suite S(USE,Uniform; d, n, k). Shaded attribute display the proportion of successful terminations, out of 100 trials, in the case: (a) k = 1 . . . 40, d = 200, and n = 200 . . . 20000; (b) k = 2 . . . 100, d = 100 . . . 1000, and n = 1000. Overlaid d,n d,n are curves for d/(2 log(n)) (solid), kˆ0.25 (dashed), and the confidence bound for kˆ0.25 with 95% confidence level (dotted).

Figure 2.2 displays two phase diagrams, giving a succinct depiction of the simulation results. Panel (a) displays a grid of (k, n) values, k ∈ [1, 40], n ∈ [200, 20000]. Each point on this grid takes a value between 0 and 1, representing the ratio of successful outcomes to overall trials; these are precisely the observed E{Ykd,n } on a grid of (k, n) values, with d fixed. Overlaid are curves for the estimated kˆd,n for each n η

with η = 0.25, and the theoretical bound d/(2 log(n)). Panel (b) has a similar configuration on a (k, d) grid, with n fixed at 1000. Careful inspection of these phase diagrams reveals that the estimated kˆηd,n closely follows the curve d/(2 log(n)); Panel (a) shows the logarithmic behavior of the phase transition as n varies, and panel (b) shows its linear behavior as d varies. Table 2.3 offers a summary of the results of this experiment for the three nonzero ensembles used in our simulations: Uniform, Gauss, and Bernoulli. In each case, two measures of statistical goodness-of-fit are reported: the mean-squared error (MSE) and the R2 statistic. The MSE measures the squared deviation of the estimated k values from the proposed formula d/(2 log(n)). The R2 statistic measures how well the observed phase transition fits the theoretical model; a value close to

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

43

1 indicates a good fit. These statistical measures give another indication that the empirical behavior is in line with theoretical predictions. Moreover, they support the theoretical assertion that the k-step solution property of the Homotopy algorithm is not affected by the distribution of the nonzeros. Coefficient Ensemble Uniform Gauss Bernoulli

M SE 0.09 0.095 0.11

R2 0.98 0.98 0.98

Table 2.3: Deviation of estimated kˆηd,n from d/(2 log(n)) for S(USE,V; d, n, k), for different coefficient ensembles V . The results in Figure 2.2 and Table 2.3 all demonstrate consistency between the empirical phase transition and the bound d/(2 log(n)), even for very modest problem sizes. We now turn to examine the sharpness of the phase transition as the problem size increases. In other words, we ask whether for large d, n, the width of the transition band becomes narrower, i.e. n in (2.2) becomes smaller as n increases. We note that the regression coefficient β1 in (2.20) associated with the predictor variable k dictates how sharp the transition from πkd,n = 1 to πkd,n = 0 is; a large negative value for β1 implies a ‘step function’ behavior. Hence, we expect β1 to grow in magnitude as the problem size increases. This is indeed verified in panel (a) of Figure 2.3, which displays the behavior of β1 for increasing n and d fixed at 200. For illustration, panels (b) and (c) show the observed mean response Ykd,n vs. k for n = 200 (β1 ≈ −0.35), and n = 20000 (β1 ≈ −1.1), respectively. Clearly, the phase transition in panel (c) is much sharper.

2.5.3

Correct Term Selection and Sign Agreement

In Section 2.4, we identified two fundamental properties of the Homotopy solution path, namely, correct term selection and sign agreement, that constitute necessary and sufficient conditions for the k-step solution property to hold. We now present the results of a simulation study identifying phase transitions in these cases. In detail, we repeated the experiment described in the previous section, generating

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

44

(a) Regression coefficient β1 vs. n, d = 200 0

−0.5

−1

−1.5

0.2

0.4

0.6

0.8

1 n

1.2

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

15

20 k

25

30

35

2

(c) Estimated Logistic Response, n = 20000, d = 200 1

10

1.8

4

(b) Estimated Logistic Response, n = 200, d = 200

5

1.6

x 10

1

0

1.4

40

0

5

10

15

20 k

25

30

35

40

Figure 2.3: Sharpness of phase transition. The top panel shows the behavior of the regression coefficient β1 of (2.20) for d = 200 and n = 200 . . . 20000. The bottom panels show two instances of the fitted transition model, for (b) d = 200, n = 200 and (c) d = 200, n = 20000.

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

45

Figure 2.4: Phase diagrams for (a) Correct term selection property; (b) Sign agreement property, on a k-n grid, with k = 1 . . . 40, d = 200, and n = 200 . . . 20000. d,n d,n Overlaid are curves for kˆ0.25 (dashed), and the confidence bound√for kˆ0.25 with 95% confidence level (dotted). Panel (a) also displays the curve d/( 2 · log(n)) (solid), and panel (b) displays the curve d/(2 log(n)) (solid). The transition on the right-hand display occurs significantly above the transition in the left-hand display. Hence the correct term selection property is the critical one for overall success.

problem instances from the suite S(USE,Uniform; d, n, k) and applying the Homotopy algorithm. As the outcome of the response variable Ykd,n , we recorded first the property { The Homotopy algorithm selected only terms from among {1, . . . , k} as entries to the active set }, and next, the property { At each iteration of the Homotopy algorithm, the signs of the residual correlations on the active set match the signs of the corresponding solution terms }. Finally, we estimated phase transitions for each of these properties. Results are depicted in panels (a) and (b) of Figure 2.4. Panel (a) displays a phase diagram for the correct term selection property, and Panel (b) has a similar phase diagram for the sign agreement property. The results are quite instructive. The phase transition for the correct term selection property agrees well with the formula d/(2 log(n)). Rather surprisingly, the phase transition for the sign agreement property seems to be at a significantly higher √ threshold, d/( 2 · log(n)). Consequently, the ‘weak link’ for the Homotopy k-step solution property, i.e., the attribute that dictates the sparsity bound for the k-step solution property to hold, is the addition of incorrect terms into the active set. We

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

46

interpret this finding as saying that the Homotopy algorithm would typically err first ‘on the safe side’, adding terms off the support of α0 into the active set (false discoveries), before removing correct terms from the active set (missed detections).

2.5.4

Random Signs Ensemble

Matrices in the Random Signs Ensemble (RSE) are constrained to have elements of constant amplitude, with signs independently drawn from an equi-probable Bernoulli distribution. This ensemble is known to be effective in various geometric problems associated with underdetemined systems, through work of Kashin [53], followed by Garnaev and Gluskin [40]. Previous work [82, 83, 81] gave theoretical and empirical evidence to the fact that many results developed in the USE case hold when the matrices considered are drawn from the RSE. Indeed, as we now demonstrate, the RSE shares the k-step solution property with USE. To show this, we conducted a series of simulations, paralleling the study described in Section 2.5.2, with the problem suite S(RSE; d, n, k) replacing S(USE; d, n, k). The results of the simulation are depicted in Figure 2.5. Table 2.4 summarizes the experiment, showing MSE and R2 measures for the deviation of the empirical phase transition from the formula d/(2 log(n)), for different coefficient ensembles. The results indeed suggest that the Homotopy algorithm has the k-step solution property when applied to the problem suite S(RSE; d, n, k), with k < d/(2 log(n)). An important implication of this conclusion is that in practice, we may use matrices drawn from the RSE to replace matrices from the USE without loss of performance. This may result in reduced complexity and increased efficiency, as matrices composed of random signs can be generated and applied much more rapidly than general matrices composed of real numbers.

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

47

Figure 2.5: Phase diagram of the k-step solution property of Homotopy applied to the problem suite S(RSE,Uniform; d, n, k). Shaded attribute display the proportion of successful terminations, out of 100 trials, in the case: (a) k = 1 . . . 40, d = 200, and n = 200 . . . 20000; (b) k = 2 . . . 100, d = 100 . . . 1000, and n = 1000. Overlaid d,n d,n are curves for d/(2 log(n)) (solid), kˆ0.25 (dashed), and the confidence bound for kˆ0.25 with 95% confidence level (dotted). Coefficient Ensemble Uniform Gauss Bernoulli

M SE 0.12 0.11 0.08

R2 0.98 0.98 0.99

Table 2.4: Deviation of estimated kˆηd,n from d/(2 log(n)) for S(RSE,V; d, n, k), for different coefficient ensembles V .

2.6

Fast Solution with Partial Orthogonal Ensembles

We now turn our attention to a class of matrices composed of partial orthogonal matrices, i.e., matrices constructed by sampling a subset of the rows of an orthogonal matrix. In particular, we are interested in the following three ensembles: • Partial Fourier Ensemble (PFE). Matrices in this ensemble are obtained by taking an n by n Fourier matrix and sampling d rows at random. The PFE is of utmost importance in medical imaging and spectroscopy, as it can be used to model reconstruction of image data from partial information in the frequency

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

48

domain [10, 14, 55, 56]. • Partial Hadamard Ensemble (PHE). Matrices in this ensemble are obtained by taking an n by n Hadamard matrix and sampling d of its rows at random. This ensemble is known to be important for various extremal problems in experimental design [46]. • Uniform Random Projections (URPE). Matrices in this ensemble are obtained by taking an n by n random orthogonal matrix and sampling d of its rows at random. It serves as a general model for the PFE and PHE. As we now demonstrate, the k-step solution property of Homotopy holds at a significantly greater range of signal sparsities than observed in Section 2.5. We first consider the PFE and PHE. Partial Fourier matrices and partial Hadamard matrices have drawn considerable attention in the sparse approximation community [10, 14, 83, 82, 42]. We mention two incentives: • Modeling of Physical Phenomena. Fourier operators play a key role in numerous engineering applications modeling physical phenomena. For instance, acquisition of 2- and 3-D Magnetic Resonance images is modeled as applying a Fourier operator to spatial data. Thus, application of a partial Fourier operator would correspond to acquisition of partial frequency data. Likewise, the Hadamard operator, albeit not as common, is often used in Hadamard Transform Spectroscopy. • Fast Implementations. Both the Fourier transform and the Hadamard transform have special structure allowing for rapid computation. They can be computed by algorithms that run much faster than the na¨ıve implementation from definition. In detail, the Fast Fourier Transform (FFT) can be applied to an n-vector using O(n log(n)) operations, rather than the n2 operations needed by direct computation [6]. Similarly, a Fast Hadamard Transform (FHT) exists, which applies a Hadamard operator using O(n log(n)) operations. With these fast algorithms, rapid computation of partial Fourier or partial Hadamard products is straightforward. Specifically, note that we may write a d by n matrix A

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

49

Figure 2.6: k-step solution property of Homotopy for the suite S(PFE,V; d, n, k). Each phase diagram presents success rates on a ρ-δ grid, with ρ = k/d in [0.05, 1], δ = d/n in [0.1, 1], and n = 1024, for: (a) V = Uniform; (b) V = Gauss; (c) V = Bernoulli. Overlaid are curves for ρˆ0.25 (δ) (dash-dotted), with the confidence bounds for 95% confidence level (dotted), and ρ = 1/(2 log(n)) (solid). The range of sparsities in which the k-step solution property holds is much more extensive than implied by the bound k < d/(2 log(n)).

from the P F E as A = R · F , with F an n by n Fourier matrix, and R a d by n random sampling matrix. Translating to linear operations, application of A amounts to applying F , using the FFT, followed by sampling d entries of the resulting vector. Similarly for partial Hadamard matrices. The experimental setup leading to Empirical Finding 2 utilizes a more extensive phase diagram. Expressly, we fix n and generate a grid of points indexed by variables δ = d/n and ρ = k/d, with δ in (0, 1] and ρ in (0, 1]. Thus, at a fixed δ, d = bδnc and k = bρdc ranges between 1 and d. In a sense, this ρ-δ grid gives a complete picture of all sparse underdetermined settings of interest. We considered two discretizations. First, we fixed n = 1024, varied δ in the range [0.1, 1], and ρ in [0.05, 1]. Second, we fixed δ = 0.5, varied n in [256, 1024], and ρ in [0.05, 1]. Data collection and estimation of an empirical phase transition were done in the same manner described in Section 2.5.2. Figures 2.6 and 2.7 summarize the simulation results, each depicting three phase diagrams, for three different nonzero distributions. The results are indeed surprising; for problems from the suite S(PFE; d, n, k), Homotopy maintains the k-step solution property at a much broader range of sparsities k than observed for problems from S(USE; d, n, k), particularly at high values of the indeterminacy ratio δ. For instance, at δ = 3/4, the empirical results indicate

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

50

Figure 2.7: k-step solution property of Homotopy for the suite S(PFE,V; d, n, k). Each phase diagram presents success rates on a ρ-n grid, with ρ = k/d in [0.05, 1], n in [128, 8192], and d = 125, for: (a) V = Uniform; (b) V = Gauss; (c) V = Bernoulli. Overlaid are curves for ρˆ0.25 (δ) (solid), with the confidence bounds for 95% confidence level (dotted). (a) Uniform nonzero distribution

(b) Gaussian nonzero distribution

1

(c) Bernoulli nonzero distribution

1 PFE PHE URPE

0.9

1 PFE PHE URPE

0.9

0.8

0.8

0.7

0.7

0.7

0.6

0.6

0.6

0.5

ρ = k/d

0.8

ρ = k/d

ρ = k/d

0.9

0.5

0.5

0.4

0.4

0.4

0.3

0.3

0.3

0.2

0.2

0.2

0.1

0.1

0 0.1

0.2

0.3

0.4

0.5 0.6 δ = d/n

0.7

0.8

0.9

1

0 0.1

PFE PHE URPE

0.1

0.2

0.3

0.4

0.5 0.6 δ = d/n

0.7

0.8

0.9

1

0 0.1

0.2

0.3

0.4

0.5 0.6 δ = d/n

0.7

0.8

0.9

1

Figure 2.8: Estimated phase transition curves. Each panel displays empirical phase transition curves for the PFE (solid), PHE (dashed), URPE (dash-dotted), with (a) Uniform nonzero distribution; (b) Gaussian nonzero distribution; (c) Bernoulli nonzero distribution.

that when the nonzeros are uniformly distributed, the empirical phase transition is at kˆ = ρˆd ≈ 174, whereas the formula d/(2 log(n)) ≈ 55, implying that the range of sparsities for which Homotopy terminates in k steps has grown threefold. We conducted parallel simulation studies for the PHE and the URPE, and we summarize the empirical findings in Figure 2.8. It displays the empirical phase transitions ρˆ(δ) for the three partial orthogonal ensembles with different nonzero distributions. Careful inspection of the curves reveals that the k-step property of Homotopy holds at roughly the same region of (δ, ρ) space for all combinations of matrix ensembles and coefficient distributions considered, with mild variations in the location and shape of

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

51

the phase transition for different nonzero distributions. In particular, the data indicate that problems with underlying nonzero distribution Bernoulli exhibit the lowest k-step breakdown. Thus, in some sense, the ‘hardest’ problems are those with uniform-amplitude nonzeros in the solution. The empirical phase transition ρˆ(δ) does not yield itself to a simple representation in functional form. It does, however, have several important characteristics, which we now highlight. • First, notice that at the right end of the phase diagram, we have ρˆ(δ = 1) = 1. In other words, when the matrix A is square, Homotopy always recovers the sparsest solution in k steps. This is hardly surprising, since, at δ = 1, the matrix A is orthogonal, and the data y is orthogonal to columns of A off the support of α0 , ensuring correct term selection. In addition, the update step d` (I) in (2.7) is equal to sgn(c` (I)), implying that the sign agreement property holds. Thus, k-step termination is guaranteed. • Second, at the left end of the phase diagram, as δ → 0, we may rely on recent theoretical developments to gauge the behavior of the phase transition curve. In their work on randomly projected polytopes, Donoho and Tanner showed that an upper bound for the value ρ(δ) below which `1 minimization recovers the sparsest solution with high probability, asymptotically behaves like 1/(2 log(1/δ)) as δ → 0 [28]. In particular, this same upper bound holds in our case, as failure to recover the sparsest solution necessarily implies failure of the k-step property. • Lastly, and most importantly, we note that the k-step phase transition for the partial orthogonal ensembles is stable with increasing n, in the following sense. For δ ∈ (0, 1] fixed, the range of ρ in which the k-step property holds does not change with n. Evidence to the stability of the phase transition is given in Figure 2.7, which demonstrates that for δ fixed at 0.5, the phase transition occurs at an approximately constant value of ρ. The implication of this property is that the empirical phase transitions computed here may be used to estimate the k-step behavior of Homotopy for problem instances of any size.

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

(1)

(2)

HOMOTOPY

52

(3)

LARS

l 1 minimization

OMP

!

Figure 2.9: Bridging `1 minimization and Omp. (1) Homotopy provably solves `1 minimization problems [32]. (2) Lars is obtained from Homotopy by removing the sign constraint check. (3) Omp and Lars are similar in structure, the only difference being that Omp solves a least-squares problem at each iteration, whereas Lars solves a linearly-penalized least-squares problem. There are initially surprising similarities of results concerning the ability of OMP and `1 minimization to find sparse solutions. We view those results as statements about the k-step solution property. We present evidence showing that for sufficiently small k, steps (2) and (3) do not affect the threshold for the k-sparse solution property. Since both algorithms have the k-step solution property under the given conditions, they both have the sparse solution property under those conditions.

2.7

Bridging `1 Minimization and OMP

In the introduction to this chapter, we alluded to the fact that there is undeniable parallelism in results about the ability of `1 minimization and Omp to recover sparse solutions to systems of underdetermined linear equations. Indeed, earlier reports [15, 26, 22, 79, 80, 81] documented instances where, under similar conditions, both `1 minimization and Omp recover the sparsest solution. Yet, these works did not offer any insights as to why the two seemingly disparate techniques should offer similar performance in such cases. In this section, we develop a linkage between `1 minimization and Omp. This relation is summarized in Figure 2.9, which highlights the intimate bonds between Homotopy, Lars, and Omp. Before proceeding, we clear up some confusion in terminology. Previous work studying the performance of Omp set conditions on the sparsity k under which the

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

53

algorithm ‘correctly recovers the sparsest solution’ [26, 79, 81]. Yet, in effect, what was proven is that Omp ‘selects a correct term at each iteration, and terminates with the correct solution after k iterations’; we now recognize this as the k-step solution property. We note that the k-step solution property is not, in general, a necessary condition for Omp to correctly recover the sparsest solution; it is clearly sufficient, however. Its usefulness lies in the simplicity of k-step solutions; when the k-step property holds, the solution path is the simplest possible, consisting of k correct steps. Thus, it yields itself naturally to focused analysis. We now argue that, through a series of simplifications, we can move from `1 minimization to Omp, at each step maintaining the k-step solution property. This transition is depicted in Figure 2.9, which shows the sequence of steps leading from `1 minimization to Omp; the two links in the sequence are Homotopy and Lars. Below, we elaborate on each of the bridging elements.

2.7.1

`1 Minimization −→ Homotopy

For the phrase ‘algorithm A has the k-step solution property’ to make sense, the algorithm must have a stepwise structure, building a solution approximation term by term. Thus, we cannot, in general, speak of `1 minimization as having the kstep solution property, as there are many algorithms for such minimization, including those with no meaningful notion of stepwise solution construction. This is where the Homotopy algorithm fits in, bridging the high-level notion of `1 minimization with the lower-level stepwise structure of Omp. Earlier work [32, 62] has shown that Homotopy is a correct algorithm for solving the `1 minimization problem (P1 ). In addition it builds an approximate solution in a stepwise fashion, thus making the k-step solution property applicable.

2.7.2

Homotopy −→ LARS

As noted earlier, the Lars procedure is a simplification of the Homotopy algorithm, achieved by removing the condition for sign agreement of the current solution and the residual correlations. This brings us one step closer to Omp; while Homotopy

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

54

allows for removal of terms from the active set, both Lars and Omp insert terms one by one into the active set, never removing any active elements. Interestingly, when the k-step property holds, this simplification changes nothing in the resulting solution. To see that, recall that Homotopy has the k-step solution property if and only if it has the correct term selection property and the sign agreement property. Correct term selection is all that is needed by Lars to reach the solution in k steps; after k correct terms are inserted into the active set, the algorithm would terminate with zero residual. The sign agreement property implies that Homotopy would successfully terminate in k steps as well. In summary, when the k-step solution property holds, the solution paths of Homotopy and Lars coincide. In particular, the assertion of Theorem 1 holds for Lars as well. Corollary 1. Let (A, y) be a problem instance drawn from S(Incµ ; d, n, k), with k ≤ (µ−1 + 1)/2. Then, Lars runs k steps and stops, delivering the solution α0 . Similar conclusions can be made about the assertions of Empirical Findings 1 and 2.

2.7.3

LARS −→ OMP

We already hinted at the fact that Homotopy and Omp share an inherent affinity. We now demonstrate that the two methods are based on the same fundamental procedure; solving a sequence of least-squares problems on an increasingly large subspace, defined by the active columns of A. More precisely, assume we are at the `-th iteration, with a solution estimate x`−1 and an active set I consisting of ` − 1 terms. Recall the procedure underlying Omp: It appends a new term to the active set by selecting the vector aj that maximizes S T the absolute residual correlation; j = arg maxj ∈I {j}. / |aj (y − Ax`−1 )|, and I := I It then projects y onto R(AI ) by solving ATI AI x` (I) = ATI y,

(2.21)

thus guaranteeing that the residual is orthogonal to the subspace R(AI ). Once y is

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

55

spanned by the active columns, the algorithm terminates with zero residual. While Lars was derived from a seemingly disparate perspective, it in fact follows the same basic procedure, replacing (2.21) with a penalized least-squares problem. Recalling the discussion in Section 2.2, we note that Lars also selects a new term according to the maximum correlation criterion, and then solves ATI AI x` (I) = ATI y − λ` · s,

(2.22)

where s is a vector of length ` recording the signs of residual correlations of each term at the point it entered the active set, and λ` is the correlation magnitude at the breakpoint on the Lars path. Indeed, (2.21) and (2.22) are identical if not for the penalization term on the right hand side of (2.22). As it turns out, this modification does not have an effect on the k-step solution property of the procedure. We offer two examples. First, for the incoherent problem suite S(Incµ ; d, n, k); the statement of theorem 1 holds for Omp as well (see, e.g., Corollary 3.6 in [79]). Second, for the random problem suite S(USE; d, n, k); Tropp et al. proved results that can now be reinterpreted as saying that when k ≤ d/(c log(n)), then with high probability for large n, Omp has the k-step solution property, paralleling the statement of Empirical Finding 1. We believe, in fact, that the bound k ≤ d/(2 log(n)) holds for Omp as well. To summarize, we exhibited a series of transformations, starting with `1 minimization, and ending with greedy pursuit. Each transformation is characterized clearly, and maintains the k-step property of the solution path. We believe this sequence clarifies initially surprising similarities between results about `1 minimization and Omp.

2.8

Homotopy and PFP

Polytope Faces Pursuit was introduced by Plumbley [66, 65] as a greedy algorithm to solve (P1 ) in the dual space. As pointed out in [65], although the algorithm is derived from a seemingly different viewpoint, it exhibits interesting similarities to

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

56

Homotopy. To study these in more detail, we first briefly describe the algorithm. Define the standard-form linear program equivalent to (P1 ) (P¯1 )

¯x, x¯ ≥ 0, min 1T x¯ subject to y = A¯

where A¯ = [A −A]. The equivalence of (P1 ) and (P¯1 ) is well established; the solutions x¯ of (P¯1 ) and x of (P1 ) have the correspondence x¯(j) = max{x(j), 0} for 1 ≤ j ≤ n, and x¯(j) = max{−x(j − n), 0} for n + 1 ≤ j ≤ 2n. The Lagrangian dual of the linear program (P¯1 ) is given by ¯ 1) (D

max y T v subject to A¯T v ≤ 1.

By the strong duality theorem, an optimal solution x¯ of the primal problem (P¯1 ) is ¯ 1 ), such that 1T x¯ = y T v. associated with an optimal solution v to the dual problem (D ¯ 1 ), which Pfp operates in the dual space, tracing a solution to the dual problem (D then yields a solution to the primal problem (P¯1 ) (or, equivalently, (P1 )). To do so, the algorithm maintains an active set, I, of nonzero coordinates, and carries out a familiar procedure at each iteration: Selection of a new term to enter the active set, removal of violating terms from the active set (if any), and computation of a new solution estimate. Specifically, At the `-th stage, Pfp first selects a new term to enter the active set, via the criterion i` = arg max i∈I /

a ¯Ti r` T a ¯ i r` > 0 1−a ¯Ti v`

,

(2.23)

¯x` the current residual. It then updates the active set, I ← I∪{¯ with r` = y−A¯ ai }, and computes a primal solution estimate, the projection of y onto the subspace spanned by the active vectors, x¯`+1 = (A¯TI A¯I )−1 A¯TI y,

(2.24)

v`+1 = A¯I (A¯TI A¯I )−1 1.

(2.25)

and a dual solution estimate

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

57

Finally, the algorithm verifies that all the active coefficients x¯`+1 (I) are non-negative; coordinates with negative coefficients are removed from the active set. The affinities between Homotopy and Pfp are not merely cosmetic. In fact, the two algorithms, derived from seemingly distinct viewpoints, carry out nearly identical procedures. To highlight these similarities, assume that both algorithms are applied to the standard form linear program (P¯1 ). We begin by examining the criteria for selection of a new term to enter the active set. In particular, assume we are at the `-th stage of the Homotopy algorithm. Recalling the discussion in Section 2.2 above, the Homotopy term selection criterion (when applied to (P¯1 )) is +

i = arg minc i∈I

+

λ − c` (i) 1−a ¯Ti v`

,

(2.26)

where v` = A¯I d` (I) = A¯I (A¯TI A¯I )−1 1 as in (2.25) above, and min+ indicates that the minimum is taken over positive arguments. Comparing (2.26) with (2.23), we note that the main difference lies in the residual; since Pfp projects the data y onto the space spanned by the active vectors, denoted RI , its residual is necessarily orthogonal to RI . Homotopy, in contrast, does not enforce orthogonality, and its residual can be expressed as the sum of a component in RI and a component in the orthogonal complement R⊥ I . Formally, r` = PI r` + r`⊥ , where PI = A¯I (A¯TI A¯I )−1 A¯TI is the orthogonal projector onto RI , and r`⊥ = (Id − PI )r` is the component of r` in R⊥ I . The component of r` in RI obeys PI r` = A¯I (A¯TI A¯I )−1 c` = λv` .

(2.27)

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

58

Plugging this into (2.26), we get i

+

λ−a ¯Ti (r`⊥ + λv` ) = arg minc i∈I 1−a ¯Ti v` a ¯Ti r`⊥ + λ− = arg minc i∈I 1−a ¯Ti v` T ⊥ T ⊥ a ¯ i r` a = arg max ¯ i r` > 0 , i∈I c 1−a ¯Ti v` +

where the last equality holds because A¯T v` < 1 throughout the Homotopy solution path. Thus, conditional on r`P F P = r`⊥ , the Homotopy selection criterion is equivalent to Pfp’s selection criterion. Now, since r`⊥ = y − PI y, as long as no elements are removed from the active set, r`P F P = r`⊥ by definition. In summary, the discussion above establishes the following. Corollary 2. As long as no elements are removed from the active set, the solution path of Pfp is identical to the Homotopy solution path. In particular, define the algorithm Pfp0 as Pfp with the condition that the solution coefficients be non-negative omitted. Then Pfp0 is equivalent to Lars. This finding is interesting in light of the earlier discussion; it implies that when the k-step solution property holds for Homotopy, it also holds for Pfp. In particular, the statement of Theorem 1 holds for Pfp as well. Formally, Corollary 3. Let (A, y) be a problem instance drawn from S(Incµ ; d, n, k), with k ≤ (µ−1 + 1)/2. Then, Pfp runs k steps and stops, delivering the solution α0 . Similarly, Empirical Findings 1 and 2 hold for Pfp as well. Figure 2.10 displays phase diagrams for the k-step solution property (Panel (a)), correct term selection property (Panel (b)), and sign agreement property (Panel (c)), when Pfp is applied to instances of the problem suite S(USE,Uniform; d, n, k). The three properties exhibit clear phase transition at around d/(2 log(n)).

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

59

Figure 2.10: Phase diagrams of Pfp for the problem suite S(USE,Uniform; d, n, k). Shaded attribute display the proportion of successful executions among total trials, for: (a) k-step solution property; (b) correct term selection property; (c) sign agreed,n ment property. Overlaid on each plot are curves for d/(2 log(n)) (solid), kˆ0.25 (dashed), d,n ˆ and the confidence bound for k0.25 with 95% confidence level (dotted).

2.9

Recovery of Sparse Solutions

Previous sections in this chapter examined various properties of Homotopy, all centered around the k-step solution property. The motivation is clear; The Homotopy algorithm is at peak efficiency when it satisfies the k-step solution property. Yet, we noted that failure to terminate in k steps need not imply failure to recover the sparsest solution. We now study this property in more detail. Definition 5. An algorithm A has the k-sparse Solution Property at a given problem instance (A, y) drawn from S(E,V; d, n, k) if, when applied to the data (A, y), the algorithm terminates with the solution α0 . Earlier papers used a different term for the same concept [25, 22, 82]; they would use the term `0 equivalence to indicate recovery of the sparsest solution, i.e. equivalence to the solution of the combinatorial optimization problem (P0 )

min kxk0 subject to y = Ax. x

For the Homotopy algorithm, the k-sparse solution property has been well-studied. Indeed, since the Homotopy algorithm is guaranteed to solve (P1 ), it will have the k-sparse solution property whenever the solutions to (P0 ) and (P1 ) coincide. In a series of papers [22, 20, 24], Donoho has shown that for problem instances from the

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

60

suite S(USE; d, n, k), the equivalence between solutions of (P0 ) and (P1 ) exhibits a clear phase transition, denoted ρW , that can be computed numerically. This phase transition provides a theoretical prediction for the conditions under which Homotopy satisfies the k-sparse solution property. Yet, at present, no such predictions exist for Lars, Omp, or Pfp. We now present the results of an extensive simulation study, examining the validity of the k-sparse solution property for Homotopy, Lars, Omp, and Pfp, comparing the performance of these four closely related algorithms. The study we conduct closely parallels the experiments described above. In particular, we measure the k-sparse solution property for the three algorithms, applied to problem instances from the suites S(USE,Uniform; d, n, k), S(USE,Gauss; d, n, k) and S(USE,Bernoulli; d, n, k). We generated 100 independent trials at each point on the parameter grid, indexed by δ = d/n and ρ = k/d, with δ ranging through 40 equispaced points in [0.1, 1], ρ ranging through 40 equispaced points in [0.05, 1], and n fixed at 1000. The resulting phase diagrams appear in Figure 2.11. Parallel results summarizing a similar experiment done for the URPE are presented in Figure 2.12. Preliminary inspection of these phase diagrams reveals that the suggestive similarities between the three algorithms indeed translate into comparable performance in recovering the sparsest solution. More careful examination commands several important observations. • Lars approximates Homotopy. In the displayed phase diagrams, the empirical phase transition of Lars closely follows the transition of Homotopy. In other words, almost anytime `1 minimization successfully recovers the sparsest solution, Lars succeeds in finding it as well. This is a rather surprising find, as, at least in the USE case, for k > d/(2 log(n)), Lars is bound to err and add elements not in the support of α0 into its active set. The evidence we present here suggests that inside the `1 -`0 equivalence phase, Lars still correctly recovers the support of α0 and obtains the sparsest solution. • Dependence on the Distribution of Nonzeros. Examining Figure 2.11, we note that for the suite S(USE; d, n, k), the performance of Homotopy and Lars is independent of the underlying distribution of the nonzero coefficients.

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

(a) Nonzero coefficient distribution U[0,1]

(b) Nonzero coefficient distribution N[0,1]

0.8

0.7

61

(c) Nonzero coefficient distribution random +/−1

0.8 Homotopy LARS OMP PFP ρW

0.7

0.6

0.6

0.5

0.5

0.9 Homotopy LARS OMP PFP ρW

0.8

0.7

Homotopy LARS OMP PFP ρW

ρ = k/d

ρ = k/d

ρ = k/d

0.6

0.4

0.4

0.3

0.3

0.2

0.2

0.5

0.4

0.3

0.1 0.1

0.2

0.3

0.4

0.5 0.6 δ = d/n

0.7

0.8

0.9

1

0.1 0.1

0.2

0.2

0.3

0.4

0.5 0.6 δ = d/n

0.7

0.8

0.9

1

0.1 0.1

0.2

0.3

0.4

0.5 0.6 δ = d/n

0.7

0.8

0.9

1

Figure 2.11: The k-sparse solution property for the suite S(USE,V; d, n, k). Each panel has plots of the empirical phase transition curves of Homotopy (dashed), Lars (dash-dotted), Omp (dotted) and Pfp (solid) alongside the theoretical curve ρW , on a ρ-δ grid, with n = 1000, for: (a) V = Uniform; (b) V = Gauss; (c) V = Bernoulli.

This is not true for Omp, whose performance is affected by the underlying distribution. In particular, Omp’s worst performance is recorded when the nonzeros have constant amplitude and a random sign pattern; this has been noted previously in the analysis of Omp [22, 79]. • Better Recovery with Partial Orthogonal Ensembles. Comparison of Figure 2.12 with Figure 2.11 reveals that, in general, all four algorithms perform better when applied to problem instances from the suite S(URPE; d, n, k). In particular, Homotopy and Lars show a marked increase in performance when applied to the suite S(URPE,Uniform; d, n, k). In our experiments, we observed similar phenomena when applying the three algorithms to problems drawn from the suites S(PFE; d, n, k) and S(PHE; d, n, k) as well. This is indeed an important finding, as these suites find extensive uses in applications.

2.10

Example: Component Separation

We conclude this chapter with an example showing Homotopy in action. The problem we consider is separating a signal into its harmonic and impulsive components. This relatively simple problem, explored in [15, 27], inspired more ambitious applications such as texture-geometry separation [70] and astronomical image representation

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

(a) Nonzero coefficient distribution U[0,1]

(b) Nonzero coefficient distribution N[0,1]

1 0.9

(c) Nonzero coefficient distribution random +/−1

1 Homotopy LARS OMP PFP ρW

0.9 0.8

1 Homotopy LARS OMP PFP ρW

0.9 0.8

0.7

0.7

0.6

0.6

0.6

0.5

ρ = k/d

0.7

ρ = k/d

ρ = k/d

0.8

62

0.5

0.5

0.4

0.4

0.4

0.3

0.3

0.3

0.2

0.2

0.1 0.1

0.2

0.3

0.4

0.5 0.6 δ = d/n

0.7

0.8

0.9

1

0.1 0.1

Homotopy LARS OMP PFP ρW

0.2

0.2

0.3

0.4

0.5 0.6 δ = d/n

0.7

0.8

0.9

1

0.1 0.1

0.2

0.3

0.4

0.5 0.6 δ = d/n

0.7

0.8

0.9

1

Figure 2.12: The k-sparse solution property for the suite S(URPE,V; d, n, k). Each panel has plots of the empirical phase transition curves of Homotopy (dashed), Lars (dash-dotted), Omp (dotted) and Pfp (solid) alongside the theoretical curve ρW , on a ρ-δ grid, with n = 1000, for: (a) V = Uniform; (b) V = Gauss; (c) V = Bernoulli.

[71]. In detail, let y be a signal of length d, composed of a few sinusoids and a few spikes, with a total of k such components, k d. Note that y may be written as y = Aα0 , with A = [I F ] a d × 2d matrix representing an overcomplete dictionary composed of the d × d identity matrix I, and the d × d Fourier matrix F . α0 is then a 2d coefficient vector, with kα0 k0 = k. We apply the Homotopy algorithm to solve this underdetermined system for the time and frequency components. Figure 2.13 presents the results of this simulation. Panel (a) displays a signal y of length d = 512, composed of 2 harmonic terms superposed with 40 spikes, with amplitudes distributed normally. These are plotted in panels (b) and (c). Panels (d)-(f) on the bottom row of Figure 2.13 show the corresponding output of the Homotopy algorithm, applied to the data y. As evident from the figure, Homotopy perfectly recovers all of the components in α0 . In Section 2.2, we commented that Homotopy may be easily adapted to treat noisy data, by terminating at a prescribed point on the Homotopy path. We now illustrate the use of this strategy, in the context of component separation. Specifically, we consider the same scenario described above, this time with y contaminated by additive white Gaussian noise. To do so, we measure data y = Aα0 + z, where z is a vector of iid Gaussian entries, such that the overall noise power kzk2 ≤ n . We again apply the Homotopy algorithm to the data, stopping when the residual gets below

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

(a) Original Signal, d = 512

(b) Time Component

(c) Frequency Component

2

2

2

1

1

1

0

0

0

−1

−1

−1

−2

0

100

200

300

400

500

−2

0

(e) Homotopy Solution

100

200

300

400

500

−2

(f) Time Component of Solution 2

2

1

1

1

0

0

0

−1

−1

−1

0

100

200

300

400

500

−2

0

100

200

300

400

0

100

200

300

400

500

(g) Frequency Component of Solution

2

−2

63

500

−2

0

100

200

300

400

500

Figure 2.13: Time-frequency separation with Homotopy: Top row: The original signal, with its time and frequency components. Bottom row: Corresponding output of Homotopy.

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

(a) Original Signal, d = 512

(b) Time Component

64

(c) Frequency Component

(d) Contaminated Signal, SNR = 3

2

2

2

2

1.5

1.5

1.5

1.5

1

1

1

1

0.5

0.5

0.5

0.5

0

0

0

0

−0.5

−0.5

−0.5

−0.5

−1

−1

−1

−1

−1.5

−1.5

−1.5

−1.5

−2

0

200

400

−2

0

(e) Homotopy Solution

200

400

(f) Time Component of Solution

−2

0

200

400

(g) Frequency Component of Solution 2

−2

2

1.5

1.5

1.5

1.5

1

1

1

1

0.5

0.5

0.5

0.5

0

0

0

0

−0.5

−0.5

−0.5

−0.5

−1

−1

−1

−1

−1.5

−1.5

−1.5

200

400

−2

0

200

400

−2

0

200

400

400

2

−1.5

0

200

(h) Residual

2

−2

0

−2

0

200

400

Figure 2.14: Time-frequency separation in the presence of noise: Top row: The original signal, its time and frequency components, and the noisy signal with SNR = 3. Bottom row: Corresponding output of Homotopy.

the noise level n . The results of this experiment are depicted in Figure 2.14. Panel (a) displays the original signal, and panels (b),(c) show its time and frequency components. Panel (d) shows the observed data, buried in white Gaussian noise, with a signal to noise ratio of 3. Thus, the noise power is one third the signal power. Panels (e)–(h) on the bottom row of Figure 2.13 show the corresponding output of the Homotopy algorithm applied to the data y. As evident from the figure, Homotopy manages to recover most of the components of α0 , leaving mostly noise in the residual. The relative reconstruction error is kˆ α − α0 k2 /kα0 k2 = 0.19.

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

2.11

65

Discussion

In this chapter, we studied the Homotopy algorithm, introduced by Osborne et al. [62] and Efron et al. [32]. It was originally proposed in the context of overdetermined noisy systems. We demonstrated that Homotopy is naturally suited to solve for sparse solutions of underdetermined systems of linear equations. In particular, we showed that when the underlying solution is sufficiently sparse and the underlying system is incoherent or random, Homotopy has the k-step solution property; it finds a k-sparse solution in exactly k steps. In such scenarios, it significantly outperforms state-of-the-art linear optimizers, and its computational complexity is then comparable to greedy stepwise algorithms such as Omp. In fact, the k-step analysis we presented sheds light on the relationship between `1 minimization and Omp. We argued that the two approaches can be bridged through a series of relaxations, each step maintaining the k-step property. Using the notions of phase diagrams and phase transitions, we mapped out the performance of Homotopy for varying indeterminacy and sparsity, showing that the algorithm exhibits three modes of operation: it either finds the solution in k-steps (kstep), recovers the sparsest solution (k-sparse), or solves the `1 minimization problem (`1 ). We conducted empirical studies to trace phase transitions delineating these three regions for problem instances drawn from the USE, RSE, URPE, PFE and PHE. In particular, our results suggest that Homotopy exhibits higher phase transitions for problems drawn from the partial orthogonal ensemblers, compared with the USE. This latter discovery is of great importance in applications, as partial Fourier and Hadamard ensembles find many uses in practice. To conclude, our analysis suggests that Homotopy is a very competitive algorithm for solving `1 minimization problems with sparse solutions. Yet, we would like to point out that when dealing with problems of very large dimensions, even an efficient algorithm such as Homotopy may be too costly to use. In such cases, we might consider an approximate approach, trading off performance for speed. Indeed, several authors have explored such approximate algorithms [18, 48, 70, 71, 35, 9].

CHAPTER 2. `1 MINIMIZATION WITH HOMOTOPY

66

In this spirit, the following chapter presents an approximate scheme, Stagewise Orthogonal Matching Pursuit (Stomp), to rapidly find sparse solutions to large-scale underdetermined systems of linear equations.

Chapter 3 Rapid Approximate Solution of Underdetermined Linear Systems with Stagewise Orthogonal Matching Pursuit In the previous chapter, we carefully analyzed the Homotopy algorithm, and showed that for a diverse set of problems, the algorithm finds a solution composed of k nonzeros in exactly k steps. When the conditions for the k-step solution property hold, the algorithm operates at peak efficiency. Yet, in scenarios involving large data sets, even when these favorable conditions hold, Homotopy may still be too costly to apply. Indeed, due to their inherent structure, stepwise algorithms such as Homotopy, Lars, and Omp allow only one coefficient to enter the active set at each iteration. When the problem dimensions are very large, even when the algorithm makes only ‘correct selections’, it may still take a prohibitive number of iterations to reach a solution. In this chapter, we present Stagewise Orthogonal Matching Pursuit (Stomp), a rapid approximate scheme to find sparse solutions of underdetermined linear equations. Much like Homotopy and Omp, Stomp successively computes a series of sparse approximate solutions by forming the vector of residual correlations at each 67

CHAPTER 3. STAGEWISE ORTHOGONAL MATCHING PURSUIT

68

iteration, identifying all coordinates with amplitudes exceeding a specially-chosen threshold, and solving a least-squares problem using the updated set of coordinates. It stops after reaching a certain number of iterations, fixed a priori and independently of the data. In contrast to Omp or Homotopy, many coefficients can enter the model at each stage in Stomp. Thus, Stomp runs much faster than competing proposals for sparse solutions. We analyze Stomp’s computational complexity and demonstrate that the algorithm can rapidly find sparse solutions of very large linear systems. Using the notion of phase diagrams and phase transitions, we analyze the performance of Stomp, and compare it with other algorithms. Our analysis suggests that for certain problem suites, Stomp’s ability to recover the sparsest solution is almost as good as `1 minimization, at a significantly lower cost. The material in this chapter is derived from the manuscript [30], which has been submitted for publication in IEEE Transactions on Information Theory.

3.1

Introduction

The exponential growth in storage volume, computing power, and communication bandwidth, along with the soaring popularity of multimedia content, have resulted in massive amounts of data being processed and relayed at every second. Within the scientific community, researchers are tackling problems of very large scale at near instantaneous processing times. As a concrete example, consider the following scenario, often arising in image processing tasks such as image inpainting [34] or cartoon-texture decomposition [70, 71]. We wish to analyze an image of dimensions m × m in an overcomplete dictionary composed of wavelet basis elements and Fourier basis elements. As each dictionary element has m2 degrees of freedom, the dictionary has dimensions m2 by 2m2 . For example, when m = 512, the underdetermined linear system is 262, 144 by n = 524, 288 in size. Thus, even when the image is sparsely represented by the dictionary elements, and the favorable computational estimate in Lemma 1 holds, the cost of applying Homotopy to find a sparse representation is excessively large. Moreover, one can easily conjure examples where the problem

CHAPTER 3. STAGEWISE ORTHOGONAL MATCHING PURSUIT

69

dimensions are in the millions, rendering the algorithm impractical. Impelled by similar reasoning, and armed with the belief that true `1 minimization is far too slow for large-scale applications, numerous researchers turned to exploring approximate algorithms for recovery of sparse solutions. Perhaps one of the better known algorithms designed to answer strict computational demands is Block Coordinate Relaxation (BCR), developed by Sardy, Bruce, and Tseng [68]. The authors focused on scenarios where the dictionary is a concatenation of several orthogonal bases. The design of BCR utilizes this special structure, by decoupling the system of linear equations into blocks corresponding to the different bases, and iteratively solving for the coefficients in each basis via soft-thresholding (often called shrinkage). This makes for a very efficient method, particularly when the problem dimensions are large and the individual orthogonal basis transformations can be evaluated via fast linear operators. Sardy et al. showed that when the dictionary is composed of several orthogonal bases, BCR converges to the minimum `1 solution. Yet, the convergence result does not hold for more general dictionary constructions (for instance, when the dictionary is composed of tight frames). The immediate inspiration for our work was research done by Jean-Luc Starck and Michael Elad [34, 70, 71], who attacked large-scale problems in image processing and component separation. Their method, named Morphological Component Analysis (MCA), extends the ideas behind BCR to deal with dictionaries composed of frames and/or bases, and exhibits rapid converge. They found that by iterative application of hard thresholding, residualization, and matched filtering, they could often obtain results comparable to `1 optimization much more rapidly. Other papers [18, 35, 33, 48] suggest similarly-flavored iterated thresholding approaches for fast approximate recovery of sparse solutions.

3.1.1

STOMP

In this chapter we describe Stagewise Orthogonal Matching Pursuit (Stomp), a method for approximate sparse solution of underdetermined systems with the property that either the dictionary is ‘random’ or the nonzeros in the solution are randomly

CHAPTER 3. STAGEWISE ORTHOGONAL MATCHING PURSUIT

70

located, or both. Stomp recovers a solution to the underdetermined linear system y = Ax by successively transforming the data y into a negligible residual. Starting with initial residual r0 = y, at the s-th stage it forms the ‘residual correlations’ AT rs−1 , identifies all coordinates with amplitudes exceeding a specially-chosen threshold, solves a least-squares problem using the selected coordinates, and subtracts the least-squares fit, producing a new residual. Since Stomp takes a small, fixed number of steps and then terminates, it runs much faster than stepwise algorithms for sparse approximation, particularly when the problem dimensions are large. At the core of our proposal lies a thresholding scheme that allows to determine which entries in the solution ought to be considered nonzero. The thresholding scheme is derived from well-understood methods in the problem of Gaussian denoising. We apply them here although there is no noise in the observations. Indeed, the analysis we present in this chapter suggests an intriguing phenomenon: noiseless underdetermined problems behave like noisy well-determined problems. In other words, incompleteness of the measurement data (for ‘random’ dictionaries) manifests itself similarly to Gaussian noise in complete measurements. Using the notion of comparison of phase transitions as a performance metric, we show below that the “success phase” (the region in the phase diagram where Stomp successfully finds sparse solutions) is large and comparable in size to the success phase for `1 minimization. In a sense, Stomp is more effective at finding sparse solutions to large extremely underdetermined problems than either `1 minimization or Omp; its phase transition boundary is even higher at extreme sparsity than the other methods. Most importantly, Stomp takes a few seconds to solve problems for which hours of computations may be required to obtain a minimum-`1 solution. As a result Stomp is well suited to large-scale applications.

3.1.2

Contents

The rest of this chapter is organized as follows. Section 3.2 describes the basic procedure underlying Stomp, gives heuristic justification for the approximate Gaussianity

CHAPTER 3. STAGEWISE ORTHOGONAL MATCHING PURSUIT

71

of residual correlations, and describes the thresholding strategy. Section 3.3 then analyzes the performance of Stomp by means of phase diagrams and phase transitions, comparing its performance with `1 minimization. In Section 3.4, we present a formal complexity analysis of the algorithm, augmented by running-time measurements. Section 3.5 investigates the performance of Stomp applied to other matrix and coefficient ensembles, and discusses its behavior in the presence of noise. In Section 3.6, we demonstrate the operation of Stomp on a toy example, of separating a signal into its harmonic and impulsive components. A final section discusses the results presented, and offers some concluding remarks.

3.2

Stagewise Orthogonal Matching Pursuit

Recall the Sparse Representation Problem (SRP) described in Chapter 1. In the SRP, an unknown vector α0 ∈ Rn is of interest; it is assumed sparse, which is to say that the number k of nonzeros is much smaller than n. We have the linear measurements y = Aα0 , where A is a known d by n matrix, and we wish to recover α0 . Stomp aims to achieve an approximate solution to problems drawn from the suite S(USE; d, n, k), where α0 is sparse (other problem suites are considered in a later section). We now describe its basic ingredients.

3.2.1

The Procedure

Stomp operates in S stages, building up a sequence of approximations x0 , x1 , . . ., xS by removing detected structure from a sequence of residual vectors r1 , r2 , . . . , rS . Figure 3.1 gives a diagrammatic representation. Stomp starts with initial ‘solution’ x0 = 0 and initial residual r0 = y. The stage counter s starts at s = 1. The algorithm also maintains a sequence of estimates I1 , . . . , Is of the locations of the nonzeros in α0 . The s-th stage applies matched filtering to the current residual, getting a vector of residual correlations cs = AT rs−1 ,

CHAPTER 3. STAGEWISE ORTHOGONAL MATCHING PURSUIT

y

rs +

"

!

T

{j :

A rs

!

c s ( j) > t s} Js

!

!

!

Ax s

!

Hard Thresholding/ Subset Selection

cs

Residual Correlations

72

! xs

Interference Construction

( A AI s ) A y

!

! !

"1

T Is

Ax s

xˆ S

Is

Projection T Is

!

Set Union

Is"1 # J s

Is"1

! !

!

!

!

Figure 3.1: Schematic representation of the Stomp algorithm.

which we think of as containing a small number of significant nonzeros in a vector disturbed by Gaussian noise in each entry (more on that below). The procedure next performs hard thresholding to find the significant nonzeros; the thresholds are specially chosen based on the assumption of Gaussianity. Thresholding yields a small set Js of ‘significant’ coordinates: Js = {j : |cs (j)| > ts }, where ts is the selected threshold parameter. We merge the subset of newly selected coordinates with the previous support estimate, thereby updating the estimate: Is = Is−1 ∪ Js . We then project the vector y onto the columns of A belonging to the enlarged support. Letting AI denote the d × |I| matrix with columns chosen using index set I, we have the new approximation xs supported in Is with coefficients given by (xs )Is = (ATIs AIs )−1 ATIs y.

CHAPTER 3. STAGEWISE ORTHOGONAL MATCHING PURSUIT

73

The updated residual is rs = y − Axs . The algorithm terminates when the norm of the current residual krs k2 is below a prescribed threshold, or when we have reached the iteration limit S. If it is not yet time to stop, we set s := s + 1 and go to the next stage of the procedure. Otherwise, we set xˆS = xs as the final output of the procedure. Keen readers will notice that the procedure resembles Orthogonal Matching Pursuit. In fact the two would give identical results if S were equal to d and if the threshold ts were set in such a way that a single new term were obtained in Js at each stage. However, the thresholding strategy used in Stomp (to be described below) aims to have numerous terms enter at each stage, and aims to have a fixed number of stages (hence the name Stagewise OMP). The choice of S, the number of iterations of Stomp performed, is dictated by the computational budget allotted, as well as the required solution accuracy. Empirically, we found that S = 10 offers a good trade-off between speed and accuracy. Our recommended choice of S (10) and our recommended threshold-setting procedures (see Section 3.2.4 below) have been designed so that when α0 is sufficiently sparse, the following two conditions are likely to hold upon termination: • All nonzeros in α0 are selected in IS . • IS has no more than d entries. With these two conditions met, Stomp is guaranteed to terminate with the final solution xS = α0 .

3.2.2

An Example

We give a simple example showing that the procedure works in a special case. We generated a coefficient vector α0 with k = 32 nonzeros, having amplitudes uniformly distributed on [0, 1]. We sampled a matrix A at random from the USE with d = 256, n = 1024, and computed a linear measurement vector y = Aα0 . Thus the problem of recovering α0 given y is 1 : 4 underdetermined (i.e. δ = d/n = .25), with underlying

CHAPTER 3. STAGEWISE ORTHOGONAL MATCHING PURSUIT

(a) Matched filtering

(c) Approximate solution x1, ||r1|| = 0.67

(b) Hard thresholding

1

1

1

0

0

0

−1

200 400 600 800 1000

−1

(d) Matched filtering

200 400 600 800 1000

−1

(e) Hard thresholding 1

1

0

0

0

200 400 600 800 1000

−1

(g) Matched filtering

200 400 600 800 1000 (h) Hard thresholding

−1

1

1

0

0

0

200 400 600 800 1000

−1

200 400 600 800 1000

(i) Approximate solution x3, ||r3|| = 0.043

1

−1

200 400 600 800 1000

(f) Approximate solution x2, ||r2|| = 0.17

1

−1

74

200 400 600 800 1000

−1

200 400 600 800 1000

Figure 3.2: Progression of the Stomp algorithm. Panels (a),(d),(g): successive matched filtering outputs c1 ,c2 , c3 ; Panels (b),(e),(h): successive thresholding results; Panels (c),(f),(i): successive partial solutions. In this example, k = 32, d = 256, n = 1024.

sparsity measure ρ = k/d = .125. To this SRP, we applied Stomp coupled with the CFAR threshold selection rule to be discussed below. The results are illustrated in Figure 3.2. Panels (a)–(i) depict each matched filtering output, its hard thresholding and the evolving approximation. As can be seen, after 3 stages a result is obtained that is quite sparse and, as the final panel shows, matches well the object α0 that truly generated the data. In fact, the error at the end of the third stage measures kˆ x3 − α0 k2 /kα0 k2 = 0.022, i.e. a mere 3 stages were required to achieve an accuracy of 2 decimal digits.

CHAPTER 3. STAGEWISE ORTHOGONAL MATCHING PURSUIT

3.2.3

75

Approximate Gaussianity of Residual Correlations

As we mentioned before, the design of Stomp is based on the observation that the residual correlations cs at each iteration can be thought of as composed of a small number of ‘true’ nonzeros buried in Gaussian noise (even though at this point we assume there is no noise in y!). This property is best illustrated with a simple example. Panel (a) of Figure 3.3 shows a sparse vector α0 of length n = 256, with only 3 nonzero coefficients. We generated a uniform spherical matrix A with dimensions 64 by 256, and computed the data vector y = Aα0 . Panel (b) displays the correlation vector c1 = AT y at the initial phase of Stomp. Inspection of the plot reveals that (visually, at least) the correlation vector is a combination of the nonzeros in α0 (marked by crosses) and ‘random noise’. A histogram of the disturbance z = AT y − α0 is shown in Panel (c) of the figure. Indeed, the noise exhibits an approximately Gaussian behavior. This important observation can be justified heuristically. First, note that z = AT y − α0 is a measure of the disturbance to exact reconstruction caused by the indeterminacy of A. The same notion arises in digital communications where it is called Multiple-Access Interference (MAI) [85]. Perhaps surprisingly—because there is no noise in the problem—the MAI in our setting typically has a Gaussian behavior. More specifically, if A is a matrix from the USE and if d and n are both large, then the entries in the disturbance vector z have a histogram that is nearly Gaussian with standard deviation

√ σ ≈ kα0 k2 / d.

(3.1)

The heuristic justification is as follows. The disturbance z has the form z(j) = aTj y − α0 (j) =

X

aTj a` α0 (`).

j6=`

The thing we regard as ‘random’ in this expression is the matrix A. The term ξkj ≡ aTj ak measures the projection of a random point on the sphere Sd−1 onto another random point. This random variable has approximately a Gaussian distribution N (0, d1 ). For A from the USE and for a given fixed aj , the different random variables

CHAPTER 3. STAGEWISE ORTHOGONAL MATCHING PURSUIT

76

(a) Sparse coefficient vector α0 3 2 1 0

50

100

150

200

250

200

250

(b) Correlation vector AT y 3 2 1 0

50

100

150 T

(c) Histogram of z = A y − α0 40 30 20 10 0 −2

−1.5

−1

−0.5

0

0.5

1

1.5

Figure 3.3: Approximate Gaussianity of the correlation vector. Panels (a): A sparse vector of length 256, with 3 nonzero coefficients. Panel (b): Correlation vector c1 = AT y. The locations of the nonzeros are marked with crosses. Panel (c): Histogram of the disturbance z. The disturbance has a Gaussian-like behavior.

CHAPTER 3. STAGEWISE ORTHOGONAL MATCHING PURSUIT

(a) k = 32

(b) k = 48

(c) k = 64

0.015

0.015

0.01

0.01

0.01

0.005

0.005

0.005

0

z

0

z

z

0.015

0

−0.005

−0.005

−0.005

−0.01

−0.01

−0.01

−0.015 −0.1

−0.05

0 0.05 N(0,1/n)

0.1

−0.015 −0.1

77

−0.05

0 0.05 N(0,1/n)

0.1

−0.015 −0.1

−0.05

0 0.05 N(0,1/n)

0.1

Figure 3.4: QQ plots comparing MAI with Gaussian distribution. Left column: k/d = .125, middle column: k/d = .1875, right column: k/d = .25. The plots all appear nearly linear, indicating that the disturbance has a nearly Gaussian distribution.

(ξkj : k 6= j) are independently distributed. Hence the quantity z(j) is an iid sum of approximately normal r.v.’s, and so, by standard arguments, should be approximately normal with mean 0 and variance σj2 = Var[

X

X ξ`j α0 (`)] = ( α0 (`)2 ) · V ar(ξ1j ) ≈ d−1 kα0 k22

j6=`

j6=`

Setting σ 2 = kα0 k2 /d, this justifies (3.1). Figure 3.4 gives further evidence of the validity of Gaussian approximation. Panels (a),(b),(c) display Gaussian QQ-plots of the disturbance vector z in the sparse case with k/d = .125, .1875 and .25, for the Uniform Spherical Ensemble with d = 256 and n = 1024. In each case, the QQ-plot appears nearly straight, as the Gaussian model would demand. Our discussion so far has only considered the first stage of the algorithm. At later stages there is residual disturbance (or residual MAI), i.e., disturbance that has not yet been cancelled. This can be defined as zs (j) = aTj rs /kPIs−1 aj k22 − α0 (j),

j 6∈ Is−1 ;

note that the coordinates j ∈ Is−1 are ignored at stage s; the residual in those coordinates is deterministically 0. Empirically, residual disturbance has also a Gaussian behavior. Figure 3.5 shows

CHAPTER 3. STAGEWISE ORTHOGONAL MATCHING PURSUIT

(b) Iteration no. 2 2

1

1

1

0

0

0

−2 −4

z

2

−1

−1

−2

0 N(0,1)

2

−2 −4

4

(d) Iteration no. 4

−1

−2

0 N(0,1)

2

−2 −4

4

(e) Iteration no. 5 2

1

1

1

0

0

0

−2 −4

z

2

−1

−1

−2

0 N(0,1)

2

4

−2 −4

−2

0 N(0,1)

2

4

(f) Iteration no. 6

2

z

z

(c) Iteration no. 3

2

z

z

(a) Iteration no. 1

78

−1

−2

0 N(0,1)

2

4

−2 −4

−2

0 N(0,1)

2

4

Figure 3.5: QQ plots comparing residual disturbance with Gaussian distribution. Quantiles of zs for s = 1, . . . , 6 are plotted against Gaussian quantiles. Near-linearity indicates approximate Gaussianity.

quantile-quantile plots for the first few stages of Stomp, comparing zs with a standard normal distribution, for s = 1, . . . , 6. The plots are effectively straight lines, illustrating the Gaussian approximation.

3.2.4

Threshold Selection

Our threshold selection proposal is inspired by the observed approximate Gaussian behavior of the residual disturbance. We view the vector of correlations cs at stage s as consisting of a small number of ‘truly nonzero’ entries, combined with a large number of ‘Gaussian noise’ entries. Therefore, we have essentially reduced the sparse representation problem into a sequence of successive denoising problems. The problem of removing Gaussian noise from corrupted measurements has been studied extensively in the scientific literature, and we may employ known denoising methods to separate the nonzeros from the Gaussian disturbance.

CHAPTER 3. STAGEWISE ORTHOGONAL MATCHING PURSUIT

79

In detail, recall that throughout its operation, Stomp attempts to recover the support I0 of α0 . Borrowing terminology from statistical estimation theory, if a coordinate belonging to I0 does not appear in IS , we call this a missed detection. If a coordinate not in I0 does appear in IS we call this a false alarm. The False Alarm Rate (FAR) is then the number of false alarms divided by the number of coordinates not in I0 . Our strategy for setting the threshold is Controlled False Alarm Rate (CFAR) [1]. We attempt to guarantee that the number of total false alarms, across all stages, does not exceed the natural codimension of the problem, defined as d − k. Subject to this, we attempt to make the maximal number of discoveries possible. To do so, we choose a threshold so that the false alarm rate at each stage does not exceed a perstage budget. Ultimately, this strategy should land us in a position where #IS ≤ n and there are no missed detections. Then, as discussed earlier, we perfectly recover: xS = α0 . We note that the CFAR strategy requires knowledge of the number of nonzeros k or some upper bound. However, our experiments seems to suggest that a crude estimate is often sufficient to set the threshold.

3.3

Performance Analysis

We now turn to investigate the performance of Stomp applied to the sparse representation problem (P0 ). To do so, we utilize the empirical analysis framework developed for the Homotopy algorithm in Chapter 2. Namely, we generate phase diagrams and compute empirical phase transitions from extensive simulation data. In our simulation studies, the attribute we measure is recovery of the sparsest solution. In the terminology of Chapter 2, we measure the k-sparse solution property for Stomp applied to problem instances from the random suite S(USE,Gauss; d, n, k). In detail, we fixed n = 1000, and generated problem configurations on a δ-ρ grid, with δ = d/n ranging through 40 equispaced points in the interval [.1, .95] and ρ = k/d ranging through 40 equispaced points in [.1, .95]. At each (d, n, k) configuration, we conducted 100 independent trials, and recorded the number of trials in which Stomp successfully recovered the solution α0 . The per-iteration false alarm rate was set to

CHAPTER 3. STAGEWISE ORTHOGONAL MATCHING PURSUIT

80

Figure 3.6: Phase diagram of the k-sparse solution property of Stomp applied to the problem suite S(USE,Gauss; d, n, k). Overlaid red curve is the empirical phase transition.

(d − k)/(S(n − k)). The resulting phase diagram is displayed in Figure 3.6. Superposed on this display is the empirical phase transition curve, computed via logistic regression, as described in Section 2.5. We notice that for very sparse problems (ρ small), the algorithm is successful at recovering (a good approximation to) α0 , while for less sparse problems (ρ large), the algorithm fails. It is interesting to contrast the performance of Stomp with the performance of `1 minimization (via Homotopy) and Omp. This we do in Figure 3.7, which displays the measured phase transitions for the three methods. This comparison clearly illustrates that Stomp’s performance is comparable with its popular stepwise ‘cousins’. It is particularly interesting since Stomp runs much more rapidly than stepwise methods when applied to large-scale problems.

CHAPTER 3. STAGEWISE ORTHOGONAL MATCHING PURSUIT

81

1

0.9

STOMP L1 minimization OMP

0.8

0.7

ρ = k/d

0.6

0.5

0.4

0.3

0.2

0.1

0 0.1

0.2

0.3

0.4

0.5 δ = d/n

0.6

0.7

0.8

0.9

Figure 3.7: Comparison of the performance of Stomp (solid curve), `1 minimization (dashed curve), and Omp (dash-dotted curve) in recovering the sparsest solution.

3.4

Computational Analysis

Now that we have demonstrated that Stomp often recovers the sparsest solution to an underdetermined linear system, we turn to investigate its computational efficiency. In this section, we present empirical evidence, augmented by a formal operation count, showing that Stomp runs significantly faster than traditional methods for recovery of sparse solutions.

3.4.1

Complexity Analysis

We begin with a formal analysis of the asymptotic complexity of Stomp. In this analysis, we consider two scenarios. • Dense Matrices. In this scenario, the matrix A defining an underdetermined linear system is an explicit d×n dense matrix stored in memory. Thus, applying A to an n-vector x involves dn flops.

CHAPTER 3. STAGEWISE ORTHOGONAL MATCHING PURSUIT

82

• Fast Operators. Here, the linear operator A is not explicitly represented in matrix form. Rather, it is implemented as an operator taking a vector x, and returning Ax. Classical examples of this type include the Fourier transform, Hadamard transform, and Wavelet transform, just to name a few; all of these operators are usually implemented without matrix multiplication. Such fast operators are of key importance in large-scale applications. In our analysis below, we let V denote the cost of one application of a linear operator or its adjoint (corresponding to one matrix-vector multiplication). In fact, as we will now see, the structure of Stomp makes it a prime choice when fast operators are available, as nearly all its computational effort is invested in solving partial least-squares systems involving A and AT . In detail, assume we are at the s-th stage of execution. Stomp starts by applying matched filtering to the current residual, which amounts to one application of AT , at a cost of dn flops. Next, it applies hard-thresholding to the residual correlations and updates the active set accordingly, using at most 2n additional flops. The core of the computation lies in calculating the projection of y onto the subset of columns AIs , to get a new approximation xs . This is implemented via a Conjugate Gradient (CG) solver [43]. Each CG iteration involves application of AIs and ATIs , costing at most 2dn + O(n) flops. The number of CG iterations used is a small constant, independent of d and n, which we denote ν. In our implementation we use ν = 10. Finally, we compute the new residual by applying A to the new approximation, requiring an additional dn flops. Summarizing, the total operation count per Stomp stage amounts to (ν + 2)dn + O(n). The total number of Stomp stages, S, is a prescribed constant, independent of the data; in the simulations conducted for this thesis we set S = 10. Recalling the discussion of Omp from Chapter 2, we note the evident parallelism in the algorithmic structure of Stomp and Omp. Indeed, much like Stomp, at each stage Omp computes residual correlations and solves a least-squares problem for the new solution estimate. Yet, unlike Stomp, Omp builds up the active set one element at a time. Hence, an efficient implementation would necessarily maintain a Cholesky factorization of the active set matrix and update it at each stage, thereby reducing the cost of solving the least-squares system. In total, k steps of Omp would

CHAPTER 3. STAGEWISE ORTHOGONAL MATCHING PURSUIT

83

take at most 4k 3 /3 + kdn + O(dn) flops. Without any sparsity assumptions on the data, Omp takes at most d steps; thus, its worst-case performance is bounded by 4d3 /3 + d2 n + O(dn) operations. A key difference between Stomp and Omp is that the latter needs to store the Cholesky factor of the active set matrix in its explicit form, taking up to d2 /2 memory elements. When d is large, as is often the case when the data is two- or three-dimensional, this greatly hinders the applicability of Omp. In contrast, Stomp has very modest storage requirements. At any given point of the algorithm execution, one need only store the current estimate xs , the current residual vector rs , and the current active set Is . This makes Stomp very attractive for use in large-scale applications. Table 3.1 summarizes our discussion so far, offering a comparison of the computational complexity of Stomp, Omp and `1 minimization. For the purposes of this discussion, we assume that the `1 minimization problem is converted to a linear program and solved with Pdco, a primal-dual log-barrier method developed by Michael Saunders [69]. The estimates listed in the table all assume worst-case behavior. Examining the bounds in the dense matrix case closely, we notice that Stomp is the only algorithm of the three admitting quadratic order complexity estimates. In contrast, Omp and Pdco require cubic order estimates for their worst-case performance bound. In the case where fast operators are applicable, Stomp yet again prevails; it is the only algorithm of the three requiring a constant number (S · (ν + 2)) of matrix-vector products to reach a solution. Algorithm Stomp Omp `1 min. with Pdco

Dense Matrices S(ν + 2)dn + O(n) 4d3 /3 + d2 n + O(dn) S(2n)3 /3 + O(dn)

Fast Operators S(ν + 2) · V + O(n) 4d3 /3 + 2d · V + O(dn) 2n · V + O(dn)

Table 3.1: Worst-case complexity bounds for Stomp, Omp and Pdco. S denotes the maximum number of stages, ν denotes the maximum number of CG iterations employed per stage of Stomp, and V stands for the cost of one matrix-vector product (implemented as a fast operator). To make the comparison still more vivid, we point ahead to an imaging example that is discussed in detail in Chapter 4. There an image of dimensions m×m is viewed

CHAPTER 3. STAGEWISE ORTHOGONAL MATCHING PURSUIT

84

as a vector x of length n = m2 . We consider a system y = Ax, where A is drawn from the partial Fourier ensemble. In other words, (A, y) is an instance of the problem suite S(PFE; d, n, k), with d = δn. Matrices in the partial Fourier ensemble can be implemented by application of a Fast Fourier Transform (FFT) followed by a coordinate selection. Thus, one matrix-vector product costs V = 4n log n = 8m2 log m. How do the three algorithms compare in this setting? Plugging-in S = 10, ν = 10, and V as above, we see that the leading term in the complexity bound for Stomp is 960·m2 log m. In contrast, for Omp the leading term in the worst-case bound becomes 4δ 3 6 m 3

+ 16δm4 log m, and for `1 minimization the leading term is 16m4 log m. The

computational gains with Stomp are indeed substantial. Moreover, to run Omp in this setting, we may need up to

δ2 4 m 2

memory elements to store the Cholesky

factorization, which renders it unusable for anything but the smallest m.

3.4.2

Running Times

To complement the complexity analysis above, we now present actual execution times measured for Stomp, Omp, and Pdco. Table 3.2 shows the running times for the three algorithms, solving an instance of the problem suite S(USE; d, n, k), with various configurations of (k, d, n). The simulations used to generate the figures in the table were all executed on a 3GHz Xeon workstation, comparable with current desktop CPUs. Problem Suite (k,d,n) (10,100,1000) (100,500,1000) (100,1000,10000) (500,2500,10000)

`1 Omp Stomp 0.97 0.37 0.02 22.79 0.79 0.42 482.22 7.98 5.24 7767.16 151.81 126.36

Table 3.2: Comparison of execution times (seconds) for instances of the problem suite S(USE; d, n, k). Table 3.2 suggests that a tremendous saving in computation time is achieved when using the Stomp scheme over traditional `1 minimization. The conclusion is that the proposed iterative thresholding methods has a large domain of applicability for

CHAPTER 3. STAGEWISE ORTHOGONAL MATCHING PURSUIT

85

sparsely solving random underdetermined systems, while running much faster than other methods in problem sizes of current interest. To convey the scale of computational benefits in large problems, we conduct a simple experiment in a setting where A can be implemented as a fast operator. We consider a problem instance drawn from the suite S(PFE; d, n, k). We apply Stomp, as well as Omp and Pdco, to this problem instance, and record execution times (all three algorithms successfully recover the sparsest solution). The results are summarized in Table 3.3. The savings achieved by using Stomp are clearly evident. Problem Suite (k,d,n) S(PFE; d, n, k) (500,10000,20000) S(PFE; d, n, k) (1000,20000,50000)

`1 Omp Stomp 237.69 53.64 2.07 810.39 299.54 5.63

Table 3.3: Comparison of execution times (seconds) in the random partial Fourier suite S(PFE; d, n, k).

3.5 3.5.1

Variations Partial Orthogonal Ensembles

Our attention so far has focused on the case where A is a random matrix, generated from the uniform spherical ensemble. In our discussion of the Homotopy algorithm in Chapter 2, we mentioned that other classes of random matrices, namely the partial orthogonal ensembles, play an important role in many applications of the sparse representation problem. In this section, we investigate the performance of Stomp applied to problem instances from the suite S(URPE; d, n, k). First, we wish to examine whether the Gaussian disturbance model holds when the underlying matrix is drawn from the URPE. Evidence to this assumption is given in Figure 3.8, which displays QQ-plots comparing the disturbance with a Gaussian distribution. The plots all appear nearly linear, indicating that the disturbance has a nearly Gaussian distribution. We conducted similar studies for the PFE and the PHE, with similar findings. We conclude that Stomp is applicable to matrices drawn from these ensembles as well.

CHAPTER 3. STAGEWISE ORTHOGONAL MATCHING PURSUIT

(a) k = 32

(b) k = 48

(c) k = 64

0.015

0.015

0.01

0.01

0.01

0.005

0.005

0.005

0

z

0

z

z

0.015

0

−0.005

−0.005

−0.005

−0.01

−0.01

−0.01

−0.015 −0.1

−0.05

0 0.05 N(0,1/n)

0.1

−0.015 −0.1

86

−0.05

0 0.05 N(0,1/n)

0.1

−0.015 −0.1

−0.05

0 0.05 N(0,1/n)

0.1

Figure 3.8: QQ plots comparing MAI with Gaussian distribution for the problem suite S(URPE,Gauss; d, n, k). Left column: k/d = .125, middle column: k/d = .1875, right column: k/d = .25.

Assured that the Gaussian model holds for the URPE as well, we turn to investigate the performance of Stomp applied to this ensemble. To do so, we repeated the simulation study described in Section 3.3, drawing problem instances from the suite S(URPE,Gauss; d, n, k), and recording the proportion of trials in which Stomp has the k-sparse solution property. Figure 3.9 displays the empirical phase diagram resulting from this study, with the empirical phase transition curve superposed. The results suggest that Stomp successfully recovers the sparsest solution at a significantly wider range of signal sparsities, compared with its performance for the USE suite. We conducted parallel studies for the partial Fourier/Hadamard ensembles, reaching similar conclusions. Recall that we have observed similar behavior for Homotopy and Lars in Section 2.9, namely that the two algorithms exhibit better performance when the underlying matrix is a partial orthogonal matrix. This finding is very encouraging, since partial orthogonal ensembles find many applications (see Section 2.6 for a discussion).

3.5.2

Other Coefficient Ensembles

The phase transitions displayed in previous sections were computed with the nonzero coefficients following a Gaussian distribution. We now examine the performance of Stomp when the nonzeros are drawn from different distributions, including a uniform distribution over [0, 1], and an equi-probable ±1 distribution. Figure 3.10 displays

CHAPTER 3. STAGEWISE ORTHOGONAL MATCHING PURSUIT

87

Figure 3.9: Phase diagram of the k-sparse solution property of Stomp applied to the problem suite S(URPE,Gauss; d, n, k). Overlaid red curve is the empirical phase transition.

phase diagrams and phase transitions of the k-sparse solution property of Stomp applied to the problem suites S(USE,Uniform; d, n, k) and S(USE,Gauss; d, n, k). We notice that the performance of Stomp is affected by the underlying distribution of the nonzeros. In particular, the worst behavior is recorded when the underlying nonzeros have a fixed amplitude and random signs (we observed similar behavior for Omp in Section 2.9).

3.5.3

Noisy Data

A key feature of Stomp is that it extends quite naturally to the case of data contaminated with white Gaussian noise. Indeed, suppose that our observations y obey y = Aα0 + ξ, where ξ denotes an iid N (0, η 2 ) noise. Following the heuristic reasoning of Section 3.2.3, it is not hard to see that the disturbance vector will obey the same normal

CHAPTER 3. STAGEWISE ORTHOGONAL MATCHING PURSUIT

88

Figure 3.10: Performance of Stomp when the nonzeros have a non-Gaussian distribution. Panel (a): Suite S(USE,Uniform; d, n, k); Panel (b): Suite S(USE,Bernoulli; d, n, k). Overlaid red curve is the empirical phase transition for the k-sparse solution property.

approximation, with a different variance. Hence, to the extent that our approach was applicable before, it remains applicable. In other words, since Stomp amounts to a sequence of Gaussian denoising procedures, the presence of ‘external’ Gaussian noise in the model does not change the basic approach. An example of Stomp’s operation in the presence of Gaussian noise is given in the next section.

3.6

Example: Component Separation

We conclude the discussion of Stomp with an example showcasing its performance when applied to a simple problem, of separating a signal into its harmonic and impulsive components. The experimental setup is similar to the one described in Section 2.10. In detail, assume we have a signal y of length d, known to admit sparse synthesis by a few selected sinusoids and a few spikes, with a total of k such components. Formally, we have y = [I F ]x, where I is a d × d identity matrix, F is a d × d Fourier matrix, and x is a 2d coefficient vector. The paper [15] established bounds on the sparsity k, under which `1 minimization successfully recovers the coefficient vector x in this underdetermined setting. Here we investigate the performance of Stomp as an alternative to direct `1 minimization.

CHAPTER 3. STAGEWISE ORTHOGONAL MATCHING PURSUIT

(a) Original Signal, d = 512

(b) Time Component

(c) Frequency Component

2

2

2

1

1

1

0

0

0

−1

−1

−1

−2

0

200

400

−2

0

(d) STOMP Solution

200

400

−2

0

200

400

2

2

(f) Frequency Component of Solution 2

1

1

1

0

0

0

−1

−1

−1

−2

0

200

400

(e) Time Component of Solution

−2

0

200

400

89

−2

0

200

400

Figure 3.11: Time-frequency separation with Stomp. Top row: Original signal, with its time and frequency components; Bottom row: Corresponding output of Stomp. The procedure resulted in perfect reconstruction of the synthesis coefficients.

Figure 3.11(a) shows a signal y of length d = 512, consisting of 2 harmonic terms perturbed by 32 spikes, with amplitudes drawn at random from a normal distribution N (0, 1/2), for a total of k = 34 nonzero synthesis coefficients in the time-frequency dictionary. The spike and sinusoid components of y are plotted individually in panels (b) and (c). We solved the resulting underdetermined system using Stomp to recover the synthesis coefficients. Results are portrayed in the bottom row of Figure 3.11. In this example, Stomp perfectly recovered the synthesis coefficients in 4 stages. In the previous section, we commented that Stomp may be easily adapted to treat data contaminated by white Gaussian noise, by simply adapting the threshold calculation to take into account the Gaussian interference. We now illustrate the use of this strategy, repeating the experiment described above, this time with y contaminated by additive white Gaussian noise. Formally, we measure data y = Aα0 + z, where z is a vector of iid Gaussian entries, with variance σz2 . The generated signal is depicted in the top row of Figure 3.12, where σz was chosen so that the signal-to-noise ratio (SNR) is equal to 5. We again apply Stomp to the data, in order to recover the synthesis coefficients. The results are displayed in the bottom row of Figure 3.12. Visually, we see that Stomp faithfully separated the signal into its time and frequency

CHAPTER 3. STAGEWISE ORTHOGONAL MATCHING PURSUIT

(a) Noisy Signal, d = 512

(b) Time Component

(c) Frequency Component

(d) Gaussian Noise, SNR = 5

2

2

2

2

1

1

1

1

0

0

0

0

−1

−1

−1

−1

−2

0

200

400

−2

(e) STOMP Solution

0

200

400

(f) Time Component of Solution

−2

0

200

400

(g) Frequency Component of Solution 2

−2

2

1

1

1

1

0

0

0

0

−1

−1

−1

−1

0

200

400

−2

0

200

400

−2

0

200

400

0

200

400

(h) Residual

2

−2

90

2

−2

0

200

400

Figure 3.12: Time-frequency separation with Stomp, in the presence of Gaussian noise. Top row: Noisy signal, with its time and frequency components; Bottom row: Corresponding output of Stomp. The error in reconstructed is kˆ α − α0 k2 /kα0 k2 = 0.19.

components. The measured error between the synthesis coefficients and the solution of Stomp is kˆ α − α0 k2 /kα0 k2 = 0.19.

3.7

Discussion

We described a simple algorithm for obtaining sparse approximate solutions of certain underdetermined systems of linear equations. The Stomp algorithm iteratively thresholds, selects and projects. It selects nonzeros by thresholding the output of the matched filter applied to the current residual signal, where the threshold is set above the mutual interference level. It projects the residual on a lower-dimensional subspace complementary to the span of the selected nonzeros. Stomp is effective when the matrix A and/or the object α0 render the residual disturbance vector approximately Gaussian. Relying on such Gaussianity, we argued that we can view the problem as a sequence of Gaussian denoising problems, and proposed a natural strategy for threshold selection, based on false alarm rate control. Utilizing an empirical analysis framework, we examined the performance of this

CHAPTER 3. STAGEWISE ORTHOGONAL MATCHING PURSUIT

91

approach in recovering sparse solutions, and demonstrated that when the underlying matrix is a uniform spherical matrix or a partial orthogonal matrix, the ‘success phase’ of Stomp is comparable in size to the ‘success phase’ of `1 minimization. We further showed that Stomp has very modest computational demands, and it often delivers results comparable to `1 minimization with dramatically less computation. Overall, our two proposed schemes for sparse solution of underdetermined linear systems, namely Homotopy and Stomp, form an arsenal of solution techniques allowing researchers to tackle a diverse set of sparse approximation problems. As we demonstrated in Chapter 2, Homotopy is an efficient stepwise algorithm that shares the benefits of `1 minimization in recovering sparse solutions. In particular, when the k-step solution property holds, Homotopy finds the sparsest solution very rapidly, at a cost comparable to that of an approximate iterative scheme such as Stomp. Thus, Homotopy is most suitable when operating within the bounds of the k-step region, particularly when solution accuracy is paramount. In contrast, Stomp’s modest computational cost and its simple structure make it an attractive choice when dealing with very large-scale problems, such as the ones encountered in image processing applications [34, 36, 55, 56, 70, 71, 83, 86]. In the second part of this thesis, we discuss several application scenarios in which these two complementary approaches play a major role.

Part II Applications

92

Chapter 4 Extensions of Compressed Sensing In the first chapter, we briefly described the notion of Compressed Sensing (CS), pioneered by Donoho [21] and Cand`es et al. [10, 14]. The notion proposes that a signal or image, unknown but supposed to be compressible by a known transform, can be subjected to fewer measurements than the nominal number of data points, and yet be accurately reconstructed. The ‘compressed’ samples are nonadaptive and measure ‘random’ linear combinations of the transform coefficients. Approximate reconstruction is obtained by solving an `1 minimization problem for the transform coefficients. In this chapter, we study the practical implications of compressed sensing and discuss some interesting extensions and applications. Most of the material in this chapter has appeared in the manuscript [83], published in the special issue on sparse approximations in signal and image processing in the EURASIP Signal Processing Journal. It is copyright 2006 by Elsevier and is reused with permission. Some of the material has also appeared in [29, 30], which have been submitted for publication in IEEE Transactions on Information Theory.

4.1

Introduction

In the modern multimedia-saturated world, fantastic volumes of digital imagery, video, and sound are generated daily. A crucial fact facilitating this multimedia 93

CHAPTER 4. EXTENSIONS OF COMPRESSED SENSING

94

revolution is that most humanly-intelligible data are highly compressible. In exploiting this fact, the dominant approach is to first sample the data, and then eliminate redundancy using various compression schemes. This raises the question: why is it necessary to sample the data in a pedantic way and then later to compress it? Can’t one directly acquire a compressed representation? Clearly, if this were possible, there would be implications in a range of different fields, extending from faster data acquisition, to higher effective sampling rates, and lower communications burden. In a paper entitled “Compressed Sensing” [21], Donoho, building on work of Cand`es et al. [10], developed theory establishing that under various assumptions, it is possible to acquire directly a form of compressed representation. We now offer a brief exposition of the main ideas underlying compressed sensing. The approach is rather abstract, but has the advantage of factoring across many potential applications areas. We assume the object of interest is a vector x0 ∈ Rn representing the n sampled values of a digital signal or image. A key assumption is that the object is compressible; there exists some linear transformation Ψ so that the resulting vector ΨT x0 has most of its information contained in a relatively small number of coordinates. This compressibility constraint is a weaker form of sparsity, allowing for a wider class of signals. For example, if x0 represents an image, common transform coding techniques include DCT (used in the image compression standard JPEG [49]), and Wavelets (used in JPEG-2000 [52]). Mathematically, we let Ψ denote the n × n matrix of the orthogonal transform in question, having columns ψi , i = 1, . . . , n. By ‘compressible’ we mean that for some p ≤ 1 and moderate R > 0, kΨT x0 kp ≤ R. In words, we require that most of the energy of the signal be concentrated in a relatively small number of coefficients; see [21] for further explanation of this condition. To make our compressed measurements, we start from a d×n matrix Φ with d < n satisfying certain conditions called CS1–CS3 in [21]. We form the matrix Ξ = ΦΨT , also d × n. We take the d-pieces of measured information y = (y(i) : 1 ≤ i ≤ d) according to the linear system y = Ξx0 . Since d < n this amounts to having fewer measurements than degrees of freedom for the signal x0 . The individual measured values are of the form y(i) = hξi , x0 i, i.e., each can be obtained by integrating the object x0 against a measurement kernel ξi . Here ξi denotes the i-th row of Ξ; from

CHAPTER 4. EXTENSIONS OF COMPRESSED SENSING

95

the formula Ξ = ΦΨT and the known properties of CS-matrices Φ, we may view each measurement kernel as a kind of ‘random’ linear combination of basis elements ψj . For the purposes of our discussion, it suffices to say that matrices drawn from the Uniform Spherical Ensemble were shown to satisfy conditions CS1–CS3 with high probability; see [21] for a thorough discussion of the properties of the sampling matrix Φ. To reconstruct an approximation to x0 we solve (CS1 )

min kΨT xk1 subject to y = Ξx. x

(4.1)

Call the result xˆ1 . We recognize (CS1 ) as familiar `1 minimization on the transform coefficients ΨT x. The paper [21] proved error bounds showing that despite the apparent undersampling (d < n), good accuracy reconstruction is possible for compressible objects. Specifically, Donoho showed that for large d, n, the reconstruction error is bounded by kˆ x1 − x0 k2 ≤ Cp · R · (d/ log(n))1/2−1/p ,

d, n > d0 .

(4.2)

As p < 2, this bound guarantees that reconstruction is accurate for large d, with a very weak dependence on n. The interpretation of this bound offered in [21] is quite remarkable. It states that if the object x0 has most of its information concentrated in the N largest coefficients of ΨT x0 , then x0 can be recovered with high accuracy from d = O(N log n) measurements, rather than the n measurements needed with traditional methods. When the object of interest x0 is highly compressible, and N n, compressed sensing can indeed offer significant savings over traditional sampling. From a practitioner’s point of view, the result (4.2) raises several questions, including: • How large do n and m have to be? Is (4.2) meaningless for practical problem sizes? • How large is the constant Cp ? Even if (4.2) applies, perhaps the constant is

CHAPTER 4. EXTENSIONS OF COMPRESSED SENSING

96

miserable? • When the object is not perfectly reconstructed, what sorts of errors occur? Perhaps the error term, though controlled in size by (4.2), has an objectionable structure? • How should the CS framework be applied to realistic signals? In [21], stylized models of spectroscopy and imaging were considered; for such models, it was proposed to deploy CS in a hybrid strategy, with the relatively small number of measurements at coarse scales obtained by classical linear sampling, and the bulk of measurements at fine scales obtained by the CS strategy. Are such ideas, originally derived for the purpose of enabling mathematical analysis, actually helpful in a concrete setting? • What happens if there is noise in the measurements? Perhaps the framework falls apart if there’s any noise in the observations – even just the small errors of floating-point representation. In this chapter, we extend the work in [21], and attempt to answer these questions with empirical studies of compressed sensing. While simulations alone cannot definitively answer some of these questions, they do provide valuable evidence that may inspire future theoretical studies, and suggest directions for future experimental and applications work. In Section 4.2, we give examples applying CS to signals that have only a small number of truly nonzero coefficients. Section 4.3 considers signals that have all coefficients nonzero, but still have sparse coefficients as measured by `p norms, 0 < p ≤ 1. We analyze the empirical results, obtain numerical evidence for the constant Cp , and compare the empirical errors with the theoretical bound (4.2). In Section 4.4, we explore the freedom available in the choice of CS matrices, describing several different matrix ensembles that seem to give equivalent results. While [21] focused on uniform spherical matrices, we have found that several other ensembles work well, including the partial Fourier ensemble (PFE) and the partial Hadamard ensemble (PHE).

CHAPTER 4. EXTENSIONS OF COMPRESSED SENSING

97

We observe that the results in the `p setting are in some sense ‘noisy’ when the number of sensed samples is small, even when the data measurements are noiseless. To alleviate that, in Section 4.5 we consider an extension of CS by post-processing to remove the ‘noise’ in CS reconstructions. Section 4.6 considers noisy observations, and develops a noise-aware reconstruction method. In Section 4.7, we consider an extension of CS mentioned in [21]: a hybrid scheme, using conventional linear sampling and reconstruction for coarse-scale information and compressed sampling on fine-scale information. The ‘noise’ in reconstructions can be dramatically lower for a given number of samples. Section 4.8 pushes this approach farther, deploying CS in a fully multiscale fashion. Different scales are segregated and CS is applied separately to each one. Again, the ‘noise’ can be dramatically lower. In the previous chapter, we introduced Stomp as an alternative to `1 minimization for finding approximate sparse solutions of underdetermined systems. Section 4.9 examines the performance of Stomp in a compressed sensing framework. Section 4.10 then considers a potential application of Compressed Sensing to fast acquisition of MR Imaging. A final section offers a discussion and some concluding remarks.

4.2

`0 Sparsity

We start with a warm-up question. Suppose we have an object x0 with a very high degree of transform sparsity: only k nonzeros out of n coefficients; how big does d have to be for the CS scheme to work well? To study this question, we conduct a small-scale experiment. For fixed n and a series of different values for k, we construct signals with k nonzeros, located in random positions in the transform domain, all of equal amplitude. We then apply the CS framework, while varying d, the number of samples used in the sensing scheme. The CS matrix Φ is drawn from the USE, i.e., the columns are drawn independently at random from a uniform distribution on the unit sphere bS d−1 in Euclidean d-space. In addition, we set Ψ = I. We repeat this experiment 20 times (independently), and record the worst-case behavior for each (d, n, k) instance, i.e., the maximal reconstruction error kˆ x 1 − x 0 k2 .

CHAPTER 4. EXTENSIONS OF COMPRESSED SENSING

98

(a) k = 20

|| x1,n − x0 ||2

1.5 1 0.5 0

0

50

100

150

200

250 300 (b) k = 50

350

400

450

500

0

50

100

150

200

250 300 (c) k = 100

350

400

450

500

0

50

100

150

200

350

400

450

500

|| x1,n − x0 ||2

1.5 1 0.5 0

|| x1,n − x0 ||2

1.5 1 0.5 0

250

300

Figure 4.1: Error of reconstruction versus number of samples for (a) k = 20 nonzeros; (b) k = 50; (c) k = 100.

CHAPTER 4. EXTENSIONS OF COMPRESSED SENSING

99

(a) Signal Blocks, n = 2048 6 4 2 0 −2

0

500

1000

1500

2000

(b) Wavelet analysis −4 −6 −8 −10 0

0.2

0.4

0.6

0.8

1

Figure 4.2: (a) Signal Blocks; (b) its expansion in a Haar wavelet basis. There are 77 nonzero coefficients. Figure 4.1, panels (a)–(c), display the `2 reconstruction error as a function of d, with n = 1024, k = 20, 50 and 100. In each case we see that as d increases beyond about 2k, the error starts to drop, and eventually becomes negligible at some multiple of k. As a more ‘signal-oriented’ instance of this phenomenon, we considered the object Blocks from the Wavelab package [7]. As Figure 4.2 shows, the object is piecewise constant, and its Haar wavelet transform has relatively few nonzero coefficients. In fact, Blocks has k = 77 nonzero coefficients out of n = 2048 samples. Figures 4.3(a),(b) show the reconstruction results with d = 340 and 256 compressed measurements, respectively. Clearly, the results are better for larger d, and somewhere around d = 340 ≈ 4k the method works well, while for d = 256 ≈ 3k the results are somewhat ‘noisy’. The results suggest a rule of thumb for these combinations of n, d and k: if an object has a representation using only k nonzeros at randomly-chosen sites, something like d ≈ 4k measurements would typically be needed. Indeed, this is essentially a reinterpretation of the rule d ≈ ck log(n) derived from (4.2) earlier. It suggests that in practice, since the range of problem dimension n is limited in scale, the log(n) factor has little impact on the number of measurements needed for accurate reconstruction.

CHAPTER 4. EXTENSIONS OF COMPRESSED SENSING

100

(a) CS Reconstruction of Blocks, d = 340 6 4 2 0 −2 −4

0

200

400

600 800 1000 1200 1400 (b) CS Reconstruction of Blocks, d = 256

1600

1800

2000

0

200

400

600 800 1000 1200 1400 1600 (c) T−I denoising of reconstruction with d = 256

1800

2000

0

200

400

600

1800

2000

6 4 2 0 −2 −4 6 4 2 0 −2 −4

800

1000

1200

1400

1600

Figure 4.3: CS reconstructions of Blocks from (a) d = 340 and (b) d = 256 measurements. (a) shows perfect reconstruction (modulo numerical effects), while (b) is visibly noisy. Panel (c) shows Translation-Invariant denoising of (b).

CHAPTER 4. EXTENSIONS OF COMPRESSED SENSING

101

Original Signal, n = 1000 1.2 1 0.8 0.6 0.4 0.2 0 −0.2

0

100

200

300

400

500

600

700

800

900

1000

800

900

1000

CS Reconstruction, d = 100, || x1,n − x0 ||2 = 0.166 1.2 1 0.8 0.6 0.4 0.2 0 −0.2

0

100

200

300

400

500

600

700

Figure 4.4: CS reconstruction of a signal with controlled `p norm, p = 3/4. (a) Original signal, n = 1000; (b) CS reconstruction with d = 100 samples, kˆ x1 − x0 k = 0.166. The most significant coefficients are recovered accurately. The notion of `0 sparsity — while powerful — is inherently limited in its applicability. It provides an easily-understandable way to introduce the idea of sparsity to ‘newcomers’; but in practice, real signals will not typically have exact zeros anywhere in the transform. Hence, examples of the type just shown serve merely as a warm-up.

4.3

`p Sparsity

We now consider a more widely applicable notion of sparsity, based on controlling P the `p norm kxkp = ( i |x(i)|p )1/p with 0 < p ≤ 1. Equating ‘sparsity’ with ‘small `p norm’ greatly expands the class of signals we may consider sparse, since objects can have all entries nonzero, but still have small `p norm; at the same time, signals sparse in the `0 case are (in an appropriate sense) also sparse in the `p sense. There is an intimate connection between `p norms, 0 < p ≤ 1 and the number of nonzeros (a.k.a. the `0 norm); for example, as p → 0 the kxkpp → kxk0 . See [21] and citations there for examples of `p constraints obeyed for natural classes of signals. For the simulations in this section, we generated random signals with controlled `p norm in the following manner. A signal θ is created with coefficients having random

CHAPTER 4. EXTENSIONS OF COMPRESSED SENSING

102

signs, and the amplitude, |θ|(k) , of the k-th largest-sized coefficient obeys |θ|(k) = (k · log n)−1/p ,

k = 1, 2, . . . .

(4.1)

Signals constructed in this manner have the property that kθkpp =

n X

n

(k · log n)−1 =

k=1

1 X1 = 1 + o(1), log n k=1 k

n → ∞.

Clearly, all of the coefficients θ(k) are nonzero, i.e. kθk0 = n. Figure 4.4 panel (a) shows such a spiky object with n = 1000, p = 3/4, and panel (b) shows the reconstruction with d = 100 sensed samples. Note that the CS scheme recovers most of the ‘important’ coefficients well, but fails to recover some of the very small ones. With that, we turn to questions posed in the introduction regarding the applicability of the bound (4.2). Namely, for this class of ‘`p -sparse signals’, how does the measured `2 error in the CS reconstruction compare to the theoretical predictions? How large is Cp in (4.2)? This question is theoretical in nature, as it refers to an inequality covering all possible x-vectors with kxkp ≤ R, rather than typical ones seen in practice. Nevertheless empirical studies conducted with typical signals can be instructive. As a starter, we would like to have empirical evidence about the dependence of the reconstruction error on d, n, and p. This should reflect the behavior of the constant Cp . To that end, consider the following simulation study. For a range of d, n, p, generate a matrix Φd×n from the uniform spherical ensemble, set Ψ = I, and create a random signal x0 of length n with a controlled `p norm in the manner described above. Compute the sensed measurement vector y = ΦΨT x0 and solve (CS1 ). Finally, measure the reconstruction error kˆ x1 − x0 k2 . Repeat this basic procedure several times for each (d, n, p) triplet and record the largest error. We ran this study with p in the range [0.25, 1], n in the range [1000, 4000], and d in the range [0.05n, 0.75n]. We ran 20 instances of the experiment for each (d, n, p) triplet.

CHAPTER 4. EXTENSIONS OF COMPRESSED SENSING

103

To summarize our results, we developed an empirical substitute C˜p for the theoretical constant Cp in (4.2). For each p, we maximized the error over all (d, n) pairs, and compared to the right-side of (4.2), obtaining: C˜p = max max d,n

x0

kˆ x 1 − x 0 k2 . kx0 kp · (log(n)/d)1/p−1/2

Here the maxima are taken over the full range of (d, n) and over the full collection of signals x0 generated in our experiments. The results are given in Table 4.1. p ˜ Cp p ˜ Cp

0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.127 0.133 0.122 0.0755 0.0525 0.0402 0.0253 0.0467 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0.0619 0.0821 0.0905 0.113 0.159 0.177 0.191 0.216 Table 4.1: Empirically-derived constant C˜p .

Examining the table, we observe that very modest values of C˜p arise for the range of d, n we considered. This suggests that the theoretical value Cp may indeed not be of catastrophic proportions. Clearly, our estimate C˜p is at best a lower bound for Cp , since it simply indicates the worst values encountered in our simulations rather than the worst values possible. Still, it gives us practical information reflecting actual behavior at plausible ‘`p sparse’ signals, such as the one portrayed in Figure 4.4(a). With this estimate in hand, we wish to see if the empirical reconstruction error behaves in the manner (4.2) predicts, as d varies. For this purpose, we conducted a simulation study, this time keeping n, p constant as d grows. Results, in the form of error plots versus the number of samples d, are shown in Figure 4.5, panels (a)–(d), for n = 2000 and p = 0.4, 0.6, 0.8, 1, respectively. Each plot shows the measured empirical error, alongside a plot of the quasibound derived from the putative relationship kˆ x1 − x0 k2 . C˜p · kx0 kp · (d/ log(n))1/2−1/p ;

(4.2)

this is similar to (4.2) with Cp replaced by our empirical quantity C˜p . We employ the quasi-inequality symbol . to remind us that this is a tool for summarizing observed error behavior rather than a mathematical formula like (4.2).

CHAPTER 4. EXTENSIONS OF COMPRESSED SENSING

(a) n = 2000, p = 0.4

|| x1,n − x0 ||2

|| x1,n − x0 ||2

(b) n = 2000, p = 0.6

0.01

0.005

0

0.08

Observed Bound (3.2)

0.015

0

500

1000

Observed Bound (3.2)

0.06 0.04 0.02 0

1500

0

500

n

1500

(d) n = 2000, p = 1 0.6

Observed Bound (3.2)

0.2

0.1

Observed Bound (3.2)

0.5 || x1,n − x0 ||2

|| x1,n − x0 ||2

1000 n

(c) n = 2000, p = 0.8 0.3

104

0.4 0.3 0.2 0.1

0

0

500

1000

0

1500

0

500

n

1000

1500

n

Figure 4.5: Reconstruction error versus number of measurements d, for fixed signal length n = 2000: (a) p = 0.4, (b) p = 0.6, (c) p = 0.8, (d) p = 1. Solid curves show maximum error in 20 pseudo-random replications at each (d, n) combination. Dashed curves show quasi-bound (4.2) using constants C˜p from Table 4.1.

(a) d = 200, p = 0.5

(b) d = 200, p = 1

Observed Bound (3.2)

0.016

0.4

0.014

0.35

|| x1,n − x0 ||2

0.012

|| x1,n − x0 ||2

Observed Bound (3.2)

0.45

0.01 0.008

0.3 0.25 0.2

0.006 0.15 0.004

0.1

0.002 0

0.05

0

500

1000 n

1500

2000

0

0

500

1000 n

1500

2000

Figure 4.6: Reconstruction error versus signal length n, for fixed number of measurements d = 200: (a) p = 0.5, (b) p = 1. Solid curves show error in 20 pseudo-random replications. Dashed curves show quasi-bound (4.2) using constants C˜p from Table 4.1.

CHAPTER 4. EXTENSIONS OF COMPRESSED SENSING

105

(a) k = 50

|| x1,n − x0 ||2

8 Observed p = 0.4 p = 0.6 p = 0.8 p=1

6 4 2 0

0

50

100

150

200

250 n

300

350

400

450

500

(b) k = 100

|| x1,n − x0 ||2

8 Observed p = 0.4 p = 0.6 p = 0.8 p=1

6 4 2 0

0

50

100

150

200

250 n

300

350

400

450

500

Figure 4.7: Reconstruction error versus number of measurements d, for fixed signal length n = 1024 and (a) k = 50 nonzeros, (b) k = 100 nonzeros. Dashed curve: observed error. Solid curve: error bounds (4.2) for various p. Figure 4.5 shows how (4.2) compares to the observed error for practical signal lengths. As d increases, the error decreases like a power law depending on p. Moreover, we observe that for smaller p, the error tends to drop much more rapidly, as the bound predicts. To examine the behavior of the constant C˜p as a function of the signal length n, we conducted a similar study, this time keeping d, the number of sensed samples, constant, and varying n, the signal length. Results are displayed in Figure 4.6, panels (a),(b), for d = 200 random measurements and p = 0.5, 1. Again, it appears that (4.2) reflects the behavior of the reconstruction error kˆ x1 − x0 k2 as n grows. Finally, we wish to tie the evidence from this section with the results presented in Section 4.2, by comparing the bounds (4.2) with observed error behavior for ‘`0 sparse’ signals. For signals of length n with k nonzeros of equal amplitude A, the `p norm is A · k 1/p . This value may be used in the quasibound (4.2) along with the estimates for C˜p in Table 4.1, and contrasted with the empirical reconstruction error displayed in Figure 4.1. Figure 4.7 displays results for several values of p. Clearly,

CHAPTER 4. EXTENSIONS OF COMPRESSED SENSING

106

`0 -sparse signals yield better error behavior than any of the bounds (4.2).

4.4

Different Ensembles of CS Matrices

In the examples shown so far, we have constructed CS matrices Φ from the uniform spherical ensemble. As we have seen in previous chapter, numerous other possibilities exist for construction of CS matrices. In this section, we examine whether the error bound (4.2) holds when the matrix Φ is drawn from different ensembles. In particular, we consider the following three ensembles: √ • Random Signs Ensemble (RSE). Here Φij has entries ±1/ d with signs chosen independently and both signs equally likely. • Partial Fourier Ensemble (PFE). We select at random d rows out of the n × n Fourier matrix, getting a d × n partial Fourier matrix. • Partial Hadamard Ensemble (PHE). We select at random d rows out of the n×n Hadamard matrix, getting a d × n partial Hadamard matrix. In Figure 4.8, we compare the quasibound (4.2) with actual errors for the different matrix ensembles. In doing so, we follow the procedure of the previous section. Specifically, for each ensemble, we consider an object defined as in Section 4.3, i.e. an n-vector whose k-th largest amplitude coefficient |θ|(k) obeys (4.1). A typical example is shown in Figure 4.4(a). We set p = 3/4, n = 2048, and consider families of experiments where d, the number of measurements, varies. For each d, we apply the CS framework and measure the `2 reconstruction error. We repeat this experiment 20 times for each (d, n, p) triplet, and record the maximal error. Figure 4.8 depicts error versus sampling d for the four different ensembles described above. We also display the error quasi-bound (4.2). Here the the constant C˜p was taken from Table 4.1, namely C˜3/4 ≈ 0.09. Figure 4.8 prompts several observations. First, the simulation results for the different ensembles are all qualitatively in agreement with the theoretical form of the error behavior in (4.2). Moreover, the relationship (4.2) gives a fairly good description

CHAPTER 4. EXTENSIONS OF COMPRESSED SENSING

(a) Uniform Spherical Ensemble

(b) Random Signs Ensemble

0.25

0.25 Observed Bound (3.2)

0.15 0.1

Observed Bound (3.2)

0.2 || x1,n − x0 ||2

|| x1,n − x0 ||2

0.2

0.05 0

0.15 0.1 0.05

0

500

1000

0

1500

0

500

d (c) Partial Hadamard Ensemble

1500

(d) Partial Fourier Ensemble 0.25

Observed Bound (3.2)

0.15 0.1 0.05

Observed Bound (3.2)

0.2 || x1,n − x0 ||2

0.2 || x1,n − x0 ||2

1000 d

0.25

0

107

0.15 0.1 0.05

0

500

1000 d

1500

0

0

500

1000

1500

d

Figure 4.8: Error versus number of measurements d for p = 3/4, n = 2048: (a) Uniform Spherical Ensemble (b) Random Signs Ensemble (c) Partial Hadamard Ensemble; (d) Partial Fourier Ensemble.

CHAPTER 4. EXTENSIONS OF COMPRESSED SENSING

108

of the true behavior observed in practice. Perhaps most importantly, we observe that the different ensembles all exhibit similar behavior. This suggests that all such ensembles are equally good in practice.

4.5

Denoising CS Results

When the object is undersampled, CS reconstructions are typically noisy. We considered the object Bumps from the Wavelab package [7], rendered with n = 2048. As Figure 4.9 panel (a) shows, the object is a superposition of smooth bumps. Panel (b) shows that the large coefficients at each scale happen near the bump locations, and panel (c) shows the decreasing rearrangement of the wavelet coefficients on a log scale. The linear appearance of this display is indicative of power-law behavior. We applied the CS framework, with Φ drawn, as usual, from a uniform spherical distribution, and Ψ an orthogonal wavelet basis, with Daubechies ‘symmlet8’ filters. Panel (a) of Figure 4.10 shows the results of the reconstruction with d = 256 measurements. Panel (b) shows the result with d = 512. Clearly both results are ‘noisy’, with, understandably, the ‘noisier’ one coming at the lower sampling rate. Indeed, the results in Figures 4.3 and 4.10 show that ‘noise’ sometimes appears in reconstructions as the number of measurements decreases (even though the data are not noisy in these examples). To alleviate this phenomenon, we considered the test cases shown earlier, namely Blocks and Bumps, and applied translation-invariant wavelet de-noising [16] to the reconstructed ‘noisy’ signals. Results are shown in panel (c) of Figure 4.3 and panels (b) and (d) of Figure 4.10. At least visually, there is a great deal of improvement.

4.6

Noise-Aware Reconstruction

So far we have not allowed for the possibility of measurement noise, digitization errors, etc. That is, we have assumed that the raw measurements y are perfectly observed linear combinations of the underlying object. Even in that case, the reconstructions can appear noisy, but not because of any ’noise’ in the system. Now we consider the

CHAPTER 4. EXTENSIONS OF COMPRESSED SENSING

109

(a) Signal Bumps, n = 2048 6 4 2 0 −2

0

200

400

600

0

0.1

0.2

0.3

0

200

400

600

800 1000 1200 (b) Wavelet analysis

1400

1600

1800

2000

−4 −6 −8 −10 0.4 0.5 0.6 (c) Coefficient decay (log scale)

0.7

0.8

0.9

1

0

10

−10

10

−20

10

800

1000

1200

1400

1600

1800

2000

Figure 4.9: (a) Signal Bumps, n = 2048; (b) its wavelet coefficients; (c) Decay of coefficient magnitudes on a log scale. Rearranged coefficients exhibit roughly exponential decay.

CHAPTER 4. EXTENSIONS OF COMPRESSED SENSING

(a) CS Reconstruction of Bumps, d = 256

(c) CS Reconstruction of Bumps, d = 512

6

6

4

4

2

2

0

0

−2

0

500

1000

1500

2000

−2

0

(b) T−I denoising of (a) 6

4

4

2

2

0

0

0

500

1000

1500

500

1000

1500

2000

(d) T−I denoising of (c)

6

−2

110

2000

−2

0

500

1000

1500

2000

Figure 4.10: CS reconstruction of Bumps from (a) d = 256 and (c) d = 512 measurements. Translation-invariant denoising in (b) and (d).

CHAPTER 4. EXTENSIONS OF COMPRESSED SENSING

111

case where the data actually are noisy. We remark that the theory allows for a small amount of noise already, through the `1 -stability property proved in [21]. To go further, assume that rather than measuring y = ΦΨT x, we measure yn = ΦΨT x + z, where z obeys kzk2 ≤ . To accommodate this noise, our primary adjustment would be to relax the equality constraint in CS1 , replacing it with (CS1, )

min kΨT xk1 subject to kyn − ΦΨT xk2 ≤ .

Indeed, we have encountered this problem earlier, by the name Lasso or Basis Pursuit DeNoising (BPDN). In Chapter 2 we remarked that the Homotopy algorithm may be easily adapted to solve (CS1, ), by following the Homotopy path, stopping when the norm of the residual reaches . To examine the performance of (L1, ), we conducted the following experiment. We took the test signals Blocks and Bumps, shown in Figures 4.2 and 4.9, respectively, and added zero-mean white gaussian noise to them. The noise was rescaled to enforce a specific noise level kzk2 = 0.2. We applied the compressed sensing scheme with denoising (CSDN) to the noisy wavelet expansions, solving (CS1, ) with the Homotopy method. For comparison, we also attempted regular CS reconstruction. Results are shown in Figures 4.11 and 4.12. We chose a signal length n = 2048, and attempted reconstructions with d = 512. Indeed, the reconstruction achieved with CSDN is far superior to the CS reconstruction in both cases. We note that in these simulations, we had prior knowledge of the noise level, which we used in the reconstruction procedure. In practice, the noise level might not always be known; in such cases an estimate of the noise level can be used in the reconstruction.

4.7

Two-Gender Hybrid CS

In [21] model spectroscopy and imaging problems were considered from a theoretical perspective. CS was deployed differently there than so far in this thesis — in particular, it was not proposed that CS alone ‘carry all the load’. In that deployment, CS

CHAPTER 4. EXTENSIONS OF COMPRESSED SENSING

(a) Noisy Signal

112

(b) Noisy wavelet coefficients

20 −4 10

−6 −8

0

−10 −10

0

500

1000

1500

2000

0

0.2

(c) CS reconstruction, d = 512

0.4

0.6

0.8

1

0.8

1

0.8

1

(d) CS reconstruction

20 −4 10

−6 −8

0

−10 −10

0

500

1000

1500

2000

0

0.2

(e) CSDN reconstruction, d = 512

0.4

0.6

(f) CSDN reconstruction

20 −4 10

−6

0

−8 −10

−10

0

500

1000

1500

2000

0

0.2

0.4

0.6

Figure 4.11: CSDN reconstruction of Blocks, n = 2048, d = 512. Signal and reconstructions are shown on the left panel, with corresponding wavelet expansions on the right panel.

CHAPTER 4. EXTENSIONS OF COMPRESSED SENSING

(a) Noisy Signal

113

(b) Noisy wavelet coefficients

20 −4 10

−6 −8

0

−10 −10

0

500

1000

1500

2000

0

0.2

(c) CS reconstruction, d = 512

0.4

0.6

0.8

1

0.8

1

0.8

1

(d) CS reconstruction

20 −4 10

−6 −8

0

−10 −10

0

500

1000

1500

2000

0

0.2

(e) CSDN reconstruction, d = 512

0.4

0.6

(f) CSDN reconstruction

20 −4 10

−6

0

−8 −10

−10

0

500

1000

1500

2000

0

0.2

0.4

0.6

Figure 4.12: CSDN reconstruction of Bumps, n = 2048, d = 512. Signal and reconstructions are shown on the left panel, with corresponding wavelet expansions on the right panel.

CHAPTER 4. EXTENSIONS OF COMPRESSED SENSING

114

was applied to measuring only fine-scale properties of the signal, while ordinary linear measurement and reconstruction were used to obtain the coarse-scale properties of the signal. In more detail, the proposal was as follows; we spell out the ideas for dimension 1 only. Expand the object x0 in the wavelet basis x0 =

X k

βj0 ,k φj0 ,k +

j1 X X j=j0

αj,k ψj,k ,

k

where j0 is some specified coarse scale, j1 is the finest scale, φj0 ,k are male wavelets at coarse-scale, and ψj,k are fine-scale female wavelets. Let α = (αj,k : j0 ≤ j ≤ j1 , 0 ≤ k < 2j ) denote the grouping together of all wavelet coefficients, and let β = (βj0 ,k : 0 ≤ k < 2j0 ) denote the male coefficients. Now consider a scheme where different strategies are used for the two genders α and β. For the male coarse-scale coefficients, we simply take direct measurements βˆ = (hφj0 ,k , x0 i : 0 ≤ k < 2j0 ). For the female fine-scale coefficients, we apply the CS scheme. Let n = 2j1 − 2j0 , and let the 2j1 × n matrix Ψ have, for columns, the vectors ψj,k in some standard order. Given a d by n CS matrix Φ, define Ξ = ΦΨT , so that, in some sense, the columns of Ξ are ‘noisy’ linear combinations of columns of Ψ. Now make d measurements y = Ξx0 . To reconstruct from these observations, define Ω = ΞΨ and consider the familiar `1 minimization problem (P1 )

min kαk1 subject to y = Ωα;

(4.1)

CHAPTER 4. EXTENSIONS OF COMPRESSED SENSING

115

Call the answer α ˆ . The overall reconstruction is xˆhy =

X

βˆj0 ,k φj0 ,k +

k

j1 X X j=j0

α ˆ j,k ψj,k ,

k

a combination of linear reconstruction at coarse scales and nonlinear reconstruction based on undersampling at fine scales. When we speak of this scheme, of course, the total number of samples dhy = 2j0 +d is the number of linear samples plus the number of compressed samples. The theory derived in [21] shows that this hybrid scheme, when applied to objects obeying certain constraints (e.g. bounded variation) gets accuracy comparable to linear sampling at scale 2−j1 , only using many fewer total samples. The gender-segregated deployment of CS was suggested in [21] for mathematical convenience, but in experiments, it substantially outperforms straightforward genderblind deployment. To see this, consider Figure 4.13. Panel (a) shows a blocky signal of original length n = 16384 reconstructed from d = 512 linear samples, where we set a coarsest scale j0 = 5 and a finest scale j1 = 9. Panel (b) shows reconstruction with dhy = 248 hybrid compressed samples (32 male samples, 216 compressed female samples). As before, we used a sampling matrix Φ drawn from a uniform spherical distribution, and Ψ a Haar wavelet basis. The accuracy is evidently comparable. It is far better than panel (b) of Figure 4.3, which shows the reconstruction from 256 standard CS samples, for n = 2048. Now consider Figure 4.14. Panel (a) shows a bumpy signal of original length n = 16384 reconstructed from d = 1024 linear samples, where we set a coarsest scale j0 = 5, and a finest scale j1 = 10. We applied hybrid reconstruction, again using a sampling matrix Φ drawn from the USE, and Ψ an orthogonal wavelet basis, with Daubechies’ ‘symmlet8’ filters. Panel (b) shows the reconstruction result, with dhy = 640 hybrid compressed samples (32 male samples, 608 compressed female samples). Again the reconstruction accuracy is comparable. It is far better than panel (b) of Figure 4.10, which shows the reconstruction from 512 standard CS samples, for n = 2048.

CHAPTER 4. EXTENSIONS OF COMPRESSED SENSING

116

(a) Linear reconstruction, d = 512 6 4 2 0 −2

0

2000

4000 6000 8000 10000 12000 (b) Hybrid CS reconstruction, d = 248

14000

16000

0

2000

4000

14000

16000

14000

16000

6 4 2 0 −2

6000

8000

10000

12000

(c) Multiscale CS reconstruction, d = 208 6 4 2 0 −2

0

2000

4000

6000

8000

10000

12000

Figure 4.13: Reconstruction of signal Blocks, n = 16384. (a) Linear reconstruction from 512 samples, kˆ xlin − x0 k2 = 0.091; (b) Hybrid CS reconstruction from 248 samples (Gender-Segregated), kˆ xhy − x0 k2 = 0.091; (c) Multiscale CS reconstruction from 208 samples, kˆ xms − x0 k2 = 0.091.

CHAPTER 4. EXTENSIONS OF COMPRESSED SENSING

117

(a) Linear reconstruction, d = 1024 6 4 2 0 −2

0

2000

4000

6000

8000

10000

12000

14000

16000

14000

16000

14000

16000

(b) Hybrid CS reconstruction, d = 640 6 4 2 0 −2

0

2000

4000

6000

8000

10000

12000

(c) Multiscale CS reconstruction, d = 544 6 4 2 0 −2

0

2000

4000

6000

8000

10000

12000

Figure 4.14: Reconstruction of Signal Bumps, n = 16384. (a) Linear reconstruction from 1024 samples, kˆ xlin − x0 k2 = 0.0404; (b) Hybrid CS reconstruction from 640 samples (Gender-Segregated), kˆ xhy − x0 k2 = 0.0411; (c) Multiscale CS reconstruction from 544 samples, kˆ xms − x0 k2 = 0.0425.

CHAPTER 4. EXTENSIONS OF COMPRESSED SENSING

4.8

118

Multiscale Compressed Sensing

Encouraged by the apparent usefulness of Hybrid CS, we now consider a fully multiscale deployment of CS. The simplest way to explain the concept is to use for our multiscale system a standard 1-D orthogonal wavelet system. The same idea can be applied with other multiscale systems and in higher dimensions. Expand the object x0 in the wavelet basis x0 =

X

βj0 ,k φj0 ,k +

k

j1 X X j=j0

αj,k ψj,k .

k

Consider now a multilevel stratification of the object in question, partitioning the coefficient vector as [(βj0 ,· ), (αj0 ,· ), (αj0 +1,· ), . . . , (αj1 −1,· )] . We then apply ordinary linear sampling to measure the coefficients (βj0 ,· ) directly, and then separately apply compressed sensing scale-by-scale, sampling data yj about the coefficients (αj,· ) at level j using a dj × 2j CS matrix Φj . We obtain thereby a total of dms = 2j0 + dj0 + dj0 +1 + · · · + dj1 −1 samples, compared to n = 2j0 + 2j0 + 2j0 +1 + · · · + 2j1 −1 = 2j1 coefficients in total. To obtain a reconstruction, we then solve the sequence of problems (P1,j )

min kαk1 subject to yj = Φj α,

j = j0 , . . . , j1 − 1,

calling the obtained solutions α ˆ (j) ; to reconstruct, we set j1 −1

xˆms =

X k

βj0 ,k φj0 ,k +

XX j=j0

k

(j)

α ˆ k ψj,k .

CHAPTER 4. EXTENSIONS OF COMPRESSED SENSING

119

(Of course, variations are possible; we might group together several coarse scales j0 , j0 + 1, . . . , j0 + ` to get a larger value of n.) For an example of results obtained using Multiscale CS, consider Figure 4.13(c). It shows the signal Blocks reconstructed from dms = 208 compressed samples (dj0 = 32 coarse-scale samples, dj = min{2j−1 , 48} compressed samples at each detail scale j, j0 < j ≤ j1 ). Indeed, the reconstruction accuracy is comparable to the linear and gender-segregated results. Similarly, Figure 4.14(c) has multiscale reconstruction of Bumps from dms = 544 compressed samples (dj0 = 32 coarse-scale samples, dj = min{2j−1 , 144} compressed samples at each detail scale j, j0 < j ≤ j1 ). Again the reconstruction accuracy is comparable to that achieved by the other methods. In [21] the ideas of gender-segregated sampling and multiscale sampling were extended to higher dimensions in considering the class of images of bounded variation. The ideas are a straightforward extension of the 1-D case, and we shall not repeat them here. We investigate the performance of multiscale CS sampling applied to image data in the following experiment. Panel (a) of Figure 4.15 displays an image of a painting by Piet Mondrian, the Dutch neo-plasticist. This image has a relatively sparse expansion in a tensor wavelet basis, and therefore is suitable for use in a CS framework. However, we note that while this image has a relatively simple geometric structure, it still presents a challenging trial, as its wavelet expansion is not very sparse. In fact, out of 262144 wavelet coefficients, there are only 798 coefficients with magnitude smaller than 10−2 . Panel (b) of Figure 4.15 shows the reconstruction results using the Haar wavelet expansion, which is naturally suited to images of this type, with a coarsest scale j0 = 5 and a finest scale j1 = 9. A total of 69834 sensed samples were used overall. Since n = 5122 = 262144, this stands for roughly one quarter of the original dimension of the data. Consider now an example working with a frame rather than an ortho-basis, in this case the Curvelets frame [8]. Theory supporting the possible benefits of using this frame for cartoon-like images was developed in [21]. Like the wavelet basis, there is a scale parameter j that specifies the size of the curvelet frame element; we considered a deployment of multiscale compressed sensing

CHAPTER 4. EXTENSIONS OF COMPRESSED SENSING

120

Figure 4.15: Reconstruction of Mondrian image with Multiscale CS. Panel (a): Mondrian painting, 512 × 512 pixels; Panel (b): Multiscale CS reconstruction kˆ xms − xk2 /kxk2 = 0.32.

CHAPTER 4. EXTENSIONS OF COMPRESSED SENSING

(a) Original Image

(b) Linear Reconstruction

(c) CS Reconstruction

100

100

100

200

200

200

300

300

300

400

400

400

500

500 100 200 300 400 500

121

500 100 200 300 400 500

100 200 300 400 500

Figure 4.16: (a) Shepp-Logan Phantom image, 742400 Curvelet coefficients; (b) Reconstruction from 480256 Linear Samples, kˆ xlin − x0 k2 = 0.142; (c) Reconstruction from 103218 Multiscale Compressed Samples using the Curvelets Frame, kˆ xms − x0 k2 = 0.134. that used different dj at each level. In Figure 4.16 we give the results of this scheme applied to the familiar Shepp-Logan Phantom image. To help the reader gain more insight into the level-by-level performance, we compare in Figure 4.17 the image coefficients and the reconstructed ones. Clearly, there is additional noise in the reconstruction. Nonetheless, this noise is not visually evident in the overall CS reconstruction.

4.9

Compressed Sensing with StOMP

The theoretical work underlying compressed sensing [21, 10, 14] is based on properties of minimum-`1 solutions to underdetermined problems. Thus, for the reconstruction procedure, we employ a linear programming solver such as Pdco [69], or the more efficient Homotopy method of Chapter 2. Yet, many exciting applications of compressed sensing involve large problem dimensions, and full-blown `1 minimization becomes prohibitively slow, even for a fast algorithm like Homotopy. This led researchers to seek alternative approaches to reconstruction from compressed measurements. Tropp and Gilbert [81] showed that in the setting of Section 4.2 — where one posits only a few nonzeros — one can typically use orthogonal matching pursuit

CHAPTER 4. EXTENSIONS OF COMPRESSED SENSING

122

(a) Original signal at scale 7 1 0.5 0 −0.5 −1

0

0.5

1

1.5

2

2.5

3

3.5 5

x 10 (b) CS Reconstruction 1 0.5 0 −0.5 −1

0

0.5

1

1.5

2

2.5

3

3.5 5

x 10

Figure 4.17: (a) Exact Curvelet coefficients of Shepp-Logan at scale 7, (b) Reconstructed coefficients in CS framework. (Omp). Yet, their results suggest that recovery by Omp demands higher degrees of sparsity than recovery by `1 methods. In addition, while Omp is an efficient stepwise method, it is still far too costly to apply on large-scale problems. Cand`es and Romberg [9] tackled the reconstruction problem from a different direction, by using Projection onto Convex Sets (POCS) to reconstruct the sensed object approximately. They present empirical evidence suggesting that good accuracy reconstruction is obtained with their method. While they do not report execution times, our own experiments suggest that their method is still too cumbersome to be applied in practice. In Chapter 3, we introduced Stagewise Orthogonal Matching Pursuit (Stomp), and demonstrated via extensive empirical studies that it successfully recovers the sparsest solution to an underdetermined system in certain instances. By computing its phase transition we saw that the range of signal sparsities for which Stomp recovers the sparsest solution (the region underneath the phase curve) is comparable with that of `1 minimization. Thus, it is reasonable to expect that Stomp would provide faithful reconstructions of objects from compressed measurements, particularly for highly compressible objects. In this section, we present evidence showing that a

CHAPTER 4. EXTENSIONS OF COMPRESSED SENSING

123

(a) Hybrid CS with Homotopy 6 4 2 0

2000

4000

6000

8000

10000

12000

14000

16000

12000

14000

16000

12000

14000

16000

(b) Hybrid CS with OMP 6 4 2 0

2000

4000

6000

8000

10000

(c) Hybrid CS with STOMP 6 4 2 0

2000

4000

6000

8000

10000

Figure 4.18: Reconstruction of Bumps with hybrid CS and Stomp. Panel (a): Reconstruction with Homotopy, kxHom − xk2 /kx0 k2 = 0.04, tHom = 37 sec.; Panel (b): Reconstruction with Omp, kxOM P − xk2 /kx0 k2 = 0.041, tOM P = 21 sec.; Panel (c): Reconstruction with Stomp, kxST OM P − xk2 /kx0 k2 = 0.041, tST OM P = 5.6 sec.

compressed sensing scheme with Stomp indeed provides accurate reconstructions, comparable with reconstructions achieved with `1 minimization, at a fraction of the cost. Our first example uses the object Bumps, considered in earlier sections, shown in panel (a) of Figure 4.9. We repeated the experiment described in Section 4.7, namely, we employed a gender-segregated CS scheme to measure 640 sensed samples (32 male samples, 608 compressed female samples). We compared the performance of Stomp to `1 minimization (implemented with Homotopy), and to Omp. The results are summarized in Figure 4.18. Evidently, the accuracy of reconstruction obtained with Stomp is similar to the accuracy obtained with `1 minimization or Omp. However, the speed of the Stomp implementation is unrivaled by Homotopy or Omp; compare the 5.6 seconds required by Stomp to generate a solution, with the 21 seconds needed by Omp, or 37 seconds of computation time entailed by `1 minimization (execution times were recorded on a 3GHz Xeon workstation).

CHAPTER 4. EXTENSIONS OF COMPRESSED SENSING

124

Figure 4.19: Reconstruction of 2-D imagery with Multiscale CS and Stomp. kxST OM P − xk2 /kxk2 = 0.36, tST OM P = 64 sec.

To further emphasize Stomp’s efficiency, we now consider a larger-scale example in 2 dimensions, namely reconstruction of the ‘Mondrian’ image displayed in Panel (a) of Figure 4.15. Again, we repeated the experiment described in Section 4.8, deploying CS in a multiscale fashion, applied to the 4 finest scales of the wavelet expansion. As before, a total of 69834 sensed measurements were used overall. Figure 4.19 shows the reconstruction result achieved with Stomp. We note that the reconstructed image is comparable in accuracy to `1 minimization; compare Panel (b) of Figure 4.15. Notice, however, the difference in computational cost; the `1 reconstruction (via Pdco) took over 30 hours of computation to complete. In contrast, Stomp produced a result of comparable accuracy in just over a minute. Such dramatic time savings make CS with Stomp practical in very large-scale applications. We find it instructive to take a closer look at the reconstruction results in the wavelet domain. To that end, we zoomed in on a horizontal slice of 100 wavelet coefficients at one scale below the finest, as displayed in panel (a) of 4.20. Comparing the reconstruction results of the iterative thresholding algorithm with the original slice

CHAPTER 4. EXTENSIONS OF COMPRESSED SENSING

125

(a) Horizontal slice of 100 wavelet coefficients at scale 7 100 0 −100 −200 10

20

30

40

50

60

70

80

90

100

70

80

90

100

80

90

100

(b) Reconstruction with PDCO 100 0 −100 −200 10

20

30

40

50

60

(c) Reconstruction with STOMP 100 0 −100 −200 10

20

30

40

50

60

70

Figure 4.20: Zoom in on a horizontal slice of wavelet coefficients. Panel (a): Original horizontal slice of coefficients, at scale 7; Panel (b): `1 reconstruction with Pdco; Panel (c): Stomp reconstruction.

of coefficients reveals a great deal about their performance when the underlying signal is not sufficiently sparse. Stomp successfully recovered the significant coefficients, while keeping the rest of the coefficients at zero. In fact, it makes sense to view the small coefficients as the result of digitization noise, in which case the thresholding algorithms are actually rejecting noise, while remaining faithful to the original signal. In contrast, `1 minimization tends to exacerbate the noise under insufficient sparsity conditions, as we have witnessed in Section 4.5, and as the close-up view in panel (b) of 4.20 suggests.

CHAPTER 4. EXTENSIONS OF COMPRESSED SENSING

4.10

126

Compressed MRI Acquisition

In previous sections we presented several experiments investigating the performance of Compressed Sensing applied to several 1-D and 2-D signals. Some of these examples were inspired by applications in signal- and image-processing. For instance, in Section 4.8 we described a compressed sensing scenario where ‘random’ combinations of Curvelet coefficients were used as sensed samples of a synthesized image. Unfortunately, in order to generate these linear combinations of Curvelet coefficients, one needs to compute the Curvelet transform of the image, which in turn requires knowledge of the image itself. Put differently, there is no way to selectively measure Curvelet coefficients of an image, without prior knowledge of the entire image (at present at least; future developments might prove otherwise). Similar arguments can be made for the other examples presented. In this section, we describe an application scenario in which the compressed sensing approach is more naturally applicable. Magnetic Resonance Imaging (MRI) is a technique used primarily in medical settings to produce high quality images of internal organs in the human body. Acquisition of an MR image involves adapting the gradients of an external magnetic field to get a spatial frequency map of the exposed organ. The theory behind MR imaging is well beyond the scope of this text, and we shall not attempt to explore it here. However, we are interested in the mathematical modeling used to recover an image from MR measurements. In a nutshell, to selectively image different voxels of a subject, an MRI scanner applies an RF pulse in the presence of orthogonal magnetic gradients. By controlling the gradients, the magnetic field can be modified in a specific location, the response read by an RF receiver. These spatially-encoded phases are recorded in a 2D or 3D matrix; this data represents spatial frequencies of the image object. In other words, the output of an MRI scanner is the frequency representation (at a discrete set of frequencies) of the scanned subject. To recover a spatial representation of the object, an inverse Fourier transform is typically applied to the scanned data. The connections with compressed sensing ideas are immediate. From a purely mathematical viewpoint, an MRI scanner is a device that selectively records frequency information of an object. By selecting a subset of frequencies at random, we essentially

CHAPTER 4. EXTENSIONS OF COMPRESSED SENSING

127

obtain the physical equivalent of a partial Fourier operator. Our discussion in Section 4.4 suggested that CS works just as well when the matrix Φ is drawn from the partial Fourier ensemble. Therefore, we expect that by applying CS in this scenario, we shall be able to reconstruct an object from partial MRI data. As MRI often deals with enormous volumes of data, acquisition time is a key factor in the success of this technique. Thus, the ability to reconstruct a faithful image from partial frequency information is of great benefit. This natural application of compressed sensing excited several research groups [9, 55, 56], and in fact originally motivated the development of compressed sensing ideas [10]. In this section we suggest using previously discussed algorithms, namely Lars and Stomp, to rapidly obtain a faithful reconstruction of a 3-D object from partial MR data. Here is the experimental setup. Assume the object is represented as a 3-D matrix, which we treat as an array of 2-D slices. Let xj denote one such slice, columnstacked into a vector of length n. We assume that each slice xj is compressible in a wavelet basis, i.e. αj = W xj has most of its energy concentrated in a small number of coefficients, with W a 2-D wavelet analysis operator (this is a reasonable assumption for the type of imagery used in MRI). At the sampling stage, we collect data yj = Φxj , where Φ is a d × n matrix drawn from the partial Fourier ensemble. In words, we sample d frequencies of xj at random. Note that in practice, this frequency sampling is done physically by the MRI scanner; its operation is simulated here with an application of the partial Fourier operator Φ. At the reconstruction stage, we solve f 1) (CS

min kαk1 subject to yj = ΦW T α,

a variant of the compressed sensing problem (CS1 ). Call the solution α ˆ j . In words, we look for an image xˆj whose wavelet expansion α ˆ j has the smallest `1 norm, among f 1 ) we used the all candidates xj that satisfy the relation yj = Φxj . In writing (CS fact that W is an orthogonal operator, whence xj = W T αj . We repeat this procedure for each slice xj until the entire 3-D volume is reconstructed. In our simulations, we used real data provided to us courtesy of Neal Bangerter of the MRSRL Lab at Stanford University [3]. It is a 3-D volume depicting a human

CHAPTER 4. EXTENSIONS OF COMPRESSED SENSING

128

leg, containing 240 slices of dimensions 128 × 128. Figure 4.21, panels (a),(d) and (g), show X- and Y-axis projections of the data along with one horizontal slice. In this example, we sampled 50% of the data at random, i.e., d = n/2 where n = 16, 384. f 1 ) has dimensions 8, 192 by 16, 384. Clearly, such large Hence, each sub-problem (CS dimensions imply that full-blown `1 minimization would be too costly to apply, and so we must resort to more economic alternatives. f 1 ). First, we employ We consider two approaches to approximately solve (CS Lars with a fixed computational budget, i.e., we limit the number of Lars steps to a prescribed budget. In our simulations, we used 1200 iterations of Lars per slice. Clearly, since we choose to apply Lars rather than Homotopy, and since we do not follow the entire solution path, we will not obtain the minimum `1 solution to f 1 ). However, if each slice is sufficiently compressible in the wavelet basis, 1200 (CS wavelet coefficients may yield a good quality reconstruction. The second approach f 1 ). Since we are dealing we investigate is to apply Stomp to approximately solve (CS with a large volume of data, Stomp seems a natural choice. The results are depicted in the two rightmost columns of Figure 4.21. Comparing the reconstruction results, we conclude that, visually at least, Stomp produced a more faithful reconstruction than Lars. More importantly, Stomp produced a complete 3-D reconstruction in a fraction of the time it took the iteration-constrained Lars; Stomp took a little over 5 hours to obtain a solution (executed on a 3GHz desktop computer), or about 80 seconds per 2-D slice. In contrast, Lars needed on average 10 minutes to reconstruct each 2-D slice, for a total of 39.3 hours. As we have remarked before, Stomp is better suited than stepwise algorithms such as Lars when the data dimensions are large. This example illustrates this point clearly.

4.11

Discussion

In this chapter, we have reviewed the basic Compressed Sensing framework, in which we gather d pieces of information about an object that, nominally, has n degrees of freedom, with d < n. When the object is compressible, then by measuring d essentially ‘random’ linear functionals of the object and reconstructing by `1 minimization on

CHAPTER 4. EXTENSIONS OF COMPRESSED SENSING

129

Figure 4.21: CS reconstruction of 3-D MRI data: Left column: X- and Y-axis projections of the original data, alongside one vertical slice; Middle column: corresponding reconstructions with Lars; Right column: corresponding reconstructions with Stomp.

CHAPTER 4. EXTENSIONS OF COMPRESSED SENSING

130

the transform coefficients, we get, in theory, a faithful reconstruction. Through a series of experiments and simulation studies, we demonstrated that the performance of compressed sensing observed in practice is in accord with theoretical predictions. In particular, we studied the constant hiding in the error bound (4.2), and demonstrated that in practice, it is small even for modestly sized problems, with n in the low thousands, and d in the few hundreds. We also presented simulation studies suggesting that the CS framework works well when used in conjunction with other matrix ensembles, namely the RSE, PFE, and PHE. Our main effort in this chapter has been to go beyond these initial observations by considering several extensions of the CS framework. We suggested the use of postprocessing noise removal via translation-invariant de-noising with a level-dependent threshold, in order to defeat the appearance of visual noise in the CS reconstructions. More significantly, we considered the use of CS not as the only way to gather information about an object, but as a tool to be used in conjunction with classical linear sampling at coarse scales. We considered a hybrid method, already discussed in [21] from a theoretical perspective, in which very coarse-scale linear sampling was combined with CS applied at all finer scales. We also introduced a fully multiscale method in which CS was applied at each scale separately from all others, and linear sampling only at the coarser scale. Both approaches can significantly outperform linear sampling, giving comparable accuracy of reconstruction with far fewer samples in our examples. This tends to support the theoretical claim in [21] that compressed sampling, deployed in a Hybrid fashion, can, for certain kinds of objects, produce an order-of-magnitude improvement over classical linear sampling. Compressed Sensing seems a promising strategy for sampling signals characterized by large numbers of nominal samples and yet exhibiting high compressibility by known transforms like the wavelet transform or Fourier transform. We illustrated this with a natural application of the CS framework for fast acquisition of MRI data, demonstrating that we can achieve a faithful reconstruction using only half the number of measurements needed with standard methods.

Chapter 5 Error Correction in a Mixed Interference Channel In this chapter, we apply ideas and algorithms developed in earlier chapters to a classical problem in communications, of designing codes that can correct errors in a noisy communication environment. We consider a digital channel model corrupted by two types of interference: additive white Gaussian noise, and additive impulse noise. We propose a class of ‘random’ codes for robust transmission over this communication channel. Our proposed scheme employs `1 minimization at the decoding stage. We present simulation results demonstrating that for certain channel configurations, the rate at which we can reliably transmit information with a negligible probability of error using the proposed codes is comparable to the fundamental limits set by Shannon’s capacity. Parts of the material in this chapter appeared in the manuscripts [29, 30], that have been submitted to IEEE Transactions on Information Theory.

5.1

Introduction

Error correction and detection have great practical importance in maintaining information integrity across noisy channels. Virtually every digital communication system employs error-control coding as an integral part of its operation. Recent advances in 131

CHAPTER 5. ERROR CORRECTION

132

digital communications and the proliferation of digital multimedia content have led to a phenomenal growth in data volumes, and a driving need for reliable codes and fast decoding schemes. The literature in IEEE Transactions on Information Theory on turbo codes and LDPC codes is literally too voluminous to characterize. Recently, Candes et al. [12, 11] and Rudelson et al. [67] pointed out that sparse approximation ideas might have a role to play in error-control coding. In fact, the idea of coupling sparse representations and error correction was explored in earlier work by Donoho and Huo [27]. These authors considered a scenario where a vector of real numbers is transmitted over a channel that corrupts a fraction of the entries at random. They showed that, as long as the corrupted sites are sufficiently sparse, `1 minimization can recover the transmitted information with very low probability of error. The ideas in [12, 67] are of great theoretical interest, and inspired us to apply sparse modeling coupled with the algorithms discussed in earlier chapters to a practical digital communication scenario. The communication system we consider in this chapter is depicted in Figure 5.1. The scenario is a typical digital communication scenario; we wish to transmit digital information, represented as a series of bits, over a noisy communication channel. The channel is a discrete memoryless channel (DMC) corrupted by a mixture of two interferences. The first interference is modeled as additive white Gaussian noise U having variance σU2 . The other interference is an additive impulse noise, which we model as a r.v. V with a prescribed probability density pV (v), multiplied by a Bernoulli r.v. B with success probability q. Thus, with probability 1 − q, the channel is simply a Gaussian channel, and with probability q, the channel exhibits a mixture of two interferences characterized by the r.v.’s U and V . Formally, the conditional probability density of the channel is given by pY |X (y|x) = (1 − q)pU (y − x) + qpU (y − x) ? pV (y − x), where pU (·) is the normal pdf with variance σU2 .

(5.1)

CHAPTER 5. ERROR CORRECTION

+

+

U ~ N(0,"U2 )

×

X

!

133

Y

!

B ~ Bernoulli(q)

V!~ P(V )

!

Figure 5.1: A mixed Gaussian/impulse noise channel model.

!

5.2

Encoding

The encoding process begins by representing the information bits to be transmitted as an antipodal sequence θ of length m, with entries ±β. To encode θ, we construct a random code with rate 0 < R < 1 in the following manner. Generate a ‘random’ orthogonal matrix Q of size n × n, with n = bm/Rc. Partition Q = G H T , so that G is an n × m matrix consisting of the first m columns of Q, and H is the (n − m) × n matrix constructed from the remaining columns. With G and H constructed, the encoding stage of our proposed coding scheme amounts to computing x = Gθ, with x the coded message, of length n. The rate R indicates the number of information bits transmitted per channel use. Note that since R < 1, n > m and some redundancy is introduced in the encoding process. The setup just described is reminiscent of classical syndrome coding. Indeed, we may think of G as the ‘generator matrix’ for the code, and H as the ‘parity check matrix’. In particular, the following three properties of G and H are important in our coding scheme: 1. GT x = θ. In words, we can recover the information bits from the coded message by applying the adjoint operator GT .

CHAPTER 5. ERROR CORRECTION

134

2. HGx = 0. This is an immediate implication of the orthogonality of Q; rows of H are orthogonal to columns of G. 3. H ∈ URPE. Indeed, H T is constructed by sampling a subset of the columns of an orthogonal matrix. The computational effort in encoding the information bits amounts to a single application of an n by m matrix, at a cost of mn flops. In fact, the encoding process can be even more economical. Recall from our discussion in previous chapters that matrices from the URPE exhibit similar properties to partial Hadamard and partial Fourier matrices. Thus, by choosing Q to be an n by n Hadamard matrix, we may use a Fast Hadamard Transform (FHT) to encode θ without ever forming Q explicitly. That would lower the cost of the encoding stage to O(n log n) flops.

5.3

Decoding, σU = 0

Our initial description of the decoding process assumes that σU = 0. Equivalently, we assume the channel is corrupted solely by the impulsive interference. In this case, it is not hard to see that the communication limit set by the capacity of the channel is unbounded. In other words, as long as q < 1, there are time slots in which we may transmit any finite number of bits, to be received error-free at the decoder side. Thus, this setting is not of much practical interest. However, it simplifies the description of the decoding process. In fact, as we show below, the entire coding process changes little with the reintroduction of the Gaussian component of the noise. With σU = 0, at the decoder we receive Y = X + W , where W = B · V is the impulse noise. Hence, at a given transmitted block of n entries, a fraction q, on average, would be corrupted, with the remaining entries completely error-free. If q is small, we may equivalently think of the channel as adding a sparse error pattern w to the transmitted vector x, and thus apply sparse approximation ideas to recover the transmitted information. Specifically, at the decoding stage we solve the

CHAPTER 5. ERROR CORRECTION

135

`1 minimization problem (EC1 )

min kαk1 subject to Hα = Hy;

Call the solution α ˆ . Notice that the right-hand side of the linear constraint in (EC1 ) reads Hy = H(Gθ + w) = Hw, since the rows of H are orthogonal to the columns of G. Thus, (EC1 ) is looking for the vector α with least `1 norm, among all vectors satisfying Hα = Hw. In other words, (EC1 ) is solving for the sparse error pattern. With an estimate α ˆ of the noise pattern, we may recover the transmitted information by computing θˆ = β · sgn(GT (y − α ˆ )).

(5.2)

For an example of this coding scheme at work, consider Figure 5.2. Panel (a) of the figure displays an antipodal sequence θ of length m = 256 that we wish to transmit over the channel. A realization of the impulse noise corrupting the communication system is shown in panel (b), with q = 0.2, so that on average, 20% of the entries passing through the channel are corrupted. We choose a code rate R = 1/2, meaning that for each information bit, we transmit two numbers, for a total number of channel uses of n = 512. To encode, we generate matrices G and H as described above, and compute x = Gθ. The encoded sequence is displayed in panel (c) of Figure 5.2, and the channel output y is displayed in panel (d). Notice that in the noisy sites, the impulse noise power overwhelms the signal power. To recover the transmitted information, we solve (EC1 ) with the Homotopy algorithm, and obtain an estimate θˆ as in (5.2). We plotted the recovered sequence θˆ overlaid on the transmitted θ in panel (e) of Figure 5.2. The measured Bit Error Rate (BER) in this instance was 0, meaning that `1 minimization perfectly recovered the impulse noise pattern. We investigated the performance of this coding scheme at a wider range of channel parameters. We fixed the block length m = 256 as before, and the code rate R = 1/4. For a range of probabilities q, we generated multiple independent realizations of the impulse noise w, and recorded the average BER at the output of the decoder. The results of this simulation are summarized in Figure 5.3. The left panel displays a graph showing the average BER versus the impulse probability q, when the impulse noise

CHAPTER 5. ERROR CORRECTION

136

(b) Impulse noise w, q = 0.2

(a) Information bits θ, m = 256 1.5

20

15 1

10 0.5

5

0 0

−5 −0.5

−10

−15 −1

−20 −1.5

50

100

150

200

50

250

100

(c) Coded message x, R = 1/2 10

10

8

8

6

6

4

4

2

2

0

0

−2

−2

−4

−4

−6

−6

−8

−10

150

200

250

(d) Channel output y

−8

50

100

150

200

250

300

350

400

450

500

−10

50

100

150

200

250

300

350

400

450

500

(e) Recovered bits, BER = 0 1.5

1

0.5

0

−0.5

−1

−1.5

50

100

150

200

250

Figure 5.2: Example of the `1 minimization based coding scheme when σU = 0 and V ∼ N (0, 10). (a) Information bits encoded as a sign train θ, m = 256; (b) Impulse noise pattern w, q = 0.2; (c) Coded message x, at a rate R = 1/2; (d) Channel output y; (e) Recovered message θˆ overlaid on the original message θ. The decoding stage resulted in perfect reconstruction.

CHAPTER 5. ERROR CORRECTION

137

−3

5

x 10

0.02

4.5

0.018

4

0.016

3.5

0.014

p

k−

2.5

0.012

Ste

BER

BER

3

2

0.008

1.5

0.006

1

0.004

0.5

0.002

0

0

0.1

0.2

0.3

0.4

0.5 q

0.6

0.7

0.8

0.9

1

p

k−

0.01

0

0

Ste

0.1

0.2

0.3

0.4

0.5 q

0.6

0.7

0.8

0.9

1

Figure 5.3: Performance of coding scheme when σU = 0 and V ∼ N (0, 10) (left panel), V ∼ U (0, 10) (right panel). Dashed vertical line delineates the range of sparsities q in which Homotopy recovered the transmitted signal in k steps, where k is the number of corrupted sites.

V follows a zero-mean Gaussian distribution with variance 10. The right panel has a similar configuration, with V distributed uniformly, with mean zero and variance 10. The results indicate that `1 minimization was able to recover the transmitted message with a negligible BER when up to about 70% of channel entries were corrupted with impulse noise. Moreover, we see that as long as q < 0.2, solving (EC1 ) with the Homotopy method led to a solution in k steps, where k is the number of corrupted sites. In other words, for a range of impulse probabilities, Homotopy recovered the transmitted information accurately and efficiently, using O(k 3 + kn2 ) flops.

5.4

Decoding, General Case

We now consider the case where both components of the noise are present. Thus, at the channel output we receive y = x+u+w, where u is a white Gaussian noise process with variance σU2 . At the encoding stage, we compute x = Gθ as before. Note that in the presence of the Gaussian interference, we have Hy = H(Gθ + u + w) = Hw + Hu, since the noise pattern is a superposition of a sparse component and a Gaussian

CHAPTER 5. ERROR CORRECTION

138

component. Thus, at the decoding stage, we solve min kαk1 subject to kHα − Hyk2 ≤ .

(EC1, )

In words, we look for a sparse representation of the impulse pattern, allowing for some noise in the representation, with noise power bounded by . Work by Donoho et al. [26, 23] showed that `1 minimization is still effective in recovering a sparse representation from noisy measurements when replacing (EC1 ) with (EC1, ). We have encountered (EC1, ) in a slightly different form before. Indeed, it is equivalent to the Lasso problem (Dλ ) we introduced in Chapter 2. Recall that to solve (EC1, ), we may simply apply the Homotopy algorithm, tracing the optimal path and stopping when the residual norm is smaller than . Since we know the variance of the noise process u, we can compute an estimate of the Gaussian noise power, and set accordingly: = E(kHuk2 ) =

√

n − m · σU .

To evaluate the performance of the coding scheme, we computed the achievable rates of the code for various configurations of the channel parameters. In detail, we fixed the block length at m = 256, and for each instance of q and σU , simulated the coding scheme at different code rates R, and recorded the BER at each rate, averaged over 1000 independent trials. The achievable rate was then selected as the rate R that achieved an average BER lower than a threshold, set at 10−4 . To benchmark the performance of the code, we compared the achievable code rates with the fundamental communication limits set by Shannon’s information theory. Since an analytical expression for the capacity is not readily available for this mixed channel model, we computed analytical lower and upper bounds to the channel capacity at each configuration. The results of our simulations are summarized in Figure 5.4, for V Gaussian distributed with mean zero and variance 10, and in Figure 5.5, for V uniformly distributed with mean zero and variance 10. Each plot shows the rate as a function of a channel parameter (q or σU ), with the measured BER displayed under each achievable

CHAPTER 5. ERROR CORRECTION

139

q = 0.2

q = 0.4

2

2

1.5

1.5

1

1 −8

0.5 10

0.5

10−6

−4

10

−8

10 0

0.2

10−6 0.4

−6

10 0.6

σU

−6

10 0.8

−8

10

0

1

0.2

−8

10

10−8 0.4

σU = 0.2

1

0.6 0.5

1

0.4

0.8 10−8

0.4

0.3 10−6

10−4

0.2

0.1

0.2

0.3 q

−6

−8

10

0.2 0

σU

10−6 0.8

σU = 0.4

1.2

0.6

−6

10 0.6

0.4

−4

10

0.1

0.5

0

10

0.1

10−6 0.2

10−8 0.3 q

−8

10 0.4

−6

10 0.5

Figure 5.4: Performance of coding scheme with V ∼ N (0, 10), for different configurations of σU and q. Overlaid on each plot are the upper bound (dashed) and lower bound (dash-dotted) of the channel capacity.

rate. For reference, upper and lower bounds on the capacity of the channel are superposed on each panel. Inspection of the results indicates that the coding scheme achieves rates that are comparable with the channel capacity, for certain channel configurations. In particular, we notice that when the Gaussian noise variance is not too high, our `1 coding scheme achieves rates that are quite close to the fundamental communication limit. We further notice that in general, the coding scheme exhibits better performance when the impulse noise follows a Gaussian distribution, than when the impulse noise has a uniform distribution.

CHAPTER 5. ERROR CORRECTION

140

q = 0.2

q = 0.4

2

2

1.5

1.5

1

1

0.5

10−8

0

0.5 −6

10

10−6

0.2

−8

10 0.4 σU

−6

10

10−8

0

0.6

10−4

10−6

0.2

σU = 0.2

10−8 10−8 0.4 σU

0.6

σU = 0.4

2

1.2 1

1.5

0.8 1

0.6 0.4

0.5 0

10−8

0.1

10−6 0.2

10−8 0.3 q

10−4 0.4

10−8 0.5

0.2 0

10−6 0.1

10−8 0.2

10−5 0.3 q

10−8 0.4

10−6 0.5

Figure 5.5: Performance of coding scheme with V U (0, 10), for different configurations of σU and q. Overlaid on each plot are the upper bound (dashed) and lower bound (dash-dotted) of the channel capacity.

CHAPTER 5. ERROR CORRECTION

5.5

141

Discussion

In this chapter, we introduced a coding scheme for robust information transmission over a noisy channel, contaminated by a mixture of Gaussian noise and impulse noise. The encoding stage amounts to an application of a random coding matrix to the information bits. At the decoding stage, an `1 minimization problem is solved via the Homotopy method to estimate the sparse impulse noise and recover the transmitted information. We conducted extensive simulations of the proposed coding scheme, and demonstrated that in certain cases, the rate at which we can reliably transmit information over the channel with a negligible probability of error is comparable to the fundamental limits set by digital communication theory. We conclude this chapter with the following remarks. • The coding scheme described offers great flexibility in choosing a block length m and a desired rate R. Since the encoding matrix is generated by partitioning a random orthogonal matrix, we can instantly generate a code for any block length m, and any rate R < 1. • The computational requirements of the coding scheme are quite modest. The complexity of encoding amounts to the cost of one matrix multiplication, and the cost of decoding is given by the favorable complexity estimates of the Homotopy algorithm; see Section 2.3. In fact, as we mentioned above, both the encoding and decoding complexity can be significantly reduced by replacing the random orthogonal matrix with a Hadamard operator, and using a Fast Hadamard Transform (FHT) to implement the matrix operations at the encoder and decoder. In addition, we remark that with a stringent computation budget, we may consider using Stomp to approximately solve (EC1, ) instead of Homotopy. While we have not investigated the use of Stomp in this coding context, the analysis presented at Chapter 3 indicates that Stomp would offer significant computational savings, albeit at the price of a slight reduction in accuracy. • The experimental results presented in this chapter indicate that the coding

CHAPTER 5. ERROR CORRECTION

142

scheme exhibits good performance already at modest block lengths. In contrast, current state-of-the-art coding techniques such as Turbo Codes, or LDPC codes, require significantly longer block lengths to obtain capacity-achieving information rates, thus making them harder to use in practice. • The description of the coding scheme above assumed that the constant interference on the channel is Gaussian distributed. In addition, our simulations considered two different distributions of the impulse component. We remark, however, that the proposed coding scheme would work well for other channel configurations, and is suitable in any scenario where we have a mixture of a constant interference bounded in power, and an impulsive interference occurring at a sparse subset of time slots. • At the encoding stage, we used an antipodal sequence to represent information bits to be transmitted. In other words, each bit was represented as a coefficient of fixed amplitude and sign corresponding to its binary value. In fact, other representations can be used in conjunction with the proposed coding scheme, which may lead to increased code performance. For instance, we can represent pairs of bits as a single element, by allowing coefficients to take one of four different values, instead of the two values currently used. While we have not explored this approach here, it may prove a promising direction for future research.

Appendix A Reproducible Research with Sparselab The growth of the Internet during the past few years has led to the emergence of socalled open source software. The term “open source” commonly refers to a software program or set of software technologies that are made widely available by an individual or group in source code form for use, modification, and redistribution. Within the research community, a parallel trend, reproducible research, has become increasingly popular, as more and more researchers in the computational sciences augment research papers with freely-distributed software to reproduce their results [7]. In that spirit, the research work described in this thesis is reproducible with the SparseLab toolbox for Matlab [73]. SparseLab is a software library for sparse approximation, developed at Stanford University, containing over 500 Matlab files, including solvers, utilities, data sets, examples, and demonstration scripts. SparseLab has been used by the author to create the figures and tables appearing in this thesis, and the toolbox contains scripts that will reproduce all the calculations in this thesis and accompanying papers [83, 30, 29]. We will not give a detailed description of the structure of SparseLab. We invite the interested reader to download the toolbox, available freely on the web [73], or to consult the supplement documentation [72]. We do, however, wish to point out a few features of the toolbox, in connection with the research work presented here. 143

APPENDIX A. REPRODUCIBLE RESEARCH WITH SPARSELAB

A.1

144

Solvers

All of the algorithms for finding sparse solutions to underdetermined systems of linear equations discussed in this thesis were implemented in SparseLab, and reside in the subdirectory Solvers of the main directory tree. Most of the available solvers take data (A, y) and a set of parameters, returning a sparse solution to the underdetermined system Ax = y. The matrix A can be represented explicitly as a dense matrix, or can be passed implicitly via a linear operator that computes, for a given vector v, the products Av or AT v. The routine SolveLasso implements the Homotopy algorithm and its variant Lars, analyzed in Chapter 2. Our implementation follows the derivation in [32], and can also be applied to solve (P1 ) with the solution constrained to be non-negative. The routine accommodates noise in the model by allowing termination at any point on the Homotopy path, through a condition on the `2 norm of the residual, or by setting an explicit value for the Lagrange multiplier λ in (Dλ ). Further, a user may opt to record the entire solution path of the algorithm by setting a parameter. At the heart of the implementation, SolveLasso maintains the R factor of a QR factorization of the active set matrix AI . This allows for fast updating and downdating of the solution estimate with every modification of the active set. We note that Efron and Hastie developed a similar implementation for the R / S-Plus environment [31]. In addition, the library includes an implementation of `1 minimization with Pdco, an interior-point method developed by Michael Saunders at the Systems Optimization Lab at Stanford [69]. Pdco can be configured to solve either (P1 ) or its noise-aware variant (Dλ ), and again, it allows A to be an explicit sparse matrix or a fast operator. Also included are implementations of Matching Pursuit (SolveMP), Omp (SolveOMP) and Pfp (SolvePFP). The approximate scheme Stomp is implemented in the routine SolveStOMP. At the core of our implementation, we used LSQR, a fast iterative method for linear least squares [64], to solve the least squares problem at each iteration of Stomp. The current implementation (used to generate the simulation results appearing in this thesis) sets the number of LSQR iterations at 6.

APPENDIX A. REPRODUCIBLE RESEARCH WITH SPARSELAB

A.2

145

Figures

All of the figures appearing in this thesis and accompanying papers [83, 30, 29] can be reproduced using SparseLab. Scripts for the figures in Chapters 2 and 5 are available in the subdirectory Papers/FastL1. The figures appearing in Chapter 3 are reproduced with scripts residing in the directory Papers/StOMP. The figures in Chapter 4 were generated with scripts available in the directory Papers/ExtCS. We note that the 3-D MRI data used in Section 4.10 is the property of the Magnetic Resonance Systems Research Laboratory (MRSRL) in Stanford, and was not included in the distribution.

A.3

Data Sets

The empirical analysis framework in Chapters 2 and 3 used phase diagrams and phase transitions derived from extensive simulation data to investigate the performance of the different algorithms discussed. We attempted to include all the simulation data leading to our results in a separate data supplement. In cases where the data was too voluminous, we opted for a more succinct description due to space restrictions. In addition, SparseLab includes routines to compute empirical phase transitions from simulation data, using the logistic regression model introduced in Section 2.5. See the SparseLab documentation for more information [72].

Bibliography [1] F. Abramovich, Y. Benjamini, D.L. Donoho, and I.M. Johnstone. Adapting to unknown sparsity by controlling the false discovery rate. Annals of Statistics, 34(2):584–653, 2006. [2] Waheed Bajwa, Jarvis Haupt, Akbar Sayeed, and Rob Nowak. Compressive wireless sensing. In Proc. Int. Conf. on Information Processing in Sensor Networks (IPSN), Nashville, TN, April 2006. [3] N.K. Bangerter, B.A. Hargreaves, J.H. Brittain, B. Hu, S.S. Vasanawala, and D.G. Nishimura. 3D fluid-suppressed T2-prep flow-independent angiography using balanced SSFP. In Proc. of the 12th ISMRM, page 11, 2004. [4] Michel Berkelaar, Kjell Eikland, and Peter Notebaert. lp solve: (mixed-integer) linear programming system. http://lpsolve.sourceforge.net. [5] P. Bofill and M. Zibulevsky. Underdetermined blind source separation using sparse representations. Signal Processing, 81(11):2353–2362, 2001. [6] William L. Briggs and Van Emden Henson. The DFT: An Owner’s Manual for the Discrete Fourier Transform. SIAM, Philadelphia, 1995. [7] Jon Buckheit and David L. Donoho.

Wavelab and reproducible research.

Wavelets and Statistics, 1995. [8] Emmanuel J. Cand`es and David L. Donoho. New tight frames of curvelets and optimal representations of objects with piecewise C 2 singularities. Comm. Pure Appl. Math., 57:219–266, 2004. 146

BIBLIOGRAPHY

147

[9] Emmanuel J. Cand`es and Justin Romberg. Practical signal recovery from random projections. In Wavelet Applications in Signal and Image Processing XI, Proc. SPIE Conf. 5914, 2005. [10] Emmanuel J. Cand`es, Justin Romberg, and Terence Tao. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52(2):489–509, February 2006. [11] Emmanuel J. Cand`es, Mark Rudelson, Terence Tao, and Roman Vershynin. Error correction via linear programming. In Proceedings of the 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 295–308, 2005. [12] Emmanuel J. Cand`es and Terence Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 51(12):4203–4215, December 2005. [13] Emmanuel J. Cand`es and Terence Tao. The Dantzig selector: statistical estimation when p is much larger than n. Manuscript, 2006. [14] Emmanuel J. Cand`es and Terence Tao. Stable signal recovery from incomplete and inaccurate observations. Comm. Pure Appl. Math., 59(8):1207–1223, August 2006. [15] Scott S. Chen, David L. Donoho, and Michael A. Saunders. Atomic decomposition by basis pursuit. SIAM J. Sci Comp., 20(1):33–61, 1999. [16] Ronald R. Coifman and David L. Donoho. Translation-invariant de-noising. Wavelets and Statistics, pages 125–150, 1995. [17] Ronald R. Coifman, Yves Meyer, Stephen R. Quake, and Mladen V. Wickerhauser. Signal processing and compression with wavelet packets. Wavelets and Their Applications, 1992. [18] Ingrid Daubechies, Michel Defrise, and Christine De Mol. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Comm. Pure Appl. Math., 57(11):1413–1541, August 2004.

BIBLIOGRAPHY

148

[19] Geoff Davis, St`ephane Mallat, and Marco Avellaneda. Adaptive greedy approximations. J. Constructive Approximation, 13:57–98, 1997. [20] David L. Donoho. Neighborly polytopes and sparse solution of underdetermined linear equations. Technical report, Dept. of Statistics, Stanford University, 2005. [21] David L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, April 2006. [22] David L. Donoho. For most large underdetermined systems of linear equations, the minimal `1 solution is also the sparsest solution. Comm. Pure Appl. Math., 59(7):907–934, July 2006. [23] David L. Donoho. For most underdetermined systems of linear equations, the minimal `1 -norm near-solution approximates the sparsest near-solution. Comm. Pure Appl. Math., 59(6):797–829, June 2006. [24] David L. Donoho. High-dimensional centrosymmetric polytopes with neighborliness proportional to dimension. Discrete and Computational Geometry, 35(4):617–652, May 2006. [25] David L. Donoho and Michael Elad. Optimally sparse representation from overcomplete dictionaries via `1 norm minimization. Proc. Natl. Acad. Sci. USA, 100(5):2197–2002, March 2003. [26] David L. Donoho, Michael Elad, and Vladimir N. Temlyakov. Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Transactions on Information Theory, 52(1):6–18, January 2006. [27] David L. Donoho and Xiaoming Huo. Uncertainty principles and ideal atomic decomposition. IEEE Trans. Info. Theory, 47(7):2845–2862, 2001. [28] David L. Donoho and Jared Tanner. Counting faces of randomly-projected polytopes when the projection radically lowers dimension, 2006.

BIBLIOGRAPHY

149

[29] David L. Donoho and Yaakov Tsaig. Fast solution of `1 minimization problems when the solution may be sparse. Submitted, 2006. [30] David L. Donoho, Yaakov Tsaig, Iddo Drori, and Jean-Luc Starck. Sparse solution of underdetermined linear equations by stagewise orthogonal matching pursuit. Submitted, 2006. [31] Bradley Efron and Trevor Hastie.

Lars software web site.

http://www-

stat.stanford.edu/˜hastie/Papers/LARS/. [32] Bradley Efron, Trevor Hastie, Iain M. Johnstone, and Robert Tibshirani. Least angle regression. Annals of Statistics, 32(2):407–499, 2004. [33] M. Elad, B. Matalon, and M. Zibulevsky. Coordinate and subspace optimization methods for linear least squares with non-quadratic regularization. Appl. Comp. Harm. Anal., to appear. [34] M. Elad, J.-L. Starck, P. Querre, and D. L. Donoho. Simultaneous cartoon and texture image inpainting using morphological component analysis (MCA). Appl. Comp. Harm. Anal., 19:340–358, November 2005. [35] Michael Elad. Why simple shrinkage is still relevant for redundant representations. IEEE Transactions on Information Theory, 52(12):5559–5569, December 2006. [36] Michael Elad and Michal Aharon. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image Processing, 15(12):3736–3745, December 2006. [37] Michael Elad and Alfred M. Bruckstein. A generalized uncertainty principle and sparse representations in pairs of bases. IEEE Transactions on Information Theory, 48(9):2558–2567, September 2002. [38] J. H. Friedman and W. Stuetzle. Projection pursuit regressions. J. Amer. Statist. Soc., 76:817–823, 1981.

BIBLIOGRAPHY

150

[39] Jean J. Fuchs. On sparse representations in arbitrary redundant bases. IEEE Transactions on Information Theory, 50(6):1341–1344, June 2004. [40] A. Y. Garnaev and E. D. Gluskin. On widths of the Euclidean ball. Soviet Mathematics – Doklady, 30:200–203, 1984. (In English). [41] I. Gent and T. Walsh. The SAT phase transition. In Proc. ECAI-94, pages 105–109, 1994. [42] Anna C. Gilbert, S. Guha, P. Indyk, S. Muthukrishnan, and M. Strauss. Nearoptimal sparse fourier representations via sampling. In Proc. 34th ACM Symposium on Theory of Computing, pages 152–161. ACM Press, 2002. [43] Gene H. Golub and Charles F. Van Loan. Matrix Computations (3rd ed.). Johns Hopkins University Press, Baltimore, MD, USA, 1996. [44] R´emi Gribonval and Morten Nielsen. Sparse representations in unions of bases. IEEE Trans. Info. Theory, 49(12):1320–1325, 2003. [45] R´emi Gribonval and Morten Nielsen. Highly sparse representations from dictionaries are unique and independent of the sparseness measure. Applied and Computational Harmonic Analysis, 22(3):335–355, 2007. [46] M. Harwit and N.J.A. Sloane. Hadamard Transform Optics. Academic Press, 1979. [47] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer-Verlag, 2001. [48] K. Herrity, A. C. Gilbert, and J. Tropp. Sparse approximation via iterative thresholding. In Proceedings of the 2006 IEEE International Conference on Acoustics, Speech, and Signal Processing, Toulouse, May 2006. [49] ISO/IEC10918-1. Digital compression and coding of continuous-tone still images, part 1, requirements and guidelines, 1991.

BIBLIOGRAPHY

151

[50] ISO/IEC11172. Information technology – coding of moving pictures and associated audio for digital storage media at up to about 1.5 mbit/s., 1993. [51] ISO/IEC13818. Information technology – generic coding of moving pictures and associated audio information., 1994. [52] ISO/IEC15444-1:2004. Information technology – JPEG 2000 image coding system: Core coding system, 2004. [53] Boris S. Kashin. Diameters of certain finite-dimensional sets in classes of smooth functions. Izv. Akad. Nauk SSSR, Ser. Mat., 41(2):334–351, 1977. [54] Sami Kirolos, Jason Laska, Michael Wakin, Marco Duarte, Dror Baron, Tamer Ragheb, Yehia Massoud, and Richard Baraniuk. Analog-to-information conversion via random demodulation. In Proc. IEEE Dallas Circuits and Systems Workshop (DCAS), Dallas, TX, 2006. [55] Michael Lustig, Jin H. Lee, David L. Donoho, and John M. Pauly. Faster imaging with randomly perturbed spirals and `1 reconstruction. In Proc. of the 13th ISMRM, 2004. [56] Michael Lustig, Juan M. Santos, David L. Donoho, and John M. Pauly. Rapid MR angiography with randomly under-sampled 3DFT trajectories and non-linear reconstruction. In Proc. of the 9th SCMR, 2006. [57] Dmitry M. Malioutov, M¨ ujdat C ¸ etin, and Alan S. Willsky. Homotopy continuation for sparse signal representation. In IEEE Int. Conf. Acoustics, Speech and Signal Processing, Philadelphia, PA, volume 5, pages 733–736, March 2005. [58] St`ephane Mallat. A Wavelet Tour of Signal Processing. Academic Press, 1999. [59] St`ephane Mallat and Zhifeng Zhang. Matching pursuit in a time-frequency dictionary. IEEE Transactions on Signal Processing, 41(12):3397–3415, 1993. [60] D. Mitchell, B. Selman, and H.J. Levesque. Generating hard satisfiability problems. Artificial Intelligence, 81(1), 1996.

BIBLIOGRAPHY

152

[61] B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM J. Comput., 24:227–234, 1995. [62] Michael R. Osborne, Brett Presnell, and Berwin A. Turlach. A new approach to variable selection in least squares problems. IMA J. Numerical Analysis, 20:389–403, 2000. [63] Michael R. Osborne, Brett Presnell, and Berwin A. Turlach. On the lasso and its dual. J. Computational and Graphical Statistics, 9:319–337, 2000. [64] Christopher C. Paige and Michael A. Saunders. LSQR: An algorithm for sparse linear equations and sparse least squares. ACM Transactions on Mathematical Software, 8(1):43–71, 1982. [65] Mark D. Plumbley. Geometry and homotopy for `1 sparse representations. In Proceedings of SPARS’05, Rennes, France, pages 206–213, November 2005. [66] Mark D. Plumbley. Recovery of sparse representations by polytope faces pursuit. In Proceedings of the 6th International Conference on Independent Component Analysis and Blind Source Separation (ICA 2006), Charleston, SC, USA, pages 206–213, March 2006. [67] Mark Rudelson and Roman Vershynin. Geometric approach to error-correcting codes and reconstruction of signals. International Mathematical Research Notices, 64:4019–4041, 2005. [68] Sylvain Sardy, Andrew G. Bruce, and Paul Tseng. Block coordinate relaxation methods for nonparametric wavelet denoising. J. Computational and Graphical Statistics, 9(2):361–379, 2000. [69] Michael A. Saunders and Byunggyoo Kim. terior method for convex objectives. SOL/software/pdco.html.

PDCO: Primal-dual in-

http://www.stanford.edu/group/

BIBLIOGRAPHY

153

[70] Jean-Luc Starck, Michael Elad, and David L. Donoho. Image decomposition: Separation of texture from piecewise smooth content. In SPIE annual meeting, San Diego, CA, December 2003. [71] Jean-Luc Starck, Michael Elad, and David L. Donoho. Redundant multiscale transforms and their application for morphological component analysis. Advances in Imaging and Electron Physics, 132:287–348, 2004. [72] Victoria Stodden, Yaakov Tsaig, and David L. Donoho. Sparselab architecture. Technical report, Dept. of Statistics, Stanford University, 2006. [73] Victoria Stodden, Yaakov Tsaig, Iddo Drori, and David L. Donoho. Sparselab web site. http://sparselab.stanford.edu/. [74] Thomas Strohmer and Robert Heath Jr. Grassmannian frames with applications to coding and communcations. Appl. Comp. Harm. Anal., 14(3):257–275, 2003. [75] Vladimir N. Temlyakov. Greedy algorithms and m-term approximation. J. Approximation Theory, 98(1):117–145, 1999. [76] Vladimir N. Temlyakov. Weak greedy algorithms. Advanced Computational Math, 5:173–187, 2000. [77] Vladimir N. Temlyakov. Nonlinear methods of approximation. Foundations of Computational Math, 3(1):33–107, July 2003. [78] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, 58(1):267–288, 1996. [79] Joel A. Tropp. Greed is good: Algorithmic results for sparse approximation. IEEE Transactions on Information Theory, 50(11):2231–2242, October 2004. [80] Joel A. Tropp. Just relax: Convex programming methods for subset selection and sparse approximation. IEEE Transactions on Information Theory, 51(3):1030– 1051, March 2006.

BIBLIOGRAPHY

154

[81] Joel A. Tropp and Anna C. Gilbert. Signal recovery from partial information by orthogonal matching pursuit. Manuscript, 2005. [82] Yaakov Tsaig and David L. Donoho. Breakdown of equivalence between the minimal `1 -norm solution and the sparsest solution. Signal Processing, 86(3):533– 548, March 2006. [83] Yaakov Tsaig and David L. Donoho. Extensions of compressed sensing. Signal Processing, 86(3):549–571, March 2006. [84] Alexander Vardy. Algorithmic complexity in coding theory and the minimum distance problem. In STOC ’97: Proceedings of the twenty-ninth annual ACM symposium on Theory of Computing, pages 92–109, New York, NY, USA, 1997. ACM Press. [85] S. Verd´ u. Multiuser Detection. Cambridge University Press, 1998. [86] Michael Wakin, Jason Laska, Marco Duarte, Dror Baron, Shriram Sarvotham, Dharmpal Takhar, Kevin Kelly, and Richard Baraniuk. An architecture for compressive imaging. In Proc. Int. Conf. on Image Processing (ICIP), Atlanta, GA, October 2006. [87] J. M. Yeomans. Statistical Mechanics of Phase Transitions. Oxford University Press, 1992.

sparse solution of underdetermined linear systems - CiteSeerX [PDF]

Recommend Stories

Idea Transcript

Helpful Links

Smile Life

Get in touch