Lecture Notes: An Introduction to the Theory of Statistical ... - drolet [PDF]

Statistical Communication. Germain Drolet ... provide a foundation in the Theory of Probability, Random Variables, Rando

0 downloads 6 Views 12MB Size

Report

Download PDF

PNG Network

Recommend Stories

Introduction to Macroeconomics Lecture Notes

Goodbyes are only for those who love with their eyes. Because for those who love with heart and soul

Introduction to Microcontrollers Lecture Notes

Don't be satisfied with stories, how things have gone with others. Unfold your own myth. Rumi

Ergodic Theory Lecture Notes

Don't ruin a good today by thinking about a bad yesterday. Let it go. Anonymous

An Introduction to Statistical Learning

Nothing in nature is unbeautiful. Alfred, Lord Tennyson

Lecture 1: Introduction; The Theory of Contests

Don't fear change. The surprise is the only way to new discoveries. Be playful! Gordana Biernat

Download PDF An Introduction to Ergodic Theory

Ask yourself: How much do I trust myself? Do I listen to others more than myself? Next

[PDF] An Introduction to Stability Theory

Be who you needed when you were younger. Anonymous

Lecture Notes on Introduction to Harmonic Analysis

You often feel tired, not because you've done too much, but because you've done too little of what sparks

Probability Theory 1 Lecture Notes

If your life's work can be accomplished in your lifetime, you're not thinking big enough. Wes Jacks

An introduction to group theory

Ask yourself: What makes me happy? Next

Idea Transcript

Lecture Notes: An Introduction to the Theory of Statistical Communication Germain Drolet Department of Electrical & Computer Engineering, Royal Military College of Canada, P.O. Box 17000, Station Forces, Kingston, Ontario, CANADA K7K 7B4 Tel: (613) 541-6000, extension: 6192 Fax: (613) 544-8107 Email: [email protected]

ii

c Copyright ⃝2006 by G. Drolet. All rights reserved. Permission is granted to make and distribute verbatim copies of these notes provided the copyright notice and this permission notice are preserved on all copies.

Preface The main purpose of these notes is to complement and guide a student through the study of Chapters 1, 2, 3, 4 in the textbook J.M. Wozencraft, I.M. Jacobs, Principles of Communications Engineering, Waveland Press Inc., U.S.A., 1965 (reprinted 1990), ISBN 0-88133-554-1. These notes may not be considered a replacement for the textbook since many proofs and discussions are incomplete. Instead we present sketches of the proofs and illustrate them with some diagrams in order to help understand the formal proofs presented in the textbook. The more concise format of the notes may additionally be helpful in quickly locating a concept in the textbook. The textbook is required for a complete understanding of Chapters 1, 2, 3, 4. Some examples and results borrowed from lecture notes by Dr. G.E. S´eguin are included in the notes. Speciﬁcally, the general approach of Dr. S´eguin’s notes is used in the presentation of random processes in Chapter 3. The ﬁrst objective of the course for which these notes are written is to provide a foundation in the Theory of Probability, Random Variables, Random Processes for the Electrical and Computer Engineer. The second objective of the course is to apply the concepts of probability to the Theory of Detection in communication systems and to present the foundation for Coding Theory. The notes are organized as follows: Chapter 2 up to §2.12: covers W&J Chapter 2. Chapter 2 §2.12 to §2.13: covers speciﬁc results on Random Vectors and Gaussian Random Vectors. This material is contained in W&J Chapter 3, but we feel that it is beneﬁcial to present the results before the deﬁnition of Random Processes; some of the results are taken from Dr. S´eguin’s notes. Chapter 3: follows the approach of Dr. S´eguin’s notes and covers the remainder of W&J Chapter 3. Chapter 4: covers W&J Chapter 4. iii

iv

PREFACE

Chapter 5: presents the foundation of Coding Theory. The material is taken from Dr. S´eguin’s lecture notes for a course taught in 1984 at the Universit´e Laval, Ste-Foy, Qu´ebec, QC, CANADA. Some of the results are contained in W&J Chapter 5. This chapter is self-contained and no references are made to the textbook by W&J. Suggestions and comments to improve the notes are welcome.

August 2006 Germain Drolet Department of Electrical & Computer Engineering, Royal Military College of Canada, P.O. Box 17000, Station Forces, Kingston, Ontario, CANADA K7K 7B4 Tel: (613) 541-6000, extension: 6192 Fax: (613) 544-8107 Email: [email protected]

Contents Preface

iii

2 Probability Theory 2.1 Fundamental Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . 2.2 Communication problem . . . . . . . . . . . . . . . . . . . . . . . 2.3 Random variables and distribution functions . . . . . . . . . . . 2.4 Transformation of random variables . . . . . . . . . . . . . . . . dFy (α) 2.4.1 Calculate Fy (α) ﬁrst, then py (α) = dα . . . . . . . . . 2.4.2 y = f (x) where f ( ) is diﬀerentiable and non-constant on any interval . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.3 Reversible transformation of random variables . . . . . . 2.4.4 Transformation of a uniformly distributed random variable 2.4.5 Impulses in px (α) (Wozencraft & Jacobs, pages 64, 65) . . 2.5 Conditional probability density . . . . . . . . . . . . . . . . . . . 2.6 Statistical Independence . . . . . . . . . . . . . . . . . . . . . . . 2.7 Mixed probability expressions . . . . . . . . . . . . . . . . . . . . 2.8 Statistical independence . . . . . . . . . . . . . . . . . . . . . . . 2.9 Communication example . . . . . . . . . . . . . . . . . . . . . . . 2.9.1 Details of calculation of P (C ) . . . . . . . . . . . . . . . . 2.9.2 Design of the optimal receiver for a channel with additive Gaussian noise . . . . . . . . . . . . . . . . . . . . . . . . 2.9.3 Performance of the optimal receiver for a channel with additive Gaussian noise . . . . . . . . . . . . . . . . . . . 2.10 Expected Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.10.1 Moments of a Gaussian random variable . . . . . . . . . . 2.10.2 Moments of some speciﬁc random variables . . . . . . . . 2.11 Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.11.1 Weak law of large numbers . . . . . . . . . . . . . . . . . 2.11.2 Central Limit Theorem . . . . . . . . . . . . . . . . . . . 2.11.3 Chernoﬀ Bound . . . . . . . . . . . . . . . . . . . . . . . . 2.12 Moments (mixed) of Random Vectors . . . . . . . . . . . . . . . 2.13 Gaussian Random Vectors . . . . . . . . . . . . . . . . . . . . . . 2.13.1 Deﬁnition: . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.13.2 Properties of jointly Gaussian Random Variables: . . . . .

1 2 11 14 28 28

v

29 30 34 39 41 43 47 49 52 53 56 57 59 63 65 69 70 71 73 83 89 89 90

vi

CONTENTS 2.13.3 Joint Gaussian Density Function . . . . . . . . . . . . . . 2.14 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 Random Waveforms 3.1 Fundamental Deﬁnitions . . . . . . . . . . 3.2 Stationariness . . . . . . . . . . . . . . . . 3.3 Moment Functions of a Random Process . 3.4 Correlation Functions and Power Spectra 3.5 Ergodicity . . . . . . . . . . . . . . . . . . 3.6 Gaussian random processes . . . . . . . . 3.6.1 Jointly Gaussian Processes . . . . 3.6.2 White Gaussian Noise . . . . . . . 3.7 Problems . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

4 Optimum Receiver Principles 4.1 Basic Approach . . . . . . . . . . . . . . . . . . . . . . 4.2 Vector Channels . . . . . . . . . . . . . . . . . . . . . 4.3 Waveform Channels . . . . . . . . . . . . . . . . . . . 4.3.1 Waveform synthesis . . . . . . . . . . . . . . . 4.3.2 Recovery of signal vectors . . . . . . . . . . . . 4.3.3 Irrelevant Data . . . . . . . . . . . . . . . . . . 4.3.4 Theorem of Irrelevance . . . . . . . . . . . . . . 4.4 Receiver Implementation . . . . . . . . . . . . . . . . . 4.5 Probability of Error . . . . . . . . . . . . . . . . . . . 4.5.1 Rectangular Signal Sets . . . . . . . . . . . . . 4.5.2 Orthogonal and Related Signal Sets . . . . . . 4.5.3 Completely Symmetric Signal Sets and a priori 4.5.4 Union Bound on the Probability of Error . . . 4.6 Bit Error Rate and Signal-to-Noise Ratio per Bit . . . 4.7 Problems: . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

) and erf(

) functions

. . . . . . . . .

. . . . . . . . .

101 101 106 109 123 130 133 135 139 142

147 . . . . . . 147 . . . . . . 148 . . . . . . 153 . . . . . . 153 . . . . . . 154 . . . . . . 154 . . . . . . 156 . . . . . . 160 . . . . . . 168 . . . . . . 169 . . . . . . 176 Knowledge180 . . . . . . 181 . . . . . . 183 . . . . . . 185

5 Reliable Communication on an Unreliable Channel 5.1 Codes of size M = 2 . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Memoryless channel . . . . . . . . . . . . . . . . . . 5.1.2 Memoryless channel and A = {0, 1} . . . . . . . . . 5.2 P (E ) averaged over all possible C ⊂ A N , |C | = M = 2 . . 5.2.1 Memoryless channel . . . . . . . . . . . . . . . . . . 5.2.2 Special case 2: memoryless channel and A = {0, 1} . 5.3 P (E ) averaged over all possible C ⊂ A N , |C | = M . . . . . A Table of the Q(

. . . . . . . . .

91 94

. . . . . . .

. . . . . . .

. . . . . . .

187 190 192 193 194 196 198 199 203

B Orthonormal Expansion and Vector Representation of Continuous Signals 205 B.1 Three-dimensional vectors in R3 : . . . . . . . . . . . . . . . . . . 206 B.2 Vectors obtained by sampling of a waveform . . . . . . . . . . . . 211

CONTENTS

vii

B.3 Eﬀect of sampling frequency . . . . . . . . . . . . . . . . . . . . . 212 B.4 Continuous time signals . . . . . . . . . . . . . . . . . . . . . . . 214

viii

CONTENTS

Chapter 2

Probability Theory robability is at the essence of communication and this explains why the P study of communication systems must be preceded by that of probability theory. Speciﬁcally, • A communication system comprises, among other things, a source that sends a message. As explained in W&J, chap 1, page 5, the message produced by the source is random; if we knew what the message was going to be, then there would be no need for communication. Thus randomness is inherent to the study of communication systems. • Also considering the transmission of a large number of bits through an imperfect channel, we know from experience that there is a fraction of bits that are received incorrectly. This fraction is referred to as the bit error rate. In chapter 4 we study, among other things, the probability of bit error P (E ) of some speciﬁc communication systems and we would expect P (E ) to be the theoretical model of the experimental bit error rate.1 Thus again the study of probability is seen as necessary to understand communication systems. Probability is concerned with making precise numerical statements and predictions about experiments/procedures of which the outcome cannot be predicted. Each realization of the experiment usually produces a diﬀerent outcome which can only be obtained by performing the experiment. Our goal is to analyze the unpredictable nature of communication systems and design the best receiver, i.e. the receiver that will most often correctly estimate the transmitted message. How can one make precise numerical statements about an unpredictable experiment? Even though the outcome of the experiment is unpredictable, the rate of occurrence of a speciﬁc outcome may be predictable. Consider for example: If a die is tossed a very large number N of times there will likely be N/6 times the outcome 1 and N/3 times an outcome that is a multiple of 3 (i.e. 3 or 6). 1 Although

wrong in general, this statement holds in some special cases.

1

2

CHAPTER 2. PROBABILITY THEORY

2.1

Fundamental Deﬁnitions

Refer to Wozencraft & Jacobs pp 13 - 20 for the details. The following deﬁnitions are intuitive and serve as basis for the formal axiomatic deﬁnition of probability system that will follow. Deﬁnition 2.1.1. A random experiment is a procedure producing an outcome that cannot be predicted accurately and with certainty at the same time. We have: 1. outcome: produced by a random experiment, 2. set of all outcomes: which may be inﬁnite, 3. result: set of some (or all or none) of the outcomes, 4. a sequence of length N is a list of N outcomes obtained by repeating a random experiment N times. 5. N (seq, A) is the number of occurrences of event A in sequence seq. 6. fN (seq, A) = N (seq,A) is the relative frequency of occurrence of result A N in sequence seq of length N . 7. relative frequency of occurrence of a result A, denoted by f∞ (A) is deﬁned by f∞ (A)

, =

lim fN (seq, A)

N →∞

N (seq, A) N →∞ N lim

Remarks. 1. We need not specify the sequence in f∞ (A) since, as N → ∞, all sequences will converge to the same ratio; this is called statistical regularity and will be discussed in more details at section 2.11.1. 2. Notice that Dom(f∞ ) = {all results} ̸= {all outcomes} (W&J, page 14). 3. As shown in W&J, pp 14 - 15, f∞ (A ∪ B) = f∞ (A) + f∞ (B) when A, B are mutually exclusive results. Example 2.1.1. random experiment: set of all outcomes: result:

tossing of a die, {1, 2, 3, 4, 5, 6},

{1} and {3, 6} ≡ {multiples of 3} are two examples of results,

2.1. FUNDAMENTAL DEFINITIONS

3

relative frequency of occurrence: f∞ ({1})

=

f∞ ({1, 3})

= = = =

N (seq, {1}) = 1/6 N →∞ N N (seq, {1, 3}) lim N →∞ N N (seq, {1}) + N (seq, {3}) lim N →∞ N N (seq, {3})) N (seq, {1})) + lim lim N →∞ N →∞ N N f∞ ({1}) + f∞ ({3}) = 1/3 lim

The above deﬁnitions are adequate to describe the physical concept but too loose and imprecise to describe a mathematical concept. We next give the formal axiomatic deﬁnition of probability system. Under certain physical conditions a random experiment can be modeled as a probability system and obeys the same laws. After giving the deﬁnitions and its immediate consequences, we illustrate the relationship between “probability system” and “random experiment”. This abstract deﬁnition should be well understood; this will make it easier to grasp the concepts of random variables and random processes later. Deﬁnition 2.1.2. (Axiomatic deﬁnition) A probability system consists of a nonempty collection of objects Ω, a non-empty collection F of subsets of Ω and a function P : F → R satisfying the following (axioms): 1. Ω ∈ F 2. A ∈ F ⇒ A¯ ∈ F 3. A1 , A2 , . . . , An , An+1 , . . . ∈ F ⇒ A1 ∪ A2 ∪ A3 ∪ . . . ∈ F , i.e. F is closed under countable unions. 4. P (Ω) = 1. 5. P (A) ≥ 0, ∀A ∈ F . 6. A1 , A2 , . . . ∈ F and Ai ∩ Aj = ∅, ∀j ̸= i (pairwise disjoints) ⇒ P (A1 ∪ A2 ∪ A3 ∪ . . .) = P (A1 ) + P (A2 ) + . . . Ω is called sample space and its elements are called sample points. The elements of F are called events (F is called class of events). A probability space will be denoted by (Ω, F , P : F → [0, 1] ⊂ R) or simply (Ω, F , P ). Remarks. 1. The deﬁnitions given by Wozencraft & Jacobs are limited to the case |F | ̸= ∞ (W&J, page 20).

4

CHAPTER 2. PROBABILITY THEORY 2. If F = 2Ω when |Ω| = ∞ there may not exit a function P satisfying the axioms (4), (5) and (6). It can be shown that for any Ω, ﬁnite or not, there is at least one F ⊂ 2Ω for which a probability measure exists (satisﬁes the axioms (1) . . . (6)).

Theorem 1. Axioms (1) to (5) imply the following: 1. A1 , A2 , . . . , An , An+1 , . . . ∈ F ⇒ any set obtained by a Boolean function on the events Ai also lies in F . 2. ∅ ∈ F and P (∅) = 0 ¯ = 1 − P (A), ∀A ∈ F 3. P (A) 4. 0 ≤ P (A) ≤ 1, ∀A ∈ F 5. A ⊂ B ⇒ P (A) ≤ P (B), ∀A, B ∈ F 6. P (A ∪ B) = P (A) + P (B) − P (A ∩ B) 7. A1 ⊂ A2 ⊂ A3 . . . ∈ F ⇒ P (∪∞ i=1 Ai ) = limn→∞ P (An ) 8. A1 ⊃ A2 ⊃ A3 . . . ∈ F ⇒ P (∩∞ i=1 Ai ) = limn→∞ P (An ) Sketch of proof: (2) to (6) only. (1) is hard to prove. (7) and (8) are called continuity properties and are also diﬃcult to prove; they will be used to prove the continuity of cumulative distribution functions (deﬁned later). (3) follows from A ∪ A¯ = Ω and A ∩ A¯ = ∅. (2) follows from axiom (3). (4) follows from axioms (3) and (4). (5) follows from B = A ∪ (B − A) and property (1). (6) follows from A ∪ B = A ∪ (A¯ ∩ B) and B = (A ∩ B) ∪ (A¯ ∩ B).

Examples of probability system: - 23 for some examples with

Refer to Wozencraft & Jacobs pages 21

1. ﬁnite sample space, 2. real line sample space. In the example “Real-line sample space,”2 F is the set of opened and closed intervals of [0, 1] = Ω, plus (enumerable) unions, intersections and complements of such intervals. Clearly F ̸= 2Ω since I = {x ∈ [0, 1] : no “7” in the decimal representation of x} ⊂ Ω but I ∈ / F . As stated in remark 2 on page 4, F must be carefully speciﬁed to ensure the axioms are all satisﬁed and it may be necessary to use F 2Ω . 2 Wozencraft

& Jacobs, page 21

2.1. FUNDAMENTAL DEFINITIONS

5

The correspondence between “probability system” and “random experiment” is summarized in the following table: Probability System Sample space Sample point Event Probability measure

Random Experiment Set of all outcomes Outcome Result Relative frequency of occurrence

Remarks. 1. The axiomatic deﬁnition has the following advantages: (a) It is very precise and formal and uses objects/concepts suitable to the development of a strong theory. (b) It does not rely on the existence of a physical random experiment.3 We can construct many examples of probability systems more freely. 2. The axiomatic deﬁnition has the disadvantage of being suitable to the modeling of a random experiment only under very speciﬁc controlled conditions. We have to be careful when using a speciﬁc law from the probability theory into the context of a real world random experiment. We will often refer to simple real world random experiments for example/illustration purposes. Conversely, for every random experiment we can construct a corresponding probability system to be used as idealized mathematical model. This is done as follows: 1. Ω , set of all possible outcomes, 2. F , set of all results, the probability of which may have to be calculated, plus any other required subset of Ω, required to satisfy the axioms of a class of event (if Ω is ﬁnite it is most convenient to take all the subsets of Ω), 3. P (A) , f∞ (A), ∀A ∈ F . Relation of the Model to the Real World: Wozencraft & Jacobs pages 24 - 29. This is a diﬃcult paragraph and may be viewed as comprising two interconnected parts: 1. Of primary importance are the deﬁnitions of compound experiment and binomial distribution (this is used in the problems) from the bottom of page W&J 25 after equation (2.15c) up to the paragraph starting with “Our primary interest . . . ” on page W&J 27. 3 refer to the examples in “Real line sample space”, W&J, page 21, for a purely mathematical construction.

6

CHAPTER 2. PROBABILITY THEORY 2. Of secondary importance (remainder of this paragraph from page W&J 27 to the bottom of page W&J 29) W&J illustrates that in a long sequence of M (independent) outcomes of a random experiment, the fraction of these outcomes that lie in an event A of probability p = P (A), is very likely close to p, or in other words f∞ (A) = P (A). This will be explained again with the Weak Law of Large Numbers in §2.11.1 and may be skipped for now.

Deﬁnition 2.1.3. A compound experiment is a random experiment which consists of a sequence of M independent (deﬁned later) trials of a simpler experiment. ¯ = 1 − p, If A is an event of the simpler experiment with P (A) = p ̸= 0, P (A) then the following set is a sample space of the compound experiment (M trials): ¯ A, A, . . . , A), (A, A, ¯ A, A, . . . , A), ΩM = {(A, A, . . . , A),(A, ¯ A, ¯ A, A, . . . , A), . . . , (A, ¯ A, ¯ . . . , A)} ¯ (A, Let FM be the set of all subsets of ΩM , i.e. FM = 2ΩM . We also deﬁne the mapping: PM : FM → [0, 1] ¯ PM : {x} 7→ p# of A in x (1 − p)# of A in x for any sequence x ∈ ΩM , and this together with the axioms of a probability measure completely determines PM for every event E ∈ FM : ∑ ¯ P (E) = p# of A in x (1 − p)# of A in x x∈E

One can easily verify that (ΩM , FM , PM ) is a probability system, i.e. FM is a valid class of events and PM is a valid probability measure. The following M + 1 events deﬁned below are of special interest: Ai = {x ∈ ΩM :x has i occurrences of A and ¯ ∈ F M ⊂ ΩM , M − i occurrences of A} for i = 0, 1, . . . , M . It is then seen (bottom of page 26) that ( ) M i PM (Ai ) = p (1 − p)M −i , i = 0, 1, . . . , M, i ( ) M! where we recall that Mi = i!(M −i)! . This is called the binomial distribution (other distributions will be deﬁned later). We see that Ai ∩ Aj = ∅, whenever i ̸= j i ∪M i=0 A = ΩM ∑M so we expect [from axioms (4) and (5)] i=0 PM (Ai ) = PM (ΩM ) = 1. This is veriﬁed by the binomial theorem (top of page 27).

Example 2.1.2. The probability of obtaining ﬁve heads when throwing a coin ( ) ( 1 )5 ( 1 )10 15 times is 15 ≈ 0.0916. 5 2 2

2.1. FUNDAMENTAL DEFINITIONS Conditional probability:

7

Wozencraft & Jacobs pages 29 - 31.

Deﬁnition 2.1.4. If A, B ∈ F and P (B) ̸= 0, then P (A|B) = the probability of event A conditioned on event B.

P (A∩B) P (B)

is called

This deﬁnition is consistent since by axioms (2) and (3), A ∩ B ∈ F . (Ω, F , P ( |B)) is easily seen to be a probability system. Conditional probability is the model for the relative frequency of occurrence of an event A given that an event B occurs. To illustrate this we consider the random experiment, the outcome of which is the sum obtained by the throw of two fair dice. We deﬁne the following results: A

= {sum of dice is either 4, 5 or 6}

B

= {sum of dice is an even number}

A sequence of N outcomes may look like: seq = 2, 8, 9, 5, 10, 4, 7, 8, . . . , 6 In order to obtain the relative frequency of occurrence of A given that B has occurred, we ﬁrst extract the even numbered outcomes out of seq: seqshortened = 2, 8, 10, 4, 8, . . . , 6 and calculate the fraction of times the outcome is 4, 5 or 6 in seqshortened : relative frequency of occurrence of “sum is larger than 3” given the “sum is even”

=

= = =

N (seqshortened , A) length of seqshortened N (seq, A ∩ B) N (seq, B) N (seq, A ∩ B)/N N (seq, B)/N fN (seq, A ∩ B) fN (seq, B)

The latter expression is of the same form as the expression used to deﬁne conditional probability. This intuitively explains that P (A|B) as deﬁned above is mathematical model for relative frequency of occurrence of a result given that another result has occurred. If we deﬁne: FB = {A ∩ B|A ∈ F } , one can also verify that (B, FB , P ( |B)) is another probability system and corresponds to the hatched portion of ﬁgure 2.8 (Wozencraft & Jacobs, page 30).

8

CHAPTER 2. PROBABILITY THEORY

Theorem 2. 1. If A1 , A2 , . . . ∈ F and Ai ∩ Aj ∩ B = ∅, ∀j, ∀i ̸= j, then: P (A1 ∪ A2 ∪ . . . |B) =

∞ ∑

P (Ai |B).

i=1

If moreover A1 ∪ A2 ∪ . . . ⊃ B then the theorem of total probability.4

∑∞ i=1

P (Ai |B) = 1; this is known as

2. Let A1 , A2 , . . . , An ∈ F , be pairwise disjoint and P (Aj ) ̸= 0, ∀j. If B ∈ F satisﬁes P (B) ̸= 0 and B ⊂ ∪nj=1 Aj , then (a) P (B) =

∑n j=1

(b) P (Ai |B) =

P (Aj )P (B|Aj ).

∑nP (Ai )P (B|Ai ) , j=1 P (Aj )P (B|Aj ) 5

for any i = 1, 2, . . . , n; this is known

as Bayes theorem.

Proof. We prove the second part of the theorem only; the proof of the ﬁrst part is left as an exercise. ) ( 1. B ⊂ ∪nj=1 Aj ⇒ B = B ∩ ∪nj=1 Aj = ∪nj=1 (B ∩ Aj ) and all the B ∩ Aj ∑n are pairwise disjoint. It follows that P (B) = j=1 P (B ∩ Aj ) from which the result follows. (B|Ai ) 2. P (Ai |B) = P (AiP)P(B) from equation (2.21) in W&J. The desired result follows from part 1 of the theorem.

Bayes theorem is useful in situations where P (B|Aj ) is given for every j but P (Aj |B) is not directly known. The following example illustrates this. Example 2.1.3. (Refer to ﬁgure 2.1) We are given three boxes A1 , A2 , A3 containing coloured balls as follows: A1 : A2 : A3 :

2 red balls and 3 black balls, 3 red balls and 5 black balls, 4 red balls and 4 black balls.

The (random) experiment consists in ﬁrst choosing a box at random among the three boxes (equiprobably, i.e. each has a probability 1/3 of being chosen) and then draw a ball at random from the box chosen. Let B denote the event “a red ball has been drawn”. Calculate the probability that the ball was drawn from box number 2 if it is red, i.e. P (A2 |B). 4 cf: 5 cf:

Wozencraft & Jacobs page 31 Wozencraft & Jacobs problem 2.6

2.1. FUNDAMENTAL DEFINITIONS

9

Solution: From the data given in the problem we have that P (Ai ) = 1/3, i = 1, 2, 3, and clearly B ⊂ ∪3j=1 Aj . It follows from the theorem that: P (A2 )P (B|A2 ) (1/3)(3/8) P (A2 |B) = ∑3 = (1/3) [2/5 + 3/8 + 4/8] P (A )P (B|A ) j j j=1 15 ≈ 0.294118 = 51

Figure 2.1:

Example 2.1.4. A binary source transmits at random one of two messages m0 or m1 , with probabilities P (m0 ) = 1/3, P (m1 ) = 2/3. The message is fed through a random (noisy) channel of input and output respectively denoted as T X and RX, and with the following transition probabilities: P (RX = 0 | T X P (RX = 1 | T X P (RX = 0 | T X P (RX = 1 | T X

= m0 ) = 0.99 = m0 ) = 0.01 = m1 ) = 0.01 = m1 ) = 0.99

Calculate 1. P (error | RX = 0) = P (T X = m1 | RX = 0), 2. P (error | RX = 1) = P (T X = m0 | RX = 1).

10

CHAPTER 2. PROBABILITY THEORY

Solution. This problem is equivalent to the following. We are given two boxes labeled as m0 , m1 which contain balls labeled 0 or 1 as follows: • box m0 contains 99 balls labeled 0 and 1 ball labeled 1, • box m1 contains 1 ball labeled 0 and 99 balls labeled 1, A random experiment consists in ﬁrst taking one of the two boxes at random with probabilities P (m0 ) = 1/3, P (m1 ) = 2/3 followed by drawing one ball from the box chosen. Calculate: 1. the probability that the ball was drawn from box m1 if it is labeled 0, 2. the probability that the ball was drawn from box m0 if it is labeled 1. We see that the problem is solved similarly to example 2.1.3. We easily ﬁnd (verify this): 1. P (error | RX = 0) = 2/101 ≈ 19.8 × 10−3 , 2. P (error | RX = 1) = 1/199 ≈ 5.03 × 10−3 .

Notice that an error is roughly four times more likely on reception of 0 than on reception of 1, even though the transmission of 1 is only twice as likely than that of 0. Statistical independence:

Wozencraft & Jacobs pages 31-32.

Deﬁnition 2.1.5. 1. A, B ∈ F are (statistically) independent if P (A ∩ B) = P (A)P (B) (equivalently P (A|B) = P (A) or P (B|A) = P (B) when P (A)P (B) ̸= 0). 2. In general A∏1 , A2 , . . . , An ∈ F are (statistically) independent if P (∩i∈I Ai ) = i∈I P (Ai ), ∀I ⊂ {1, 2, . . . , n}, I ̸= ∅. 3. A1 , A2 , . . . , An ∈ F are pairwise (statistically) independent if P (Ai ∩ Aj ) = P (Ai )P (Aj ), for every 1 ≤ i ̸= j ≤ n. Theorem 3. Let A1 , A2 , . . . , Am , B1 , B2 , . . . , Bn ∈ F be (statistically) independent and deﬁne: A

≡ event obtained from Ai using ∪, ∩ and –

B

≡ event obtained from Bj using ∪, ∩ and –

Then A and B are statistically independent. Example 2.1.5. Without using the result of the above theorem, show that A ¯ are statistically independent if A and B are two statistically independent and B events.

2.2. COMMUNICATION PROBLEM

11

¯ Next, Solution: we ﬁrst notice that A = (A ∩ B) ∪ (A ∩ B). ¯ = P (A) − P (A ∩ B) P (A ∩ B) = P (A) − P (A)P (B) = P (A)(1 − P (B)) ¯ = P (A)P (B)

2.2

Communication problem

Refer to Wozencraft & Jacobs, pages 33 to 37. In this example, a probability system is constructed by combining a random experiment (message source), a random transformation (digital communication channel) and a deterministic transformation (decision element). 1. Source: We deﬁne: Sample space: Ωsource = {m0 , m1 } Class of events: 2Ωsource = {∅, {m0 }, {m1 }, {m0 , m1 }} Probability function: PS (∅) = 0 , PS ({m0 }) = Pm0 , PS ({m1 }) = Pm1 , 1 = PS (Ωsource ) = PS ({m0 , m1 }) = Pm0 + Pm1 = 1 . Pm0 , Pm1 are called a priori message probabilities. ( ) Ωsource , 2Ωsource , PS satisﬁes all the axioms of a probability system. 2. Discrete communication channel: is a transformation with unpredictable (random) characteristics. It is determined by its transition probabilities P [rj |mi ] ≥ 0 as in ﬁgure 2.10 (W&J). To be valid it must satisfy: ∑

P [rj |mi ] = 1

all j

for all i. Combining the source and the discrete communication channel we deﬁne: Sample space: ΩDCC = {(m0 , r0 ), (m0 , r1 ), (m0 , r2 ), (m1 , r0 ), (m1 , r1 ), (m1 , r2 )} Class of events: 2ΩDCC = {∅, {(m0 , r0 )}, {(m0 , r1 )}, . . . , ΩDCC }

12

CHAPTER 2. PROBABILITY THEORY Probability function: PDCC ({(mi , rj )}) = PS ({mi })P [rj |mi ] , for every i and every j. One easily veriﬁes that (you should verify this): ∑∑ PDCC ({(mi , rj )}) = 1 . PDCC (ΩDCC ) = all i all j

(ΩDCC , 2ΩDCC , PDCC ) forms a probability system. 3. The decision element is a deterministic (non-random) mapping:  m b : {r0 , r1 , r2 } → {m0 , m1 }   This is just one example;  m b : r0 7→ m0 there are 8 possible mapm b : r1 7→ m1    pings! m b : r2 7→ m1 This mapping m b can be used together with the probability system (ΩDCC , 2ΩDCC , PDCC ) to deﬁne the probability system (ΩD , 2ΩD , PD ) described below: Sample Space: ΩD = {(m0 , m0 ), (m0 , m1 ), (m1 , m0 ), (m1 , m1 )} where the ﬁrst component of each pair denotes the message transmitted and the second component denotes the decision made. Events: 2ΩD = {∅, {(m0 , m0 )}, {(m0 , m1 )}, . . . , ΩD } Probability function: completely speciﬁed by the probability function PDCC ( ), the decision mapping m( b ), the axioms and the following: PD ({(m0 , m0 )}) = PDCC ({(m0 , r0 )}) PD ({(m0 , m1 )}) = PDCC ({(m0 , r1 ), (m0 , r2 )}) PD ({(m1 , m0 )}) = PDCC ({(m1 , r0 )}) PD ({(m1 , m1 )}) = PDCC ({(m1 , r1 ), (m1 , r2 )}) (this corresponds to the above example of mapping m( b )). The probability system (ΩD , 2ΩD , PD ) describes the overall operation of the discrete communication channel. Its performance is measured by its probability of correct decision P (C ) (page 35, W&J). We call C = {(m0 , m0 ), (m1 , m1 )} ⊂ ΩD the correct decision event. For the above example of mapping m( b ), the correct decision event C ⊂ ΩD corresponds to C˜ ⊂ ΩDCC given by: C˜ = {(m0 , r0 ), (m1 , r1 ), (m1 , r2 )} In general, C˜ is given by: C˜ = {(mi , rj ) : m(r ˆ j ) = mi , ∀i ∈ {0, . . . , M − 1}, ∀j ∈ {0, . . . , J − 1}} = {(m(r ˆ j ), rj ) : j ∈ {0, . . . , J − 1}}

2.2. COMMUNICATION PROBLEM

13

The probability of the event C is: P (C ) = P (C˜)

= PDCC ({(m0 , r0 )}) + PDCC ({(m1 , r1 )}) + PDCC ({(m1 , r2 )}) =

2 ∑

PDCC ({(m(r ˆ j ), rj )})

j=0

=

2 ∑

PS ({m(r ˆ j )})P [rj |m(r ˆ j )]

(2.1)

j=0

From equation (2.1) it is clear that the probability of correct decision depends on the assignment made by the decision element m( ˆ ). The remainder of this paragraph (page 35 in W&J) shows how the mapping (decision element) can be chosen in order to maximize P (C ) given a priory message probabilities P ({mi }) = Pmi and channel transition probabilities P [rj |mi ]. It is helpful to ﬁrst consider a numerical example. Example 2.2.1. Consider the discrete communication channel of ﬁgure 2.2. The a priori and transition probabilities are shown in the ﬁgure and we consider the speciﬁc decision element also shown in the ﬁgure. We want to calculate the probability of correct decision P (C ), where C = {(m0 , m0 ), (m1 , m1 )} ∈ FD From the paragraph above, we have P (C ) = P (C˜) where C˜ = {(m0 , r0 ), (m1 , r1 ), (m1 , r2 )} ∈ FDCC We then obtain P (C ) = P (C˜)

= PS (m0 ) P [r0 |m0 ] + PS (m1 ) P [r1 |m1 ] + PS (m1 ) P [r2 |m1 ] = 0.6 × 0.9 + 0.4 × 0.15 + 0.4 × 0.8 = 0.54 + 0.06 + 0.32 = 0.92

Finally, the probability of error is P (E ) = 1 − P (C ) = 0.08. We leave it as an exercise to show that no other decision element would lead to a larger probability of correct decision; in other words, the decision element of ﬁgure 2.2 is optimum. The important results and terminology are summarized in deﬁnition 2.2.1 and theorem 4. Deﬁnition 2.2.1. 1. An optimum decision (element) is a decision mapping that yields the largest P (C ) over any other choice of decision mapping.

14

CHAPTER 2. PROBABILITY THEORY

Figure 2.2:

2. The a posteriori probability of message mi given the received symbol rj is deﬁned by the conditional probability: P ({mi } | {rj }) , where P ({rj }) =

∑M −1 i=0

PDCC ({(mi , rj )}) P ({rj })

PDCC ({(mi , rj )}) as long as P ({rj }) ̸= 0.

Theorem 4. 1. An optimum decision mapping is obtained by mapping a received symbol rj to that message mi that maximizes the product P [rj | mi ]P ({mi }). 2. The decision mapping that maps a received symbol rj (P ({rj }) ̸= 0) to that message mi that maximizes the a posteriori probability P ({mi } | {rj }) is an optimum receiver.

2.3

Random variables and distribution functions

Refer to Wozencraft & Jacobs, pages 37 to 44. We now describe a mathematical concept that is used to model physical random experiments, the outcomes of which are real numbers. The deﬁnition appears more complicated and abstract than necessary, but this is required and will later naturally extend to the deﬁnition of random processes in chapter 3. Moreover the abstract deﬁnition 1. leads to a simpler deﬁnition of joint random variables i.e. many random variables (such as in deﬁnition 2.3.3), 2. extends to the deﬁnition of transformation of random variables (section 2.4),

2.3. RANDOM VARIABLES AND DISTRIBUTION FUNCTIONS

15

3. indicates that there is an underlying random physical phenomenon, the result of which is an observed real number. Deﬁnition 2.3.1. Consider of probability system (Ω, F , P : F → [0, 1]) and the mapping: x : Ω → R ∪ {±∞} x : ω 7→ x(ω) such that {ω : x(ω) ≤ α} ∈ F , ∀α ∈ R ∪ {±∞}, and {ω : x(ω) = ±∞} ∈ F are events of probability 0. The mapping x is a random variable. We often use the shorter notation {x ≤ α} , {ω : x(ω) ≤ α} whenever there is no confusion. Proposition 5. 1. For every β ∈ R ∪ {±∞} such that β < α we have i. {x < α} ∈ F , {x = α} ∈ F , ii. {x ≥ α} ∈ F , {x > α} ∈ F , iii. {β < x ≤ α} ∈ F . 2. For any interval I ∈ R, the set {ω : x(ω) ∈ I} = {x ∈ I} is an event. If I1 , I2 are two disjoint intervals in R then the events {x ∈ I1 } and {x ∈ I2 } are disjoint. The book gives some examples of random variables. The following describes other examples of random and “non random” variables. 1. (Ω = {1, 2, 3, 4, 5, 6}; F = 2Ω ; P (E) = space and one easily veriﬁes that: x: Ω x: ω

→ 7→

|E| 6 , ∀E

∈ F ). This is a probability

R ∪ {±∞} ω2 − 1

is a random variable. The function: y: Ω y: ω

→ 7 →

R ∪ {±∞} 1 ω−5

is not a random variable. The function is not deﬁned at ω = 5. The axiom P ({ω : x(ω) = ∞}) = 0 and P ({ω : x(ω) = −∞}) = 0 is not satisﬁed. 2. (Ω = {1, 2, 3, 4, 5, 6}; F = {∅, Ω, {2, 4, 6}, {1, 3, 5}}; P (E) = This is a probability space. The function: x: x:

Ω ω

→ 7 →

|E| 6 , ∀E

R ∪ {±∞} ω2 − 1

is not a random variable since {ω : x(ω) ≤ 9} = {1, 2, 3} ∈ / F.

∈ F ).

16

CHAPTER 2. PROBABILITY THEORY 3. Ω = [0, 1]; F ≡ sub-intervals + countable unions and intersections of such sub-intervals; P ≡ total length of the sub-interval(s). Consider: x: Ω x: ω

→ R ∪ {±∞} 7 → drop all 7’s in the decimal representation of ω

This is not a random variable since {ω : x(ω) ≤ 1} does not lie in F .6 1 Notice that y : ω 7→ ω−1/2 is a random variable; in this case P (y = ∞) = P (y = −∞) = 0. Deﬁnition 2.3.2. Let x : Ω → R ∪ {±∞} be a random variable for the probability system (Ω, F , P : F → [0, 1]). The function Fx : R ∪ {±∞} Fx : α

→ [0, 1] 7→ P ({ω : x(ω) ≤ α})

is called probability distribution function of x. Example 2.3.1. Refer to W&J, page 38. Example 2.3.2. The random experiment is that of throwing a fair die and the probability space consists in: Ω F

= {1, 2, 3, 4, 5, 6} =

2Ω

The probability is determined by: P ({1}) = P ({2}) = . . . = P ({6}) = 1/6 {z } | ⇒P (E)=|E|/6, ∀E∈F

We deﬁne a function x as follows: x: Ω x: ω

→ R ∪ {±∞} 7→ ω2 + 1

One easily veriﬁes that the function x satisﬁes the axioms of a random variable. In order to sketch Fx (α) we evaluate it at a few points: Fx (0) Fx (1)

= =

P ({x ≤ 0}) = P ({ω ∈ Ω : ω 2 + 1 ≤ 0}) = P (∅) = 0 P ({x ≤ 1}) = P ({ω ∈ Ω : ω 2 + 1 ≤ 1}) = P (∅) = 0

Fx (2) Fx (5) lim− Fx (α)

= = =

P ({x ≤ 2}) = P ({ω ∈ Ω : ω 2 + 1 ≤ 2}) = P ({1}) = 1/6 P ({x ≤ 5}) = P ({ω ∈ Ω : ω 2 + 1 ≤ 5}) = P ({1, 2}) = 1/3 P ({x < 2}) = P ({ω ∈ Ω : ω 2 + 1 < 2}) = P (∅) = 0

lim Fx (α) ...

= P ({x < 5}) = P ({ω ∈ Ω : ω 2 + 1 < 5}) = P ({1}) = 1/6 ... ...

α→2

α→5−

Finally we obtain the function sketched in ﬁgure 2.3. 6 refer

to remark 1 on page 3

2.3. RANDOM VARIABLES AND DISTRIBUTION FUNCTIONS

Figure 2.3:

Theorem 6. (properties of probability distribution functions) 1. 0 ≤ Fx (α) ≤ 1, ∀α ∈ R ∪ {±∞}, 2. Fx (−∞) = 0; Fx (∞) = 1, 3. a > b ⇒ Fx (a) − Fx (b) = P ({ω|b < x(ω) ≤ a}), 4. a > b ⇒ Fx (a) ≥ Fx (b), 5. ∀a ∈ R, Fx (α) is continuous on the right at the point α = a, 6. Fx (a) − limα→a− Fx (α) = P ({ω|x(ω) = a}) proof of 5.: Deﬁne Ai = {ω : x(ω) ≤ a + 1/i}, i = 1, 2, . . .. We then have A1 ⊃ A2 ⊃ A3 ⊃ . . . ∩∞ i=1 Ai = {ω : x(ω) ≤ a} It then follows from property 8 of theorem 1, the continuity property, that Fx (a) = P ({ω : x(ω) ≤ a}) = P (∩∞ i=1 Ai ) = lim P (An ) = lim Fx (a + 1/n) = lim Fx (α) n→∞

n→∞

α→a+

17

18

CHAPTER 2. PROBABILITY THEORY

We now turn our attention to a more general situation involving more than one random variable. For example, consider the random experiment of throwing two fair dice labelled as #1 and #2. The ﬁrst random number is the outcome of die #1 and the second random number is the sum of both outcomes of dice #1, #2. We will detail the tools required to calculate the probability of events deﬁned by multiple random variables. Given two random variables x1 , x2 we can construct the two probability distribution functions Fx1 (α1 ), Fx2 (α2 ). As we have seen the probability of any event involving x1 only and the probability of any event involving x2 only can respectively be calculated from the functions Fx1 (α1 ) and Fx2 (α2 ). But the two distributions functions are not suﬃcient to calculate the probability of an event involving both variables x1 , x2 , such as P ({2 < x1 ≤ 4} ∩ {−1 < x2 ≤ 1}). In order to calculate the probability of such an event a new distribution function is deﬁned: Deﬁnition 2.3.3. Let x1 : Ω → R ∪ {±∞}, x2 : Ω → R ∪ {±∞} be random variables for the probability system (Ω, F , P : F → [0, 1]). The function: Fx1 ,x2 : Fx1 ,x2 :

(R ∪ {±∞})2 (α1 , α2 )

→ 7→

[0, 1] P ({ω : x1 (ω) ≤ α1 } ∩ {ω : x2 (ω) ≤ α2 })

is called joint probability distribution function of the random variables x1 and x2 . Remarks.

( ) 1. For any region I ∈ R2 , the set {ω : x1 (ω), x2 (ω) ∈ I} = {(x1 , x2 ) ∈ I} is an event. If I1 , I2 are two disjoint regions in R2 then the events {(x1 , x2 ) ∈ I1 } and {(x1 , x2 ) ∈ I2 } are disjoint. 2. The properties of joint probability distribution functions are immediate generalizations of the properties of (1-dimensional) probability distribution functions (refer to page 40, W&J). 3. Property VI on page 40, W&Y, is similar to the proof of property 3, page 17: ﬁrst show that Fx1 ,x2 (a1 , a2 ) =

Fx1 ,x2 (a1 , b2 ) + P ({ω : x1 (ω) ≤ a1 } ∩ {ω : b2 < x2 (ω) ≤ a2 })

Fx1 ,x2 (a1 , a2 ) =

Fx1 ,x2 (b1 , a2 ) + P ({ω : b1 < x1 (ω) ≤ a1 } ∩ {ω : x2 (ω) ≤ a2 })

Deﬁnition 2.3.4. Let xi : Ω → R ∪ {±∞}, i = 1, 2, 3, . . . , k be k random variables for the probability system (Ω, F , P : F → [0, 1]). 1. The mapping x: Ω x: ω

k

→ (R ∪ {±∞}) 7→ (x1 (ω), x2 (ω), . . . , xk (ω)) , x(ω)

is called random vector.

2.3. RANDOM VARIABLES AND DISTRIBUTION FUNCTIONS

19

2. Let a = (a1 , a2 , . . . , ak ), b = (b1 , b2 , . . . , bk ) be two elements of k (R ∪ {±∞}) . We write a ≤ b to signify that ai ≤ bi , ∀i = 1, 2, . . . , k. 3. The function Fx : (R ∪ {±∞}) Fx : α

k

→ 7 →

[0, 1] P ({ω : x(ω) ≤ α})

is called joint probability distribution function of the random vector x. Example 2.3.3 (bottom of page 42 in W&J). Recall that Ω F P

= [0, 1] , ≡ set of intervals of Ω and “countable” unions and intersections of such intervals, ≡ total length of the union of the intervals making the event.

The remainder of the example is easy to follow. Remark. “The use of probability distribution functions is inconvenient in computations”.7 For this reason we introduce the probability density function deﬁned below. Deﬁnition 2.3.5 (page 45, W&J). The function px (α) = ability density function of the random variable x.

dFx (α) dα

is called prob-

Proposition 7. Properties of the probability density function [pages 44 - 47, W&J] 1. When I ⊂ R ∪ {±∞}∫ is such that {ω : x(ω) ∈ I} ∈ F we can show that P ({ω : x(ω) ∈ I}) = I px (α)dα. 2. If lim+ Fx (α) = lim− Fx (α) and lim+ Fx′ (α) ̸= lim− Fx′ (α) then px (a) = α→a

α→a

α→a

α→a

lim Fx′ (α), i.e. the probability density function px ( ) is continuous on the

α→a+

right. 3. If lim+ Fx (α) − lim− Fx (α) = Pa ̸= 0 then px (α) contains the impulse α→a

Pa δ(α − a).8

α→a

7 bottom 8 δ(

ϵ > 0:

of page 43, W&J ) is the Dirac impulse function and it is deﬁned on page 46, W&J. For all t0 ∈ R and ∫

t0 +ϵ

t0 −ϵ

f (t)δ(t − t0 )dt = f (t0 )

if f (t) is continuous at t = t0 . We also have: f (t)δ(t − t0 )

=

f (t0 )δ(t − t0 )

f (t) ∗ δ(t − t0 )

=

f (t − t0 )

if f (t) is continuous at t = t0 (∗ denotes the convolution).

20

CHAPTER 2. PROBABILITY THEORY 4. Fx (α) =

∫α −∞

px (β)dβ

5. px (α) ≥ 0, ∀α ∈ R ∫∞ 6. −∞ px (α)dα = 1 Theorem 8. For every function f (α) : R ∪ {±∞} → R ∪ {±∞} satisfying: f (α) ≥ 0, ∀α, ∫ ∞ f (α)dα = 1, −∞

there exists a probability system (Ω, F , P : R ∪ {±∞} → [0, 1]) and a random variable x : Ω → R ∪ {±∞} such that px (α) = f (α). Sketch of proof. The probability system is constructed as follows: • Ω = R, • F is the set of intervals of R = Ω, plus (enumerable) unions, intersections and complements of such intervals, ∫ • P (I) , I f (α)dα, ∀I ∈ F . Next we deﬁne the random variable x: Ω=R x: ω

→ R 7 → ω

One can then easily show that px (α) = f (α) (this is somewhat similar to the construction of page 23 in W&J). Example 2.3.4. The probability density function of the random variable described in example 2.3.2 and distribution function as in ﬁgure 2.3 is shown in ﬁgure 2.4.

Figure 2.4:

Example 2.3.5 (some probability density functions – refer to Wozencraft & Jacobs pp 47 - 49 for the deﬁnitions of the exponential, Rayleigh, Cauchy and Gaussian probability density functions).

2.3. RANDOM VARIABLES AND DISTRIBUTION FUNCTIONS

21

1. uniform: b > a   px (α) =

Fx (α) =

1 b−a

 0  0          

;a ≤ α < b ; elsewhere ;α < a

α−a b−a

;a ≤ α < b

1

;b ≤ α

2. Binomial: N ∈ N∗ , 0 < p < 1 px (α) = (N )

N ( ) ∑ N k=0

k

pk (1 − p)N −k δ(α − k)

! . A typical binomial probability density function is where k , k!(NN−k)! sketched in ﬁgure 2.5.

Figure 2.5: Typical binomial probability density function with n = 5, p = 0.4.

3. Poisson: ρ > 0 px (α) =

∞ ∑ ρk k=0

k!

e−ρ δ(α − k)

A typical Poisson probability density function is sketched in ﬁgure 2.6.

22

CHAPTER 2. PROBABILITY THEORY

Figure 2.6: First 6 impulses of a typical Poisson probability density function with λ = 1.5. The Poisson probability density function contains an inﬁnite number of impulses: 1 for every x ∈ N.

Remarks. 1. All of the above functions satisfy 2. It can be shown that ∫

∞

−∞

This follows from

∫∞ −∞

√

1 2πσ 2

−∞

px (α)dα = 1 (verify).

e−(α−µ)

e−x dx = 2

∫∞

2

√

/(2σ 2 )

dα = 1.

π.

3. There is no analytical form to the Gaussian probability distribution function. It is tabulated (refer to Appendix A) as: [ ( )] ∫ ∞ 1 1 α −β 2 /2 Q(α) = √ e dβ = 1 − erf √ 2 2π α 2 ∫ α √ 2 2 e−β dβ = 1 − 2Q( 2α) erf(α) = √ π 0 represented graphically by the hatched areas in ﬁgures 2.7 and 2.8.9 The function erf( ) is deﬁned in the C math library math.h and most computer packages (MAPLE, MATHEMATICA, MATLAB ) also include an implementation of the function. The erf( ) and Q( ) functions are plotted in ﬁgure 2.9.

2.3. RANDOM VARIABLES AND DISTRIBUTION FUNCTIONS 1

2 2

e

2

23

Q

Figure 2.7: Deﬁnition of the Q( ) function

2

2

e

erf

Figure 2.8: Deﬁnition of the erf( ) function

4. If x is a Gaussian random variable with probability density function:10 px (α) = √

1 2πσ 2

e−(α−µ)

then:

( P (x ≥ a) = Q

2

a−µ σ

/2σ 2

)

for any a ∈ R. More generally for any a, b ∈ R, a < b: P (a ≤ x < b) =

P (x ≥ a) − P (x ≥ b) ( ( ) ) a−µ b−µ = Q −Q σ σ

µ−b and in particular P (x < b) = 1 − Q( b−µ σ ) = Q( σ ).

generally only list the values of Q(α) for α ≥ 0. If α < 0, we use Q(α) = 1−Q(−α). parameters µ and σ 2 will respectively be called mean and variance of the random variable x; refer to section 2.10. 9 Tables

10 The

24

CHAPTER 2. PROBABILITY THEORY

Figure 2.9: Graphs of the erf(α) function (red) and Q(α) function (blue)

5. The Taylor series of the function Q( ) is given by:11

Q(α)

=

≈

1 α −√ + 2 2π

∞ ∑ n=3 n odd

(−1)(n−3)/2

(n−3)/2 ∏

(2i + 1)

√ n! 2π

i=0

αn

1 α3 α5 α7 α9 α √ − √ + O(α11 ) −√ + √ − √ + 2 2π 6 2π 40 2π 336 2π 3456 2π

6. The ﬁrst derivative of the function Q(α) is given by: 2 dQ(α) −1 = √ e−α /2 dα 2π

Deﬁnition 2.3.6 (Wozencraft & Jacobs, page 49). Let x1 , x2 be two random variables with joint probability distribution function Fx1 , x2 (α1 , α2 ). The function ∂2 px1 ,x2 (α1 , α2 ) = Fx ,x (α1 , α2 ) ∂α1 ∂α2 1 2 is called joint probability density function of x1 , x2 . Proposition 9. Properties of the joint probability density function [Wozencraft & Jacobs, pp 50 - 55] 1. When I ⊂ R2 is such that {ω : (x1 (ω), x2 (ω)) ∈ I} ∈ F we can show that: ∫∫ P ({ω : (x1 (ω), x2 (ω)) ∈ I}) = px1 ,x2 (α1 , α2 )dα1 dα2 I 11 this

is required to solve problem 2.39 - more details to follow

2.3. RANDOM VARIABLES AND DISTRIBUTION FUNCTIONS ∫

2.

α1

∫

25

α2

px1 ,x2 (β1 , β2 )dβ2 dβ1 Fx1 ,x2 (α1 , α2 ) = −∞ −∞ ∫ α1 ∫ ∞ Fx1 (α1 ) = px1 ,x2 (β1 , β2 )dβ2 dβ1 −∞ −∞ ∫ ∞ ∫ α2 px1 ,x2 (β1 , β2 )dβ2 dβ1 Fx2 (α2 ) = −∞

−∞

3. px1 ,x2 (α1 , α2 ) ≥ 0, ∀α1 , ∀α2 . ∫∫ ∞ 4. −∞ px1 ,x2 (α1 , α2 )dα2 dα2 = 1. 5. Discontinuities are treated in the same way as in the one dimensional case. Example 2.3.6 (Wozencraft & Jacobs, pp 53 - 54). 1. Computer generated plots (using MAPLE ) of equation W&J (2.58) are presented in ﬁgure 2.10. A more general expression for the probability density function of two (and more) jointly Gaussian random variables will be presented in §2.13.3. 2. The joint probability density function of equation (2.59) in W&J is purely impulsive because the term 1 δ(α1 − i) δ(α2 − j) 36 is 0 everywhere except at α1 = i and α2 = j where it is an inﬁnitely high and inﬁnitely narrow spike of volume 1/36. 3. Equation W&J (2.60) should be px1 ,x2 (α1 , α2 ) =

[ ] (α2 − i)2 1 δ(α1 − i) √ exp − 2 2 2π

2 ∑ 1 i=1

and, as explained, represents two inﬁnitely high fences at α1 = 1 and α1 = 2. For any other value of α1 the joint probability density function is 0 since the impulse δ(α1 − 1) is 0 everywhere except at α1 = 1 and similarly δ(α1 − 2) is 0 everywhere except at α1 = 2. 4. The change of variable at the bottom of page 54 in W&J is the conversion from rectangular to polar coordinates. It is understood that θ = tan−1 (

α2 ) ± π u(−α1 ) α1

where u( ) is the unit step function, i.e. π radians (180◦ ) must be added (or subtracted) whenever α1 is negative since the tan−1 ( ) function only covers angles from −π/2 to +π/2.

26

CHAPTER 2. PROBABILITY THEORY This example shows that it is sometimes possible to obtain an expression for the volume under a 2-dimensional Gaussian probability density function even though it is not possible for a 1-dimensional Gaussian probability density function. We will also see that the Q( ) and erf( ) functions may also sometimes be used (problem 3.2 in W&J).

(a) ρ = 0

(b) ρ = 0.9

(c) ρ = −0.9

(d) ρ = −0.7

Figure 2.10: Plots of the joint Gaussian probability density function of 2 random variables Theorem 10. px1 (α1 ) =

∫∞ −∞

px1 ,x2 (α1 , α2 )dα2 .

Proof. We have: ∫ Fx1 (α1 ) = Fx1 ,x2 (α1 , ∞) =

α1

−∞

∫

∫ |

∞

−∞

px1 ,x2 (β1 , β2 )dβ2 dβ1 {z } h(β1 )

α1

= −∞

h(β1 )dβ1

(2.2)

which follow from property V page 40 (W&J) and equation W&J (2.56). We

2.3. RANDOM VARIABLES AND DISTRIBUTION FUNCTIONS ∫

also have: Fx1 (α1 ) =

27

α1 −∞

px1 (β1 )dβ1

(2.3)

from equation W&J (2.40 b). Both RHS of 2.2 and 2.3 give Fx1 (α1 ) for all values of α1 . The integrands must be the same, i.e. ∫ ∞ px1 (α1 ) = h(α1 ) = px1 ,x2 (α1 , β2 )dβ2 . −∞

Remark. This proof∫ is not 100 % rigorous; neither is Wozencraft & Jacobs’. α d They show that dα h(β)dβ = h(α) when h(β) is continuous at β = α. The −∞ result remains valid even if this condition is not satisﬁed, and this follows from the way in which the discontinuities of Fx (α) were handled in the deﬁnition of x (α) px (α) = dFdα . Example 2.3.7 (Wozencraft & Jacobs, pp 56 - 57). If x1 , x2 are two jointly Gaussian random variables then x1 is Gaussian and x2 is Gaussian. Therefore the volume under the surface of ﬁgure 2.27 in W&J, inside the strip bounded by a, b, is obtained with the Q( ) function; there is no need of a 2-dimensional version of the Q( ) function in this case. Deﬁnition 2.3.7. Let x = (x1 , x2 , . . . , xk ) be a random vector comprising k random variables with joint probability distribution function Fx (α). The ∂k function px (α) = ∂α Fx (α) is called joint probability density function of the random vector x. Proposition 11. Properties of the joint probability density function [Wozencraft & Jacobs pp 57 - 58] 1. When I ⊂ Rk is such that {ω : x(ω) ∈ I} ∈ F we can show that: ∫ P ({ω : x(ω) ∈ I}) = px (α)dα I

2. Discontinuities of Fx (α) are treated in the same way as the 1-dimensional and 2-dimensional cases. 3. (Elimination of random variables, W&J, pp 55) If x (x1 , x2 , . . . , xk ), x′ = (x1 , x2 , . . . , xi−1 , xi+1 , . . . , xk ), α = (α1 , α2 , . . . , αk ), α′ = (α1 , α2 , . . . , αi−1 , αi+1 , . . . , αk ) then ∫ ∞ ′ ′ px (α ) = px (α)dαi .

=

−∞

Deﬁnition 2.3.8. Let x1 , x2 be random variables for the probability system (Ω, F , P : F → [0, 1]). We write x1 = x2 if P ({ω : x1 (ω) ̸= x2 (ω)}) = 0.

28

CHAPTER 2. PROBABILITY THEORY

In particular x1 (ω) = x2 (ω), ∀ω ∈ Ω ⇒ x1 = x2 . Indeed x1 (ω) = x2 (ω), ∀ω ∈ Ω ⇒ {ω : x1 (ω) ̸= x2 (ω)} = ∅, and P (∅) = 0. We usually (i.e. from now on) exclude from Ω the sample points of probability 0 and in this case we ﬁnd that x1 = x2 ⇔ {ω : x1 (ω) ̸= x2 (ω)} = ∅ ⇔ x1 (ω) = x2 (ω), ∀ω ∈ Ω. Under this assumption any one of the above two statements can be used to deﬁne equality of random variables.

2.4

Transformation of random variables

Refer to Wozencraft & Jacobs, pp 58 - 65. Let x : Ω → R ∪ {±∞} be a random variable for the probability system (Ω, F , P : F → [0, 1]), and let y: Ω y: ω

→ 7 →

R ∪ {±∞} f (x(ω))

where f : R → R is a piecewise diﬀerentiable function. For many such functions, y satisﬁes the axioms of a random variable. We say that y is deﬁned by a transformation of random variable(s) and we write y = f (x). We want to calculate py (α) from px (α) and the function f . This generalizes to random vectors. Remark (Wozencraft & Jacobs, p 64). If px (α) contains impulses, the contribution of each impulse is calculated separately and added to the non-impulsive part. This will be illustrated in §2.4.5.

2.4.1

Calculate Fy (α) ﬁrst, then py (α) =

dFy (α) dα

The following are obtained (when px (α) is non-impulsive) [refer to W&J]: 1. y = bx + a where a, b ∈ R, b ̸= 0: ⇒ py (α) = { 2. Half-wave (linear) rectiﬁer y =

1 px |b|

(

α−a b

)

x ; x≥0 0 ; x0 ; α 0 =  ∅ ; α 0 (2.5) ; if α < 0

Reversible transformation of random variables

Refer to Wozencraft & Jacobs, pp 111 - 114. In this case we have a invertible transformation of random vectors and the scaling factor in equation (2.4) becomes the Jacobian of the transformation. Speciﬁcally for random vectors of length 2 we would have: y1 = f1 (x1 , x2 )

x1 = g1 (y1 , y2 )

y2 = f2 (x1 , x2 )

x2 = g2 (y1 , y2 )

The joint probability density function py1 , y2 (β1 , β2 ) is given by: py1 , y2 (β1 , β2 ) = px1 , x2 (g1 (β1 , β2 ), g2 (β1 , β2 )) × |Jg (β1 , β2 )| where Jg (β1 , β2 ) is given by: ∂g1 (β1 , β2 ) ∂β1 ∂g2 (β1 , β2 ) ∂β 1

∂g1 (β1 , β2 ) ∂β2 ∂g2 (β1 , β2 ) ∂β2

More generally we have: Theorem 12. Let x be a random vector for the probability system (Ω, F , P : F → [0, 1]) and let y be deﬁned by: y: y:

Ω ω

→ 7→

(R ∪ {±∞})k f (x(ω)) = (f1 (x(ω)), . . . , fk (x(ω)))

where f : Rk → Rk is a reversible diﬀerentiable transformation. Then y is a random vector and py (β) = px (g(β))|Jg (β)| where g = f −1 and Jg (β) is the Jacobian associated with the transformation g. Remarks. 1. f : Rk → Rk is reversible and diﬀerentiable if and only if: (a) ∃g : Rk → Rk such that:

and

g ◦ f : Rk g◦f : α

→ Rk 7 → g(f (α)) = α

f ◦ g : Rk f ◦g : β

→ 7 →

Rk f (g(β)) = β

2.4. TRANSFORMATION OF RANDOM VARIABLES (b)

∂fi ∂xj

and

∂gi ∂xj

31

exist 1 ≤ ∀i, j ≤ k.

2. Wozencraft & Jacobs gives a very good interpretation of this result following equation (2A.7) on page W&J 113. Proof. (refer to Wozencraft & Jacobs, page 112) = P ({ω : y(ω) ≤ β})

Fy (β)

= P ({ω : f (x(ω)) ≤ β}) = P ({ω : x(ω) ∈ I}) ∫ = px (α)dα

(2.6)

I

where (refer to Wozencraft & Jacobs, equation 2A.3b) I

= {α|f (α) ≤ β} = {g(γ)|γ ≤ β}

represented graphically as in ﬁgure 2.11. We then apply the change of variables α = g(γ) to equation (2.6): px (α) −→ px (g(γ)) dα −→ |Jg (γ)|dγ α∈I

−→ γ ≤ β

and it follows that

(α = g(γ) ∈ I ⇒ γ ≤ β)

∫ Fy (β) =

px (g(γ))|Jg (γ)|dγ,

(2.7)

γ≤β

(Wozencraft & Jacobs (2A.6)). Identiﬁcation of (2.7) with Fy (β) = ∫ p (γ)dγ yields γ≤β y py (γ) = px (g(γ))|Jg (γ)| (Wozencraft & Jacobs (2A.7)) since equality is veriﬁed for every β ∈ Rk .

Example 2.4.2. Let x, y be 2 random variables with joint probability density function px y (α, β) and deﬁne the random variable z = x + y. We want to ﬁnd the probability density function pz (γ). Solution: Because the transformation from x, y to z is not ( invertible we) introduce a random variable w = y. The transformation f = f1 ( , ), f2 ( , ) takes the form: z w

= =

x+y y

= f1 (x, y) = f2 (x, y)

32

CHAPTER 2. PROBABILITY THEORY

Figure 2.11: change of variable used in proof ( ) and the inverse transformation g = g1 ( , ), g2 ( , ) is given by: x = z−w y = w

= g1 (z, w) = g2 (z, w)

The Jacobian of the inverse transformation is: ∂g1 (γ, β) ∂g1 (γ, β) ∂β Jg (γ, β) = ∂g2∂γ (γ, β) ∂g2 (γ, β) ∂γ ∂β 1 −1 =1 = 0 1

and we may directly write: ( ) pz w (γ, β) = px y g1 (γ, β), g2 (γ, β) |Jg (γ, β)| = px y (γ − β, β) × 1 Finally the probability density function of z is obtained by elimination of the random variable w: ∫ ∞ ∫ ∞ pz (γ) = pz w (γ, β) dβ = px y (γ − β, β) dβ −∞

−∞

this is equation (2.87) in W&J. Example 2.4.3. Let x, y be 2 random variables with joint probability density function px y (α, β) and deﬁne the random variable z = xy. We want to ﬁnd the probability density function pz (γ).

2.4. TRANSFORMATION OF RANDOM VARIABLES

33

Solution: (Assume that P (x = 0) = P (y = 0) = 0) Because the transformation from x, y to z is not ( invertible we) introduce a random variable w = y. The transformation f = f1 ( , ), f2 ( , ) takes the form: z w

= xy = y

= f1 (x, y) = f2 (x, y)

( ) and the inverse transformation g = g1 ( , ), g2 ( , ) is given by: x = z/w y = w

= g1 (z, w) = g2 (z, w)

(notice that P (w = 0) = P (y = 0) = 0). The Jacobian of the inverse transformation is: ∂g1 (γ, β) ∂g1 (γ, β) ∂γ ∂β Jg (γ, β) = ∂g2 (γ, β) ∂g2 (γ, β) ∂γ ∂β 1/β −γ/β 2 = 1/β = 0 1 and we may directly write: pz w (γ, β)

( ) = px y g1 (γ, β), g2 (γ, β) |Jg (γ, β)| 1 = px y (γ/β, β) × |β|

Finally the probability density function of z is obtained by elimination of the random variable w: ∫ ∞ ∫ ∞ px y (γ/β, β) pz (γ) = pz w (γ, β) dβ = dβ |β| −∞ −∞ this is equation (2.89) in W&J. Example 2.4.4. The example of pages 113 - 114 (W&J) is a classic! Equation Wozencraft & Jacobs(2A.9a) should read y2 = f2 (x) = tan−1

(

x2 x1

) ± πu(−x1 )

where u( ) is the unit step function. If ρ = 0 in equation Wozencraft & Jacobs(2A.11b) then r is Rayleigh distributed, θ is uniformly distributed in the interval [0, 2π] and r, θ are statistically independent.

34

CHAPTER 2. PROBABILITY THEORY

2.4.4

Transformation of a uniformly distributed random variable

Theorem 13. Let y be a random variable and f ( ) a valid cumulative distribution function12 such that the random variable x deﬁned by the transformation x = f (y) is uniformly distributed between 0 and 1. The probability density function of y is then given by: df (β) py (β) = dβ Proof. Refer to problem 2.16 in W & J. Example 2.4.5. Consider the random variable y ≥ 0 deﬁned by the transfor√ mation y = −2 ln(1 − x) in which x is a random variable uniformly distributed between 0 and 1. Then ( ) 2 x = 1 − e−y /2 u(y) = f (y) and from the sketch in ﬁgure 2.12, f ( ) is clearly seen to be a valid cumulative distribution function. Therefore: py (β)

= f ′ (β) ,

∀β

= βe−β

u(β)

2

/2

≡ Rayleigh probability density function.

Figure 2.12:

12 f (

) is a monotonically increasing function from 0 to 1.

2.4. TRANSFORMATION OF RANDOM VARIABLES

35

Example 2.4.6. Consider the random variable −10 ≤ y ≤ 10 deﬁned by the transformation y = 10 sin(π(x − 1/2)) in which x is a random variable uniformly distributed between 0 and 1. Then (y) 1 1 x = arcsin + = f (y) π 10 2 and from the sketch shown in ﬁgure 2.13, f ( ) is clearly seen to be a valid cumulative distribution function. Therefore: py (β) = =

f ′ (β) 1 √ , −10 ≤ β ≤ 10 π 100 − β 2

Figure 2.13:

Corollary. If x, y are two random variables uniformly distributed between 0 and 1 such that pxy (α, β) = px (α) py (β), then the random variable z deﬁned by the transformation: √ z = sin(π(x − 1/2)) −2 ln(y) is a Gaussian random variable. Proof. Recognizing that both y and 1 − y are uniformly distributed between 0 and 1, the result follows from the previous two examples and problem 2.19 in W & J. A uniformly distributed random variable is often readily available on a computer system and we next show how theorem 13 can be used to transform this random variable into another random variable y having any desired probability density function py (β). We ﬁrst calculate the corresponding desired cumulative distribution function Fy (β) of the random variable y. Next we deﬁne a function f −1 : [0, 1] → R

36

CHAPTER 2. PROBABILITY THEORY

as follows. Over the range of input values where the function Fy ( ) is invertible, we let f −1 ( ) = Fy−1 ( ). If Fy ( ) has a step at β0 then deﬁne f −1 (α) = β0 for every α ∈ [Fy (β0− ), Fy (β0 )]. If Fy ( ) is constant over the semi-opened interval [β0 , β1 [ then f −1 (Fy (β0 )) = β0 . Finally f −1 (0) ≡ largest β such that Fy (β) = 0. The function f −1 ( ) is well-deﬁned over the entire range [0, 1]. The probability density function of the random variable deﬁned by the transformation y = f −1 (x) where x is a random variable uniformly distributed in the interval from 0 to 1 is the desired probability density function py (β). This is illustrated in the example that follows. Example 2.4.7. We want to generate a random variable y with probability density function shown in ﬁgure 2.14. Clearly this is a valid probability density function and by theorem 7 the random variable y exists. We propose to apply a transformation of random variable on a random variable x which is uniformly distributed between 0 and 1.

Figure 2.14:

Step 1:

Calculate the cumulative distribution function Fy (β).

Case 1 (0 < β < 1). ∫ Fy (β) =

py (β)dβ ∫

=

βdβ =

β2 +K 2

We must have Fy (0) = 0 ⇒ K = 0. Therefore: Fy (β) =

β2 ⇒ Fy (1) = 1/2 = Fy (2). 2

2.4. TRANSFORMATION OF RANDOM VARIABLES

37

Case 2 (2 < β < 3). ∫ Fy (β) =

py (β)dβ ∫

=

1 β dβ = + K 2 2

We must have Fy (2) = 1/2 ⇒ 2/2 + K = 1/2 ⇒ K = −1/2. Therefore Fy (β) =

β − 1/2 ⇒ Fy (3) = 1. 2

Fy (β) remains constant at 1 for any β ≥ 3. In summary we have:  0        β2    2    Fy (β) =

            

;

β + ln 2 s0 − s1 P ({m0 }) | {z } a

assuming s0 > s1 , otherwise the inequality is reversed in the last expression. If moreover P ({m0 }) = P ({m1 }) = 1/2, the optimum decision rule becomes: m ˆ : ρ 7→ m0 ⇔ ρ ≥

s0 + s1 2

naturally.

2.9.3

Performance of the optimal receiver for a channel with additive Gaussian noise

The performance of the receiver is measured by its probability of error (same as the communication example of §2.2): P (E )

=

1 − P (C )

= P (E |{m0 })P ({m0 }) + P (E |{m1 })P ({m1 }) Assuming WLOG16 that s0 > s1 (similarly to §2.9.2), we obtain: P (E |{m0 })

= P (r ∈ I1 |{m0 }) = P (r < a|{m0 }) = P (s0 + n < a|{m0 }) = P (n < a − s0 |{m0 }) = P (n < a − s0 ) ( 0 − (a − s ) ) 0 = Q σ (s − a) 0 = Q σ

Similarly we ﬁnd

16 without

loss of generality

(a − s ) 1 p(E |{m1 }) = Q σ

58

CHAPTER 2. PROBABILITY THEORY

and it follows that (s − a) (a − s ) 0 1 P (E ) = P ({m0 })Q + P ({m1 })Q σ σ s0 −a −s1 1 1 If moreover P ({m0 }) = P ({m1 }) = 1/2, then a = s0 +s = a−s = s02σ 2 , σ σ s0 −s1 and the expression for the probability of error simpliﬁes to P (E ) = Q( 2σ ).

Theorem 18. For any a priori probabilities(P ({m0 }), ) P ({m1 }) and correspondP ({m1 }) s0 +s1 σ2 ing optimal threshold a = 2 + s0 −s1 ln P ({m0 }) we have: s0 − s1 Q( ) 2σ } | {z

≥ P ({m0 })Q(

a − s1 s0 − a ) + P ({m1 })Q( ) σ σ

useful upper bound

Proof. Refer to the last paragraph of page 81 and top of page 82 in W & J. The signiﬁcance of the upper bound is that it is possible to design a receiver −s1 ). It also follows that for which the probability of error is no larger than Q( s02σ the probability of error of an optimal receiver is largest when the messages are equiprobable, i.e. when the source is most unpredictable. The probability of error of an optimal receiver can be made arbitrarily small by making the messages unequally likely. For example a source with a priori message probabilities P ({m0 }) = 1, P ({m1 }) = 0 leads to a trivial receiver that always guesses message m0 ; no information is however communicated from such a trivial source. Theorem 19. Let Eb denote the maximum allowable transmitted ) energy per (√ Eb . message. The optimal receiver will achieve P (E ) ≤ Q 2 2σ 2 | {z } per message

Deﬁnition 2.9.2. The ratio

Eb 2σ 2

is the signal-to-noise ratio in the system.

Remarks. 1. W&J uses Eb /σ 2 as deﬁnition of the signal-to-noise ratio. We will later Eb see why it is preferable to use 2σ 2 ; not important for now. The probability of error is plotted as a function of the signal to noise ratio in ﬁgure 2.25. In the worst case (low signal-to-noise ratio), the probability of error is 0.5, i.e. equivalent to guessing between m0 and m1 without even looking at the channel output r. For large signal-to-noise ratios, the probability of error becomes smaller but is never 0. It follows that we cannot guarantee the correct transmission of an arbitrarily large sequence; this has been the belief of engineers for a long time until the work of Shannon in 1948. 2. This example illustrates the relationship between the various factors affecting the error performance of a Continuous Communication Channel : • addition of noise on the channel and in the receiver, • randomness of the source,

2.10. EXPECTED VALUE

59

Figure 2.25:

• maximum transmitted energy per message, • choice of signals. In chapter 5 (W&J) other factors are added to the list, such as bandwidth of the channel, rate of transmission, . . .

2.10

Expected Value

Refer to Wozencraft & Jacobs pp 84 - 94 for the details. The probability density function px (α) of a random variable x is required in order to make probability calculations of events involving x. Obtaining the probability density function of a random variable may however be diﬃcult, especially when modelling a physical random experiment. Moments, when they exist, are numerical values that characterize a probability density function and they are calculated using an operator called expected value. They sometimes have a physical meaning and may be easy to calculate and/or measure. In many situations the probability density function is obtained by ﬁrst ﬁnding its moments. Deﬁnition 2.10.1. Let x denote a random variable with probability density

60

CHAPTER 2. PROBABILITY THEORY

function px (α) and n > 0 an integer. 1. The n-th moment of px (α) is given by: ∫

∞

αn px (α)dα

−∞

when it exists (i.e. when the integral converges). 2. The expected value of x, denoted by E(x) or by x ¯, is the ﬁrst moment of its probability density function: ∫ E(x) ,

∞

α px (α)dα

−∞

when it exists; E(x) is also called mean of x. For a random variable x = g(y) deﬁned in terms of a transformation of random variables, E(x) = E(g(y)) is the ﬁrst moment of the probability density function px (α) which must ﬁrst be calculated using py (β) and the function g( ) following the techniques detailed in §2.4. The fundamental theorem of expectation that follows shows that the ﬁrst moment of px (α), and more generally any moment of px (α), can be calculated directly from the probability density function py (β) and the function g( ). Theorem 20 (Fundamental theorem of expectation (W&J, page 85)). Let y be a k- dimensional random vector with joint probability density function py (β) and deﬁne the random variable x by the transformation x = g(y) where g( ) is a function from Rk into R. Then E[x] =

E[g(y)] ∫ ∞∫ ∞

= −∞

−∞

∫ ···

∞

−∞

g(β)py (β)dβ

(2.16)

(refer to equation W&J (2.126b)). Remark. It is usually faster to calculate E[x] by equation (2.16) ∫ ∞ rather than deriving px (α) from py (β) and g : Rk → R, and then calculate −∞ αpx (α)dα.

Sketch of proof. The outline of the proof can be represented as in ﬁgure 2.26.

2.10. EXPECTED VALUE It then follows that:

61 ∫

E[x]

∞

= −∞ ∞ ∑

≈ =

= ≈

αpx (α)dα ai P ({ω|x(ω) ∈ Ii })

i=−∞ ∞ ∑ i=−∞ ∞ ∑

ai P ({ω|g(y(ω)) ∈ Ii }) g(β i )P ({ω|y(ω) ∈ Bi })

i=−∞ ∫ ∞ −∞

∫

···

∞

−∞

g(β)py (β)dβ

if we assume that all Ii ’s and Bi ’s are very small.

k g i ai , i

Bi : g Ii , i k

B2

2

B1 Domain of y

B1

1 1

B0

0

g:

I1

I0

I1

I2

a1

a0

a1

a2

k

Domain of x

Figure 2.26: Proof of the fundamental theorem of expectation In light of the theorem of expectation we have ∫ ∞ αn px (α) dα = E(xn ) −∞

and the n-th moment of the probability density function px (α) is then simply the expected value of the random variable xn . From now on we naturally extend the

62

CHAPTER 2. PROBABILITY THEORY

deﬁnition of moment of a probability density function to moment of a random variable: the n-th moment of x is simply the n-th moment of the probability density function of x. Proposition 21 (Properties of expected value W&J, pp 87 - 88). Let x, y be two random variables with mean values E(x), E(y) respectively. Then: 1. E(ax + by) = aE(x) + bE(y), ∀a, b ∈ R. 2. x, y are statistically independent ⇒ E(xy) = E(x)E(y). Remark. The converse of property (2) is in general not true. Deﬁnition 2.10.2 (W&J, p 88). Let x, y be random variables with probability density function px (α), py (β) respectively. ∫∞ 1. Whenever ﬁnite, −∞ (α − x ¯)n px (α)dα = E[(x − x ¯)n ] is called the n-th central moment of x. 2. The second central moment (if it exists), denoted by σx2 or V ar(x), is called variance of x. 3. The standard deviation σx is the square root of the variance (if it exists). Remark (W&J, p 89). The ﬁrst and second moments can respectively be interpreted as center of gravity and moment of inertia. The followings are easy to prove: Proposition 22 (Properties of 2nd moments). 1. E(x2 ) = σx2 + x ¯2 . 2. z = ax + by in which x, y are statistically independent ⇒ σz2 = a2 σx2 + b2 σy2 . 3. z = ax ⇒ σz2 = a2 σx2 . Deﬁnition 2.10.3. A complex random variable is a mapping: w: Ω w: ω

→ C 7 → x(ω) + jy(ω)

where x : Ω → R, y : Ω → R are random variables. The complex random variable w is completely described by the joint probability density function px,y (α, β) of x and y. Using the fundamental theorem of expectation we naturally extend the definition of expected value to complex random variables: ∫ ∞∫ ∞ (α + jβ)px,y (α, β)dαdβ E(w) = −∞

=

−∞

x+jy

2.10. EXPECTED VALUE

63

We obtain a new interpretation of the characteristic function of a random variable x: ∫ ∞ Mx (ν) = px (α)ejνα dα = E[ejνx ] −∞

(refer to W&J, equation (2.136)). This, together with statement 4 of section 2.8 on page 49 of these notes, leads to a one-line-proof that “if z = x + y and x, y are statistically independent, then Mz (ν) = Mx (ν)My (ν).” We also have: Theorem 23. Let px (α) and Mx (ν) respectively denote the probability density function and characteristic function of a random variable x of which all the moments are ﬁnite. (n)

(n)

1. Mx (0) = (j n )E[xn ], where Mx (0) denotes the n-th derivative of Mx (ν) evaluated at ν = 0. 2. The moments completely specify the probability density function px (α). Sketch of proof. 1. refer to W & J. 2. The Taylor series of ejνx (around ν = 0) is: ejνx =

∞ ∑ (jνx)n n! n=0

It follows that Mx (ν) = E[ejνx ] = ∫∞ 1 −jνα dν. 2π −∞ Mx (ν)e

2.10.1

∑∞ n=0

(jν)n n n! E[x ].

Finally, px (α) =

Moments of a Gaussian random variable

Deﬁnition 2.10.4. A random variable x is Gaussian if its characteristic function is of the form 1 2 2 Mx (ν) = ejνµ e− 2 ν σ where µ, σ 2 are constants in R. We then ﬁnd (using MAPLE or another math package): Mx(1) (ν) =

ejνµ e− 2 ν

Mx(2) (ν) =

−ejνµ e− 2 ν

1

2

1

σ2 2

(jµ − νσ 2 )

σ2

(µ2 + 2jµσ 2 ν + σ 2 − ν 2 σ 4 )

Substituting ν = 0 we then have: Mx(1) (0) =

jµ

Mx(2) (0) =

−µ2 − σ 2

64

CHAPTER 2. PROBABILITY THEORY

Finally we obtain: (1)

x ¯ =

Mx (0) =µ j (2)

x2

=

Mx (0) = µ2 + σ 2 j2

⇒

σx2 = σ 2

This shows that the parameters µ and σ 2 respectively represent E[x] and E[(x − x ¯)2 ] = V ar(x) and we consequently write the characteristic function 1 2 2 of a Gaussian random variable as Mx (ν) = ejν x¯ e− 2 ν σx . Theorem 24. The general form of the Gaussian probability density function is: 2 2 1 px (α) = √ e−(α−¯x) /(2σx ) 2 2πσx Proof. Calculate F −1 [Mx (−ν)] with the above Mx (ν) using MAPLE, another math package or a table of Fourier transform pairs. Theorem 25. 1. If x is a Gaussian random variable, then:    0 E[(x − x ¯)n ] = n n!σx   n/2 n 2

( 2 !)

; n odd ; n even

2. The sum of N statistically independent Gaussian random variables is also Gaussian. Sketch of proof. 1. Let y = x − x ¯. Then y is a Gaussian random variable with y¯ = 0 and 1 2 2 σy2 = σx2 (easy to prove). It follows that My (ν) = e− 2 ν σx . From theorem 23 on page 63 we obtain (using MAPLE or another math package): y = −jMy(1) (0) = 0

y 2 = −My(2) (0) = σx2

y 3 = jMy(3) (0) = 0

y 4 = My(4) (0) = 3(σx2 )2

y 5 = −jMy(5) (0) = 0

y 6 = −My(6) (0) = 15(σx2 )3

y 7 = jMy(7) (0) = 0

y 8 = My(8) (0) = 105(σx2 )4

y 9 = −jMy(9) (0) = 0 .. . 2. Refer to W&J page 93.

y 10 = −My(10) (0) = 945(σx2 )5 .. .

2.10. EXPECTED VALUE

2.10.2

65

Moments of some speciﬁc random variables

1. Exponential probability density function: c > 0 px (α) =

1 −α/c e u(α) c

where u( ) denotes the unit step function. We then have E[x] = c, E[x2 ] = 2c2 , E[(x − x ¯)2 ] = c2 = σx2 To obtain the characteristic function, we start with Fourier transform pair: e−at u(t) ←→

1 a + jω

It follows that: px (α) =

1 −α/c 1/c 1 e u(α) ←→ Px (ω) = = c 1/c + jω 1 + jcω

Finally Mx (ν) = Px (−ν) 1 = 1 − jcν We easily ﬁnd: Mx′ (0) =c j Mx′′ (0) = 2c2 j2

x ¯ = x2

=

as already established. 2. Rayleigh probability density function: b > 0 px (α) =

2α −α2 /b e u(α) b

where u( ) denotes the unit step function.17 We then have E[x] = E[x2 ] = E[(x − x ¯)2 ] =

1√ bπ, 2 b, ( π) b 1− = σx2 ≈ 0.2146 b 4

17 Equation Wozencraft & Jacobs(7.75b) is not quite the same as equation Wozencraft & Jacobs(2.46a), but it is equivalent.

66

CHAPTER 2. PROBABILITY THEORY The characteristic function is given by: ∫ Mx (ν)

∞

= ∫

−∞ ∞

px (α)ejνα dα

2α −α2 /b jνα e e dα b 0 ∫ ∞ 2α −α2 /b + jνα = e dα b 0 ( √ )( ( jν √b )) 2 jν bπ e−ν b/4 = 1+ 1 + erf 2 2

=

where the last equality is obtained by MAPLE. We easily ﬁnd (using MAPLE again): √ Mx′ (0) bπ = j 2 Mx′′ (0) =b j2

x ¯ = x2

=

as already established. 3. uniform probability density function: b > a   px (α) =



1 b−a

;a ≤ α ≤ b

0

; elsewhere

We then have: E[x]

=

E[x2 ] = E[(x − x ¯)2 ] =

b+a , 2 1 2 (a + b2 + ab), 3 (b − a)2 = σx2 12

The characteristic function is given by: ∫

Mx (ν) = =

b

ejνα dα b−a (a )( )( ) j 1 ejνb − ejνa a−b ν

where the last equality is obtained by MAPLE. Multiplying the last equa-

2.10. EXPECTED VALUE

67

tion by e−jν(b+a)/2 ejν(b+a)/2 we obtain: ( jν(b+a)/2 )( )( ) e 1 Mx (ν) = ejν(b−a)/2 − e−jν(b−a)/2 b−a jν ( jν(b+a)/2 )( ) e 2 = sin(ν(b − a)/2) b−a ν sin(πν(b − a)/(2π)) = ejν(b+a)/2 πν(b − a)/(2π) ( ) ν(b − a) = ejν(b+a)/2 sinc 2π We easily ﬁnd (using MAPLE ): Mx′ (0) a+b = j 2 Mx′′ (0) a2 + ab + b2 = j2 3

x ¯ = x2

=

as already established. 4. Cauchy probability density function: b > 0 px (α) =

b/π b2 + α 2

We then have: E[x] = E[x2 ] =

does not exist does not exist

E[(x − x ¯ )2 ] =

does not exist

The characteristic function is given by: Mx (ν) = e−b |ν| (refer to problem 2.35 in W&J). The characteristic function is not diﬀerentiable at ν = 0 as expected since the ﬁrst and second moments (in fact any moment) do not exist. 5. Binomial probability density function: N ∈ N∗ , 0 < p < 1 px (α) =

N ( ) ∑ N k=0

k

pk (1 − p)N −k δ(α − k)

We then have E[x] = N p, E[x2 ] = N p(N p + 1 − p), E[(x − x ¯)2 ] = N p(1 − p) = σx2

68

CHAPTER 2. PROBABILITY THEORY The characteristic function is given by: [ ] Mx (ν) = E ejνx ( ) N ∑ N k p (1 − p)N −k = ejνk k k=0 ( )N = 1 − p + pejν where the last equality is obtained by MAPLE. We easily ﬁnd: Mx′ (0) = Np j Mx′′ (0) = p2 N 2 + N p − p2 N j2

x ¯ = x2

=

as already established. 6. Poisson probability density function: ρ > 0 px (α) =

∞ ∑ ρk k=0

k!

e−ρ δ(α − k)

We then have E[x] = E[x2 ] = E[(x − x ¯ )2 ] =

ρ, ρ(ρ + 1), ρ = σx2

Refer to problem 2.34 in W&J for the calculation of the characteristic function. We ﬁnd: jν

Mx (ν) = eρ(e

−1)

= eρ(cos(ν)−1) ejρ sin(ν)

We easily ﬁnd: x ¯ = x2

=

Mx′ (0) =ρ j Mx′′ (0) = ρ2 + ρ j2

as already established. The probability density function px (α) of a random variable x leads to the probability calculation of events involving x. Through the E( ) operator we also obtain the moments (when they exist) and the characteristic function Mx (ν). The moments and/or the characteristic function may conversely be used to perform probability calculations of events on x by ﬁrst recovering the probability density function px (α). In the next section we see that the moments can directly be used (i.e. without ﬁrst calculating px (α)) to approximate some probabilities; Chebyshev’s inequality (lemma 26) is one such example.

2.11. LIMIT THEOREMS

2.11

69

Limit Theorems

Refer to Wozencraft & Jacobs, pages 94 to 111. Motivating problem: You walk into a classroom of 100+ students you’ve never seen before. The students just received a test for which the class average is 75% and the standard deviation is 10 (i.e. the variance is 100). For this game, you are asked to pick a group of N students at once and you win if this group’s average is within 10 points of the class average, i.e. between 65% and 85%. The prize is $100. Ideally, you could take the whole class and you would certainly win, but for each of the N students you pick, you have to pay $1; thus, taking the whole class would cost you more than you would win. How many students should you take to be 95% certain of winning, i.e. 95% chances of making a net 100 − N dollars and 5% chances of loosing N dollars. Answer: With the weak law of large numbers we will ﬁnd N = 20. That means you have at least 95% chances of making $80 and at most 5% chances of loosing $20. With the Central Limit Theorem we will ﬁnd that N > 3.84, i.e. 4 students may be suﬃcient.18 Deﬁnition 2.11.1. Let xi , i = 1, 2, . . . , N denote N statistically independent identically distributed random variables. The random variable m deﬁned by: m=

N 1 ∑ xi N i=1

is called sample mean of x1 , x2 , . . . , xN . We ﬁrst note that E(m) V ar(m)

= xi = x σx2i σ2 = = x N N

In this section we study some formal probability statements, referred to as limit theorems, about the sample mean. In particular we will see that it is fair to assume that m is close to E(x) as N becomes large. Remark. If xi is a random variable deﬁned ∑N from the i-th trial of a compound experiment, then the sample mean N1 i=1 xi is a statistical estimate of the expected value x ¯. The deﬁnition of the sample mean does not require that the xi be statistically independent repeated trials of the same experiment but in practice this is often the case. 18 This assumes that the sample mean with only 4 random variables has an approximately Gaussian density function.

70

CHAPTER 2. PROBABILITY THEORY

Lemma 26 (Chebyshev’s inequality, W&J, page 95). Let y be a random variable with ﬁnite variance σy2 . Then for any positive ϵ we have P (|y − y¯| ≥ ϵ) ≤ σy2 /ϵ2 . Proof. We have ∫ σy2

∞

= −∞ ∫ ∞

= ∫ ≥

−∞

(α − y¯)2 py (α)dα

(2.17)

β 2 py (β + y¯)dβ

(2.18)

|β|≥ϵ

β 2 py (β + y¯)dβ

∫

≥ ϵ

2 |β|≥ϵ

py (β + y¯)dβ

(2.19)

py (α)dα

(2.20)

∫ = ϵ2

|α−¯ y |≥ϵ

= ϵ2 P (|y − y¯| ≥ ϵ) equation (2.18) follows from (2.17) using the change of variable β = α − y¯ and equation (2.20) follows from (2.19) using the change of variable α = β + y¯ In concrete words the lemma says that “The variance is a measure of the randomness of a random variable”.19

2.11.1

Weak law of large numbers

An immediate application of Chebyshev’s inequality is that “the probability that the sample mean diﬀers from the true mean by more than ϵ > 0 approaches 0 as N → ∞”.20 This is the ﬁrst of our limit theorems and it is formally stated in theorem 27. Theorem 27 (Weak law of large numbers, W&J, pp 96 - 97). Let x1 , x2 , . . . , xN be statistically independent identically distributed random variables with mean x ¯ and variance σx2 . Then ∀ϵ > 0 we have ( ( ) ) N 1 ∑ σ2 P xi − x ¯ ≥ ϵ ≤ x2 . (2.21) N Nϵ i=1 Proof. Easy; follows immediately from Chebyshev’s inequality In the last two paragraphs of The Weak Law of Large Numbers, Wozencraft&Jacobs present a very important special case that conﬁrms the validity of the estimation of the probability of an event A by Monte Carlo simulations. 19 W&J, 20 W&J,

page 94 page 96

2.11. LIMIT THEOREMS

71

This is what we meant in item #2 described under Relation of the Model to the Real World on page 6 of these notes. This also leads to a (conservative) bound on the conﬁdence intervals given a certain number of repetitions.21

2.11.2

Central Limit Theorem

Theorem 28 (Central Limit Theorem, W&J, pp 106 - 111). Let y1 , y2 , . . . , yN denote N statistically independent, identically distributed random variables with mean y¯ and (ﬁnite) variance σy2 . We deﬁne the random variable: N √ 1 ∑ z=√ (yi − y¯) = N (m − y¯) N i=1

and denote by Fz (α) the probability distribution function of z. Then: ∫ α 2 2 1 √ lim Fz (α) = e−β /2σy dβ N →∞ 2πσy −∞ In concrete terms, the distribution function of the sample mean becomes Gaussian as N increases. Remarks. (pp 108 - 109): 1. The result does not imply that pz (α) is Gaussian (refer to Wozencraft&Jacobs, page 108 for a very good discussion). 2. As N becomes large the shape of the distribution function of the sum approaches the shape of the Gaussian distribution function. “Large N ” depends on the shape of the density function of each yi . Obviously if the yi ’s are all Gaussian to start with, then the result is true for any N . In fact if the yi are all Gaussian they don’t need to be statistically independent, nor do they need to have the same variance nor the same mean. This general result will be shown in chapter 3 (W&J),22 but the special case of two jointly Gaussian random variables of equal unity variance and equal zero-mean can be shown using equations (2.87) (W&J) and (2.58) (W&J). This is useful, if not required to solve problems 2.25 (W&J) and 3.2(c) (W&J). 3. The result explains why we often model the noise in a system with a Gaussian probability density function (more on this in problems 3.15, 3.16). 4. The result ﬁnds applications similar to the weak law of large number: ( ) ( ) N 1 ∑ a P √ (yi − y¯) ≥ a ≈ Q σy N i=1 when N is “large” and |a/σy | is “relatively small”. 21 W&J,

page 97. 4 on pages 166 - 168

22 property

72

CHAPTER 2. PROBABILITY THEORY 5. The approximation: (( P

) ( ) ) N N 1 ∑ 1 ∑ yi − y¯ ≥ ϵ = P (yi − y¯) ≥ ϵ N i=1 N i=1 ( ) N √ 1 ∑ = P √ (yi − y¯) ≥ ϵ N N i=1 ( √ ) ϵ N ≈ Q (2.22) σy

is not normally used. The Chernoﬀ bound (§2.11.3) is much tighter for large values of N and is also much tighter than the weak law of large numbers. The above approximation is nonetheless sometimes used to estimate the conﬁdence intervals of simulation results, because it is easier to use than the Chernoﬀ bound. Sketch of proof. Assume y¯ = 0: (1) W&J equation (2.174) [ ( )]N N 1 ∑ ν z=√ yi ⇒ Mz (ν) = My √ N i=1 N (2) W&J equation (2.175b) [ ( )] ν ln [Mz (ν)] = N ln My √ N (3) Using the Taylor series of My (ν) with y¯ = 0, as in the proof of part 2 of theorem 23, we obtain (W&J equation (2.176)): My (ν) = 1 −

ν 2 σy2 + ν 3 f (ν) |{z} 2 goes to

−jy 3 3!

as ν → 0

and it follows that: ( My

ν √ N

)

ν 2 σy2 ν3 =1− + 3/2 f 2N N |

(

) ν √ N {z }

goes to

−jy 3 3!

as N → ∞

2.11. LIMIT THEOREMS

73

(2) & (3) together with the Taylor series of ln(1 + w) (refer to equations (2.176), (2.177) in W&J):  ln [Mz (ν)]



[  ( )]   ν 2 σy2 ν3 ν   = N ln 1 − − 3/2 f √  2N N N   | {z } −w

= N ln(1 + w) N w3 N w2 + − ··· = Nw − 2 3 ≈ Nw if |w| ≪ 1

when |w| < 1

* The latter condition |w| ≪ 1 is achieved when N is large enough. We then have:

ln (Mz (ν)) ≈ N w =

−ν 2 σy2 ν3 +√ f 2 N

(

ν √ N

)

** If N is even larger, then ln [Mz (ν)] = − 12 ν 2 σy2 , from which we obtain Mz (ν) = e− 2 ν 1

2

σy2

In concrete words the above says that if My (ν) is the characteristic function ∑N of y1 , y2 , ..., yN , then the characteristic function of z = √1N i=1 yi is Mz (ν) = √ 1 2 2 My (ν/ N )N . When N becomes large enough this approaches e− 2 ν σy . This is illustrated in the following example. Example 2.11.1. Let y be a uniformly distributed random variable with probability density function sketched in ﬁgure 2.27. The characteristic function of y is given by: sin(ν/2) My (ν) = ν/2 and σy2 = V ar(y) = 1/12 (refer to §2.10.2). The plots shown in ﬁgure 2.28 √ 1 2 2 illustrate that MyN (ν/ N ) approaches e− 2 ν σy as N increases.

2.11.3

Chernoﬀ Bound

Refer to W&J, pp 97 - 106.

74

CHAPTER 2. PROBABILITY THEORY py 1

1 2

1 2

Figure 2.27: Probability density function of y in example 2.11.1.

Theorem 29. Let x1 , x2 , . . . , xn be statistically independent, identically distributed random variables and let x denote any one of the xi ’s. If E[xeλ0 x ] and E[eλ0 x ] are both ﬁnite then:  ( ∑ ) N 1  if d > x ¯   P N i=1 xi ≥ d E[eλ0 (x−d) ]N ≥

( )    P 1 ∑N xi ≤ d i=1 N

if d < x ¯

where λ0 is given implicitly by the equation: E[xeλ0 x ] =d E[eλ0 x ] Remarks (W&J, page 101). 1. The conditions of the Chernoﬀ bound are always veriﬁed if the xi ’s are discrete random variables taking only a ﬁnite number of values. 2.

d>x ¯ ⇒

λ0 ≥ 0

d x ¯. Then we have P (m ≥ 1/2) ≤ E[eλ0 (x−1/2) ]N where λ0 is obtained by solving the equation: E[xeλ0 x ] 1 = λ x 0 E[e ] 2 We ﬁnd: E[xeλ0 x ] = = E[eλ0 x ] = = Therefore

(eλ0 −e−λ0 ) (eλ0 +e−λ0 )

=

1 2

⇒

(1/2)eλ0 + (1/2)(−e−λ0 ) (eλ0 − e−λ0 ) 2 (1/2)eλ0 + (1/2)(e−λ0 ) (eλ0 + e−λ0 ) 2 √ λ0 = ln( 3) . It follows that

E[eλ0 (x−1/2) ] = (1/2)eλ0 (1−1/2) + (1/2)eλ0 (−1−1/2) = =

eλ0 /2 + e−3λ0 /2 2 2 33/4

2.11. LIMIT THEOREMS

77

Finally we obtain: ( P (m ≥ 1/2) ≤

2

)N ≈ 0.877383N

33/4

(b) Let d = −1/2 < x ¯. Then we have P (m ≤ −1/2) ≤ E[eλ0 (x+1/2) ]N where λ0 is obtained by solving the equation: E[xeλ0 x ] 1 =− λ x 0 E[e ] 2 We ﬁnd (as before): (1/2)eλ0 + (1/2)(−e−λ0 ) (eλ0 − e−λ0 ) = 2 λ0 x E[e ] = (1/2)eλ0 + (1/2)(e−λ0 ) (eλ0 + e−λ0 ) = 2

E[xeλ0 x ] =

Therefore

(eλ0 −e−λ0 ) (eλ0 +e−λ0 )

E[eλ0 (x+1/2) ]

√ = − 12 ⇒ λ0 = − ln( 3) . It follows that = (1/2)eλ0 (1+1/2) + (1/2)eλ0 (−1+1/2) = =

e3λ0 /2 + e−λ0 /2 2 2 same as before 33/4

Finally we obtain: ( P (m ≤ 1/2) ≤

2 33/4

)N ≈ 0.877383N

The events {m ≤ −1/2} and {m ≥ 1/2} are disjoint. Consequently P (|m| ≥ 1/2) =

P (m ≥ 1/2) + P (m ≤ −1/2) ( )N 2 = 2 ≈ 2 × 0.877383N 33/4

Once again we see that the probability goes to 0 as N → ∞.

78

CHAPTER 2. PROBABILITY THEORY In summary: 1. Exact probability value:

1 2N −1

(N )

∑N k=3N/4

k

2. Weak Law of Large Numbers (upper) bound:

4 N

3. Chernoﬀ (upper) bound: 2 × 0.877383N The graphs shown in ﬁgures 2.29 illustrate that the probability goes to 0 as N → ∞. The weak law of large numbers bound becomes looser as N increases. Whenever possible the Chernoﬀ bound should be used because it is the tightest exponential bound. Example 2.11.3 (W&J, pp 102 - 106). As a second application of the Chernoﬀ bound, consider the compound experiment of problem 2.10(c). In order to transmit information on the binary symmetric channel of transition probability p = 1/6, a symbol is repeatedly transmitted N times (N odd) and the receiver estimates the transmitted message by taking a majority vote on the received symbols. If messages “0” and “1” are equally likely the exact probability of error P (E ) is given by: ( ) N ∑ N P (E ) = (1/6)k (5/6)N −k k N +1 k=

2

In order to better see the behaviour of this probability as N becomes large (N → ∞) we calculate the Chernoﬀ bound of the above probability expression. To be more general we consider the binary symmetric channel of transition probability p where we assume WLOG p < 1/2 and as before each equally likely message (“0” or “1”) is transmitted N times (N odd) and the receiver estimates the message by majority vote on the received symbols. In order to apply the Chernoﬀ bound we rephrase the problem in terms of random variables x1 , x2 , . . . , xN deﬁned by: { 0 ; if the i-th digit is received correctly xi = 1 ; if the i-th digit is not received correctly It is then easy to show that the error event E is equivalent to: E

⇔

N ∑

xi ≥

N +1 2

xi ≥

N 2

i=1

⇔

N ∑ i=1

⇔

N 1 ∑ 1 xi ≥ N i=1 2

2.11. LIMIT THEOREMS

79 (

since by assumption N is odd. Therefore P (E ) = P x ¯ = E[x2 ] = σx2

=

1 N

∑N i=1

xi ≥

1 2

) , where:

(1 − p)(0) + (p)(1) = p  (1 − p)(0)2 + (p)(1)2 = p  p − p = p(1 − p) 2

not required to apply  the Chernoﬀ bound

and x denotes any one of the random variables xi . The Chernoﬀ bound (d = 1/2 > x ¯ = p) then states that: ) ( N [ ]N 1 ∑ 1 ≤ E eλ0 (x−1/2) P xi ≥ N i=1 2 ( ) where λ0 = ln 1−p > 0 since by assumption p < 1/2 (refer W&J, page 103 p for the details). It follows that: [ ] √ √ E eλ0 (x−1/2) = 2p 2(1 − p) √ = 2 p(1 − p) < 1 (since p < 1/2) ( ∑ ) ( √ )N N So P N1 i=1 xi ≥ 12 ≤ 2 p(1 − p) and we immediately see that the probability of error goes to 0 exponentially as N → ∞. The graphs in ﬁgure 2.30 show the behaviour of the Chernoﬀ bound in comparison to the exact probability of error (obtained in problem 2.10(c)) when p = 1/6.

80

CHAPTER 2. PROBABILITY THEORY

N (a) linear (vertical) scale

N (b) logarithmic (vertical) scale

Figure 2.29: Probability of obtaining 3N/4 times the same face with N throws of a coin

2.11. LIMIT THEOREMS

81

N (a) linear (vertical) scale

N (b) logarithmic (vertical) scale

Figure 2.30: Probability of error as a function of N

82

CHAPTER 2. PROBABILITY THEORY

Additional Topics in Probability Theory e now focus on the special case of Gaussian random vectors. We give W a formal deﬁnition and state the properties which will later be most important to the analysis of Gaussian processes. Before considering Gaussian random vectors, we study in more generality the (mixed) moments of random vectors.

2.12

Moments (mixed) of Random Vectors

Let x = (x1 , . . . , xk ) be a random vector. From the theorem of expectation we have ∫ ∞ ∫ ∞ E[g(x)] = ... g(α)px (α)dα . −∞

−∞

In particular we have: ∫ E[xi11 xi22 . . . xikk ] =

∫

∞

∞

... −∞

−∞

α1i1 α2i2 . . . αkik px (α)dα

Deﬁnition 2.12.1. 1. If the integrals converge, E[xi11 . . . xikk ] is called n-th mixed moment of x where n = i1 + i2 + . . . + ik . [∏ ] k ij (x − 2. If the integrals converge, E x ) is called n-th mixed central j j j=1 moment of x. 3. The second mixed central moment of two random variables x1 , x2 is called covariance and denoted Cov(x1 , x2 ). Clearly Cov(x1 , x2 ) = E[x1 x2 ] − x1 x2 . In general there will be many diﬀerent n-th mixed moments of a random vector x. This is illustrated in the example that follows. The practical meaning of mixed moments is to give information about the joint probability density function of a random vector, in applications for which the joint probability 83

84

CHAPTER 2. PROBABILITY THEORY

density function is not known but statistical estimates of the mixed moments are available. Lemma 30. The total number of diﬀerent n-th moments of a set of v random variables is given by the expression: ) v ( )( ∑ v n−1

where, by convention,

(n) k

k−1

k

k=1

= 0 whenever k > n.

Sketch of proof. This is equivalent to the number of ways in which a line segment of length n can be divided into v or fewer sub-segments (all lengths are integer).

Example 2.12.1. Let x = (x1 , x2 ). There are 3 second moments of x: x21 , x22 , x1 x2 . The second central moments are V ar(x1 ) = σx21 , V ar(x2 ) = σx22 and Cov(x1 , x2 ). If x1 and x2 are statistically independent then: Cov(x1 , x2 ) = E[x1 x2 ] − x1 x2 = E[x1 ]E[x2 ] − x1 x2 =

0

If x1 and x2 are related by x2 = ax1 , a ̸= 0, then: Cov(x1 , x2 ) = E[(x1 − x1 )a(x1 − x1 )] = aV ar(x1 ) { √ √V ar(x1 )V ar(x2 ) ; = − V ar(x1 )V ar(x2 ) ;

a>0 a 0 and |Λy | is consequently a real number as required. Indeed Λy = A × Λx × AT where A is an invertible matrix and Λx = diag[σ12 , σ22 , . . . , σk2 ]. It follows that: det(Λy ) = det(A) det(Λx ) det(AT ) = det(A)2

k ∏

σi2 > 0

i=1

Example 2.13.1. (from W&J, page 171) Consider 2 jointly Gaussian random variables y1 , y2 such that: E[y1 ] = E[y12 ] = E[y1 y2 ] = i.e. the correlation coeﬃcient is √ my Λy

ρσ 2

Cov(y1 ,y2 ) V ar(y1 )V ar(y2 )

= ρ. Then we easily ﬁnd

= (0, 0) [ ] [ 2 V ar(x1 ) Cov(x1 , x2 ) σ = = Cov(x2 , x1 ) V ar(x2 ) ρσ 2

|Λy | = σ 4 (1 − ρ2 ) Λ−1 y

E[y2 ] = 0 E[y22 ] = σ 2

=

1 σ 4 (1 − ρ2 )

[

σ2 −ρσ 2

−ρσ 2 σ2

[

] =

ρσ 2 σ2

1 σ 2 (1−ρ2 ) −ρ σ 2 (1−ρ2 )

]

−ρ

σ 2 (1−ρ2 ) 1 σ 2 (1−ρ2 )

]

94

CHAPTER 2. PROBABILITY THEORY

which leads to T (β1 , β2 ) × Λ−1 y × (β1 , β2 ) =

β12 − 2ρβ1 β2 + β22 σ 2 (1 − ρ2 )

Finally we obtain: py (β1 , β2 ) =

( ) 1 β 2 − 2ρβ1 β2 + β22 √ exp − 1 2 . 2σ (1 − ρ2 ) 2πσ 2 1 − ρ2

Example 2.13.2. Problem 3.3, W&J, page 201.

2.14

Problems

1. Une exp´erience al´eatoire consiste `a tirer deux billes de quatres urnes identiﬁ´ees par la lettre A et les chiﬀres 1, 2, 3 de la fa¸con suivante. La premi`ere bille est prise de l’urne A qui contient 75 billes marqu´ees 1, 24 billes marqu´ees 2, 1 bille marqu´ee 3. Le chiﬀre de la bille tir´ee de l’urne A indique l’urne de laquelle la deuxi`eme bille doit ˆetre prise. Le contenant des urnes 1, 2, 3 est comme suit: l’urne 1 contient 1 bille noire, 15 billes blanches, 14 billes rouges, l’urne 2 contient 1 bille noire, 9 billes blanches, 5 billes rouges, et l’urne 3 contient 9 billes noires, 1 bille blanche, 5 billes rouges. (a) Calculez la probabilit´e que la deuxi`eme bille soit noire. (b) De quelle urne la deuxi`eme bille a-t-elle le plus probablemment ´et´e tir´ee si elle est noire? 2. A random experiment consists in drawing two balls out of three urns labeled A, B, C in the following manner. The ﬁrst ball is drawn from the urn labeled A which contains 1 ball labeled B and 9 balls labeled C, the letter B or C indicating from which urn the second ball is to be drawn. Urn B contains 9 red balls and 1 black ball, and urn C contains 1 red ball and 4 black balls. A valid sample space for the experiment is: Ω = {(B, red), (B, black), (C, red), (C, black)} where the ﬁrst component of each sample point is the result of the ﬁrst ball (drawn from urn A) and the second component is the colour of the second ball (drawn from either urn B or C). The class of events is assumed to be the set of all 16 subsets of Ω. (a) Calculate the probability of the four events consisting of a single sample point, i.e. P [{(B, red)}], P [{(B, black)}], P [{(C, red)}], P [{(C, black)}].

2.14. PROBLEMS

95

(b) Show that P [{the second ball is red}] = 0.27. It obviously follows from this result that P [{the second ball is black}] = 0.73 (you need not show the latter). (c) Calculate the probability that the second ball is drawn from urn B if it is red. Calculate the probability that the second ball is drawn from urn C if it is red. (d) Based on your result in part (2c), which urn would you guess the second ball is drawn from if it is red. Justify your answer. What is the probability that your guess is correct. 3. The probability density function of a Gaussian random variable x is given by: 2 1 px (α) = √ e−(α−2) /16 4 π Calculate the following: (a) P ({ω : x(ω) ≥ 4}) (b) P ({ω : x(ω) ≥ 0}) (c) P ({ω : 3 ≤ x(ω) < 4}) 4. The joint probability density function of two random variables x and y is given by:  1  2 e−|β−α| ; 0 ≤ α < 1 pxy (α, β) =  0 ; elsewhere (a) Show that the value of the probability density function of the random variable y∫ at 2 is py (2) = e−1 2e2 . Recall: eu du = eu (b) Calculate the probability that the value of x be between 0.5 and 1 if it known that y = 2. 5. Let y denote a random variable deﬁned by the transformation y = |x| + 1 where x is a random variable with probability density function:  1 ; −2 ≤ α < −1  2     1 ; 1≤α 4.375 Calculate P (m(r) b = m0 ), P (m(r) b = m1 ), P (m(r) b = m2 ).

2.14. PROBLEMS

97

(c) Calculate P (E ) with the decision rule (2.24). Hint: It is easier to ﬁrst calculate P (C ).

(a)

(b)

(c)

Figure 2.31: 8. A communication system is used to transmit a voltage s to a decision device as shown in Fig. 2.32. Thus the decision device has available two received voltages, r1 and r2 , in which r 1 = s + n1 ,

r 2 = n1 + n2

Assume that n1 , n2 are zero-mean Gaussian random variables with variances σ12 = σ22 = σ 2 and that s, n1 and n2 are jointly statistically independent. The system is used to communicate one of two messages m0 and m1 with a priori probabilities P [m0 ] = P [m1 ] = 1/2. For message ml the voltage is √ sl = (−1)l E; l = 0, 1. The optimum decision rule seeks to determine that l for which the mixed form joint density function pr1 , r2 (ρ1 , ρ2 , {s = sl }) is maximum. (a) Determine the structure of the optimum decision device. Hint: Start from pr1 , r2 (ρ1 , ρ2 , {s = sl })

= P [ml ] pr1 (ρ1 |{s = sl }) pr2 (ρ2 |{s = sl }, r1 = ρ1 ) = P [ml ] pn1 (ρ1 − sl ) pn2 (ρ2 − ρ1 + sl )

and show that m(ρ ˆ 1 , ρ2 ) =

  m0

; 2ρ1 − ρ2 > 0



; otherwise

m1

98

CHAPTER 2. PROBABILITY THEORY (b) Calculate the probability of error. Hint: By symmetry P (E ) = P (E |{s = s0 }) = P (E |{s = s1 }) (you need not show this). Starting from the above optimum √ decision device, show that P (E |{s = s1 }) reduces to P ({n > 2 E}), where n = n1 − n2 is a Gaussian random variable. Calculate the mean and variance of n and ﬁnally express the probability of error in terms of the Q( ) function.

Figure 2.32:

9.

23

We wish to simulate a communication system on a digital computer and estimate the error probability P (E ) by measuring the relative frequency of error. Let N denote the number of independent uses of the channel in the simulation and xi , i = 1, 2, . . . , N denote the random variable such that:   0 ; no error on the i-th use of the channel xi =  1 ; there is an error on the i-th use of the channel The xi are identically distributed and statistically independent random variables. The relative frequency of error, i.e. the estimate of P (E ) is given by the sample mean: m=

N 1 ∑ xi N i=1

and clearly E[m] = E[xi ] = P (E ). Approximate by means of (a) the Weak Law of Large Numbers, (b) the Central Limit Theorem, 23 This

problem is based on problem 2.38, page 127 in W&J.

2.14. PROBLEMS

99

how many independent uses of the channel we must simulate in order to be 99.9% certain that the observed relative frequency lies within 5% of the true P [E ], which we may assume to be approximately 0.01. Hint: For the Weak Law of Large Numbers, use equation (2.21) on page 70 in the notes (in theorem 27). For the Central Limit Theorem, use the approximation: ( ( ) ( √ ) ) N 1 ∑ ϵ N P xi − x ¯ ≥ ϵ ≈ 2Q N σx i=1

which follows from equation (2.22) on page 72 in the notes. 10. The probability density function of a random variable x is given by:   e−(α−a) ; α ≥ a px (α) =  0 ; α 0 (delay), • r(t) = s(t) + n(t), ∫∞ • y(t) = −∞ h(α)x(t − α)dα, for some ﬁxed h(t) (ﬁltering). In the above we deﬁne two processes x(t) and y(t) to be equal if and only if the probability of the event {ω ∈ Ω : x(ω, t) ̸= y(ω, t)} is equal to 0; the concept is similar to that of equality of random variables as per deﬁnition 2.3.8. Example 3.1.1. Let Ω = {Head, Tail}, P (Head) = P (Tail) = random process x(t) by: x: Ω×R → x : (Head, t) 7→ x : (Tail, t) 7→ 2 W&J 3 W&J,

page 133 pages 133 - 135

R 2 sin(t)

1 2

and deﬁne the

3.1. FUNDAMENTAL DEFINITIONS

103

x(0) is a random variable and we see that P (x(0) = 2)

=

P (x(0) = 0)

=

1 2 1 2

x(π/2) is also a random variable and we have P (x(π/2) = 2) = P (x(π/2) = 1) =

1 2 1 2

The probability density functions of x(0), x(π/2) are sketched in ﬁgures 3.1, 3.2. Their joint probability density function is sketched in ﬁgure 3.3.

Figure 3.1:

Figure 3.2:

Example 3.1.2. Let x(t) = 10 sin(2πt + θ) where θ is a random variable uniformly distributed between 0 and 2π, i.e.   1/(2π) ; 0 ≤ θ < 2π pθ (β) =  0 ; elsewhere

104

CHAPTER 3. RANDOM WAVEFORMS

Figure 3.3:

Then x(t0 ) = 10 sin(2πt0 +θ) and the probability density function of the random variable x(t0 ) is obtained using the technique of §2.4.2: px(t0 ) (α) =

∑ β∈S(α)

pθ (β) |g ′ (β)|

where g(β)

= 10 sin(2πt0 + β)

′

g (β) = 10 cos(2πt0 + β) |g ′ (β)| = |10 cos(2πt0 + β)| S(α)

= {β ∈ [0, 2π] : 10 sin(2πt0 + β) = α}

As can be seen from ﬁgure 3.4, S(α) = {β0 , β1 } (i.e. S(α) contains two elements) for any −10 < α < 10, and β0 is given by: β0 = arcsin(α/10) − 2πt0 The expression for β1 is more complicated but it is not required since pθ (β0 ) = pθ (β1 ) = 1/(2π) and |g ′ (β1 )| = |g ′ (β0 )| = |10 cos(arcsin(α))| (easy to see on the

3.1. FUNDAMENTAL DEFINITIONS

105

graph of ﬁgure 3.4). We then obtain directly (as long as −10 < α < 10): px(t0 ) (α)

pθ (β0 ) pθ (β1 ) + ′ ′ |g (β0 )| |g (β1 )| 1/(2π) = 2 |10 cos(2πt0 + (arcsin(α/10) − 2πt0 ))| 1 = 10π| cos(arcsin(α/10))| 1 √ = π 100 − α2

=

for any t0 . px(0) (α) is sketched on ﬁgure 3.5.

Figure 3.4:

Figure 3.5:

Deﬁnition 3.1.3. A random process for which all sample functions are periodic with (same) period T > 0 is called periodic random process. The random process of example 3.1.2 above is periodic and its period is 1 [second].

106

3.2

CHAPTER 3. RANDOM WAVEFORMS

Stationariness

(W&J pp. 135 - 144) Deﬁnition 3.2.1. A random process x(t) for which px(t+T ) (α) = px(t) (α) for all α ∈ Rk , t ∈ Rk , k ∈ N∗ and for any T = (T, T, . . . , T ) where T ∈ R, is called stationary. Example 3.2.1. In example 3.1.1, we found that px(0) (α) ̸= px(π/2) (α). Therefore x(t) is not a stationary process. Example 3.2.2. (W&J pp 136 - 139) Let x(t) = f (t + τ ) where f (t) is the periodic triangular wave of period T sketched in ﬁgure 3.6 and τ is a random variable uniformly distributed between 0 and T . We see that x(t) is speciﬁed by method 3 of page 102. We show that x(t) is stationary in two steps: ﬁrst we prove that px(t1 ) (α) = px(t1 +T ) (α) for all α, t1 , T ∈ R. Next we use the result to prove that px(t) (α) = px(t+T ) (α), ∀t, T = (T, T, . . . , T ), α ∈ Rk and for all k ∈ N∗ .

Figure 3.6:

1. Using the technique of §2.4.2, the probability density function of the random variable x(t1 ) = f (t1 + τ ) is given by: px(t1 ) (α) =

∑ β∈S(α)

pτ (β) |g ′ (β)|

where the transformation g( ) and the set of roots S(α) are respectively given by: g(τ ) = f (t1 + τ ) S(α) = {β ∈ [0, T ] : g(β) = f (t1 + β) = α} We then have:

g ′ (β) ≡ slope of f ( ) = a/T

3.2. STATIONARINESS

107

and one easily sees that ∀α ∈ [0, a] there exists a unique β0 ∈ S(α) (refer to ﬁgure 3.7). We obtain directly: px(t1 ) (α) =

pτ (β0 ) 1/T 1 = = ′ |g (β0 )| |a/T | a

as long as 0 ≤ α < a. The latter is equation (3.14) in W&J and it clearly

Figure 3.7: indicates that the probability is independent of t1 ∈ R. Therefore px(t1 +T ) (α) = px(t1 ) (α) for every t1 , T, α ∈ R. 2. We factor px(t) (α) into: px(t) (α) = px′ (t′ ) (α′ |x(t1 ) = α1 )px(t1 ) (α1 ) in which x′ (t′ ) = (x(t2 ), x(t3 ), . . . , x(tk )) t′ = (t2 , t3 , . . . , tk ) α′

= (α2 , α3 , . . . , αk )

and we notice that the knowledge of x(t1 ) = α1 implies the knowledge of τ which completely speciﬁes the values of x(t2 ), x(t3 ), . . . , x(tk ). Indeed from t1 and x(t1 ) = α1 , 0 ≤ α1 < a we deﬁne h(t) = f (t − t1 + α1 T /a) where 0 ≤ α1 T /a < T . Clearly h(t) is a delayed version of f (t) and one easily sees that h(t1 ) = f (t1 − t1 + α1 T /a) = f (α1 T /a) = α1 T /a × a/T = α1 = x(t1 ) .

108

CHAPTER 3. RANDOM WAVEFORMS It follows that x(t2 ) = ··· x(tk ) =

α1 T ) , a2 a ··· α1 T h(tk ) = f (tk − t1 + ) , ak a h(t2 ) = f (t2 − t1 + ··· ··· ···

This is equation (3.15a) in W&J and it follows that px(t) (α)

= px(t1 ) (α1 )

k ∏

δ(αi − ai )

i=2

= px(t1 ) (α1 )

k ∏

δ(αi − f (ti − t1 +

i=2

= px(t1 +T ) (α1 )

k ∏

α1 T )) a

δ(αi − f ((ti + T ) − (t1 + T ) +

i=2

α1 T )) a

= px(t+T ) (α) for every T ∈ R. Example 3.2.2 above generalizes to the following: Theorem 38. Let g(t) be a periodic function of period T and deﬁne the periodic random process y(t) = g(t+τ ) where τ is a random variable uniformly distributed in the interval [0, T ]. y(t) is stationary. Sketch of proof. (W&J pp. 140 - 141). Clearly py(t) (α) = py(t+T ) (α) ⇔ ⇔

Fy(t) (α) = Fy(t+T ) (α) Fy(t) (α) = Fz(t) (α)

where z(t) = y(t + T ) (the process z(t) is speciﬁed by method 4 of page 102). Let Ωy ≡ Ωz ≡

set of sample functions of y(t) set of sample functions of z(t)

Then we easily see that Ωy = Ωz (any function in Ωz lies in Ωy and vice versa). Moreover the sample functions are all equiprobable. It follows that if I K

= {a(t) ∈ Ωy : a(t) ≤ α} = {b(t) ∈ Ωz : b(t) ≤ α}

then I = K and Fy(t) (α) = P (I) = P (K) = Fz(t) (α) = Fy(t+T ) (α) . y(t) is therefore stationary.

3.3. MOMENT FUNCTIONS OF A RANDOM PROCESS

109

Theorem 39. (W&J pp 142). Let g(W , t) be a periodic function of t of period T and J parameters W1 , W2 , . . . , WJ , none of which aﬀect the period T of g(W , t). Deﬁne x(t) = g(γ, t + τ ) where τ is a random variable, γ is a random vector and τ , γ are statistically independent. If τ is uniformly distributed in the interval [0, T ] then x(t) is stationary. Proof. Refer to W&J p. 142. Example 3.2.3. It follows immediately from theorem 38 that the random process x(t) = 10 sin(2πt + θ) from example 3.1.2 is stationary. Also theorem 39 implies that the random process deﬁned by equation (3.5) on W&J p. 133 is stationary when pr,θ (α, β) is deﬁned by equation (3.6). This is illustrated on W&J pp. 142 - 143. Remark. (W&J p. 143) “It is readily veriﬁed that 2 1 px(ti ) (α) = px(0) (α) = √ e−α /2 ”. 2π

This follows from W&J Appendix 2A in which we found that if (ρ = 0)  2 1 −α1 /2  α ; α1 ≥ 0, 0 ≤ α2 < 2π 2π e pr,θ (α1 , α2 ) =  0 ; elsewhere and we deﬁne the random variables x = r cos θ,4 y = r sin θ = x(0), then pxy (β1 , β2 ) =

1 −(β12 +β22 )/2 e 2π

By elimination of the random variable x we obtain ∫ ∞ 2 1 py (β2 ) = pxy (β1 , β2 )dβ1 = √ e−β2 /2 , 2π −∞ and by the change of variables α = β2 the desired result is obtained.

3.3

Moment Functions of a Random Process

(not in W&J) Deﬁnition 3.3.1. 1. Let x(t) be a random process and n ∈ N∗ . E[x(t1 )x(t2 ) . . . x(tn )] is a function of t = (t1 , t2 , ..., tn ) ∈ Rn called n-th moment function of x(t) if it exists. 2. The ﬁrst moment function of x(t) always exists and is called mean function and denoted mx (t) , E[x(t)]. 4 this

variable x is not to be confused with the random process x(t).

110

CHAPTER 3. RANDOM WAVEFORMS

3. The second moment function of x(t), if it exists, is called autocorrelation function and denoted Rx (t, s) , E[x(t)x(s)]. 4. Similarly, the n-th central moment function is deﬁned by E[(x(t1 ) − mx (t1 ))(x(t2 ) − mx (t2 )) . . . (x(tn ) − mx (tn ))] if it exists. The second central moment function of x(t), if it exists, is called autocovariance function and denoted Lx (t, s). Remarks. 1. Lx (t, s) =

E[(x(t) − mx (t))(x(s) − mx (s))]

= E[(x(t)x(s)] − mx (t)mx (s) = Rx (t, s) − mx (t)mx (s) If x(t) is statistically independent of x(s) the covariance reduces to 0 at (t, s), i.e. x(t) and x(s) are uncorrelated. Also, if mx (t) = 0 or mx (s) = 0 then Lx (t, s) = Rx (t, s). 2. V ar(x(t0 )) = Lx (t0 , t0 ) = Rx (t0 , t0 ) − m2x (t0 ). 3. If x(t) is stationary then the mean function is a constant function of time, i.e. mx (t1 ) = mx (t2 ), ∀t1 , t2 ∈ R, as shown below. Proof. mx (t1 ) = E[x(t1 )] ∫ ∞ = αpx(t1 ) (α)dα −∞ ∫ ∞ = αpx(t2 ) (α)dα −∞

= E[x(t2 )] = mx (t2 )

Also, the autocorrelation (autocovariance) function is independent of the time origin, i.e. Rx (t, s) = Rx (t + T, s + T ), ∀t, s, T ∈ R, as shown below. Proof. Rx (t, s) = E[x(t)x(s)] ∫∫ ∞ = αβpx(t),x(s) (α, β)dαdβ −∞ ∫∫ ∞ = αβpx(t+T ),x(s+T ) (α, β)dαdβ −∞

= E[x(t + T )x(s + T )] = Rx (t + T, s + T )

3.3. MOMENT FUNCTIONS OF A RANDOM PROCESS

111

When x(t) is stationary we see that Rx (t, s) is not function of both t, s, since by the subtraction of t (or s) from both arguments we obtain Rx (t, s) = Rx (0, s − t) = Rx (t − s, 0). We consequently drop one argument and write Rx (s − t) or simply Rx (τ ) where τ represents s − t: Rx (τ ) , E[x(t)x(t + τ )] Similarly, we denote by mx the mean function of a stationary process x(t). Deﬁnition 3.3.2. A random process x(t) for which the mean function is independent of time and the autocorrelation function is independent of time origin is said to be wide sense stationary. Obviously, a (strict sense) stationary process is also wide sense stationary, but the converse is not necessarily true. An example of a wide sense stationary process that is not (strict sense) stationary is presented on page 177 in W&J. It is also a lot easier to prove (or disprove) the wide sense stationariness of a process than the (strict sense) stationariness, especially if we only dispose of experimental data. Example 3.3.1. Consider the following random process speciﬁed from the experiment consisting in the throw of a coin (P (Head) = P (Tail) = 1/2) as in example 3.1.1: x: Ω×R → R x : (Head, t) 7→ 2 x : (Tail, t) 7→ sin t 1. Mean function: mx (t)

= E[x(t)] = (1/2) sin t + (1/2)(2) sin t = 1+ 2

The mean function mx (t) is sketched in ﬁgure 3.8(a). 2. Autocorrelation function: Rx (t, s)

= E[x(t)x(s)] = (1/2)(2 · 2) + (1/2)(sin t sin s) = 2 + 1/2 sin t sin s = 2 + 1/4 cos(t − s) − 1/4 cos(t + s)

recalling that 2 sin u sin v = cos(u − v) − cos(u + v).

112

CHAPTER 3. RANDOM WAVEFORMS

3. Autocovariance function: Lx (t, s) =

Rx (t, s) − mx (t)mx (s)

= 2 + 1/2 sin t sin s − (1 + 1/2 sin t)(1 + 1/2 sin s) = (1 − 1/2 sin t)(1 − 1/2 sin s) The function Lx (t, s) is sketched in ﬁgure 3.8(b). Clearly this process is not wide sense stationary (⇒ nor strict sense stationary).

(a) mx (t) of example 3.3.1

(b) Lx (t1 , t2 ) of example 3.3.1

Figure 3.8: Mean and autocovariance functions for the process of example 3.3.1

Example 3.3.2. Let x(t) = 10 sin(2πt + θ) where θ is a random variable uniformly distributed between 0 and 2π.

3.3. MOMENT FUNCTIONS OF A RANDOM PROCESS

113

1. Mean function: From the fundamental theorem of expectation we obtain: mx (t) = E[x(t)] = E[10 sin(2πt + θ)] ∫ ∞ = 10 sin(2πt + α)pθ (α)dα ∫

−∞ 2π

= 0

10 sin(2πt + α)dα 2π

= 0 independently of the time variable t. 2. Autocorrelation function: Again using the fundamental theorem of expectation we obtain: Rx (t, s)

= E[x(t)x(s)] = E[100 sin(2πt + θ) sin(2πs + θ)] = 50 cos(2π(t − s)) − 50 E[cos(2π(t + s) + 2θ)] ∫ 2π cos(2π(t + s) + 2α) = 50 cos(2π(t − s)) − 50 dα 2π {z } |0 0

= 50 cos(2π(t − s)) We ﬁnd that Rx (t, s) is independent of the time origin. It is also periodic with respect to t − s and has the same period as x(t). Rx (t, s) is sketched in ﬁgure 3.9. The process is clearly wide sense stationary; in fact it is stationary in the strict sense by theorem 39. Example 3.3.3. Let x(t) = r sin(2πt + θ) where r and θ are two random variables with joint density function  α −α2 /2  2π e ; 0 ≤ β < 2π and 0 ≤ α prθ (α, β) =  0 ; elsewhere 1. Mean function: mx (t) = E[x(t)] ∫ ∞ ∫ 2π = α sin(2πt + β)prθ (α, β)dβdα 0

=

0

hence independent of time.

0

114

CHAPTER 3. RANDOM WAVEFORMS

Figure 3.9: Autocorrelation function for the random process of example 3.3.2

2. Autocorrelation function: Rx (t, s) = E[x(t)x(s)] ∫ ∞ ∫ 2π = α2 sin(2πt + β) sin(2πs + β)prθ (α, β)dβdα 0

=

0

cos(2π(t − s))

hence independent of time origin.5 Again we see that Rx (t, s) = Rx (t − s) is periodic (with respect to t − s) and has same period as x(t). The process is wide sense stationary; it is also stationary in the strict sense by theorem 39. Example 3.3.4. Consider a stationary process x(t) with mean function mx = 2 [volts], and autocorrelation function Rx (τ ) = 24 sinc(500 Hz τ ) + 4 [volts2 ]. 1. Use Chebyshev’s inequality to estimate the range of x(1 ms) with a probability of 95%. 2. Let y = x(1 ms) + 2x(2 ms) − 3x(5 ms) + 5 volts. Calculate y¯, V ar(y). Solution: 1. We need to ﬁnd a, b such that P (a < x(1 ms) < b) ≥ 95% = 0.95, or in other words: P (x(1 ms) ≤ a or x(1 ms) ≥ b) ≤ 0.05 5 The above is easily veriﬁed with MAPLE : use two nested integrations and the function combine( , trig) to simplify the trigonometric expression.

3.3. MOMENT FUNCTIONS OF A RANDOM PROCESS

115

The Chebyshev’s inequality states that: 2 ( ) σx(1 ms) P |x(1 ms) − x(1 ms)| ≥ ϵ ≤ ϵ2

which may be rewritten as: ( ) P x(1 ms) − x(1 ms) ≥ ϵ or x(1 ms) − x(1 ms) ≥ ϵ ( ) P x(1 ms) ≥ x(1 ms) + ϵ or x(1 ms) ≤ x(1 ms) − ϵ | {z } | {z }

≤ ≤

a

b

2 σx(1 ms)

ϵ2 2 σx(1 ms) ϵ2

| {z } 0.05

It is given that x(1 ms) = mx

= 2 volts

2 σx(1 ms) = Lx (0) =

Rx (0) − m2x

= 28 − 4 = 24 volts2 It follows that

24 volts2 ⇒ ϵ = 21.91 volts. 0.05 The probability of x(1 ms) being outside the range [−19.91, 23.91] volts is therefore smaller than 0.05%; in other words, we are 95% certain that x(1 ms) lies between -19.91 volts and 23.91 volts. This is a very conservative estimate of the range; if the process was speciﬁed, a much narrower range could be determined from the probability density function px(1 ms) (α). ϵ2 =

2. Deﬁne the random vector x , [x(1 ms), x(2 ms), x(5 ms)]. We ﬁrst ﬁnd its mean vector ( ) mx = E [x(1 ms), x(2 ms), x(5 ms)] = [2, 2, 2] and its covariance matrix  Λx

 Lx (0) Lx (1 ms) Lx (4 ms) Lx (0) Lx (3 ms)  =  Lx (1 ms) Lx (4 ms) Lx (3 ms) Lx (0)   24 15.279 0 24 −5.093  =  15.279 0 −5.093 24

Using lemma 32 and noticing that: 

 1 y =x× 2 +5 −3

116

CHAPTER 3. RANDOM WAVEFORMS we easily obtain:

y¯ = = σy2

= =

[

2 2

2

]



 1 × 2 +5 −3

5 [

−3

1 2

]



24 15.279 24 ×  15.279 0 −5.093

   1 0 −5.093  ×  2  24 −3

458.23

Example 3.3.5. Binary Random Process: Let

x(t) =

∞ ∑

ai g(t − iT + τ )

i=−∞

where the ai ’s and τ are statistically independent random variables such that pai (α) pτ (α)

= 1/2δ(α − 1) + 1/2δ(α + 1), ∀i ∈ Z   1/T ; 0 ≤ α < T =  0 ; elsewhere

and g(t) is a deterministic “pulse” which is 0 outside of the interval [0, T ].

1. Mean function: mx (t) = =

=

=

E[x(t)] ∞ ∑ E[ ai g(t − iT + τ )] i=−∞ ∞ ∑

E[ai g(t − iT + τ )]

i=−∞ ∞ ∑ i=−∞

= independent of time.

0

E[ai ] E[g(t − iT + τ )] | {z } 0

3.3. MOMENT FUNCTIONS OF A RANDOM PROCESS

117

2. Autocorrelation function: Rx (t, s) =

E[x(t)x(s)] ∞ ∞ )( ∑ )] [( ∑ ai g(t − iT + τ ) ai g(s − iT + τ ) = E i=−∞

= E

=

=

=

∞ [ ∑

i=−∞

] ai aj g(t − iT + τ )g(s − jT + τ )

∞ ∑

i=−∞ j=−∞ ∞ ∑

∞ ∑

E[ai aj ]E[g(t − iT + τ )g(s − jT i=−∞ j=−∞ ∞ ∑ E[a2i ] E[g(t − iT + τ )g(s − iT + τ )] | {z } i=−∞ 1 ∞ ∑

+ τ )]

E[g(t − iT + τ )g(s − iT + τ )]

i=−∞

The expected value inside the above sum is expanded as follows: E[g(t − iT + τ )g(s − iT + τ )] ∫ T = g(t − iT + α)g(s − iT + α)pτ (α)dα 0

1 = T

∫

t−(i−1)T

g(u)g(s − t + u)du t−iT

where, in the last line, we use the change of variables u = t − iT + α ⇔ α = u − t + iT dα = du α = 0 ⇒ u = t − iT α = T ⇒ u = t − (i − 1)T It follows that Rx (t, s)

∫ ∞ ∑ 1 t−(i−1)T = g(u)g(s − t + u)du T t−iT i=−∞ ∫ 1 ∞ = g(u)g(s − t + u)du T −∞ 1 Rg (s − t) = T

where we deﬁne:

∫ Rg (t)

∞

= −∞

g(u)g(u − t)du

= g(t) ∗ g(−t)

118

CHAPTER 3. RANDOM WAVEFORMS Hence Rx (t, s) is independent of the time origin. Remark. The function Rg (t) is called (deterministic) autocorrelation function of the energy signal g(t). We have then showed that the autocorrelation function Rx (t, s) of the binary random process x(t) is proportional to the deterministic autocorrelation function Rg (t) at s−t. Even though they bear the same name, Rx (t, s) and Rg (t) are not to be confused; Rx (t, s) is used with random processes and Rg (t) is used with deterministic (energy) signals.6 Case 1 (Square pulse). The pulse is shown in ﬁgure 3.10(a). Typical sample functions are sketched on ﬁgures 3.10(b) - 3.10(e), and the corresponding autocorrelation function is plotted on ﬁgure 3.10(f). Case 2 (Raised-sine pulse). The pulse is shown in ﬁgure 3.11(a). Typical sample functions are sketched on ﬁgures 3.11(b) - 3.11(e), and the corresponding autocorrelation function is plotted on ﬁgure 3.11(f). Notice that the maximum value 0.375 in the autocorrelation function corresponds to: 1 T

∫

T 2

g (t)dt

=

0

1 1 ms

∫

1 ms

(

0

1 − cos(2π(1 kHz)t) 2

)2 dt

= 0.375

The autocorrelation function is related to the average bandwidth of a random process as detailed in the next section, but we ﬁrst present its properties. Theorem 40 (Properties of Rx (τ )). Let x(t) be a wide-sense stationary process. The autocorrelation function Rx (τ ) = E[x(t)x(t + τ )] of x(t) satisﬁes the following: 1. Rx (0) ≥ |Rx (τ )|, ∀τ ∈ R. 2. Rx (τ ) = Rx (−τ ), ∀τ ∈ R. 3. If x(t) is periodic then Rx (τ ) is periodic. 4. Rx (0) = E[x2 (t)]. Sketch of proof. 1. It has been shown that |Cov(x1 , x2 )| ≤ σx1 σx2 for any two (ﬁnite variance) random variables x1 , x2 .7 Consequently, if x(t) is wide sense stationary 6 The same name autocorrelation function is used with both deterministic signals and random signals because both functions relate to similar physical quantities when the random signals are wide-sense stationary (see deﬁnition 3.3.3 below and the Wiener-Khinchine theorem in section 3.4 below). 7 Can also be proved by expanding 0 ≤ E[(x(t) ± x(t + τ ))2 ].

3.3. MOMENT FUNCTIONS OF A RANDOM PROCESS

119

(a) Square pulse in binary random process

(b) typical sample function

(c) other typical sample function

(d) yet another sample function

(e) yet another one more sample function

(f) autocorrelation function

Figure 3.10: Binary random process with square pulses, T = 1 ms.

120

CHAPTER 3. RANDOM WAVEFORMS

(a) Raised-sine pulse in binary random process

(b) typical sample function

(c) other typical sample function

(d) yet another sample function

(e) yet another one more sample function

(f) autocorrelation function

Figure 3.11: Binary random process with raised sine pulses, T = 1 ms.

3.3. MOMENT FUNCTIONS OF A RANDOM PROCESS

121

then |Lx (s − t)| ≤

√ V ar(x(t)) V ar(x(s)) = Lx (0) | {z } | {z } Lx (0)

Lx (0)

= V ar(x(t)) which implies that |Lx (s − t)| + m2x ≤ Lx (0) + m2x = Rx (0) We also have |Lx (s − t)| + m2x ≥ |Lx (s − t) + m2x | = |Rx (s − t)|. Therefore |Rx (τ )| ≤ |Rx (0)| for any τ = s − t. 2. Rx (τ )

= E[x(t0 )x(t0 + τ )], ∀t0 = E[x(t0 + τ )x(t0 )], ∀t0 = E[x(t)x(t − τ )], ∀t = t0 + τ = Rx (−τ )

3. Let T be the period of x(t). Then x(t) = x(t + T ) for every t. It follows that: Rx (τ )

= E[x(t0 )x(t0 + τ )]; ∀t0 = E[x(t0 )x(t0 + τ + T )]; ∀t0 = Rx (τ + T )

4. Obvious from the deﬁnitions.

The deﬁnition of the autocorrelation function was motivated by the concept of moments. The following deﬁnition and discussion show that the autocorrelation function is surprisingly also related to some physical quantities in a wide-sense stationary random process. Deﬁnition 3.3.3. Let x(t) be a wide-sense stationary process with mean function mx and autocorrelation function Rx (τ ) such that Rx (0) is ﬁnite. 1. mx ≡ average DC value of x(t). 2. Rx (0) ≡ average total (DC + AC) power of x(t). 3. Lx (0) ≡ average AC power of x(t).

122

CHAPTER 3. RANDOM WAVEFORMS

Discussion: Consider the set of all the sample functions x(ω, t), ∀ω ∈ Ω. In the applications to communication, x(ω, t) represents an electric signal (voltage or current), for which DC value ≡ Total power ≡ AC power ≡

1 T →∞ 2T

∫

T

lim

lim

T →∞

1 2T

1 T →∞ 2T

−T ∫ T −T ∫ T

lim

−T

x(ω, t)dt = ⟨x(ω, t)⟩ x(ω, t)2 dt = ⟨x(ω, t)2 ⟩ (x(ω, t) − ⟨x(ω, t)⟩)2 dt

= ⟨(x(ω, t) − ⟨x(ω, t)⟩)2 ⟩ The sample functions x(ω, t) do not all have same DC values, total power nor AC power. It can however be shown that whenever Rx (0) is ﬁnite then the integrations and expectations can be interchanged. We then obtain: ∫ T ) 1 x(ω, t)dt T →∞ 2T −T ∫ T 1 = lim E(x(ω, t))dt T →∞ 2T −T ∫ T 1 mx dt = lim T →∞ 2T −T ∫ T 1 = mx lim dt T →∞ 2T −T {z } |

average DC value ≡ E[⟨x(ω, t)⟩] = E

(

lim

1

= mx since E(x(ω, t)) = mx (t) = mx whenever the process is wide-sense stationary. Similarly we have: average total power ≡ E[⟨x(ω, t)2 ⟩]

∫ T ) 1 x(ω, t)2 dt T →∞ 2T −T ∫ T 1 E(x(ω, t)2 )dt = lim T →∞ 2T −T ∫ T 1 = lim Rx (0)dt T →∞ 2T −T ∫ T 1 dt = Rx (0) lim T →∞ 2T −T | {z } (

= E

lim

1

= Rx (0)

3.4. CORRELATION FUNCTIONS AND POWER SPECTRA

123

since E(x(ω, t)2 ) = Rx (0) whenever the process is wide-sense stationary. In the same manner we can show that E[⟨(x(ω, t) − ⟨x(ω, t)⟩)2 ⟩] = Lx (0) −→ average A.C. power m2x represents the average DC power as shown below: Rx (0) | {z } average total power

3.4

Lx (0) | {z }

=

average AC power

+

m2x |{z} average DC power

Correlation Functions and Power Spectra

W&J, p 179 - 185. Deﬁnition 3.4.1. Let x(t) be a wide-sense stationary process and Rx (τ ) its autocorrelation function. Then ∫ ∞ Sx (f ) , F (Rx (τ )) = Rx (τ )e−j2πf τ dτ −∞

is called power spectral density of x(t). Remarks. 1. Rx (τ ) can be recovered from Sx (f ) by: ∫ Rx (τ ) =

∞ −∞

Sx (f )ej2πf τ df

2. The sample function x(ω, t) can be use to deﬁne a random process in the variable f ∈ R as follows: consider the sample functions: ∫ 2 1 T −j2πf t X(ω, f ) = lim x(ω, t)e dt . T →∞ 2T −T For each ω ∈ Ω we obtain a (possibly) diﬀerent function X(ω, f ) and we easily see that X(f ) is a complex random process. It can be shown that if Rx (0) is ﬁnite then E[X(f )] = Sx (f ). This result is known as the Wiener-Khinchine theorem and means that “Sx (f ) represents the average frequency content of the sample functions of the process x(t)”. The physical meaning of Sx (f ) will be illustrated later in theorem 43 (also W&J pp. 182 - 183). Sketch of proof - Wiener-Khinchine theorem. We assume that integrations and expectation are freely interchangeable. We show that F −1 (E(X(f ))) = Rx (τ ). Given that by deﬁnition F −1 (Sx (f )) =

124

CHAPTER 3. RANDOM WAVEFORMS

Rx (τ ), this implies that E(X(f )) = Sx (f ). First we have: ∫ )( ∫ T )∗ 1 ( T X(ω, f ) = lim x(ω, t)e−j2πf t dt x(ω, s)e−j2πf s ds T →∞ 2T −T −T ∫ T ∫ T ( )( ) 1 = lim x(ω, t)e−j2πf t dt x(ω, s)ej2πf s ds T →∞ 2T −T −T ∫∫ T 1 x(ω, t)x(ω, s)e−j2πf (t−s) dtds = lim T →∞ 2T −T It follows that: ( ) E X(ω, f ) = lim

1 T →∞ 2T

Therefore F

−1

(E(X(f )))

∫∫

T

−T

( ) E x(t)x(s) e−j2πf (t−s) dtds | {z } Rx (t−s)

) ∫∫ T 1 −j2πf (t−s) = lim Rx (t − s)e dtds ej2πf τ df T →∞ 2T −∞ −T ∫∫ T ∫ ∞ 1 = lim Rx (t − s)e−j2πf (t−s) ej2πf τ df dtds T →∞ 2T −T −∞ ∫∫ T ∫ ∞ 1 Rx (t − s) ej2πf (s−t+τ ) df dtds = lim T →∞ 2T −T −∞ | {z } ∫

=

∞

(

1 T →∞ 2T

T

∫

δ(s−t+τ ) ∞

lim

1 2T = Rx (τ ) =

∫

−T

∫

Rx (t − s)δ(s − t + τ )dt ds {z } Rx ((s+τ )−s)=Rx (τ )

T

lim

T →∞

|

−∞

−T

Rx (τ )ds

Example 3.4.1. We take the Fourier transform of the autocorrelation functions of the binary random processes sketched in ﬁgures 3.10 and 3.11 to estimate the bandwidth requirements of an average sample function. From example 3.3.5 we have: 1 Rx (τ ) = Rg (τ ) T 1 = g(τ ) ∗ g(−τ ) T By deﬁnition and the properties of the Fourier transform we have: Sx (f ) = F (Rx (τ )) 1 |G(f )|2 = T

3.4. CORRELATION FUNCTIONS AND POWER SPECTRA

125

where G(f ) = F (g(t)) denotes the Fourier transform of the pulse g(t). In the following we use: sinc(x) rect(x) ∆(x)

sin(πx) { πx 1 ; −1/2 < x < 1/2 , 0 ; elsewhere { 1 − 2|x| ; −1/2 < x < 1/2 , 0 ; elsewhere

,

• For the rectangular pulse we use the Fourier transform pair: rect(t/(1 ms)) ←→ (1 ms) sinc((1 ms) f ) = (1 ms) sinc(f /1000) It follows that |G(f )| = (1 ms) |sinc(f /1000)| for the rectangular pulse.8 • For the triangular pulse we use the Fourier transform pair ∆(t/(1 ms)) ←→ (0.5 ms) sinc2 ((0.5 ms) f ) This is G(f ) for the triangular pulse. • For the raised-cosine pulse we notice that ( ) 1 + cos(2πt/(1 ms)) g(t + 0.5 ms) = rect(t/(1 ms)) 2 1 1 = rect(t/(1 ms)) + rect(t/(1 ms)) cos(2πt/(1 ms)) 2 2 Using the property of modulation on the Fourier transform pair 1 rect(t/(1 ms)) ←→ (0.5 ms) sinc((1 ms) f ) = 2 (0.5 ms) sinc(f /1000) we obtain 1 rect(t/(1 ms)) cos(2πt/(1 ms)) ←→ (0.25 ms)sinc((1.0 ms)(f − (1 kHz)))+ 2 (0.25 ms)sinc((1.0 ms)(f + (1 kHz))) = (0.25 ms)sinc((f − 1000)/1000)+ (0.25 ms)sinc((f + 1000)/1000) 8 recall that the pulses in example 3.3.5 go from 0 to T = 1 ms, whereas the above goes from -0.5 ms to +0.5 ms. The time delay aﬀects the phase of the Fourier transform only; it has no eﬀect on the magnitude, which is all that is required here.

126

CHAPTER 3. RANDOM WAVEFORMS It follows that |G(f )| = (0.5 ms)sinc(f /1000)+ (0.25 ms)sinc((f − 1000)/1000)+ (0.25 ms)sinc((f + 1000)/1000) for the raised cosine pulse.

Figure 3.12 shows sketches of the Power spectral densities of the binary random processes of example 3.3.5 for the cases of rectangular pulse (green curve), triangular pulse (blue curve) and raised cosine pulse (red curve). Experimental

Figure 3.12:

measurements were taken which conﬁrm the above results; refer to ﬁgure 3.13. Theorem 41. (W&J pp. 179 - 180) Let z(t) be a random process with mean function mz (t), autocorrelation function Rz (t, s) which is ﬁnite for all t, s ∈ R. Deﬁne the random process: ∫ ∞ y(t) = z(α)h(t − α)dα −∞

= z(t) ∗ h(t),

3.4. CORRELATION FUNCTIONS AND POWER SPECTRA

127

Figure 3.13:

i.e. y(t) is the output of a (bounded-input-bounded-output stable, linear, invariant) ﬁlter of impulse response h(t) with input z(t). Then:

∫ my (t)

∞

=

mz (α)h(t − α)dα = mz (t) ∗ h(t)

−∞ ∞

∫∫ Ry (t, s)

= −∞ ∞

Rz (α, β)h(t − α)h(s − β)dαdβ

∫∫ Ly (t, s) =

−∞

Lz (α, β)h(t − α)h(s − β)dαdβ

Sketch of proof: Assuming that expectations and integrals are freely inter-

128

CHAPTER 3. RANDOM WAVEFORMS

changeable we have: my (t) = E(y(t)) (∫ ∞ ) = E z(α)h(t − α) dα −∞ ∫ ∞ ( ) E z(α) h(t − α) dα = −∞ ∫ ∞ = mz (α)h(t − α) dα −∞

= mz (t) ∗ h(t) Ry (t, s)

= E(y(t)y(s)) ∫ ∞ (∫ ∞ ) = E z(α)h(t − α) dα z(β)h(s − β) dβ −∞ −∞ ) ( ∫∫ ∞ = E z(α)z(β)h(t − α)h(s − β) dα dβ −∞ ∫∫ ∞ ( ) = E z(α)z(β) h(t − α)h(s − β) dα dβ {z } −∞ | ∫∫

Rz (α, β)

∞

= −∞

Rz (α, β)h(t − α)h(s − β) dα dβ

By deﬁnition and using the above we then have: Ly (t, s) =

Ry (t, s) − my (t)my (s) ∫∫ ∞ = Rz (α, β)h(t − α)h(s − β) dα dβ −∞ ∫∫ ∞ − mz (α)h(t − α)mz (β)h(s − β) dαdβ −∞ ∫∫ ∞ ( ) = Rz (α, β) − mz (α)mz (β) h(t − α)h(s − β) dα dβ −∞ | {z } ∫∫

Lz (α,β)

∞

= −∞

Lz (α, β)h(t − α)h(s − β) dα dβ

Theorem 42 (W&J, pp. 181 - 182). Let z(t) be a wide-sense stationary random process with mean mz , autocorrelation function Rz (τ ) and power spectral density Sz (f ). Deﬁne the random process ∫ ∞ y(t) = z(α)h(t − α)dα −∞

= z(t) ∗ h(t)

3.4. CORRELATION FUNCTIONS AND POWER SPECTRA

129

Then y(t) is a wide-sense stationary random process and if Rz (0) is ﬁnite then: ∫ ∞ my (t) = my = mz h(t)dt ∫∫ Ry (τ )

=

Ly (τ )

=

−∞

∞

−∞

∫∫

∞

−∞

Rz (τ + α − β)h(α)h(β)dαdβ Lz (τ + α − β)h(α)h(β)dαdβ

Sy (f ) = Sz (f )|H(f )|2 where H(f ) =

∫∞ −∞

h(t)e−j2πf t dt.

Sketch of proof. Top of page 182 in W&J; easy to read. Theorem 43 (Properties of Sx (f )). (W& J pp. 184 - 185) 1. Sx (f ) ∈ R, ∀f , 2. Sx (f ) = Sx (−f ), ∀f , 3. Sx (f ) ≥ 0, ∀f . ∫∞ 4. Rx (0) = −∞ Sx (f )df . Proof. 1. Follows from Rx (τ ) being an even function. 2. Follows from Rx (τ ) and Sx (f ) being real functions. 3. Consider the situation depicted in ﬁgure 3.14. We have: ∫ ∞ j2πf τ 0 ≤ Ry (0) = Sy (f )e df −∞ τ =0 ∫ ∞ = Sy (f )df −∞ ∫ ∞ = Sx (f )|H(f )|2 df −∞

≈ Sx (−f0 )∆f + Sx (f0 )∆f = 2Sx (f0 )∆f It follows that: Sx (f0 ) ≈

Ry (0) ≥ 0, ∀f0 . 2∆f

130

CHAPTER 3. RANDOM WAVEFORMS

Figure 3.14:

The above proof of property (3) illustrates the physical meaning of the power spectral density Sx (f ): Average power of y(t) 2∆f Average power of x(t) in the band |f ± f0 | ≤ ∆f /2 2∆f

Sx (f0 ) ≡ ≡

In concrete words (W&J, pp. 182 - 183): “Sx (f ) describes the distribution of mean power with frequency in the process x(t)”.9

3.5

Ergodicity

Up to now, the moments of random processes have been calculated with expressions such as: ∫ x(t) (

=

∞

E[x(t)] = −∞

αpx(t) (α)dα

∫

)2 x(t) − x(t)

=

Rx (s, t)

=

∞ ( ( ) )2 V ar x(t) = α − x(t) px(t) (α)dα −∞ ∫∫ ∞ αβpx(t)x(s) (α, β)dαdβ

... ... ... ... ... ...

−∞

Operations of this form are called ensemble averages: the expectation is calculated through all the sample functions at ﬁxed time value(s). This is the true expectation on a random process. It may however be impractical for experimental measurements since it requires a statistically large number of sample 9 W&J,

pp. 182-183.

3.5. ERGODICITY

131

functions, as opposed to the following time averages: ∫ T 1 ⟨x(ω, t)⟩ = lim x(ω, t)dt T →∞ 2T −T ∫ T ( )2 ( )2 1 x(ω, t) − ⟨x(ω, t)⟩ dt ⟨ x(ω, t) − ⟨x(ω, t)⟩ ⟩ = lim T →∞ 2T −T ∫ T 1 ⟨x(ω, t)x(ω, t + τ )⟩ = lim x(ω, t)x(ω, t + τ )dt T →∞ 2T −T ... ... ... ... ... ... In the above expressions, x(ω, t) is a sample function of x(t) and diﬀerent ω’s generally yield diﬀerent values in the above integrals. An ergodic process yields the same values for all sample functions x(ω, t). Deﬁnition 3.5.1. A stationary random process x(t) is ergodic if E[g(x(t),x(t + τ1 ), . . . , x(t + τn−1 ))] ∫ T 1 = lim g(x(ω, t), x(ω, t + τ1 ), . . . , x(ω, t + τn−1 ))dt T →∞ 2T −T for every function g : Rn → R, for all τ1 , τ2 , . . . , τn−1 ∈ R, for all n ∈ N∗ and for all ω ∈ Ω. In other words, each sample function of a process that is ergodic contains all the statistical information of the random process; time and ensemble averages consequently yield the same values. In particular all sample functions have the same DC value, average power, AC power, frequency content: E[x(t)] = mx (t) = mx = ⟨x(ω, t)⟩ E[x(t)x(t + τ )] = Rx (τ ) = ⟨x(ω, t)x(ω, t + τ )⟩ ∫ 2 1 −T −j2πf t Sx (f ) = lim x(ω, t)e dt T →∞ 2T T

since by deﬁnition an ergodic process is necessarily stationary (but the converse is not always true). Remark. If x(t) is wide sense stationary but not ergodic we still refer to mx Rx (0) Lx (0)

as as as

average DC value average power average AC power

As mentioned in the discussion following deﬁnition 3.3.3, the quantities may diﬀer from one sample function to another and mx , Rx (0), Lx (0) are then the expected values of the quantities (average DC value, average power, average AC power) calculated through the ensemble of all the sample functions. Similarly, if x(t) is wide sense stationary but not ergodic, Sx (f ) represents the ensemble average of the frequency content of the sample functions (Wiener-Khinchine theorem).

132

CHAPTER 3. RANDOM WAVEFORMS

Example 3.5.1. The random process of example 3.1.1 is not ergodic (obviously, since the process is not stationary; this example illustrates this by calculating the time averages). We recall that: x: Ω×R x : (Head, t) x : (Tail, t)

→ R 7→ 2 7 → sin t

Indeed, it was found that mx (t) = E[x(t)] = 1 + sin(t)/2. We also have: ⟨x(Head, t)⟩ =

⟨2⟩

1 = lim T →∞ 2T = 2 ⟨x(Tail, t)⟩ =

⟨sin t⟩

1 = lim T →∞ 2T = 0

∫

T

2dt −T

∫

T

sin tdt −T

All of mx (t), ⟨x(Head, t)⟩, ⟨x(Tail, t)⟩ (and a whole lot more) would have to be equal for the process to be ergodic. Example 3.5.2. Let x(t) = 10 sin(2πt + θ) where θ is a random variable uniformly distributed between 0 and 2π. It was found in example 3.3.2 that mx (t) = E[x(t)] = 0, ∀t. We also have: ⟨x(ω, t)⟩ = =

⟨10 sin(2πt + θ(ω))⟩ ∫ T −θ(ω)/2π 1 lim 10 sin(2πt + θ(ω))dt T →∞ 2T −T −θ(ω)/2π {z } |

=

0

cover interval of length 2T

for any value of θ(ω). It was also found that Rx (τ ) = 50 cos(2πτ ). We now ﬁnd: ⟨x(ω, t)x(ω, t + τ )⟩ 1 T →∞ 2T

∫

T

100 sin(2πt + θ(ω)) sin(2π(t + τ ) + θ(ω))dt

= lim

−T

= 50 cos(2πτ ) for any value of θ(ω). The above indicates (but does not guarantee) that the process may be ergodic. In the following section we investigate Gaussian random processes, some of which are ergodic.

3.6. GAUSSIAN RANDOM PROCESSES

3.6

133

Gaussian random processes

(refer to W&J, pp. 171 - 179, 186 - 192) Deﬁnition 3.6.1. (W&J, p. 171) A random process x(t) is Gaussian if the random vector x(t) is Gaussian for all t = (t1 , t2 , . . . , tk ) ∈ Rk and for all k ∈ N∗ . In many practical situations, the conditions of the central limit theorem are satisﬁed which makes “Gaussian processes of the utmost practical (as well as mathematical) importance”.10 In light of sections 2.13 and 3.3, we see that a Gaussian random process is speciﬁed by its mean function mx (t) and its autocorrelation function Rx (t, s) or its autocovariance function Lx (t, s).11 The following example illustrates that the roles played by the mean and autocovariance functions are similar to that of the mean vector and covariance matrix of a Gaussian random vector. Example 3.6.1. (W&J, page 174) Let x(t) be a Gaussian random process with: mx (t) = sin(πt)

(W&J, eq. 3.96a)

−|t−s|

Lx (t, s) = e

(W&J, eq. 3.96b)

Notice that x(t) is not (wide sense) stationary (nor ergodic). We let x = (x1 , x2 ) = (x(1), x(3/2)) and ﬁnd: mx Λx

= =

e−1/2 1 e−1 = e

1 − e−1 [ e 1 = −1/2 −e e−1

|Λx | = Λ−1 x

(0, −1) [ 1 −1/2 e

]

−e−1/2 1

[

] =

e e−1 1/2 e 1−e

e1/2 1−e e e−1

]

Finally we obtain from equation (W&J, 3.90): px (α) =

( −1 ) 1 T √ exp (α − mx ) × Λ−1 x × (α − mx ) 2 2π |Λx |

Similarly any joint probability density function px(t) (α) can be calculated for any k ∈ N∗ and any t = (t1 , t2 , . . . , tk ) ∈ Rk . Remarks. 1. Λx(t) is symmetrical as required since Rx (t, s) = E[x(t)x(s)] = E[x(s)x(t)] = Rx (s, t). 10 W&J 11 We

top of page 72. recall that Lx (t, s) = Rx (t, s) − mx (t)mx (s).

134

CHAPTER 3. RANDOM WAVEFORMS

2. Random processes which are not Gaussian are in general not speciﬁed by mx (t) and Rx (t, s) or Lx (t, s) alone; other moment functions are required. Indeed W&J shows two random processes y(t) and z(t) having the same mean and autocorrelation functions (⇒ same covariance function) at the top of page 175, but having diﬀerent probability density functions. This means that in order to specify these non-Gaussian random processes, more information is required than that carried by the mean and autocorrelation functions. 3. (W&J, pp. 175 - 177) A Gaussian random process x(t) is (strict sense) stationary if and only if it is wide sense stationary. This follows from the previous remark that any joint probability density function px(t) (α) is completely determined by the functions mx (t) and Rx (t, s): if x(t) is wide sense stationary, the functions mx (t) and Rx (t, s) are independent of the time origin and any px(t) (α) is also consequently independent of the time origin. Wide sense stationariness of non-Gaussian random processes does not guarantee their strict sense stationariness. For example, W&J shows at the top of page 177 a (non-Gaussian) random process which is wide sense stationary, but not strict sense stationary. Theorem 44. Let x(t) be a stationary Gaussian random process with autocorrelation function Rx (τ ). Then: ∫

∞ −∞

|Rx (τ )|dτ < ∞ ⇒ x(t) is ergodic

Remark. The condition of the theorem is not satisﬁed if mx ̸= 0; Rx (τ ) cannot be replaced by Lx (τ ) in the theorem. Theorem 45. (W&J, p. 179) Let x(t), y(t) be random processes such that: x(t) ∗ h(t) ∫ ∞ = h(α)x(t − α)dα

y(t) =

−∞

i.e. y(t) is the output of a bounded-input-bounded-output stable linear invariant ﬁlter of impulse response h(t) with input x(t). Then y(t) is Gaussian whenever x(t) is Gaussian. Sketch of proof. Suppose h(t) = 0, ∀|t| > T , for some 0 < T ∈ R. Then ∫ y(t)

T

= −T

≈

k ∑ i=1

h(α)x(t − α)dα

h(αi )x(t − αi )∆αi

3.6. GAUSSIAN RANDOM PROCESSES

135

where αi ∈ [−T, T ], ∆i ∈ R+ , i = 1, 2, . . . , k determine k disjoint intervals [αi , αi + ∆i ] covering [−T, T ], i.e. [αi , αi + ∆i ] ∩ [αj , αj + ∆j ] = ∅, ∀i ̸= j ∪ki=1 [αi , αi + ∆i ] = [−T, T ] The random variables y(t1 ), y(t2 ), . . . , y(tN ) can therefore be written in matrix notation as follows:   x(t1 − α1 ) ..     .    x(t1 − αk )        x(t − α )   2 1 y(t1 ) h 0 ... 0    ..  y(t2 )   0 h . . . 0     .       = ×     .. .. .. . . ..     . . . .   x(t2 − αk )  .    .. y(tN ) 0 0 ... h   .    x(tN − α1 )      ..   . x(tN − αk ) where h = [h(α1 ) ∆α1 , h(α2 ) ∆α2 , . . . , h(αk ) ∆αk ]. It follows from property 4, W&J p. 166 that the random vector [y(t1 ), y(t2 ), . . . , y(tN )] is Gaussian. As stated in W&J, “the formal proof is mathematically involved” and deals with inﬁnite weighted sums of Gaussian random variables. In general, the probability density functions of the output random process are not of the same nature as those of the input random process, but it works for the Gaussian process. Combining this result with those of theorems 41 and 42, we see that if x(t) is Gaussian and speciﬁed then y(t) is also Gaussian and speciﬁed.

3.6.1

Jointly Gaussian Processes

(W&J, pp. 186 - 188) Although some of the deﬁnitions and results of this subsection apply to any random processes, we limit the discussion to Gaussian random processes. Deﬁnition 3.6.2. Let y(t) and z(t) be Gaussian random processes. If w = (y(t1 ), y(t2 ), . . . , y(tk ), z(tk+1 ), z(tk+2 ), . . . , z(tk+l )) is a Gaussian random vector ∀k, l ∈ N, not both 0, and ∀t = (t1 , t2 , . . . , tk+l ) ∈ Rk+l then y(t) and z(t) are called jointly Gaussian random processes. Example 3.6.2. (W&J, p. 186) Consider the following situation depicted in Fig. 3.25, W&J, and ﬁgure 3.15. The sketch of proof of theorem 45 can be

136

CHAPTER 3. RANDOM WAVEFORMS

Figure 3.15:

generalized to the present situation to show that any random vector formed by samples taken from y(t) and z(t) is Gaussian. Thus y(t) and z(t) are jointly Gaussian random processes. We recall that the random processes y(t) and z(t) are speciﬁed by their respective mean and autocorrelation functions since they are Gaussian. But the joint probability density function of the random vector denoted as w in deﬁnition 3.6.2 cannot be written without the knowledge of E[y(tj )z(si )] or Cov(y(tj ), z(si )) = E[y(tj )z(si )] − my (tj )mz (si ). This is contained in the crosscorrelation function deﬁned next. Deﬁnition 3.6.3. 1. The function Ryz (t, s) , E[y(t)z(s)] is called cross-correlation function of the random processes y(t) and z(t). 2. The function Lyz (t, s) , Cov(y(t), z(s)) is called cross-covariance function of the random processes y(t) and z(t). One easily shows that Lyz (t, s) = Ryz (t, s) − my (t)mz (s). The joint probability density function of any random vector such as w in deﬁnition 3.6.2 can be written from the knowledge of the functions my (t), mz (t), Ry (t, s), Rz (t, s), Ryz (t, s). In such cases y(t) and z(t) are said jointly speciﬁed, meaning that any joint probability density function of the form py(t),z(s) (α, β) can be calculated. If the mean and correlation functions are independent of time origin, y(t) and z(t) are said jointly stationary (jointly wide sense stationary if the processes are not jointly Gaussian) and as usual one argument of the functions can be dropped and we write: my , mz , Ry (τ ), Rz (τ ), Ryz (τ ) , Ryz (t, t + τ ) in which τ = s − t. Remark. In the case of the autocorrelation, we have Ry (τ ) = Ry (t, t + τ ) = Ry (t + τ, t) since the autocorrelation function is even as shown in theorem 40.

3.6. GAUSSIAN RANDOM PROCESSES

137

For the cross-correlation function we assume, unless otherwise stated, that: Ryz (τ )

, Ryz (t, t + τ ) = Rzy (t + τ, t) , Rzy (−τ )

and the functions Ryz ( ), Rzy ( ) are in general not even. Deﬁnition 3.6.4. Let y(t), z(t) be jointly stationary Gaussian random processes with cross-correlation function Ryz (τ ). Then the function ∫ ∞ Ryz (τ )e−j2πf τ dτ Syz (f ) = −∞

is called cross-power spectra or cross-power spectral density. Ryz (τ ) can be recovered uniquely from Syz (f ) by: ∫ ∞ Ryz (τ ) = Syz (f )ej2πf τ df −∞

In contrast to the functions Rx (τ ), Sx (f ), no physical interpretation can easily be made of the functions Ryz (τ ), Syz (f ); they do not relate to the average total power or average frequency content. Ryz (τ ), Syz (f ) are used to write the moments of random vectors obtained by sampling random process. The following properties of the cross-correlation and cross-spectra are easy to show and stated without proof. Proposition 46 (Properties of Ryz (f ) and Syz (f )). 1. Ryz (τ ) = Rzy (−τ ) 2. Syz (f ) = Szy (−f ) 3. Syz (−f ) = Syz (f )∗ Theorem 47. Let x(t) be a Gaussian random process and deﬁne the jointly Gaussian random processes y(t), z(t) by: ∫ ∞ y(t) = x(t) ∗ hy (t) = x(t − α)hy (α)dα −∞ ∫ ∞ z(t) = x(t) ∗ hz (t) = x(t − α)hz (α)dα −∞

If Rx (t, s) is ﬁnite for all t, s then: 1. (W&J, eq 3.127a)

∫∫

Ryz (t, s)

=

Lyz (t, s)

=

∞

−∞

∫∫

∞

−∞

Rx (t − α, s − β)hy (α)hz (β)dαdβ Lx (t − α, s − β)hy (α)hz (β)dαdβ

138

CHAPTER 3. RANDOM WAVEFORMS

2. If x(t) is stationary then y(t) and z(t) are jointly stationary and: ∫∫ ∞ Ryz (τ ) = Rx (τ + β − α)hy (α)hz (β)dαdβ −∞

∫∫ Lyz (τ ) = Syz (f ) = |Syz (f )|2

=

∞

−∞

Lx (τ + β − α)hy (α)hz (β)dαdβ

Sx (f )Hy (f )Hz (f )∗ Sy (f )Sz (f )

Proof. 1. Similar to what has been done in the sketch of proof of theorem 41 by interchanging the order of expectation and integration. 2. The cross-correlation formula follows from the general formula by replacing Rx (t − α, s − β) with Rx (s − t + α − β) and identifying τ = s − t. The cross-power spectra formula directly follows from the properties of the Fourier transform applied to both sides of the formula for the crosscorrelation.

Remark. If Hy (f ) and Hz (f ) are non-overlapping (as in W&J, Fig. 3.26), i.e. Hy (f )Hz (f ) = 0, ∀f and x(t) is stationary then Syz (f ) = 0, ∀f and this implies that Ryz (τ ) = 0, ∀τ . We also have from theorem 42 that ∫ ∞ my = mx hy (t)dt = mx Hy (0) −∞

mz

= mx Hz (0)

and ﬁnally my mz = m2x Hy (0)Hz (0) = 0. Therefore: Lyz (τ )

= Ryz (τ ) − my mz = 0

“Thus any covariance involving both y(t) and z(t) must be zero”. It follows that (W&J, eq 3.130): py(t)z(s) (α, β) = py(t) (α)pz(s) (β)

(3.1)

recalling that x(t) must be both Gaussian and stationary. Deﬁnition 3.6.5. Two random processes y(t) and z(t) for which the joint probability density functions satisfy equation (3.1) for all t, s ∈ Rk+l and for all k, l ∈ N (not both 0) are called statistically independent. Theorem 48. Two jointly Gaussian processes x(t), y(t) are statistically independent if and only if Lxy (t, s) = 0, ∀t, s.

3.6. GAUSSIAN RANDOM PROCESSES

3.6.2

139

White Gaussian Noise

(W&J, pp. 188 - 192) Deﬁnition 3.6.6. (W&J, p. 189) A zero-mean stationary Gaussian random process nw (t) for which the power spectral density Sw (f ) and the autocorrelation function Rw (τ ) are of the form: Sw (f ) = Rw (τ ) =

N0 , ∀f 2 N0 δ(τ ) 2

is called white Gaussian noise. Remarks. 1. Lw (τ ) = Rw (τ ) since nw (t) is zero-mean. 2. White Gaussian noise is ergodic by theorem 44. 3. Rw (τ ) = N20 δ(τ ) means that “white Gaussian noise represents the ultimate in randomness”.12 ∫∞ The total average power of white Gaussian noise is Rw (0) = −∞ Sw (f )df = ∞. As a consequence white noise is ﬁctitious; it is a mathematical construction/model that is not found in nature. Its usefulness comes from the fact that ﬁltered white Gaussian noise produces an ergodic (⇒ stationary) zero-mean Gaussian noise (random process) that is meaningful. The next theorem, stated without proof, results from section 3.4, properties of the Fourier transform and Parseval’s theorem: Theorem 49. Let nw (t) be a white Gaussian noise with power spectral density Sw (f ) = N0 /2 and deﬁne: n(t)

in which

∫∞ −∞

= nw (t) ∗ h(t) ∫ ∞ = nw (t − α)h(α)dα −∞

h2 (t)dt is ﬁnite. Let H(f ) , F [h(t)] =

∫∞ −∞

h(t)e−j2πf t dt. Then:

1. (W&J, eq. 3.137) Rn (τ ) = Ln (τ ) = = 2. (W&J, eq. 3.134a) Sn (f ) = 12 W&J,

p. 190.

N0 h(τ ) ∗ h(−τ ) 2 ∫ N0 ∞ h(β − τ )h(β)dβ 2 −∞

N0 2 2 |H(f )|

140

CHAPTER 3. RANDOM WAVEFORMS

3. The total average power of n(t) is given by (W&J, eq. 3.134b): ∫ N0 ∞ 2 n2 (t) = Rn (0) = h (α)dα 2 −∞ ∫ N0 ∞ = |H(f )|2 df 2 −∞ Proof of part 1: From theorem 42 we have: ∫∫ ∞ Rn (τ ) = Rnw (τ + α − β)h(α)h(β)dαdβ −∞ ∫∫ ∞ N0 = δ(τ + α − β)h(α)h(β)dαdβ −∞ 2 ∫ ∞ N0 = h(β)h(β − τ )dβ −∞ 2 N0 = Rh (τ ) 2 where Rh (τ ) denotes the deterministic autocorrelation function of h(t). Example 3.6.3. (W&J, pp. 190 - 192) We consider the ﬁltering of a white Gaussian noise w(t) with an ideal low-pass ﬁlter of transfer function W (f ) given by: { 1 ; |f | < W W (f ) = 0 ; elsewhere Denoting by n(t) the output of the ﬁlter, we have: 1. n(t) is an ergodic (⇒ stationary) Gaussian random process. ∫∞ 2. mw = 0 ⇒ mn = mw −∞ h(t)dt = 0. { 3. Sn (f ) =

N0 2

0

; |f | < W ; elsewhere

4. Rn (τ ) = Ln (τ ) = F −1 [Sn (f )] = W N0 sinc(2τ W ).13 Rn (τ ) is sketched on ﬁgure 3.16. 5. The total average power of n(t) is W N0 . Since n(t) is ergodic the total average power of any sample function of n(t) is also W N0 . 6. Deﬁne the jointly Gaussian random variables ni = n 13 Recall

that sinc(x) ,

sin(πx) . πx

( i ) + T , i = 1, 2, . . . , k, 2W

3.6. GAUSSIAN RANDOM PROCESSES

141

Figure 3.16:

where T ∈ R. In order to calculate pn (α) of the Gaussian vector n = (n1 , n2 , . . . , nk ) we ﬁrst calculate the mean vector mn and covariance matrix Λn of n:

mn = E[n] = (n1 , n2 , . . . , nk ) = (mn , mn , . . . , mn ) = 0 [ Λn

=

] Cov(ni , nj )

i=1,...,k j=1,...,k

[ ( ( ) ( j ))] i = Cov n + T ,n +T i=1,...,k 2W 2W j=1,...,k  1 1 ) Ln ( W ) . . . Ln ( k−1 Ln (0) Ln ( 2W 2W ) 1 k−2  Ln ( 1 ) L (0) L ( ) . . . L ( n n n 2W 2W 2W )  k−3  Ln ( 1 ) Ln ( 1 ) L (0) . . . L ( n n W 2W 2W ) =   . . . . . . . . . . . . . . .   ... ... ... ... ... k−3 Ln ( k−1 ) Ln ( k−2 Ln (0) 2W ) Ln ( 2W ) . . . [ 2W ] = diag N0 W, N0 W, . . . , N0 W

       

and the ni ’s are consequently statistically independent. Therefore:

pn (α) =

k ∏

pni (αi )

i=1

=

1 exp (2πN0 W )k/2

(

) k −1 ∑ 2 αi . 2N0 W i=1

142

CHAPTER 3. RANDOM WAVEFORMS

3.7 1.

Problems 14

Consider two zero-mean jointly Gaussian noise processes, n1 (t) and n2 (t) such that: sin πτ ; πτ sin πτ . R12 (τ ) , n1 (t)n2 (t − τ ) = 2πτ Ri (τ ) , ni (t)ni (t − τ ) =

i = 1, 2,

(a) Write the covariance matrix of the random vector x , (x1 , x2 , x3 ) where: x1 x2 x3 Remark: lim

τ →0

sin πτ πτ

, n1 (t) t=0 , , n1 (t) t=1 , , n2 (t) t=0 .

= 1.

(b) Calculate the variance of the random variable y , x1 + x2 + x3 . 2. Let x(t), y(t) denote two 0-mean jointly Gaussian jointly stationary processes such that: y(t) = x(t) ∗ h(t) as shown in ﬁgure 3.17. The autocorrelation and cross-correlation functions are shown in ﬁgure 3.18. Note that the cross-correlation Rxy (τ ) denotes Rxy (t, t + τ ) = E[x(t)y(t + τ )]. (a) Is the process x(t) ergodic given that:

Rx (τ ) =

  1 − 2|τ | ; −1/2 < τ < 1/2 

0

; elsewhere

Justify your answer. (b) Calculate the average (total) power of x(t) and of y(t). (c) Write the expressions px(1/2) (α), py(0) (β).

of

the

probability

density

functions

(d) Calculate the probability of the event described by x(1/2)+2y(0) > 5. 14 Based

on W&J, problem 3.4

3.7. PROBLEMS

143

(a)

(b)

Figure 3.17:

Figure 3.18:

3. Consider the random process x(t) = 10 sin(ωt + θ) where ω, θ are two statistically independent random variables with probability density functions: pω (α) =

pθ (α) =

1 1 δ(α − 1) + δ(α − 2) 2 2  1  2π ; 0 ≤ α < 2π 

0

; elsewhere

i.e. ω can take the values 1 or 2 with equal probabilities and θ is uniformly distributed between 0 and 2π. Calculate the mean function mx (t) and the autocorrelation Rx (t1 , t2 ) of x(t). Based on your answer, is the processes wide-sense stationary?

144

CHAPTER 3. RANDOM WAVEFORMS Hint: If you need to calculate an expected value by integrating with the joint probability density function pω, θ (α, β), you should carry out the ﬁrst integration with respect to β; it’s a lil’ faster but it would work either way. The following trigonometric identity may also be useful: 2 sin(u) sin(v) = cos(u − v) − cos(u + v)

4. Let z(t) be a wide-sense stationary random process with mean mz , autocorrelation function Rz (τ ) and power spectral density Sz (f ): mz

= 0 AW Wτ sinc2 ( ) 2 2  ; −W A + 2Af  W 2 ≤f ≤0     = A − 2Af ; 0≤f ≤ W W 2      0 ; elsewhere

Rz (τ ) =

Sz (f )

The power spectral density of z(t) is sketched in ﬁgure 3.19. As shown in ﬁgure 3.20, z(t) is fed through a linear invariant ﬁlter (ideal low-pass ﬁlter), the frequency response of which is given by:   1 ; |f | ≤ W 4 H(f ) =  0 ; elsewhere The ﬁlter’s output is labelled y(t). (a) Express or sketch on a labelled graph the mean function my and the power spectral density Sy (f ) of the wide-sense stationary random process y(t) as functions of A and W . Hint: Do not use the expression for Rz (τ ) and do not use any table of Fourier transform pairs. (b) Express the total average power of y(t) as a function of A and W . 5. Calculate the autocorrelation function Rx (t, s) = E[x(t)x(s)] of the random process: x(t) = r sin(2πf t + θ) + y(t) where • f ∈ R, f > 0 is not random,

3.7. PROBLEMS

145

Figure 3.19:

Figure 3.20:

• r is a Rayleigh distributed random variable (b > 0),  −α2 /b  2α ; α≥0 b e pr (α) =  0 ; α r1 , r0 > r2 , . . . , r0 > rM −1 , √ 2. r0 = Es + n0 and rj = nj , j = 1, 2, . . . , M − 1, 3. symmetry ⇒ P (C |m0 ) = P (C |mj ), j = 1, 2, . . . M − 1. The remainder of the proof is straightforward (W&J, page 258). √ Even though it is not clear from the expression, P (E ) is function of the ratio ES /N0 only; it does not depend of both ES √ and N0 independently. P (E ) is plotted on ﬁgure 4.15 as a function of SN R = ES /N0 . M −1 Deﬁnition 4.5.3. Let {sj }j=0 be a set of M = N equally likely, equal energy, −1 orthogonal signal vectors. The signal vector set {s′j = sj − a}M j=0 where

a=s=

M −1 1 ∑ si M i=0

is a simplex signal vector set. Remarks. 8 The notation is changed in that j ranges from 0 to M − 1 instead of from 1 to N . We also write s = (s0 , s1 , . . . , sM −1 ), n = (n0 , n1 , . . . , nM −1 ), r = (r0 , r1 , . . . , rM −1 ).

4.5. PROBABILITY OF ERROR

177

Figure 4.15:

1. The vector a in the above is the center of gravity (equation (4.97a) in W&J). The center of gravity of a simplex signal set is therefore at the origin. The M signal vectors of a simplex are linearly dependent; they span a vector space of dimension M − 1. 2. As previously mentioned in the proof of theorem 54, P (E ) is not aﬀected by translation of the signal set and consequently the simplex set {s′j = −1 M −1 sj − a}M j=0 yields the same P (E ) as the signal vector set {sj }j=0 . Less energy is required for the transmission of the simplex signal vectors (page 261 in W&J). P (E ) for a given SN R can be obtained from ﬁgure 4.37 in W&J by rescaling the horizontal axis with the factor 1 − 1/M . P (E ) is much better than vertices of hypercube, rectangular decision regions such as QAM-16 plotted in ﬁgure 4.13; we expand on this below. 3. “A simplex is the optimum set of M signals for use in white Gaussian noise when the energy is constrained and P (mi ) = 1/M, ∀i” 9 . (s)

The signal vectors of a simplex set of energy Es 9 W&J,

page 260.

and dimension N = M − 1

178

CHAPTER 4. OPTIMUM RECEIVER PRINCIPLES

are of the form: √ s′i = −

(s)

Es (+1, +1, . . . , +1, (1 − M ), +1, . . . , +1) {z } M (M − 1) | M components

where the component (1 − M ) is in position i + 1 where 0 ≤ i ≤ M − 1. The square of the distance between two signal vectors s′i and s′j , i ̸= j, is ′ |si − s′j |2 where: √ (s) Es s′i − s′j = − (0, 0, . . . , 0, −M, 0, 0, . . . , 0, +M, 0, . . . , 0) M (M − 1) In the above, the component −M is in position i and the component +M is in position j. The square of the distance is then: |s′i − s′j |2

(s)

= =

Es × 2M 2 M (M − 1) 2M E (s) M −1 s (o)

An orthogonal set of signals of equal energy Es

=

M Es(s) M −1

is also such that the (s)

square of the distance between any two distinct signals is M2M −1 Es . Both sets consequently have the same P (E ) and P (C ). For example, a ternary (M = 3) (s) simplex set of dimension N = 2 with Es = 1 V2 has the same P (E ) as the (o) corresponding orthogonal signal set with energy Es = 1.5 V2 . In other words, the probability of error of a ternary simplex set at a signal-to-noise ratio of 0 dB (SN R = 1) is equal to the probability of error of a ternary orthogonal set at a signal-to-noise ratio of√1.76 dB (SN R = 1.5). P (E ) is plotted on ﬁgure 4.15 as a function of SN R = ES /N0 . −1 Deﬁnition 4.5.4. (W&J, page 261) Let {sj }N j=0 be a set of equal energy or−1 thogonal signal vectors. The signal vector set {sj , −sj }N j=0 is called biorthogonal set and clearly M = 2N . −1 Theorem 58. Let {sj , −sj }N j=0 contain M = 2N equiprobable biorthogonal signal vectors with with same energy Es = Em . Then (equation (4.104) in W&J): (∫ ∞ )N −1 ∫ ∞ √ P (C ) = pn (α − Es ) pn (β)dβ dα

∫

−α

0 ∞

= ∫

pn (α −

√

∞ 0

pn (α −

√

)N −1

∞

α

0

=

( ∫ Es ) 1 − 2

pn (β)dβ

dα

( ( α ))N −1 Es ) 1 − 2Q √ dα N0 /2

4.5. PROBABILITY OF ERROR

179

Proof. Similar to orthogonal signal vectors. If s0 is transmitted, then:  r closer to s0 than to −s0     and  r closer to s0 than to sj , j = 1, 2, . . . , N − 1 no errors ⇔   and    r closer to s0 than to −sj , j = 1, 2, . . . , N − 1 This becomes:

 r0 > 0      and no errors ⇔      r0 > ri and r0 > −ri , i = 1, 2, . . . , N − 1

The remainder of the proof is straightforward. The expressions of P (C ) in equations (4.96a), (4.104) in W&J can be shown to depend on SN R = Em /N0 = Es /N0 only and √ not on both Es , N0 . P (E ) is plotted on ﬁgure 4.16 as a function of SN R = ES /N0 .

Figure 4.16:

180

4.5.3

CHAPTER 4. OPTIMUM RECEIVER PRINCIPLES

Completely Symmetric Signal Sets and a priori Knowledge

Pages 263 - 264 in W&J. Congruent Decision Region Receiver Deﬁnition 4.5.5. Consider a receiver m \ CDR ( ) and corresponding decision −1 regions {Ij }M . If all I ’s are rotations and translations of each others, the j j=0 decision regions are said to be congruent and the receiver m \ CDR ( ) is called congruent decision region receiver. We easily see that a symmetrical signal vector set with equiprobable messages leads to a congruent decision region receiver by the MAP decision rule (page 263 in W&J). Equivalently, when the signal set is symmetrical, a maximum likelihood receiver is a congruent decision region receiver. A congruent decision region receiver need not be optimal, but over an additive white Gaussian noise channel it satisﬁes (equation (4.106a) in W&J): P (C |mi ) = P (C |mj ) for all i, j, from which it follows that (equation (4.106b) in W&J): P (C ) =

M −1 ∑

P (mi )P (C |mi )

i=0

=

P (C |mi ), ∀i.

This proves the following (page 263 in W&J): Theorem 59. The error performance of a congruent decision region receiver is invariant to the actual (a priori) source statistics. A congruent decision region receiver is only guaranteed to be optimum when the (congruent) decision regions correspond to the MAP decision rule for given a priori message probabilities; if the a priori probabilities are not all equal, the receiver still achieves the same P (E ) but a better receiver could be designed. This can be advantageous from a designer’s point of view (page 264 in W&J). Minimax Receiver It has been observed that the P (E ) of a receiver depends on the a priori message probabilities (problem 2.12 in W&J), and reaches a maximum for a certain choice of the P (mi )’s. Deﬁnition 4.5.6. A receiver having smallest maximum P (E ) over all possible choices of P (mi )’s is called minimax receiver.

4.5. PROBABILITY OF ERROR

181

Theorem 60. A congruent decision region receiver which is optimal for a certain choice of the P (mi )’s is minimax. In particular a maximum likelihood receiver with symmetrical signal vector set is minimax. Proof. Refer to page 264 in W&J; this proves part 4 of theorem 50.

4.5.4

Union Bound on the Probability of Error

Pages 264 - 266 in W&J. The following theorem leads to useful simple expressions bounding the P (E ) of equations (4.96), (4.104) in W&J. −1 Theorem 61. Let {sj }M j=0 be a set of M equally likely signal vectors. Then (equation (4.109) in W&J)

P (E |mi ) ≤

M −1 ∑

P2 [si , sk ]

k=0 k ̸= i where P2 [si , sk ] denotes the P (E ) of a binary system that would use signals si and sk to communicate one of two equally likely messages. For the additive white Gaussian noise channel we have (equations (4.110) and previously (4.76b) in W&J) ( ) |si − sk | P2 [si , sk ] = Q √ 2N0 and therefore P (E |mi ) ≤

M −1 ∑

k=0 k ̸= i

( ) |si − sk | Q √ 2N0

Proof. Refer to page 265 in W&J – easy to read. Corollary. 1. For orthogonal signals (equation (4.111) in W&J): ) (√ P (E ) ≤ (M − 1)Q Es /N0 2. For simplex signals:

10

(√( )( )) M Es P (E ) ≤ (M − 1)Q M −1 N0 )

( 10 Recall

(s)

that Es

(o)

= Es

1−

1 M

(o)

⇒ Es

=

(s) M E , M −1 s

(s)

(o)

where Es , Es

denote the average signal energy of the simplex and orthogonal signal sets.

respectively

182

CHAPTER 4. OPTIMUM RECEIVER PRINCIPLES

3. For biorthogonal signals (equation (4.112) in W&J): √ √ P (E ) ≤ (M − 2)Q( Es /N0 ) + Q( 2Es /N0 ) Proof. 1. For orthogonal signals we have: √ √ s0 − sk = ( Es , 0, 0, . . . , 0, − Es , 0, 0, . . . , 0) √ √ where the components Es , − E √s are respectively in positions 0 and k ̸= 0. It follows that |s0 − sk | = 2Es . Therefore: P (E |m0 ) ≤

M −1 ∑ k=1

=

M −1 ∑

( ) |s0 − sk | Q √ 2N0 ) (√ Q Es /N0

k=1

=

(√ ) (M − 1)Q Es /N0

and by symmetry (congruent decision region receiver) we obtain: √ P (E ) = P (E |m0 ) = (M − 1)Q( Es /N0 ) 2. For a simplex signal set we start with the latter bound, and recalling that (page 261 in W&J): M (s) Es(o) = Es M −1 (s)

where Es

= Em , we obtain: (√( )( )) M Em P (E ) ≤ (M − 1)Q M −1 N0

3. For biorthogonal signals, we have P (E |m0 ) =

N −1 ∑ k=1

( ( ) N∑ ) ) −1 ( |s0 − sk | |s0 + sk | |s0 + s0 | Q √ + Q √ +Q √ 2N0 2N0 2N0 k=1

and we easily ﬁnd: √ 2Es √ |s0 + sk | = 2Es √ |s0 + s0 | = 2 Es |s0 − sk | =

4.6. BIT ERROR RATE AND SIGNAL-TO-NOISE RATIO PER BIT

183

It follows (by symmetry) that: P (E )

= P (E |m0 )

(√ ) (√ ) = 2(N − 1)Q Es /N0 + Q 2Es /N0 (√ ) (√ ) = (M − 2)Q Es /N0 + Q 2Es /N0

These bounds are plotted in ﬁgure 4.17 for the case M = 16.

Figure 4.17:

4.6

Bit Error Rate and Signal-to-Noise Ratio per Bit

The graphs presented in the previous section do not present a fair comparison between signal sets of diﬀerent sizes because of the following: 1. A set of M signals carries log2 (M ) bits for each transmission of a message. The signal energy should be calculated as an average energy per information bit. 2. Each message error results in a number of bit errors that can be anywhere between 1 and log2 (M ).

184

CHAPTER 4. OPTIMUM RECEIVER PRINCIPLES

It may be very diﬃcult to predict the bit error rate of a signal set as a function of the average signal-to-noise ratio per bit, mainly because it depends on the mapping of information bit sequences to signal vectors. A simple approximation may be obtained by noticing that at a low signal-to-noise ratio one expects that half of the log2 (M ) bits of a message will be incorrect and at a high signal-to-noise ratio only one of the log2 (M ) bits of a message will be incorrect (using a Gray code to map the information bit sequences to signal vectors). After rescaling the horizontal axis to reﬂect the average signal-to-noise ratio per bit, some of the curves of the previous section are redrawn as two curves each, one of which corresponding to “half of the information bits are incorrect” and the other curve corresponding to “only one bit is incorrect”; the true performance curve should lie somewhere in between the two. Some of the curves are presented in ﬁgures 4.18 and 4.19.

Figure 4.18:

4.7. PROBLEMS:

185

Figure 4.19:

4.7

Problems:

1. Consider the signals in ﬁgure 4.20. (a) Calculate an orthonormal basis for s1 (t), s2 (t). Hint: you will ﬁnd two orthonormal functions ϕ1 (t) and ϕ2 (t). (b) Calculate the coordinates of s1 (t) and s2 (t) in the basis {ϕ1 (t), ϕ2 (t)}. (c) To which function ϕ1 (t) or ϕ2 (t) does s1 (t) resemble the most? To which function ϕ1 (t) or ϕ2 (t) does s2 (t) resemble the most? (d) The signal  2  t ; 0≤t≤1 g(t) = t2 (u(t) − u(t − 1)) =  0 ; elsewhere does not lie in the space generated by ϕ1 (t), ϕ2 (t). Calculate the d in the space generated by ϕ1 (t), ϕ2 (t) that resembles the signal g(t) d on the same graph. Does most the signal g(t). Sketch g(t) and g(t) your answer appear to make sense? (e) Calculate a third orthonormal function ϕ3 (t) such that all of s1 (t), s2 (t) and g(t) lie in the space generated by ϕ1 (t), ϕ2 (t) and ϕ3 (t).

186

CHAPTER 4. OPTIMUM RECEIVER PRINCIPLES

Figure 4.20:

Chapter 5

Reliable Communication on an Unreliable Channel the situation depicted in ﬁgure 5.1 where a source is transmitting a message to a receiver via a communication channel. A denotes the source Consider source

A

channel

B

receiver

Figure 5.1: alphabet, A , |A | > 1 is the number of source alphabet symbols. The source is characterized by the a priori message probabilities P (A), ∀a ∈ A , that we assume known. B denotes the channel output alphabet. The channel is described by the conditional transition probabilities P (b|a), ∀a ∈ A , ∀b ∈ B. The receiver attempts to recover the transmitted symbol from the received symbol using the a priori probabilities and the channel transition probabilities. In chapter 4 we calculated the probability of error when one symbol from A is transmitted over the channel. Using the analogy between probabilities and relative frequency of occurrence of an event, we expect the error rate to equal the probability of error when multiple bits are transmitted over the channel (this is the basis of Monte Carlo simulations to estimate the probability of error by repeated transmission of symbols over a channel). Because the receiver has a non-zero probability of making an error, this scheme does not allow the reliable transmission of an arbitrarily long sequence over the channel. This is the problem that we now address: to transmit an arbitrarily long sequence over the channel and guarantee (with an arbitrarily small probability of error) that the sequence is received correctly in its entirety. The event “an error has occurred” is realized whenever the receiver makes a single or more mistakes in the received sequence. For a long time it was believed that the probability of this event could 187

188CHAPTER 5. RELIABLE COMMUNICATION ON AN UNRELIABLE CHANNEL never be made arbitrarily small. Shannon showed how this can be accomplished in a famous paper published in 1948. In the following we denote: A N = {(a1 , a2 , . . . , aN )|ai ∈ A } in which N > 1 is an integer. A sequence of N source alphabet symbols is an element of A N . Deﬁnition 5.0.1. 1. A block code C of (block) length N and rate R over the alphabet A is a subset of A N of size (cardinality) M = A RN . The elements of C are called codewords. It is convenient to number the codewords and without loss of generality we write: C = {y 1 , y 2 , . . . , y M }. 2. RN = logN2 M is the bit rate. RN may be larger than 1 and RN = R when A = 2. 3. If RN ∈ N, RN = K > 1, an encoder is a bijective mapping ϵ : A K → C . In the above deﬁnition the rate could equivalently be deﬁned as: logA M ≤ 1. N The number ⌈logA M ⌉ represents the minimum number of components that are required in order to represent M messages with diﬀerent sequences of symbols from A . It follows that the rate can be interpreted as the (average) fraction of alphabet symbol sent over the channel per channel use when a codeword taken at random is transmitted. Similarly, the bit rate RN represents the equivalent average number of bits that are transmitted per channel use. R=

Example 5.0.1. log2 40 = 5.322 bits which means that 40 given binary vectors of length N ≥ 6, can be used to represent all 32 binary vectors of length 5 plus some 8 binary vectors of length 6. If N = 10 is used then an average of 0.5322 bits is sent each time the channel is used to transmit the symbols of one of the 40 codewords taken at random. Example 5.0.2. C = {(0, 1, 0, 1), (1, 1, 1, 1), (1, 0, 1, 0), (1, 1, 1, 0)} is a block 1 code of length 4 and rate 21 over the alphabet A = {0, 1}; there are 4 = 2(4)( 2 ) codewords. The mapping: ϵ : {0, 1}2 ϵ : (0, 0) ϵ : (0, 1) ϵ : (1, 0) ϵ : (1, 1)

→ 7 → 7→ 7→ 7→

C (0, 1, 0, 1) (1, 1, 1, 1) (1, 0, 1, 0) (1, 1, 1, 0)

is one of the 4! = 24 possible encoders. Clearly an average of 21 bit from the source alphabet A = {0, 1} is sent every time the channel is used to transmit a codeword taken at random.

189 In order to communicate over the (possibly unreliable) channel, the source and receiver (the receiver is called a decoder in the context of coding for the correction of errors – see deﬁnition below) agree on a block code C and the source transmits a codeword y i ∈ C ⊂ A N taken at random over the channel as depicted in ﬁgure 5.2. The source is characterized by the a priori probabilities yi ∈ C

source

source

RN symbols

z∈B

channel

N

decoder yˆ

yˆ ( z ) ∈ C

encoder

If RN is an integer

Figure 5.2: P (y i ), ∀y i ∈ C , (we suppose the codewords are equally likely). The channel is described by the conditional transition probabilities PN (z|y i ), ∀y i ∈ C , ∀z ∈ B N . The decoder yb : B N → C attempts to recover the transmitted codeword from the received word z ∈ B N using the a priori probabilities and the channel transition probabilities and makes an error if and only if yb(z) ̸= y i . Deﬁnition 5.0.2. 1. A decoder is a mapping yb : B N → C . 2. Let I1 , I2 , . . . , IM ⊂ B N such that: Ii , {z ∈ B N : yb(z) = y i }. Ii is called the decision region of y i , i = 1, 2, . . . , M . N Clearly we have ∪M and Ii ∩ Ij = ∅, ∀i ̸= j. It follows that: i=1 Ii = B ∑ P (E |y i ) = P (z ∈ / Ii |y i ) = PN (z|y i ) z ∈I / i

which implies P (E )

=

M ∑

P (y i )P (E |y i )

i=1

=

M ∑ i=1

P (y i )

∑

PN (z|y i )

z ∈I / i

The last expression shows that the probability is a function of the a priori probabilities P (y i ), the channel probabilities PN (z|y i ), the code C and the decoders decision regions Ii , i = 1, 2, . . . M .

190CHAPTER 5. RELIABLE COMMUNICATION ON AN UNRELIABLE CHANNEL Remark. the last expression assumes that z is a discrete ∑variable; if∫ it is continuous then all PN (z|y i ) become pz (γ|y = y i ) and all z become dγ. Deﬁnition 5.0.3. 1. An optimal decoder minimizes P (E ) so that any other decoder achieves a probability of error bigger or equal to that of an optimal decoder. 2. A maximum a posteriori decoder is such that: yb(z) = y i ⇔ P (y i )PN (z|y i ) ≥ P (y j )PN (z|y j ), ∀j ∈ {1, . . . , M }. 3. A maximum likelihood decoder is such that: yb(z) = y i ⇔ PN (z|y i ) ≥ PN (z|y j ), ∀j ∈ {1, . . . , M }. It is well known that a maximum a posteriori decoder is optimal and that a maximum likelihood receiver is optimal when the codewords are equally likely 1 (P (y i ) = M , ∀i). In the following we assume that the codewords are equally 1 likely .

5.1

Codes of size M = 2

We write C = {y 1 , y 2 } and the rate is R =

logA 2 N .

The decision regions are:

I1

=

{z ∈ B N : PN (z|y 1 ) ≥ PN (z|y 2 )},

I2

=

{z ∈ B N : PN (z|y 2 ) > PN (z|y 1 )} = B N − I1 .

Example 5.1.1. Let A = {0, 1}, A = 2 and the (memoryless) Binary Symmetric Channel (BSC) shown in ﬁgure 5.3 with p < 12 . We have: PN (z|y i )

= (1 − p)N −dH (z,yi )pdH (z,yi ) ( p )dH (z,yi = (1 − p)N ) 1−p

where dH (z, y i ), called the Hamming distance between z and y i , denotes the p < 1 and number of positions in which z and y i diﬀer. Clearly p < 12 ⇒ 1−p PN (z|y i ) is maximized by minimizing dH (z, y i ). So if C = {(0, 0, 1, 0), (1, 0, 0, 1)}, 1 This

can be seen as a worst case scenario – refer to W&J page 81

5.1. CODES OF SIZE M = 2

191

then we ﬁnd: I1

= {(0, 0, 0, 0), (0, 0, 1, 0), (0, 0, 1, 1), (0, 1, 0, 0), (0, 1, 1, 0), (0, 1, 1, 1), (1, 0, 1, 0), (1, 1, 1, 0)}

I2

= {(0, 0, 0, 1), (0, 1, 0, 1), (1, 0, 0, 0), (1, 0, 0, 1), (1, 0, 1, 1), (1, 1, 0, 0), (1, 1, 0, 1), (1, 1, 1, 1)}

Clearly I1 ∪ I2 = {0, 1}4 and I1 ∩ I2 = ∅. { 0,1}

4

I1 I2

y1= 0,0,1,0 

y2 =1,0,0,1

Figure 5.3:

For the code of size M = 2 and the decision regions of the maximum likelihood decoder we have: ∑ P (E |y 1 ) = P (z ∈ I2 |y 1 ) = PN (z|y 1 ) z∈I2

We then notice that for any z ∈ I2 we have PN (z|y 1 ) < PN (z|y 2 ), i.e. ( )s PN (z|y 2 ) PN (z|y 2 ) ≥1⇒ ≥ 1, ∀s ≥ 0 PN (z|y 1 ) PN (z|y 1 ) The above P (E |y 1 ) is bounded by: P (E |y 1 )

≤

∑ z∈I2

=

∑

( )s PN (z|y 2 ) PN (z|y 1 ) PN (z|y 1 ) PN (z|y 1 )1−s PN (z|y 2 )s

z∈I2

≤

∑

PN (z|y 1 )1−s PN (z|y 2 )s ,

z∈BN

∀s ≥ 0. The above bound is convenient because it does not involve the decision regions; it can be calculated or computed without ﬁrst having to ﬁnd the decision

192CHAPTER 5. RELIABLE COMMUNICATION ON AN UNRELIABLE CHANNEL regions. We also ﬁnd similarly: P (E |y 2 ) ≤

∑

PN (z|y 2 )1−s PN (z|y 1 )s

z∈BN

∀s ≥ 0. It follows that: P (E ) = P (E |y 1 )P (y 1 ) + P (E |y 2 )P (y 2 ) ∑ PN (z|y 1 )1−s PN (z|y 2 )s ≤ P (y 1 ) z∈BN

∑

+P (y 2 )

PN (z|y 2 )1−s PN (z|y 1 )s

z∈BN

∀s ≥ 0. This bound is seen to be valid for any a priori probabilities. If P (y 1 ) = P (y 2 ) = 21 it is seen that the value s = 12 gives the smallest (tightest) upper bound which reduces to: ∑ √ PN (z|y 1 )PN (z|y 2 ) , g(y 1 , y 2 ) P (E ) ≤ z∈BN

This bound only depends on the code C = {y 1 , y 2 } and the discrete communication channel probabilities PN (z|y), with or without memory; it assumes equally likely messages and a maximum likelihood decoder.

5.1.1

Memoryless channel

The bound g(y 1 , y 2 ) can be expanded as follows: ∑ √ g(y 1 , y 2 ) = PN (z|y 1 )PN (z|y 2 ) z∈BN

=

v uN ∑ u∏ t P (zn |y1n )P (zn |y2n )

z∈BN

=

∑

n=1 N ∏

√

P (zn |y1n )P (zn |y2n )

z∈BN n=1

=

∑ ∑

···

z1 ∈B z2 ∈B

=

N √ ∑ ∏

P (zn |y1n )P (zn |y2n )

zN ∈B n=1

N ∑ √ ∏ P (zn |y1n )P (zn |y2n ) n=1 zn ∈B

Given a code C = {y 1 , y 2 }, a memoryless DCC with transition probabilities P (zn |yn ), equally likely messages and maximum likelihood decoder we have: P (E ) ≤

N ∑ √ ∏ P (zn |y1n )P (zn |y2n ) n=1 zn ∈B

5.1. CODES OF SIZE M = 2

193

The subscript n of zn may be omitted in the above. We notice that if y1n = y2n ∑ ∑√ P (z|y1n )P (z|y2n ) = P (z|y1n ) = 1 z∈B

z∈B

and those values do not aﬀect the product. We can then rewrite the product in the above bound as: N ∏

P (E ) ≤

∑√

n=1 y1n ̸= y2n

5.1.2

P (z|y1n )P (z|y2n )

z∈B

Memoryless channel and A = {0, 1}

Since there are now only two possibilities for y1n , y2n and we insist that y1n ̸= y2n for some n, the bound becomes: P (E )

N ∏

≤

∑√ P (z|0)P (z|1)

n=1 y1n ̸= y2n

z∈B

= γ dH (y1 ,y2 )

√ ∑ where γ , P (z|0)P (z|1). z∈B ∫∞ √ pr (ρ|0)pr (ρ|1)dρ. −∞ Example 5.1.2.

If z is a continuous variable then γ =

2

1. Binary symmetric channel with transition probability p. 3 We ﬁnd: ∑ √ √ γ= P (z|0)P (z|1) = 2 p(1 − p) z∈{0,1}

and P (E ) is bounded by ( )dH (y1 ,y2 ) √ P (E ) ≤ 2 p(1 − p) . −3

If for example p = 10

√ then γ ≈ 2 p and P (E ) ≤

( )dH (y1 ,y2 ) √ 2 p . The

repetition code of block length 11 gives the bound P (E ) ≤ (211 )(10−3 )11/2 = 6.48 × 10−14 2 The bounds obtained in problem 2.39 in W&J are identical to the results obtained in these examples; the situation described in problem 2.39 corresponds to dH (y 1 , y 2 ) = N . 3 Throughout this chapter we assume that the BSC consists in an AWGN channel used with (binary) antipodal modulation and a maximum likelihood receiver. We then have p = ) (√ 2EN /N0 , where EN and N0 /2 respectively denote the energy transmitted per channel Q use and the double sided noise power spectral density.

194CHAPTER 5. RELIABLE COMMUNICATION ON AN UNRELIABLE CHANNEL The exact probability of error is given by: 11 ( ) ∑ 11 t P (E ) = p (1 − p)11−t = 4.6 × 10−15 t t=6 and the bound is found to be 14 times larger than the exact value in this speciﬁc case. 2. AWGN channel with binary antipodal modulation, maximum likelihood (ML) receiver and as before EN and N0 /2 respectively denote the energy transmitted per channel use and the double sided noise power spectral density: ∫ ∞√ √ √ γ= pr (ρ| EN )pr (ρ| − EN )dρ −∞

In the above we have pr (ρ|si ) = pn (ρ − si ) = √ and σ 2 = γ

1 2πσ 2

e−(ρ−si )

2

/2σ 2

N0 2 .

= = = = =

It follows that: ∫ ∞√ √ √ 1 1 √ e−(ρ− EN )2 /2σ2 √ e−(ρ+ EN )2 /2σ2 dρ 2 2 2πσ 2πσ −∞ √ √ ( ) ∫ ∞ 2 1 −(ρ − EN ) − (ρ + EN )2 √ exp dρ 4σ 2 2πσ 2 −∞ ( ) ∫ ∞ ρ2 + EN 1 √ exp − dρ 2σ 2 2πσ 2 −∞ ( )∫ ∞ 2 2 EN 1 √ exp − 2 e−ρ /2σ dρ 2 2σ 2πσ −∞ −EN /N0 e

Refer to the graph in ﬁgure 5.4 for a comparison of the γ corresponding to: 1. BSC, 2. AWGN channel/antipodal modulation/ML receiver. Since a smaller value of γ is desirable we notice that channel 2 is better (≈ 2 dB more energy eﬃcient).

5.2

P (E ) averaged over all possible C A N , |C | = M = 2

⊂

In the following we calculate (a bound on) the average P (E ) that is obtained when we chose at random a code of size 2 and block length N . Speciﬁcally we

5.2. P (E ) AVERAGED OVER ALL POSSIBLE C ⊂ A N , |C | = M = 2 195 1

0.8

gamma for BSC gamma for AWGN

gamma

0.6

0.4

0.2

0 -15

-10

-5

0

5

10

15

SNR EN/N0 [dB]

Figure 5.4:

consider the probability system (Ω = A N , F = 2Ω , QN : F → [0, 1]), in which QN ( ) is a probability measure and4 : ∑

QN ({y}) = 1

y∈A N

The selection of a code C of size 2 and block length N is a compound random experiment: 2 codewords are selected by 2 independent trials5 of the random N experiment corresponding to the probability system (A N , 2A , QN ). It follows that the probability of a code C = {y 1 , y 2 } is p(C ) = P ({y 1 , y 2 }) = QN (y 1 )QN (y 2 ) and the probability of error corresponding to code C is a random variable ξ given by and bounded by: ξ

=

∑

PN (z|y 1 )P (y 1 ) +

z∈I2

∑

PN (z|y 2 )P (y 2 )

z∈I1

≤ g(y 1 , y 2 ) 4 for 5 we

simplicity of notation we let QN (y) , QN ({y}) do not insist that the codewords y 1 and y 2 be diﬀerent

196CHAPTER 5. RELIABLE COMMUNICATION ON AN UNRELIABLE CHANNEL where we recall that g(y 1 , y 2 ) =

∑ √ PN (z|y 1 )PN (z|y 2 ) z∈BN

The average probability of error, denoted by P (E ), is given by: P (E )

∑

=

ξP (C )

C

∑

≤

∑

g(y 1 , y 2 )QN (y 1 )QN (y 2 ) , g

y 1 ∈A N y 2 ∈A N

g is simpliﬁed as follows: g

=

∑∑∑ z

=

y1

=

y2

∑(∑ z

√ √ QN (y 1 ) PN (z|y 1 )QN (y 2 ) PN (z|y 2 )

√ QN (y 1 ) PN (z|y 1 )

y1

∑( ∑ z

)( ∑

)2 √ QN (y) pN (z|y)

√ QN (y 2 ) PN (z|y 2 )

)

y2

y∈A N

The average probability of error for any discrete communication channel (with or without memory) with 2 equally likely messages and a maximum likelihood decoder is bounded by: P (E ) ≤ g =

∑ ( ∑ z∈BN

5.2.1

√ QN (y) pN (z|y)

)2

y∈A N

Memoryless channel

g takes the form

g

=

∑( ∑ z

=

QN (y)

y∈A N

N ∑( ∑ ∏ z

)2 N √ ∏ P (zn |yn ) n=1

)2 √ Q(yn ) P (zn |yn )

y∈A N n=1

∏N where we choose QN (y) = n=1 Q(yn ) in which Q( ) is a probability measure on A (this corresponds to drawing by independent trials the symbols of y out

5.2. P (E ) AVERAGED OVER ALL POSSIBLE C ⊂ A N , |C | = M = 2 197 of the alphabet A , similarly to a compound experiment). It follows that:

g

=

N ∑ ∑( ∏ z

=

n=1 y∈A

N ( ∑ ∑ ∏ z∈BN n=1

=

)2 √ Q(y) P (zn |y)

N ∏

(

=

Q(y)

)2

)2 √ P (z|y)

y∈A

∑(∑ z∈B

= 2

y∈A

∑(∑

n=1 z∈B

√ Q(y) P (zn |y)

√ Q(y) P (z|y)

)2 )N

y∈A

−N R(Q)

where we deﬁne R(Q) , − log2

(∑(∑ z∈B

)2 ) √ Q(y) P (z|y)

y∈A

Remark. Clearly the above is only useful if g < 1 ⇔ log2 (g) = −N R(Q) < 0 ⇔ log2 (1/g) = N R(Q) > 0 As long as Q( ) and the channel probabilities P (z|y) are such that R(Q) > 0 (bits per channel use) then g can be made as small as we wish by taking N large enough; there must then exist at least one pair of codewords {y 1 , y 2 } ⊂ A N for which the probability of error is less than g. In order to obtain the “tightest” bound, the probability measure Q( ) maximizing R(Q) is chosen. Deﬁnition 5.2.1. R0 , maxQ( ) (R(Q)) is called exponential bound parameter or computational cut-oﬀ rate, and we let g0 , 2−N R0 . Equivalently we have: ( R0 = − log2

min Q( )

(∑(∑ z∈B

√ Q(y) P (z|y)

)2 ))

y∈A

and P (E ) ≤ g0 = 2−N R0 . When R0 > 0, the average probability of error decreases exponentially with N .

198CHAPTER 5. RELIABLE COMMUNICATION ON AN UNRELIABLE CHANNEL

5.2.2

Special case 2: memoryless channel and A = {0, 1}

The optimal choice for Q( ) is Q(0) = Q(1) = R0

1 2

and we ﬁnd:

)2 ∑( √ √ (1/2) P (z|0) + (1/2) P (z|1) = − log2 z∈B

∑ ( P (z|0)

= − log2

z∈B

4

P (z|1) + + 4

√

P (z|0)P (z|1) 2

)

= − log2 (1/2)(1 + γ) = 1 − log2 (1 + γ) √ ∑ where we recall that γ = z∈B P (z|0)P (z|1). Example 5.2.1. Refer to the graphs on ﬁgure 5.5 for a comparison of the R0 corresponding to: 1. BSC, 2. AWGN channel/antipodal modulation/ML receiver. Since a larger R0 is characteristic of a better channel, we notice that channel 2 is a better channel (≈ 2 dB more energy eﬃcient) as remarked earlier on the plots of γ. 1 0.9 0.8 R0 for BSC R0 for AWGN

0.7

R0

0.6 0.5 0.4 0.3 0.2 0.1 0 -15

-10

-5

0 SNR EN/N0 [dB]

Figure 5.5:

5

10

15

5.3. P (E ) AVERAGED OVER ALL POSSIBLE C ⊂ A N , |C | = M

5.3

P (E ) averaged over all possible C A N , |C | = M

199

⊂

We now calculate the average P (E ) that is obtained when a code of size M and block length N is chosen at random. Extending the construction of the previous section, M codewords are selected by M repeated (independent) trials of the random experiment corresponding to the probability system ∑ N (A N , 2A , QN ), where we recall that y∈A N QN (y) = 1. The probability of code C = {y 1 , y 2 , . . . , y M } is:

P (C ) =

M ∏

QN (y i )

i=1

We deﬁne the following regions of B N for code C = {y 1 , y 2 , . . . , y M }:

Ii,j = {z ∈ B N : PN (z|y j ) ≥ PN (z|y i )} ,

i, j = 1, 2, . . . , M, i ̸= j (there are M (M − 1) such regions). If the M codewords are equally likely, we can use the union bound (W & J, p 265) to bound the probability of error achieved by a maximum likelihood decoder:

P (E ) = P (E |y 1 ) = P (z ∈ ∪j̸=1 I1j |y 1 ) ∑ ≤ P (z ∈ I1j |y 1 ) j̸=1

≤

∑

g(y 1 , y j )

j̸=1

where the ﬁrst equality relation may become √ ≤ depending on how ties are broken ∑ and we recall that g(y 1 , y j ) = z∈BN P (z|y 1 )P (z|y j ) is the upper bound for the code {y 1 , y j } of size 2 (equally likely and maximum likelihood decoder). When the code is chosen at random then the above probability of error is a

200CHAPTER 5. RELIABLE COMMUNICATION ON AN UNRELIABLE CHANNEL random variable, the expected value of which is given by: ∑ P (E |y 1 ) = P (E )P (C ) C

≤

∑

P (C )

C

=

∑

g(y 1 , y j )

j̸=1

M ∑∏

QN (y l )

C l=1

=

∑

∑

∑

∑

···

y 1 ∈A N y 2 ∈A N

=

g(y 1 , y j )

j̸=1

M ∑ ∑

QN (y l )g(y 1 , y j )

y M ∈A N j̸=1 l=1

∑

···

∑

g(y 1 , y j )

y M ∈A N

j=2 y 1 ∈A N y 2 ∈A N

∑M

M ∑∏

M ∏

QN (y l )

l=1

∑ ∑ y 1 ∈A N y j ∈A N g(y 1 , y j )QN (y 1 )QN (y j )× ∑ ∑ ··· QN (y 2 ) · · · QN (y M )

j=2

y 2 ∈A N

=

y M ∈A N

|

{z

|

but not y 1 nor y j

} {z

}

equals 1

=

M ∑ ∑

∑

QN (y 1 )QN (y j )g(y 1 , y j )

j=2 y 1 ∈A N y j ∈A N

=

(M − 1)g ≤ M g

Since the above bound P (E |y 1 ) ≤ M g is independent of y 1 we obtain: P (E ) =

∑

P (y 1 )P (E |y 1 ) = P (E |y 1 ) ≤ M g

y1

when all y 1 ∈ A N are equiprobable. We recall that g=

∑ ( ∑ z∈BN

)2 √ QN (y) P (z|y)

y∈A N

and when the channel is memoryless g = 2−N R(Q) where (∑(∑ )2 ) √ R(Q) = − log2 Q(y) P (z|y) z∈B

y∈A

In order to obtain the tightest possible bound of P (E ) we take maxQ R(Q): P (E ) ≤ M g0 = 2N RN 2−N R0 = 2−N (R0 −RN )

5.3. P (E ) AVERAGED OVER ALL POSSIBLE C ⊂ A N , |C | = M

201

where RN R0

log2 M N = max R(Q) Q ( (∑(∑ )2 )) √ = − log2 min Q(y) P (z|y) =

Q

z∈B

y∈A

When A = {0, 1} the above expression reduces to R0 = 1 − log2 (1 + γ) where:  ∑ √  P (z|0)P (z|1) ; in general  z∈B     √    4p(1 − p) ; for BSC with transition probabilγ= ity p        e−EN /N0 ; for AWGN channel/antipodal   modulation/ML receiver Interpretation which

There exists at least one code C ⊂ A N of rate RN < R0 for P (E ) ≤ 2−N (R0 −RN )

Any sequence of K source alphabet symbols can be transmitted and received correctly in its entirety with an arbitrarily small probability of error δ > 0 as long as RN < R0 and N is large enough, regardless of how unreliable the channel may be; for given values of RN and R0 , RN < R0 , the longer the sequence, the more likely it will be received correctly. The units of R0 and RN are the same, i.e. bits per channel use. RN represents the average number of equivalent source bits transmitted every time the channel is used. R0 suggests the existence of an upper limit on the rate of transmission. The limit, denoted by C, is called capacity and C ≥ R0 . For the special cases of BSC and AWGN channel with antipodal signals it is given by:  1 − H2 (p) ; for BSC with transition probabil    ity p C= ( )  1 EN  ; for AWGN channel/antipodal   2 log2 1 + 2 N0 modulation/ML receiver where H2 ( ) is the binary entropy function.6 Shannon showed that 1. RN < C ⇒ P (E ) goes to 0 as N → ∞, 2. RN > C ⇒ P (E ) goes to 1 as N → ∞. Refer to W & J bottom of page 321, bottom of page 322 and the discussion on page 341. 6H

2 (x)

, −x log2 (x) − (1 − x) log2 (1 − x), ∀x ∈ [0, 1].

202CHAPTER 5. RELIABLE COMMUNICATION ON AN UNRELIABLE CHANNEL Example 5.3.1. 1. There exists a code of rate RN = 0.6 that can be used to transmit a 160 bit sequence over a BSC with p = 1.25 × 10−2 and a P (E ) < 10−8 . 7 This is because the channel has R0 ≈ 0.7105, N = 160 0.6 = 267 and 2−N (R0 −RN ) = 2−267×0.1 ≈ 9 × 16−9 < 10−8 . The size of the code is M = 2160 ≈ 1.5 × 1048 codewords. Practical design of such codes/encoders/decoders exceeds the scope of this course. 2. The results presented below are obtained by Monte Carlo simulation for the transmission of a 5000 bit sequence over the AWGN channel with binary antipodal signalling and maximum likelihood receiver at a signalN to-noise ratio E = 0 dB. This means that the channel’s probability of √N0 bit error is Q( 2) = 7.865 × 10−2 and the channel’s computational cut-oﬀ rate is R0 ≈ 0.548. We distinguish the cases: (a) No coding: The packet-error-rate (PER) ≈ 1 since every packet/sequence contains an average of 393 bit errors; it is very unlikely that a packet be received with not a single bit error: P (C ) = (1 − 7.865 × 10−2 )5000 ≈ 1.33 × 10−178 ≈ 0 (b) Convolutional coding with RN = 21 < R0 : We see that as the constraint length is increased from 7 to 13 the PER goes to 0, even though the rate RN remains ﬁxed at 12 . Code Constraint Length —

7 9 11 12 13

packets

packet errors

bit errors

PER

BER

500 500 196592 1 7.86 × 10−2 There were 196592 bit errors during the transmission of 2500000 bits at a signal to noise ratio of 0.0 dB 500 122 1063 2.44 × 10−1 2.13 × 10−4 500 29 185 5.80 × 10−2 7.40 × 10−5 500 12 72 2.40 × 10−2 2.88 × 10−5 500 10 65 2.00 × 10−2 2.60 × 10−5 263 0 0 0 0

Table 5.1: simulation results rate 1/2 convolutional codes over an AWGN channel with binary antipodal modulation, 0 dB signal-to-noise ratio and maximum likelihood receiver – Viterbi decoding of the convolutional code 7 this corresponds to a signal ratio of EN = 0.4 dB if the BSC is implemented with binary N0 antipodal signalling over an AWGN channel and a maximum likelihood receiver

Appendix A

Table of the Q( ) and erf( ) functions The approximation Q(x) ≈ x 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 1.10 1.20 1.30 1.40 1.50 1.60

erf(x) 0 0.1125 0.2227 0.3286 0.4284 0.5205 0.6039 0.6778 0.7421 0.7969 0.8427 0.8802 0.9103 0.934 0.9523 0.9661 0.9763

Q(x) 0.5 0.4602 0.4207 0.3821 0.3446 0.3085 0.2743 0.242 0.2119 0.1841 0.1587 0.1357 0.1151 0.0968 0.08076 0.06681 0.0548

√1 (1 x 2π

x 1.70 1.80 1.90 2.00 2.10 2.20 2.30 2.40 2.50 2.60 2.70 2.80 2.90 3.00 3.10 3.20 3.30

−

0.7 −x2 /2 x2 )e

erf(x) 0.9838 0.9891 0.9928 0.9953 0.997 0.9981 0.9989 0.9993 0.9996 0.9998 0.9999 0.9999 1 1 1 1 1

203

may be used when x > 2.

Q(x) 0.04457 0.03593 0.02872 0.02275 0.01786 0.0139 0.01072 0.008198 0.00621 0.004661 0.003467 0.002555 0.001866 0.00135 0.0009676 0.0006871 0.0004834

x 3.40 3.50 3.60 3.70 3.80 3.90 4.00 4.10 4.20 4.30 4.40 4.50 4.60 4.70 4.80 4.90 5.00

erf(x) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Q(x) 0.0003369 0.0002326 0.0001591 0.0001078 7.235 × 10−5 4.810 × 10−5 3.167 × 10−5 2.066 × 10−5 1.335 × 10−5 8.540 × 10−6 5.413 × 10−6 3.398 × 10−6 2.112 × 10−6 1.301 × 10−6 7.933 × 10−7 4.792 × 10−7 2.867 × 10−7

204

APPENDIX A. TABLE OF THE Q(

) AND ERF( ) FUNCTIONS

Appendix B

Orthonormal Expansion and Vector Representation of Continuous Signals e define a scalar product for ﬁnite energy signals. This scalar product, essentially corresponds to the projection described in the motivating probW lem below, and is used to obtain a geometric view of a given ﬁnite set of (ﬁnite energy) signals. Additionally, we obtain a distance measure between signals. Signal bases are deﬁned and we show how to perform a basis change. Finally we prove Parseval’s relationships. We start with a review of the concepts with 3-dimensional vectors over the real numbers. Motivating problem: Two teams are sent to measure the positions of three objects with respect to a ﬁxed reference point in 3 dimensions using a meter stick. The ﬁrst team obtains the coordinates: (1, 1, 1) (0, 1, 1) (0, 1, −1)

Objects measured by team 1 with NSEW reference.

with an orthogonal axis grid based on North, South, East, West centered on the speciﬁed ﬁxed reference point. The second team obtains coordinates: √ ( 0, 0) √ 3,√ Objects measured by team 2 with (2/ 3, √ 2/3, 0) shore line reference. (0, 0, 2) with an orthogonal axis grid based on a shore line and centered on the same speciﬁed ﬁxed reference point. 205

206APPENDIX B. ORTHONORMAL EXPANSION AND VECTOR REPRESENTATION OF CON The objects are no longer available and we want to determine if the measurements from the two teams are in agreement or not. A third team is then sent to measure the coordinates of the reference axis axis grid of the second team with respect to the NSEW grid used by the ﬁrst team and they found: √ √ √ (1/ 3, 1/ 3, 1/ Reference grid of shore line used √ √ √3) by team 2 with NSEW reference (− 2/3, 1/ 6, 1/ 6) √ √ used by team 1. (0, 1/ 2, −1/ 2) In order to determine whether or not the measurements agree, we use the measurements of team 3 to draw the axis system of team 2 on top of the axis system used by team 1. We next take the measurements from team 1 and project on the axis system used by team 2 to calculate what team 2 should have measured. We then only have to compare with what team 2 actually measured to see if the measurements agree or not. The situation is sketched on ﬁgure B.1.

Figure B.1:

B.1

Three-dimensional vectors in R3 :

Deﬁnition B.1.1. A basis in R3 is a set of three vectors {g⃗1 , g⃗2 , g⃗3 } such that any vector ⃗v ∈ R3 may be written uniquely in the form a1 g⃗1 +a2 g⃗2 +a3 g⃗3 , ai ∈ R. Let ⃗v denote a vector in R3 . We can express ⃗v as: ⃗v = v1⃗i + v2⃗j + v3⃗k

B.1. THREE-DIMENSIONAL VECTORS IN R3 :

207

where v1 , v2 , v3 ∈ R and ⃗i, ⃗j, ⃗k are the standard unit vectors which form a basis of R3 . Since the above expansion is unique, the triplet (v1 , v2 , v3 ) can consequently be used to represent ⃗v with respect to the basis and we write: ⃗v ←→ (v1 , v2 , v3 ) Similarly another vector w ⃗ maybe written as w ⃗ ←→ (w1 , w2 , w3 ) meaning that w ⃗ = w1⃗i + w2⃗j + w3⃗k . Deﬁnition B.1.2. Let ⃗v = v1⃗i + v2⃗j + v3⃗k, w ⃗ = w1⃗i + w2⃗j + w3⃗k. 1. The bilinear map deﬁned by: ⟨ , ⟩ : R3 × R3 ⟨ , ⟩ : (⃗v , w) ⃗

→ R 7 → ⟨⃗v , w⟩ ⃗ , v1 w1 + v2 w2 + v3 w3

is called dot product (aka scalar product or standard inner product) with respect to the basis {⃗i, ⃗j, ⃗k}. One easily shows that: ⟨av⃗1 + bv⃗2 , v⃗3 ⟩

= a⟨v⃗1 , v⃗3 ⟩ + b⟨v⃗2 , v⃗3 ⟩

⟨v⃗1 , av⃗2 + bv⃗3 ⟩

= a⟨v⃗1 , v⃗2 ⟩ + b⟨v⃗1 , v⃗3 ⟩

for every a, b ∈ R and for every v⃗1 , v⃗2 , v⃗3 ∈ R3 . 2. The distance between ⃗v and w, ⃗ denoted by d(⃗v , w), ⃗ is deﬁned by: √ d(⃗v , w)) ⃗ = (v1 − w1 )2 + (v2 − w2 )2 + (v3 − w3 )2 3. The length or norm of ⃗v is the distance between ⃗v and the origin; it is denoted by |⃗v |. The following are stated without proof (easy to do in 2 dimensions and it generalizes to 3 dimensions). Theorem 62.

√ 1. d(⃗v , w) ⃗ = ⟨⃗v − w, ⃗ ⃗v − w⟩. ⃗ √ 2. |⃗v | = ⟨⃗v , ⃗v ⟩. 3. ⟨⃗v , w⟩ ⃗ = |⃗v | |w| ⃗ cos(∠⃗v − ∠w), ⃗ where ∠⃗v − ∠w ⃗ is the angle between vectors ⃗v and w. ⃗ 4.

⟨⃗ v , w⟩ ⃗ |w| ⃗

is the length of the projection of vector ⃗v on vector w. ⃗ In particular, if |w| ⃗ = 1 then ⟨⃗v , w⟩ ⃗ is the length of the projection of vector ⃗v on vector w. ⃗

208APPENDIX B. ORTHONORMAL EXPANSION AND VECTOR REPRESENTATION OF CON Deﬁnition B.1.3. A vector ⃗v is called normal if ⟨⃗v , ⃗v ⟩ = 1 and two vectors ⃗v , w ⃗ are called orthogonal if ⟨⃗v , w⟩ ⃗ = 0. A set of vectors {v⃗1 , v⃗2 , . . . , v⃗n } is called orthonormal if ⟨⃗ vi , v⃗i ⟩ = 1, ∀i and ⟨⃗ vi , v⃗j ⟩ = 0, ∀i ̸= j. In light of theorem 62, we see that a vector is normal if its length is 1, and ⃗v and w ⃗ are orthogonal if the projection of ⃗v on w ⃗ (or vice-versa) has a length of 0 as one would expect intuitively. The basis {⃗i, ⃗j, ⃗k} used above is by deﬁnition orthonormal.1 The expansion of ⃗v with respect to ⃗i, ⃗j, ⃗k is represented graphically in ﬁgure B.2.

Figure B.2:

Remark. Not every basis of R3 is orthonormal or even orthogonal; orthonormality is a property that some bases satisfy. The following forms another orthonormal basis of R3 :2 √ √ √ w⃗1 = ⃗i/ 3 + ⃗j/ 3 + ⃗k/ 3 √ √ √ w⃗2 = −⃗i 2/3 + ⃗j/ 6 + ⃗k/ 6 √ √ w⃗3 = ⃗j/ 2 − ⃗k/ 2 One easily veriﬁes that any two of w⃗1 , w⃗2 , w⃗3 are orthogonal and that each one is normalized. Example B.1.1. Let ⃗i, ⃗j, ⃗k denote the standard unit vectors and deﬁne:3 v⃗1 v⃗2 v⃗3

= ⃗i + ⃗j + ⃗k ⃗j + ⃗k = ⃗j − ⃗k =

←→ ←→ ←→

(1, 1, 1) (0, 1, 1) (0, 1, −1) | {z } with respect to the basis {⃗i, ⃗j, ⃗ k}

1 This

could correspond to the basis used by team #1 in our motivating example. could correspond, in our motivating example, to the measurements taken by team #3 of the basis used by team #2 in the coordinate system used by team #1. 3 This corresponds to the measurements taken by team #1 in our motivating example. 2 This

B.1. THREE-DIMENSIONAL VECTORS IN R3 :

209

Since {w⃗1 , w⃗2 , w⃗3 } forms a basis of R3 they can also be used to expand v⃗1 , v⃗2 , v⃗3 . We would ﬁnd (this will be explained below):4 √ √ v⃗1 = 3w⃗1√ ←→ ( 3,√ 0, 0) √ √ v⃗2 = √ 2w⃗1 / 3 + w⃗2 2/3 ←→ (2/ 3, √ 2/3, 0) v⃗3 = 2w⃗3 ←→ (0, 0, 2) | {z } with respect to the basis {w ⃗1 , w ⃗2 , w ⃗3 }

Not surprisingly, the representation of the same vectors v⃗1 , v⃗2 , v⃗3 depends on the basis used to represent them, but the dot products are however independent of the orthonormal basis used (and this is not a coincidence as we will see): ⟨⃗ vi , v⃗j ⟩ = | {z } wrt {⃗i, ⃗j, ⃗ k}

⟨⃗ vi , v⃗j ⟩ | {z }

, ∀i, j = 1, 2, 3.

wrt {w ⃗1 , w ⃗2 , w ⃗3 }

Why do we want a basis to be orthonormal? Because the orthonormality makes it very easy to ﬁnd the coordinates of a vector and because the dot products and invariant under a change of orthonormal basis. A vector ⃗v is expanded in an orthonormal basis {w⃗1 , w⃗2 , w⃗3 } as follows: ⃗v = ⟨⃗v , w⃗1 ⟩w⃗1 + ⟨⃗v , w⃗2 ⟩w⃗2 + ⟨⃗v , w⃗3 ⟩w⃗3 and the dot products in the above can be expanded with respect to any orthonormal basis. Example B.1.2. The coordinates of v⃗1 , v⃗2 , v⃗3 in the orthonormal basis {w⃗1 , w⃗2 , w⃗3 } from the previous example are given by: √ ⟨v⃗1 , w⃗1 ⟩ = 3 ⟨v⃗3 , w⃗1 ⟩ = 0 ⟨v⃗1 , w⃗2 ⟩ = 0 ⟨v⃗3 , w⃗2 ⟩ = √ 0 ⟨v⃗1 , w⃗3 ⟩ = 0 √ ⟨v⃗3 , w⃗3 ⟩ = 2 ⟨v⃗2 , w⃗1 ⟩ = 2/ 3 √ ⟨v⃗2 , w⃗2 ⟩ = 2/3 ⟨v⃗2 , w⃗3 ⟩ = 0 This proves that the measurements taken by teams #1 and #2 are in agreement. In general the expansion of a vector ⃗v in a basis that is not orthonormal is not as easily done. Why can’t we just use the standard unit vector basis since it is orthonormal? Why do we need other orthonormal bases? In R3 not much is to be gained by using diﬀerent orthonormal bases, but if we have a small number of vectors with many components then some orthonormal bases may give a very simple representation of the vectors. This is illustrated in the next example. 4 This could correspond to the measurements taken by team #2 in our motivating example, assuming that their measurements are in agreement with those made by team #1.

210APPENDIX B. ORTHONORMAL EXPANSION AND VECTOR REPRESENTATION OF CON Example B.1.3. Let v⃗1 v⃗2 v⃗3

= (−1, 2, 1, 0, 0, 2, −1, 0, −2, 0) = (1, −2, 0, 1, 2, 2, −2, −1, −2, 2) = (−3, 6, 1, −2, −4, −2, 3, 2, 2, −4)

with respect to the standard unit vector basis {i⃗1 , i⃗2 , . . . , i⃗10 } in R10 . The vectors √ w⃗1 = (−1, 2, 1, 0, 0, 2, −1, 0, −2, 0)/ 15 √ w⃗2 = (2, −4, −1/2, 3/2, 3, 2, −5/2, −3/2, −2, 3)/ 57 form an orthonormal basis (they obviously generate a subspace of R10 ) and we have: √ v⃗1 = ( 15, 0) √ √ v⃗2 = ( 15/3, 2 57/3) ≈ (1.291, 5.033) √ √ v⃗3 = ( 15/3, −4 57/3) ≈ (1.291, −10.066) with respect to the basis {w⃗1 , w⃗2 } and we may represent the vectors graphically as in ﬁgure B.3. Clearly the representation of v⃗1 , v⃗2 , v⃗3 in the basis {w⃗1 , w⃗2 } is much simpler than the representation in the standard unit vector basis.

Figure B.3:

The Gram-Schmidt orthonormalization procedure is used to compute an orthonormal basis from a given set of vectors. It involves the following steps, starting from a ﬁnite set of vectors v⃗1 , v⃗2 , . . . , v⃗N represented in an orthonormal basis. Step 1 w⃗1 = normalize v⃗1 (assuming that v⃗1 ̸= 0). Step 2 w⃗2 = normalize v⃗2 − ⟨v⃗2 , w⃗1 ⟩w⃗1 if diﬀerent than 0. This operation is represented graphically in ﬁgure B.4.

B.2. VECTORS OBTAINED BY SAMPLING OF A WAVEFORM

211

Figure B.4:

Step 3 w⃗3 = normalize v⃗3 − ⟨v⃗3 , w⃗1 ⟩w⃗1 − ⟨v⃗3 , w⃗2 ⟩w⃗2 if diﬀerent than 0. Step 4 . . . and so on . . . There are as many steps as there are (non-zero) vectors in the given set. The vectors w⃗1 , w⃗2 , . . . are orthonormal. The number of (non-zero) orthonormal vectors is less than or equal to the number of vectors in the given set.

B.2

Vectors obtained by sampling of a waveform

Signals sampled on a time interval of ﬁnite duration produce vectors possibly consisting of very large number of components. The vector components are sketched on a discretized time axis indicating the approximate shape of the signal from which the samples were taken. Example B.2.1. From a previous example with v⃗1 , v⃗2 , v⃗3 ∈ R10 , the components are sketched on a discretized time axis (with linear interpolation) in ﬁgure B.5(a). Let v⃗i ∈ RN , i = 1, 2, . . . , M denote M vectors obtained by sampling of waveforms and represented as: v⃗i = (vi1 , vi2 , . . . , viN ) with respect to a standard unit vector basis. We deﬁne the functions vi : {1, 2, . . . , N } → vi : n 7→

R vi (n) = vi n

so that we may write v⃗i = (vi (1), vi (2), . . . , vi (N )) in the standard unit vector basis. The Gram-Schmidt procedure applied to the v⃗i returns L ≤ M orthonormal vectors w⃗j

=

(wj1 , wj2 , . . . , wjN )

=

(wj (1), wj (2), . . . , wj (N ))

212APPENDIX B. ORTHONORMAL EXPANSION AND VECTOR REPRESENTATION OF CON in the (same) standard unit vector basis such that:5 v⃗i =

L ∑

⟨⃗ vi , w⃗j ⟩w⃗j ,

j=1

i = 1, 2, . . . , M . This can equivalently be written as vi (n) =

L ∑

cij wj (n) ,

j=1

∑N n = 1, . . . , N , where cij = ⟨⃗ vi , w⃗j ⟩ = k=1 vi (k)wj (k). In this context, the Gram-Schmidt procedure obtains L ≤ M orthonormal discrete signals wj (n) from M discrete signals vi (n) such that any vi (n) can be expressed as a (unique) linear combination of the wj (n). As previously remarked in example B.1.1, the dot product of vectors is invariant under a change of orthonormal basis. In the present context we have: N ∑ n=1

|

vi (n)vk (n) {z

L ∑

=

j=1

}

wrt standard unit vectors

|

cij ckj {z

}

wrt the basis {w ⃗j }

Example B.2.2. Three signals v1 (n), v2 (n), v3 (n) are sketched on ﬁgure B.5(a). The signals w1 (n), w2 (n) sketched in ﬁgure B.5(b) form an orthonormal basis suitable to represent v1 (n), v2 (n), v3 (n) (they have been obtained by applying the Gram-Schmidt procedure). v1 (n), v2 (n), v3 (n) can then be represented as a point in a Cartesian plane similarly to the sketch in ﬁgure B.3. By a proper choice of the linear combination, v1 (n), v2 (n), v3 (n) can be recov√ ered from sketches of w1 (n) 15, √ w1 (n), w2 (n)√as shown in ﬁgure √ B.5(c) by the √ w1 (n) 15/3 + 2w2 (n) 57/3 and w1 (n) 15/3 − 4w2 (n) 57/3. Finally the invariance of the dot product to the choice of the orthonormal basis is illustrated below with signals v1 (n), v2 (n): ⟨v1 (n), v2 (n)⟩

=

10 ∑

v1 (n)v2 (n) = 5

n=1

√ √ √ ⟨( 15, 0), ( 15/3, 2 57/3)⟩

B.3

=

15 +0=5 3

Eﬀect of sampling frequency

A normalization of the dot product is required in order to make it independent of the sampling frequency (as it should be as long as the Nyquist sampling criterion is not violated). To see why, consider a continuous signal s(t) and the two discrete sampled signals 5 and

where the functions wj ( ) are deﬁned similarly to the functions vi ( )

(b) Orthonormal basis signals

Figure B.5: Example for Gram-Schmidt procedure with the sampling of a waveform

(c) Signals recovery from the orthonormal basis

(a) Sketch of three signals obtained from the sampling of three waveforms

B.3. EFFECT OF SAMPLING FREQUENCY 213

214APPENDIX B. ORTHONORMAL EXPANSION AND VECTOR REPRESENTATION OF CON

v⃗1

v⃗2

( ( )) N n = s fs n=1 ( ( ) 2N n ) = s 2fs n=1

Both v⃗1 , v⃗2 contain samples of s(t) for 0 < t < N/fs , but v⃗2 contains twice as many samples. Consequently: ⟨v⃗1 , v⃗1 ⟩ ≈ ⟨v⃗2 , v⃗2 ⟩/2 In order for the dot product of discrete signals to be independent of the sampling frequency we deﬁne: ⟨s1 (n), s2 (n)⟩ ,

∑ s1 (n)s2 (n) n

fs

as the dot product of two discrete signals sampled at the rate fs (both signals need to be sampled at the same rate fs so that they both have the same number of samples in any given time interval). We now have two diﬀerent dot products, and there will not be any ambiguities by the context: • dot product through the (discrete) time n; requires normalization by the sampling frequency fs , { }L • dot product through the basis wj (n) j=1 ; does not require the normalization by fs .

B.4

Continuous time signals

Intuitively we take the limit as fs → ∞ ⇒ 1/fs ≈ dt → 0 and ∫ ∞ ⟨s1 (t), s2 (t)⟩ = s1 (t)s2 (t)dt

∑

∫ → :

−∞

is the dot product between two signals. The concepts of distance, norm and projection introduced in deﬁnition B.1.2 and theorem 62 naturally extend to signals. We then have: √ 1. d(s1 (t), s2 (t)) = ⟨s1 (t) − s2 (t), s1 (t) − s2 (t)⟩; √ 2. |s1 (t)| = ⟨s1 (t), s1 (t)⟩; 3.

⟨s1 (t), s2 (t)⟩ |s2 (t)|

is the length of the projection of signal s1 (t) on signal s2 (t). In particular, if |s2 (t)| = 1 then ⟨s1 (t), s2 (t)⟩ is the length of the projection of signal s1 (t) on signal s2 (t);

B.4. CONTINUOUS TIME SIGNALS

215

4. s1 (t) and s2 (t) are orthonormal if: ∫ ∫

∞

−∞

∞

s1 (t)s2 (t)dt ∫ ∞ s21 (t)dt = s22 (t)dt

= 0

−∞

= 1

−∞

The Gram-Schmidt orthonormalization procedure for continuous signals follows the same steps as those previously explained on page 210: Step 1 ϕ1 (t) = normalize s1 (t) if s1 (t) ̸= 0. Step 2 ϕ2 (t) = normalize s2 (t) − ⟨s2 (t), ϕ1 (t)⟩ϕ1 (t) if diﬀerent than 0. Step 3 ϕ3 (t) = normalize s3 (t) − ⟨s3 (t), ϕ1 (t)⟩ϕ1 (t) − ⟨s3 (t), ϕ2 (t)⟩ϕ2 (t) if different than 0. Step 4 . . . and so on . . . Example B.4.1. (W&J, pp 269 - 272) Find an orthonormal basis for the signals s1 (t), s2 (t), s3 (t) and s4 (t) sketched in ﬁgure B.6 and restricted to the interval t ∈ [0, 3].

Figure B.6:

Solution. Apply the Gram-Schmidt orthonormalization procedure to the signals.

216APPENDIX B. ORTHONORMAL EXPANSION AND VECTOR REPRESENTATION OF CON Step 1. Since s1 (t) ̸= 0 we let θ1 (t) = s1 (t) and ϕ1 (t) = √ ∫ ⟨θ1 (t), θ1 (t)⟩ = ⟨s1 (t), s1 (t)⟩

θ1 (t) ⟨θ1 (t), θ1 (t)⟩

, where:

3

s21 (t)dt

= ∫

0

∫

1

=

3

2

(2) dt + 0

∫

2

2

(2)2 dt

(−2) dt + 1

2

= 12 ϕ1 (t) is sketched on ﬁgure B.7.

Figure B.7:

Step 2. We need to calculate θ2 (t) = s2 (t) − ⟨s2 (t), ϕ1 (t)⟩ϕ1 (t): ∫ 1 ∫ 2 ∫ 3 −1 −3 1 √ dt + √ dt + √ dt ⟨s2 (t), ϕ1 (t)⟩ = 3 3 3 0 1 2 √ −3 = √ =− 3 3 It follows that: θ2 (t) = =

s2 (t) − ⟨s2 (t), ϕ1 (t)⟩ϕ1 (t) √ s2 (t) + 3 ϕ1 (t) ̸= 0

θ2 (t) is sketched on ﬁgure B.8 together with ϕ2 (t) = √ we easily calculate ⟨θ2 (t), θ2 (t)⟩ = 4 + 4 = 8. Step 3. We need to calculate θ3 (t) ⟨s3 (t), ϕ2 (t)⟩ϕ2 (t): ⟨s3 (t), ϕ1 (t)⟩ = ⟨s3 (t), ϕ2 (t)⟩ =

=

θ2 (t) ⟨θ2 (t), θ2 (t)⟩

, where

s3 (t) − ⟨s3 (t), ϕ1 (t)⟩ϕ1 (t) −

√ √ √ 1/ 3 + 2/ 3 + 0 = 3 √ √ 0 + (− 2) + 0 = − 2

and as illustrated on ﬁgure B.9 we obtain θ3 = 0. It is then not necessary to introduce a third function ϕ3 (t) at this point.

B.4. CONTINUOUS TIME SIGNALS

217

Figure B.8:

Step 4. We need to calculate θ4 (t) ⟨s4 (t), ϕ2 (t)⟩ϕ2 (t): ⟨s4 (t), ϕ1 (t)⟩ = ⟨s4 (t), ϕ2 (t)⟩ =

=

s4 (t) − ⟨s4 (t), ϕ1 (t)⟩ϕ1 (t) −

√ √ √ √ −1/ 3 + 1/ 3 − 3 = − 3 √ √ √ 0 − 1/ 2 − 3/ 2 = −2 2

and as illustrated on ﬁgure B.10 we obtain θ4 = 0. Again, it is not necessary to introduce a third function ϕ3 (t). The signals s1 (t), s2 (t), s3 (t), s4 (t) are represented in the orthonormal basis {ϕ1 (t), ϕ2 (t)} as in ﬁgure B.11. The ﬁndings are summarized in the following. Theorem 63. Any set of M ﬁnite energy signals si (t), i = 0, 1, . . . , M − 1 can be expressed as: si (t) =

N ∑

sij ϕj (t), i = 0, 1, . . . , M − 1,

j=1

where ϕj (t) are N appropriately chosen signals and N ≤ M . The set of signals {ϕj (t)} is not unique (for example, the Gram-Schmidt procedure may give a diﬀerent set of orthonormal functions if the signals are simply reordered). The Gram-Schmidt procedure yields a minimal set of orthonormal functions, the cardinality of which is called dimensionality of the set of signals. The coeﬃcients sij in the above equation are given by: ∫ ∞ sij = ⟨si (t), ϕj (t)⟩ = si (t)ϕj (t)dt −∞

218APPENDIX B. ORTHONORMAL EXPANSION AND VECTOR REPRESENTATION OF CON

Figure B.9:

We also have ⟨si (t), sk (t)⟩ =

N ∑

⟨si (t), ϕj (t)⟩⟨sk (t), ϕj (t)⟩

j=1

∀i, k ∈ {0, 1, . . . , M − 1}, and in particular ( Parseval’s relathionship) ⟨si (t), si (t)⟩ =

N ∑

⟨si (t), ϕj (t)⟩2

j=1 ∞

∫ =

−∞

s2i (t)dt ≡ energy of si (t)

Moreover, for any ﬁnite energy signal r(t) we have: ⟨r(t), si (t)⟩ =

N ∑

⟨r(t), ϕj (t)⟩⟨si (t), ϕj (t)⟩

j=1

and more generally for any ﬁnite energy signal s(t) lying in the space generated by the orthonormal functions ϕ1 (t), . . . ϕN (t), we also have: ⟨r(t), s(t)⟩ =

N ∑ j=1

⟨r(t), ϕj (t)⟩⟨s(t), ϕj (t)⟩

B.4. CONTINUOUS TIME SIGNALS

219

Figure B.10:

Theorem 64. Let nw (t) be a 0-mean white Gaussian noise with power spectral density Snw (f ) = N0 /2 and deﬁne nj

= ⟨nw (t), ϕj (t)⟩, j = 1, 2, . . . , N

n(t) =

N ∑

nj ϕj (t)

j=1

r2 (t) = nw (t) − n(t) where ϕj (t), j = 1, 2, . . . , N are N orthonormal functions. Then n(t) and r2 (t) are jointly Gaussian statistically independent 0-mean random processes. Sketch of proof. We assume that expectations and integrals are freely interchangeable.

220APPENDIX B. ORTHONORMAL EXPANSION AND VECTOR REPRESENTATION OF CON

Figure B.11:

1. We have: ∫ nj

∞

= −∞

≡

nw (t)ϕj (t)dt

linear operations on the Gaussian process nw (t)

By theorem 44 (or a special case of theorem 44) we conclude that nj is a Gaussian random variable, ∀ j = 1, 2, . . . , N . 2. The nj ’s are Gaussian random variables resulting from linear operations on the same Gaussian process nw (t). It follows that the random vector (n1 , n2 , . . . , nN ) is a Gaussian random vector. 3. Samples of the process n(t) at times t1 , . . . , tk may be expressed as:  ϕ1 (t1 ) ϕ1 (t2 ) . . . ϕ1 (tk )  ϕ2 (t1 ) ϕ2 (t2 ) . . . ϕ2 (tk )  (n(t1 ), . . . , n(tk )) = (n1 , . . . , nN ) ×  .. .. .. .. | {z }  . . . . Gaussian random

|

ϕN (t1 )

ϕN (t2 ) . . . {z

ϕN (tk )

N ×k matrix of constants

     }

vector

It follows that (n(t1 ), n(t2 ), . . . , n(tk )) is a Gaussian random vector ∀t1 , t2 , . . . , tk and ∀k; n(t) is therefore a Gaussian random process. 4. r2 (t) = nw (t)−n(t) where nw (t), n(t) are both Gaussian random processes and therefore r2 (t) is also a Gaussian random process.

B.4. CONTINUOUS TIME SIGNALS

221

5. Both n(t), r2 (t) result from operations on the same random process nw (t); n(t) and r2 (t) are then jointly Gaussian random processes.

6. In order to show that n(t) and r2 (t) are 0-mean we ﬁrst calculate :

E(nj ) = E(⟨nw (t), ϕj (t)⟩) = ⟨E(nw (t)), ϕj (t)⟩ = ⟨0, ϕj (t)⟩ = 0

It follows that

E(n(t))

= E

(∑ N

) nj ϕj (t)

j=1

=

N ∑ j=1

=

E(nj ) ϕj (t) | {z } 0, ∀j

0

E(r2 (t)) = E(nw (t) − n(t)) = E(nw (t)) − E(n(t)) =

0−0=0

Thus n(t) and r2 (t) are jointly Gaussian 0-mean random processes. Even though this is not required at this time, we next calculate E(ni nj ); we’ll

222APPENDIX B. ORTHONORMAL EXPANSION AND VECTOR REPRESENTATION OF CON need this shortly: E(ni nj )

( ) = E ⟨nw (t), ϕi (t)⟩⟨nw (t), ϕj (t)⟩ (∫ ∞ ) ∫ ∞ = E nw (t)ϕi (t)dt nw (s)ϕj (s)ds −∞ −∞ ( ∫∫ ∞ ) = E nw (t)nw (s)ϕi (t)ϕj (s)dtds −∞ ∫∫ ∞ ( ) = E nw (t)nw (s) ϕi (t)ϕj (s)dtds {z } −∞ | =

N0 2

∫

Rnw (t−s)=

(∫

∞ −∞

|

N0 2

δ(t−s)

) ϕi (t)ϕj (s)δ(t − s)ds dt −∞ {z } ∞

ϕi (t)ϕj (t)

∫ N0 ∞ ϕi (t)ϕj (t)dt = 2 −∞ N0 = ⟨ϕi (t), ϕj (t)⟩ 2  0 ; if i ̸= j =  N0 ; if i = j 2 N0 = δi, j 2 {

where we deﬁne: δi, j ,

0 ; if i ̸= j 1 ; if i = j

for convenience. 7. By theorem 48 processes n(t) and r2 (t) are statistically independent if and only if Ln, r2 (t, s) = 0, ∀ t, s, or equivalently if and only if Rn, r2 (t, s) = 0, ∀ t, s, since n(t) and r2 (t) are 0-mean; this is proved below. First we have: ( ) Rn, r2 (t, s) = E n(t) r2 (s) (( ∑ ) N )( ) = E nj ϕj (t) nw (s) − n(s) j=1

=

E

N (∑

( )) ϕj (t)nj nw (s) − n(s)

j=1

=

N ∑ j=1

( ( )) ϕj (t)E nj nw (s) − n(s)

(B.1)

B.4. CONTINUOUS TIME SIGNALS

223

The expected value in equation (B.1) is evaluated below: ( ( )) E nj nw (s) − n(s) = nj nw (s) − nj n(s) = ⟨nw (t), ϕj (t)⟩nw (s) − nj

N (∑

) ni ϕi (s)

i=1

= ⟨nw (t)nw (s), ϕj (t)⟩ −

N ∑

nj ni ϕi (s)

i=1 N ∑ N0 N0 δ(t − s), ϕj (t)⟩ − δj, i ϕi (s) 2 2 i=1 ∫ N0 ∞ N0 = ϕj (s) δ(t − s) ϕj (t)dt − 2 −∞ 2 N0 N0 = ϕj (s) − ϕj (s) 2 2 = 0

= ⟨

Substituting the latter in equation (B.1) yields Rn, r2 (t, s) = 0, ∀ t, s. This concludes the proof that n(t), r2 (t) are jointly Gaussian statistically independent 0-mean random processes.

Lecture Notes: An Introduction to the Theory of Statistical ... - drolet [PDF]

Recommend Stories

Idea Transcript

Helpful Links

Smile Life

Get in touch