Using motion data to detect cardiac dysfunctions - UiO - DUO [PDF]

May 2, 2016 - 2.1.4 Monitoringtheheart . .... vector. The classifier has approximated the true function and is showed as

0 downloads 5 Views 12MB Size

Recommend Stories


Untitled - UiO - DUO
You're not going to master the rest of your life in one day. Just relax. Master the day. Than just keep

Bioprospecting Norwegian Microalgae - UiO - DUO [PDF]
Jun 15, 2015 - bioactive compounds. Each extract was then tested in the bioassay to determine if the solvents used in its extraction gave a better result. Once the first round of screening was completed and results were analysed, the next question co

Quasi-Realism and the Moral Problem - UiO - DUO [PDF]
1.3 Problems with the Expressivist Solution. 1.4 The Quasi-Realist Solution. 1.5 Aim and Structure of the Thesis. 1.1 The Moral Problem. In The Moral Problem, Michael Smith describes two characteristic features of moral judgments. First, they aim at

The Problematic “Play of Perception” - UiO - DUO [PDF]
Wharton's The House of Mirth, The Age of Innocence, and The Custom of the Country can be ... literary characters of these novels, where the woman is being observed, while ... First and foremost, I want to thank my excellent supervisor, Professor Nils

Using Data Analytics to Detect Fraud - Dennis Yegon
Your big opportunity may be right where you are now. Napoleon Hill

Motion to Stay PDF
Your big opportunity may be right where you are now. Napoleon Hill

Utilizing Topological Data Analysis to Detect Periodicity
Respond to every call that excites your spirit. Rumi

Data Mining Tools To Detect Financial Fraud
Life is not meant to be easy, my child; but take courage: it can be delightful. George Bernard Shaw

Data Mining Tools To Detect Financial Fraud
Ask yourself: Is work-life balance important to me? Next

Photonic Crystals & MEMS - UiO [PDF]
Acoustic wave moves diaphragm (150 - 300 µm diameter). ▫ Photonic-crystal mirror and fiber-end mirror forms a Fabry-Perot interferometer. ▫ Reflected power in fiber changes with a change in the diaphragm-fiber gap. ▫ Photonic-crystal mirror pr

Idea Transcript


Using motion data to detect cardiac dysfunctions A machine learning approach Ole Kristian Stumpf Master’s Thesis Spring 2016

Using motion data to detect cardiac dysfunctions A machine learning approach

Ole Kristian Stumpf May 2, 2016

The Interventional Centre Oslo University Hospital, Rikshospitalet Faculty of Medicine University of Oslo Oslo, Norway

Department of Informatics Faculty of Mathematics and Natural Sciences University of Oslo Oslo, Norway

Abstract This thesis investigates the possibility of predicting four different cardiac (heart) dysfunctions using machine learning and cardiac motion data. The intended goal is to integrate a motion sensor into a pacemaker, thus enabling continuous monitoring of cardiac motion during and after cardiac surgery. The cardiac motion of the four different dysfunctions was provided to this thesis. The measurements of the cardiac motion had been collected from 20 pigs at the Intervention Centre at Oslo University Hospital, Oslo, Norway. Each measurement was split into sequences representing one complete heart beat. Six different sets of features were created from the sequences. These feature sets were tested on three different classifiers: decision tree, SVM, and AdaBoost. Results show that predicting one out of the five dysfunctions is not feasible using the current dataset. The sparse dataset and the overlap in data makes it difficult for the classifiers to discriminate between the various dysfunctions. Predicting one cardiac function from another, however, shows good results. Some combinations of two cardiac functions are relevant for clinical use because they are often expected either during or after cardiac surgery. These are combinations including baseline and a dysfunction. Baseline versus adrenaline, ischemia, and fluid loading respectively, all achieved a precision and recall of more than 0.80. These results demonstrate the feasibility of using machine learning and motion data from the heart to predict certain cardiac functions.

Contents 1

Introduction 1.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

Background 2.1 Medical background . . . . . . . . . . . . . . . . . . 2.1.1 The heart . . . . . . . . . . . . . . . . . . . . . 2.1.2 The cardiac cycle . . . . . . . . . . . . . . . . 2.1.3 Cardiac motion . . . . . . . . . . . . . . . . . 2.1.4 Monitoring the heart . . . . . . . . . . . . . . 2.2 Thesis background . . . . . . . . . . . . . . . . . . . 2.2.1 The novel technique . . . . . . . . . . . . . . 2.2.2 Cardiac dysfunctions . . . . . . . . . . . . . . 2.2.3 Past research regarding the novel technique 2.2.4 Accelerometer . . . . . . . . . . . . . . . . . . 2.3 Machine learning algorithms . . . . . . . . . . . . . 2.3.1 Machine learning . . . . . . . . . . . . . . . . 2.3.2 Classifiers . . . . . . . . . . . . . . . . . . . . 2.4 Features . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Curse of dimensionality . . . . . . . . . . . . 2.4.2 Feature extraction . . . . . . . . . . . . . . . . 2.4.3 Feature scaling . . . . . . . . . . . . . . . . . 2.4.4 Feature selection . . . . . . . . . . . . . . . . 2.5 Model validation . . . . . . . . . . . . . . . . . . . . 2.5.1 Cross-validation . . . . . . . . . . . . . . . . 2.5.2 No free lunch theorem . . . . . . . . . . . . . 2.5.3 Hyperparameter optimization . . . . . . . . 2.5.4 Overfitting . . . . . . . . . . . . . . . . . . . . 2.6 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 Performance measures . . . . . . . . . . . . . 2.6.2 Significance test . . . . . . . . . . . . . . . . .

3

. . . . . . . . . . . . . . . . . . . . . . . . . .

Implementation 3.1 Choice of implementation environment and language 3.2 Motion data segmentation . . . . . . . . . . . . . . . . 3.2.1 The dataset . . . . . . . . . . . . . . . . . . . . 3.2.2 Segmentation . . . . . . . . . . . . . . . . . . . 3.2.3 Achieving a fixed length of the sequences . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

1 2

. . . . . . . . . . . . . . . . . . . . . . . . . .

3 3 3 3 5 6 7 7 8 11 11 12 13 14 18 18 19 19 19 20 20 21 21 22 22 23 24

. . . . .

27 27 28 28 30 31

CONTENTS

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

33 37 37 41 41 42 44

Experiments and results 4.1 Two-class experiments and results . . . . . . . . 4.1.1 Analysis . . . . . . . . . . . . . . . . . . . 4.2 Further improving the best two-class models . . 4.2.1 Experiments . . . . . . . . . . . . . . . . . 4.2.2 Results . . . . . . . . . . . . . . . . . . . . 4.2.3 Analysis . . . . . . . . . . . . . . . . . . . 4.3 Details of three two-class models . . . . . . . . . 4.3.1 Best performing model . . . . . . . . . . . 4.3.2 Model using only three features . . . . . 4.3.3 Model using the raw data as features . . 4.4 Optimizing certain combinations of two classes . 4.4.1 Experiments . . . . . . . . . . . . . . . . . 4.4.2 Results . . . . . . . . . . . . . . . . . . . . 4.4.3 Analysis . . . . . . . . . . . . . . . . . . . 4.5 Adding classes to the classification task . . . . . 4.5.1 Experiments . . . . . . . . . . . . . . . . . 4.5.2 Three classes - results . . . . . . . . . . . 4.5.3 Four classes - results . . . . . . . . . . . . 4.5.4 Five classes - results . . . . . . . . . . . . 4.5.5 Analysis . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

49 49 51 52 52 53 54 56 57 59 61 63 63 64 64 65 65 65 66 67 69

Discussion 5.1 General discussion . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71 71 73 74

3.3

3.4 3.5 3.6 4

5

3.2.4 Graphs of the motions sequences Feature extraction . . . . . . . . . . . . . 3.3.1 Domain knowledge . . . . . . . 3.3.2 Machine learning knowledge . . Training and testing . . . . . . . . . . . . Choice of grid search . . . . . . . . . . . Approach . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

Appendices

77

A MATLABs decision tree split criterions

79

B MATLABs SVM kernels

81

C Statistics C.1 Feature selection methods . . . . . . . . . . . . . . . . . . . . C.2 P-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

83 83 84

vi

List of Figures 2.1

Diagram of the heart [4] showing the four chambers and blood vessel connected to the heart. . . . . . . . . . . . . . .

4

Figure showing the difference between diastole and systole [63]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

Figure showing the decomposition of cardiac movement into a radial, circumferential and longitudinal axes [3]. . . . . . .

5

Figure showing the characteristic ECG trace for a cardiac cycle [2, 8]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

Figure showing a heart examination using transthoracic echocardiogram [49] . . . . . . . . . . . . . . . . . . . . . . .

7

View of the heart showing the 3-axis accelerometer (motion sensor). Arrow indicates the circumferential motion [30]. . .

8

2.7

Figure showing myocardial ischemia. [5] . . . . . . . . . . .

9

2.8

Illustration of the left ventricle, showing the basis and apex of the heart (also referred to as basal and apical regions), the LAD and the location of the occlusion and the motion sensor (accelerometer) [28] . . . . . . . . . . . . . . . . . . . . . . . .

10

A conceptual drawing of an accelerometer. It consists of a mass m, that can move relative to the accelerometer’s housing, a spring that connects the mass to the housing k, and a damper that dissipates energy and keeps the springmass system from vibrating forever. The other constants are: the damping constant c, the position of the moving block relative to an inertial (stationary) frame of reference x, and the position of the housing relative to the same inertial frame y [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

2.10 A machine learning model consists of a machine learning algorithm/classifier takes labeled feature vectors as inputs (seen in a) and produces a classifier model which can predict the label of new feature vectors (seen in b) [11]. . . . . . . . .

13

2.2 2.3 2.4 2.5 2.6

2.9

LIST OF FIGURES

2.11 Figure showing a two-class classification problem. All squares belongs to the training set. Each square is a twodimensional feature vector with values according to x1 and x2 . The black and grey color indicates the label of the feature vector. The classifier has approximated the true function and is showed as the solid line (decision boundary). The white triangle represents a new and unseen inputs that the classifier will predict. Since it is placed above the decision boundary, the classifier will predict that it belongs to the black class. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

2.12 An example of a decision tree that based on the outlook, humidity and windy classifies a day to belong to either the P or N class [51]. . . . . . . . . . . . . . . . . . . . . . . . . . .

15

2.13 Figure showing two iterations of AdaBoost, labeled 1, 2 and 3. In figure 1, a point belonging to the triangle class is misclassified. The point gets a bigger weight, indicated by its bigger size and is thus correctly classified in figure two. This leads to a misclassification of a point in the star class. The point gets a bigger weight resulting in a perfect classification in figure 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

2.14 Figure showing the optimal margin for a two dimensional two-class classification problem [16]. The circles and crosses are feature vectors with values according to the x- and y-axis. The circle and cross represent its belonging class. The examples surrounded with a grey square denotes the support vectors. The dotted line is the decision boundary/hyperplane created by the SVM. . . . . . . . . . . . . . . . . . . . . . . . .

17

2.15 Figure showing the transformation, often denoted φ, transforming the data into a linearly separable space. The solid line is the decision boundary, and the circles are data from the training set [6] . . . . . . . . . . . . . . . . . . . . . . . . .

18

2.16 Figure showing leave-one-out-cross-validation (LOOCV) used on a dataset with ten examples. The lighter grey example represents a sample used in the testing set while the darker squares are samples in the training set. At each iteration, a new example is chosen as the testing example. When ten iterations have been done, all examples are used once in the training set, and nine times in the testing set . . . . . . .

21

2.17 Figure showing two classifiers, and a training set consisting of two classes. The smooth line represents the good classifier as it doesn’t take into account the noisy data points, i.e. the data points that is surrounded by points from the other class. The jagged line represents a classifier subject to overfitting. It tries to maximize the accuracy on the training set, but it will not achieve a good accuracy on new and unseen data points [7]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

viii

LIST OF FIGURES

2.18 Figure showing a classifier predicting cancer. Patients are represented as circles. All patients lying within the circle is predicted as cancer. . . . . . . . . . . . . . . . . . . . . . . . . 2.19 Figure illustrating the two-sample Kolmogorov-Smirnov statistic. Red and blue lines each correspond to an empirical distribution function,and the black arrow denotes the test statistic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure showing the motion data collected from one animal during during a cardiac function of a period of 10 seconds. The timeline is shown on the x-axis. The upper graph shows the acceleration while the next two are the velocity and displacement, marked in the top right corner of each graph. The acceleration, velocity, and displacement are shown as the calculated resultant of all three directions. The bottom graph is the ECG where the R-peak is shown as a red cross. The graph turns green during the first 150 ms after R-peak, which is also indicated in the motion data as well. The reason for showing the first 150 ms after R-peak in green is given in section 3.3. . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Figure showing how a cycle that is represented with more than 300 samples is interpolated and decimated in order to be represented by exactly 300 samples. . . . . . . . . . . . . . 3.3 Figure showing the average acceleration all cycles in the circumferential direction. . . . . . . . . . . . . . . . . . . . . . 3.4 Figure showing the average velocity of all cycles in the circumferential direction. . . . . . . . . . . . . . . . . . . . . . 3.5 Figure showing the average acceleration of all cycles in the longitudinal direction. . . . . . . . . . . . . . . . . . . . . . . 3.6 Figure showing the average velocity of all cycles in longitudinal the direction. . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Figure showing the average acceleration of all cycles the radial direction. . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 Figure showing the average velocity of a cycles in the radial direction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9 Figure showing velocity and displacement in circumferential direction during baseline and ischemia. Thick lines indicate the middle of systole, as approximated by Halvorsen et al [30]. 3.10 Figure showing average peak midsystolic velocity, denoted Vsys . The bottom figure is from an accelerometer mounted in the basal region of the heart (close to the top), but these measurements were discarded. The top figure is from an accelerometer mounted on the apical region (close to the bottom). Vsys is shown during all four interventions (LAD occlusion is ischemia) in addition to baseline (healthy). ± standard deviation is also shown. [23]. . . . . . . . . . . . . .

24

25

3.1

ix

29

32 34 34 35 35 36 36

37

38

LIST OF FIGURES

3.11 Representative left ventricular curves from a single patient of accelerometer circumferential velocity and acceleration during baseline and preconditioning LAD occlusion (ischemia) [28] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.12 Figure showing the 2-dysfunction-classification task which aims to find the best feature set and classifier. The diamond boxes represent loops while the produce result block is specified in figure 3.13. The parallelograms represents an input and output. . . . . . . . . . . . . . . . . . . . . . . . . . 3.13 Figure showing the steps in creating a model and achieving a result. Rectangles represents a process that takes an input and produces an output. The dashed rectangle is a process that is sometimes skipped. White ellipses represents a result from a process. The parallelograms represents an input and output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1

40

46

47

Figure showing how many times sample number from the acceleration in the circumferential direction is chosen to be used as a feature in the two-class classification task. . . . . . Figure showing how many times sample number from the velocity in the circumferential direction is chosen to be used as a feature in the two-class classification task. . . . . . . . .

63

C.1 Figure showing the basic idea of the Relieff feature selection algorithm with k set to one. . . . . . . . . . . . . . . . . . . .

83

4.2

x

62

List of Tables 2.1

3.1

3.2 3.3

3.4 3.5 3.6

4.1

4.2 4.3 4.4

4.5

A confusion matrix showing the results of a 3-class classification task. It shows that the classifier got all the instances that belonged to class C1 correct, but misclassified 2 instances of class C2 as C1 and is having difficulties getting the C3 class correct. Naturally, the desirable result is to have low numbers on the off-diagonal. . . . . . . . . . . . . . . . . . . . . .

24

Overview of how many subjects there are for each cardiac function. The various cardiac functions are described in section 2.2.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

Table showing how the number of cycles and the standard deviation for each class. . . . . . . . . . . . . . . . . . . . . .

33

Table showing the average number of samples used for each cardiac dysfunction. The results were scaled into the range [0, 1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

Table showing the hyperparameters used in the grid search for decision tree. . . . . . . . . . . . . . . . . . . . . . . . . . .

43

Table showing the hyperparameters used in the grid search for AdaBoost. . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

Table showing the hyperparameters used in the grid search for SVM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

Table showing the results after following the approach shown in figures 3.12 and 3.13. The three best results are marked with bold text. The various feature sets are specified in section 3.3.1. . . . . . . . . . . . . . . . . . . . . . . . . . .

50

Table showing the hyperparameters used on the most promising models which uses AdaBoost as its classifier. . . .

53

Table showing the hyperparameters used in the finer grid search for decision tree. . . . . . . . . . . . . . . . . . . . . . .

53

Table showing the results of the experiments specified in section 4.2. The best result is shown in bold. Experiment number five and six are based on the underlined results from experiment four. . . . . . . . . . . . . . . . . . . . . . . . . . .

54

Table specifying that models that will be further investigated. 57

LIST OF TABLES

4.6

4.7

4.8

4.9

4.10

4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 4.19

Table showing the accuracy and confusion matrix of each class on the model that performed best, which were AdaBoost using all features, achieving an average accuracy of 0.84. Regarding the confusion matrix, columns represents the true class, while rows represents the predicted class. . . . Table showing the accuracy and confusion matrix of each class on the model that performed second best, which were decision tree using all features and Relieff feature selection. In addition, the features chosen by Relieff is also shown. The numbers point to the list of features shown in 4.3 . . . . . . . Table showing the accuracy and confusion matrix a model only using the raw data as features, in combination with CFS feature selection method. . . . . . . . . . . . . . . . . . . . . . Table showing which models will be created for certain combinations of two classes. Note that a grid search will be done for each of the combinations. All features are summarized in section 4.2.3. . . . . . . . . . . . . . . . . . . . Table showing accuracy, precision, recall and true negative rate for the clinically most important combinations of classes. The latter class in the leftmost column is used as positive class in the calculations. . . . . . . . . . . . . . . . . Table specifying the models used to achieve the results shown in table 4.10. . . . . . . . . . . . . . . . . . . . . . . . . Table showing which models will be created when adding classes to the classification task. . . . . . . . . . . . . . . . . . Table showing the details of the best performing models using three classes. . . . . . . . . . . . . . . . . . . . . . . . . Table showing the decomposition of the average accuracy for the best model, shown in table 4.13. . . . . . . . . . . . . Table showing the details of the best performing models using three classes. . . . . . . . . . . . . . . . . . . . . . . . . Table showing the decomposition of the average accuracy for the best model, shown in table 4.15. . . . . . . . . . . . . Table showing the details of the best performing models using five classes. . . . . . . . . . . . . . . . . . . . . . . . . . Table showing the model achieving the best accuracy when including all classes. . . . . . . . . . . . . . . . . . . . . . . . Table showing the model achieving the second best accuracy when including all classes. . . . . . . . . . . . . . . . . . . . .

B.1 MATLABs SVM kernels. σ and γ and d are hyperparameters which must be set on beforehand outside of the training process. Hyperparameters is reviewed in section 2.5.3. . . .

xii

57

59

61

63

64 64 65 66 66 66 66 68 68 68

81

Acknowledgment I would like to express my deepest gratitude to my three supervisors: Ole Jacob Elle, Espen Remme and Jim Tørresen, for the continuous guidance and support. A special thanks to PhD candidate Ole Marius Hoen Rindal from DSP and Image Analysis Research Group, who voluntary acted as a fourth, and highly valuable, supervisor. My sincere thanks to all the fellow students at Robotics and Intelligent Systems research group for making a great learning environment since the very start of the bachelor studies. Especially thanks to Joakim Myrvoll Johansen for the valuable discussions regarding the structure of the thesis. I would also like to thank my friends and family and especially my significant other Cecilie N. Michaelsen for the endless support and her feedback on the thesis.

Chapter 1

Introduction Recent decades have shown that a vast number of problems can be solved using machine learning. This is a result of an improved understanding of machine learning and an increase in computational resources. When the dimensionality of the data is too big for the human eye, machine learning has often proved useful. Research in the field of motion sensors and machine learning has mostly been focused on activity classification. That is, classifying the activity currently performed, such as walking, running and climbing stairs while wearing a motion sensor. This has been done with promising results [50, 53], hence, it proposes its use with more fine-grained motion. A novel technique for detecting heart diseases using cardiac (heart) motion has been developed at The Intervention Centre, Oslo University Hospital (OUS), Rikshospitalet, Oslo, Norway. The intended goal is to use a combined miniaturized accelerometer and temporary pacemaker lead attached to the heart during surgery and withdrawn before the patient is discharged. Such a sensor would permit continuous monitoring of ventricular function during and after cardiac surgery, enabling early detection of complications and treatment guidance, [22, 29]. However, the substantial amount of data produced by the sensor makes it difficult to analyze. The goal of this thesis is to use machine learning and create a predictor which can label the heart function of patients. One of the cardiac dysfunctions is myocardial ischemia, which is one of the most prominent heart diseases. Patients who have undergone cardiac surgery has an increased chance of experiencing ischemia. 2-6% of arterial and 10-20% of venous grafts (transplanted blood vessels) are occluded after one year [32]. Early detection of ischemia is therefore highly desirable. However, the tools for continuously monitoring and detecting ischemia have room for improvements. The most prominent tool is electrocardiography (ECG), which monitors the electrical activity of the heart. The ECG trace will exhibit a specific abnormal activity when ischemia is present, but this is not always the case [36]. There is also cases where the abnormal trace is exhibited, but ischemia is not present [62]. Another tool is the echocar-

1.1. OUTLINE

diography, but it requires a skilled technician to interpret the images and is therefore too cumbersome to use as a continuous monitoring technique. Hence, a sensor that continuously monitors the heart and automatically detects ischemia during after heart surgery would be beneficial [54]. The machine learning model that will be built will take use of a dataset created at OUS. Several pigs underwent interventions to simulate the dysfunction. Three additional interventions were done to alter the heart function using the same pigs. These include injection of the drugs esmolol and adrenaline, which decreases and increases the heart rate, respectively, as well as an intervention to decrease the stroke volume. A model which can classify these cardiac dysfunctions increases the potential clinical applicability of the accelerometer in cardiac surgery [23]. The intended approach is to test several machine learning classifiers and feature sets. In addition, all combinations of two- three-, and four-, combinations will be tested.

1.1

Outline

The thesis is divided into four additional chapters: (i) background (ii) implementation (iii) experiments and results (iv) and discussion. The background chapter will elaborate further regarding the novel monitoring technique. Additionally, it will introduce the various relevant machine learning concepts. The implementation chapter will explain and reason behind the various choices done regarding the machine learning model. The experiments and results chapter outlines the experiments conducted and presents the results along with a short analysis where appropriate. Lastly, the discussion chapter contains a general discussion along with a conclusion of the thesis and suggestions for future work.

2

Chapter 2

Background This chapter will attempt to give an introduction to the theory relevant to the thesis. A medical background is presented first, which is intended to give an understanding of the heart and its motion. Next is a part which describes a novel technique and the past research leading which lead to this thesis. The following section, section 2.3 introduces machine learning in general and the three machine learning algorithms used.

2.1 2.1.1

Medical background The heart

The heart (cardiac), seen in figure 2.1, is a muscular organ roughly shaped as a cone, with a wide basis and a pointed apex (tip). It is located behind and slightly left of the breastbone, with the apex pointing downwards and to the left. The heart contains four chambers: upper left and right atria and lower left and right ventricles. The atria work as blood collection chambers for the ventricles which expel blood out of the heart. Blood enters the right atrium which pumps it downwards to the right ventricle. The right ventricle pumps blood to the pulmonary system, i.e., the lungs, which fills the blood with oxygen and the left atria then collect it. The left atrium pumps it downwards to the left ventricle which pumps it back out to the circulatory system, i.e., the body [58].

2.1.2

The cardiac cycle

The cardiac cycle refers to a complete heartbeat from its generation to the beginning of the next beat. A normal resting heart rate for adults ranges from 60 to 100 beats a minute, which implies that the heart rate lasts from 0.6 to 1 seconds. The cardiac cycle is split into two main phases: the diastole and the systole. The diastole refers to the part of the cycle where the ventricles are relaxed in preparation for refilling with circulating blood, whereas the systole refers to the ejection of blood into an adjacent chamber or vessel. This is seen in figure 2.2. During most of the diastole, blood is passively flowing from the atria to the ventricles. The atria contract at

2.1. MEDICAL BACKGROUND

Figure 2.1: Diagram of the heart [4] showing the four chambers and blood vessel connected to the heart. the end of diastole, which propels an additional amount of blood into the ventricles. Systole represents the time in which the myocardium (cardiac muscle) contracts. In the first phase of systole, blood is flowing from the atria to the ventricles while the myocardium contracts. In the second phase of systole, blood is expelled out of the ventricles to the body and lungs while the myocardium still contracts. Several other phases within the

Figure 2.2: Figure showing the difference between diastole and systole [63]. cardiac cycle are defined. The two most interesting ones in this project are the mid-systole and the isovolumetric relaxation (IVR) phase and is reasoned in section 3.3. Formally, mid-systole begins after the first of two heart sounds and ends before the second heart sound [52]. Mid-systole is the phase where the valves which keep the blood from flowing towards the lungs and body is are open, and blood is ejected through them. The IVR phase occurs at the end of systole, where the ventricular muscle relaxes and decreases its tension without lengthening so that ventricular volume 4

2.1. MEDICAL BACKGROUND

remains the same [21]

2.1.3

Cardiac motion

This section aims to provide an overview regarding how the heart moves during the cardiac cycle. The motion will be decomposed into three axes, shown in figure 2.3, because the motion data at hand is decomposed in the same way. Hence, it can give a better understanding of the data.

Figure 2.3: Figure showing the decomposition of cardiac movement into a radial, circumferential and longitudinal axes [3].

Cirumferential motion Circumferential motion is the rotation of the heart, that is, the movement upon itself. Usually, the rotation is measured from the left ventricle and in the apical region (lower part of the heart) and is given in degrees. The rotation of the left ventricle apical region is about 11 ± 5◦ in situ healthy human heart [54], reaching peak velocities at the beginning of ventricular systole [13]. The rotation is in a counterclockwise rotation as seen from the apex. By the middle of ventricular systole, the basal and mid-ventricular segments reverse their circumferential motion, reaching peak clockwise velocities by the end of this phase. The apical segments, however, continues their counter-clockwise circumferential motion until ventricular repolarization occurs. Longitudinal motion Regarding movement in the longitudinal direction, there is little apical displacement during systole. All segments move downwards towards the left ventricular apex. Before the end of systole, a fast recoil motion occurs, moving all segments upwards. The movement continues until the blood stops being expelled from the left ventricle. Radial motion The radial motion on the epicardial surface reduces due to the wall thickening and thinning that occur during the phases when the endocardial surface moves inward and outward, respectively. Consequently, the dominant motion of the left ventricle epicardial apical region, where the motion sensor is placed is circumferential displacement or rotation [54]. 5

2.1. MEDICAL BACKGROUND

2.1.4

Monitoring the heart

The following sections will give an introduction in various techniques used to do cardiac monitoring and their weaknesses.

Figure 2.4: Figure showing the characteristic ECG trace for a cardiac cycle [2, 8]. Electrocardigraphy Electrocardiography (ECG) is the process of monitoring the electrical activity of the heart over time. The monitoring is achieved by placing electrodes on the skin that detects the tiny electrical changes that arise from the depolarization of the heart muscle during each heart beat. The electrical changes are displayed as a graph that shows voltage over time. A healthy heart will show the characteristic ECG tracing where each wave is named, seen in figure 2.4. A typical ECG tracing is a repeating cycle of three electrical entities: a P wave, which shows the atrial depolarization (contraction), a QRS complex which shows ventricular depolarization, and finally a T wave which shows ventricular repolarization (expansion/relaxation). To the trained clinician, an ECG conveys a large amount of information about the structure of the heart and the function of its electrical conduction system. Ischemia causes a decreased blood flow to muscle tissues to the heart. This causes a depolarization of the resting membrane potential of the ischemic region to the resting membrane potential of the normal region, which is manifested as either an elevated or depressed ST segment in the ECG, though not always. There are several drawbacks of using ECG as detection method for ischemia. These includes: (i) The ST-depression or elevation is not present in the ECG trace (ii) Baseline drift (iii) Varying ST-T patterns in ECG of the same patient (iv) Noise (v) Power line interference (vi) Patient dependent 6

2.2. THESIS BACKGROUND

abnormal ST depression levels [44]. Echocardiography Echocardiography uses high-pitched sound waves sent through a device called a transducer to create images of the heart, seen in figure 2.5. A standard echocardiogram is also known as a transthoracic echocardiogram or cardiac ultrasound. Views of the heart are obtained by moving a the transducer to different locations on the chest. It is the most widely used diagnostic technique in cardiology as it can provide helpful information about the heart regarding the size, shape, pumping capacity and the location and extent of tissue damage. It can also provide an accurate assessment of myocardial ischemia [39]. However, it requires a skilled technician to operate the transducer and interpret the images and is therefore too cumbersome to use for continuous monitoring [28].

Figure 2.5: Figure showing a heart examination using transthoracic echocardiogram [49]

2.2

Thesis background

Having introduced the most relevant medical theory, the past research leading to this project can be presented. First, a novel technique for monitoring the motion of the heart is given. The motion data produced by this technique is used in this project. Then, the various cardiac functions that the technique is intended to monitor is presented. Lastly, a summary of the preceding studies regarding the technique is given.

2.2.1

The novel technique

This thesis bases on a novel technique for automatic detection of various cardiac dysfunctions developed at The Intervention Centre, Oslo University Hospital, Rikshospitalet, Oslo, Norway. The technique uses a motion 7

2.2. THESIS BACKGROUND

sensor attached to the heart, seen in figure 2.6. The motion sensor is a 3axis accelerometer, measuring acceleration in the three directions specified in figure 2.3.

Figure 2.6: View of the heart showing the 3-axis accelerometer (motion sensor). Arrow indicates the circumferential motion [30].

Several pigs underwent interventions to simulate the various cardiac dysfunctions the technique was intended to monitor. During the interventions, measurements from the motion sensor were collected into a dataset. It is this dataset this project will take use of to make the machine learning model. The reason for attaching a motion sensor is that myocardial (heart muscle) dysfunction associates with motion abnormalities [64,66]. A miniaturized motion sensor attached to the cardiac wall may continuously monitor myocardial motion and automatically detect abnormalities. The proposed use of the technique is during and after cardiac surgery [22], as patients undergoing cardiac surgery may experience myocardial dysfunction or ischemia during surgery or the first few days following the operation [15]. Additionally, present alternatives, ECG, and echocardiography, have their mentioned shortcomings.

2.2.2

Cardiac dysfunctions

The following paragraphs introduce the four different cardiac dysfunctions the model will try to classify from each other and baseline (healthy). Consequently, they will also be denoted as classes. The paragraphs will explain its consequences as well as how they can effectively be inflicted to patients. Note that these functions cannot appear simultaneously. 8

2.2. THESIS BACKGROUND

Figure 2.7: Figure showing myocardial ischemia. [5]

Ischemia/Occlusion Ischemia is the term for what happens when the myocardium (heart muscle) does not receive enough blood, usually caused by a narrowing or blockage of one or more of the blood vessels supplying blood to the myocardium, as seen in figure 2.7. When this happens, the muscle cells which doesn’t receive enough blood starts to hibernate. When hibernating, the muscle cell remains viable, but contraction is chronically depressed because the muscle cells try to reduce energy consumption, leading to a change in motion. But if the blood flow to the hibernating muscle cells is too low, the muscle cells will die and cannot be recovered. Common symptoms are chest pain or discomfort which may travel into the shoulder, arm, back, neck or jaw. The pain tends to get worse with activity and go away with rest. Other symptoms are shortness of breath, sweating and a feeling of nausea. The most severe complication is a heart attack, which happens if the blood vessel becomes completely blocked. The part of heart muscle dependent on blood going through the obstructed blood vessel will die. A heart attack can be fatal if immediate emergency services is not received [14]. Myocardial ischemia is the most common cause of death in most Western countries and a major cause of hospital admissions [9]. It might be treated with medications that improve blood flow to the heart muscle. If the patient does not respond to the medications, surgery is needed. The most common surgical procedures are (i) angioplasty and stenting and (ii) coronary artery bypass surgery. The former procedure forces the vessel open by inserting a stent while the latter creates a graft around the narrowed area by using a vessel from another body part. Ischemia is inflicted onto the animals by using a vascular occluder nearby a blood vessel. The vascular occluder can be thought of as a pinch, thus hinders the blood flowing through the vessel. The blood vessel occluded is the left anterior descending artery (LAD) as seen in figure 2.8. The LAD and its branches supply the left ventricle with blood and is the most commonly 9

2.2. THESIS BACKGROUND

occluded for research purposes. It is the thickest of the heart’s chambers and is responsible for pumping oxygen-rich blood to tissues all over the body.

Figure 2.8: Illustration of the left ventricle, showing the basis and apex of the heart (also referred to as basal and apical regions), the LAD and the location of the occlusion and the motion sensor (accelerometer) [28]

Low contractility/Esmolol Myocardial contractility refers to the ability of the myocardium to contract. An injection of the drug esmolol will cause a decrease in the force and rate of heart contraction [35]. It is commonly used in patients during surgery to prevent the heart rate increasing past normal rate, a condition known as tachycardia. Injection too much esmolol can cause the heart to become ischemic. High contractility/Adrenaline Epinephrine, also known as adrenaline, is a hormone that can be used as a stimulant in cardiac arrest. Cardiac arrest is a sudden stop in effective blood circulation due to a stop in myocardial contraction. Epinephrine will increase blood flow as well as increase the heart rate, muscle strength and blood pressure [38]. Epinephrine is also used to treat anaphylactic shock, the allergic reaction causing the heart to be unable to pump enough blood. The details of the infliction of both esmolol and adrenaline onto the animals can be found in [22]. Fluid loading The intervention regarding fluid loading will enhance preload [47]. Preload is the measure of the initial stretching of the cardiac muscle cells at the end of diastole. Increasing preload will cause an increase in stroke volume while decreasing preload will cause a decrease in stroke 10

2.2. THESIS BACKGROUND

volume1 . To determine how the accelerometer measurements responded to an elevation in volume loading (preload) of the heart, rapid intravenous infusion of body tempered isotonic saline 2 was performed.

2.2.3

Past research regarding the novel technique

A brief overview of the research regarding the novel technique will be given here 3 . Elle et al. [19] describes the first study of how accelerometers can be used to recognize cardiac ischemia. An accelerometer was sutured on the left ventricle in the region of blood supply of the left anterior descending coronary artery (LAD) on three different pigs. The LAD was totally occluded, and it was shown that the positive to negative peak systolic acceleration was reduced by approximately 40% from baseline to occlusion. A method to detect the early changes in the measured acceleration just after occlusion and during incipient ischemia were proposed using a difference value based on the FFT frequency pattern. The next study, by Halvorsen et al. [28] concluded that “...accelerometer detects myocardial ischemia with great accuracy. This novel technique has potential to improve monitoring of myocardial ischemia during cardiac surgery”. Further development was made by Grymyr et al. [23]. A method for automatically calculating a 3D velocity vector from accelerations in circumferential, longitudinal and radial directions was made. The method facilitates insertion of the accelerometer as no precise alignment is required. The weakness with the studies is that the data is manually analyzed. When a substantial amount of data is collected, creating a rule-based system for prediction is difficult and poses a risk of missing out on valuable pieces of data that. This weakness alleviates with machine learning, which instead of following rule-based models, automatically identifies discriminative values in the measurements to predict the outcome of new motion data. There has been a limited amount of research using machine learning and a motion sensor attached to the heart to detect various cardiac dysfunctions.

2.2.4

Accelerometer

The motion sensor used by the described technique is an accelerometer. An accelerometer is a device that measures proper acceleration, meaning the physical acceleration experienced by an object. A conceptual drawing is shown in figure 2.9. 1 The

volume of blood pumped from the left ventricle per beat sodium chloride solution at physiological concentration (0.87%); used as a wound cleansing agent or irrigating fluid 3 A detailed description of the technique can be found in the following research papers: (sorted by date of publicity): [19, 22, 23, 28–30, 54] 2 Sterile

11

2.3. MACHINE LEARNING ALGORITHMS

Figure 2.9: A conceptual drawing of an accelerometer. It consists of a mass m, that can move relative to the accelerometer’s housing, a spring that connects the mass to the housing k, and a damper that dissipates energy and keeps the spring-mass system from vibrating forever. The other constants are: the damping constant c, the position of the moving block relative to an inertial (stationary) frame of reference x, and the position of the housing relative to the same inertial frame y [1]. One can think of an accelerometer as a device that behaves like a damped mass attached at the end of a spring. When the device experiences an acceleration, the mass is displaced to the point that the spring is able to accelerate the mass at the same rate as the casing. The displacement is then measured to give the acceleration. Most studies using an accelerometer and machine learning have been in the field of activity detection and fall detection. Ravi et al. [53] uses an accelerometer worn near the pelvic region and manages to detect different activities such as standing, walking and running with high accuracy. The measurements from the accelerometer were used to calculate the average, standard deviation, energy and correlation, which in turn were used to classify the activities. Albert et al. used the embedded smartphone accelerometer and a machine learning model to detect if the subject had fallen, as well as classifying the type of fall, were four types of fall were defined. The model achieved 98% accuracy in detecting fall and 99% accuracy in detecting the type of fall. These results demonstrate the potential use of accelerometer, suggests that it could be used to classify more fine-grained motions, such as cardiac function. No past research using an accelerometer to predict cardiac function, except for the studies mentioned in the previous paragraph were found.

2.3

Machine learning algorithms

The remainder of this chapter will introduce machine learning and its various concepts relevant to this thesis. This section will attempt to introduce machine learning in general and shortly present three different 12

2.3. MACHINE LEARNING ALGORITHMS

machine learning algorithms.

2.3.1

Machine learning

Figure 2.10 shows the general procedure of implementing and using a machine learning model to predict the true label of new and unseen data.

Figure 2.10: A machine learning model consists of a machine learning algorithm/classifier takes labeled feature vectors as inputs (seen in a) and produces a classifier model which can predict the label of new feature vectors (seen in b) [11]. Machine learning is the process of building a model from a dataset in order to make data-driven predictions or decisions. The dataset consists of pairs, seen as training examples. Each pair consists of an input/feature vector and the corresponding output, also referred to as the class. The training process consists of analyzing the pairs and producing an inferred function, which then can be used for mapping new and unseen data to a class. More precisely, given training set of N example input-output pairs

( x1 , y1 ), ( x2 , y2 ), ...( x N , y N ), where y j was generated by an unknown function y = f(x), the training/learning task is to discover a function h that approximates the true function f, also denoted the decision boundary. When the number of inputoutput pairs in the dataset grows, the difference between h to the true function f decreases, hence, we say that the model is learning the true function from the input. We say that h is a good approximation of f, if the model generalizes well. That is, that the model often predicts the correct label to new and unseen inputs. The learning process inherent in any machine learning model is what separates machine learning from an algorithm that follow strictly static program instructions. Figure 2.11 shows an example of a training set, a machine learning algorithms approximation h of the true function f, and the prediction of a new and unseen input. 13

2.3. MACHINE LEARNING ALGORITHMS

Figure 2.11: Figure showing a two-class classification problem. All squares belongs to the training set. Each square is a two-dimensional feature vector with values according to x1 and x2 . The black and grey color indicates the label of the feature vector. The classifier has approximated the true function and is showed as the solid line (decision boundary). The white triangle represents a new and unseen inputs that the classifier will predict. Since it is placed above the decision boundary, the classifier will predict that it belongs to the black class.

2.3.2

Classifiers

The classifier, or the machine learning algorithm, as seen in figure 2.10 are the component which produces the inferred function h=f(x). There are a vast number of classifiers and the three next paragraphs will give a summary of the classifiers intended to be used in this thesis and their strengths and weaknesses and why they have been chosen. Decision trees The decision tree classifier is one of the most simple classifiers, hence the first to be introduced. It was first presented by JR Quinlan in 1986 [51]. The learning phase involves constructing a tree consisting of nodes and edges using the dataset. A node represents a test on one of the features, and the edges represent an outcome of the test. Leaf nodes represent the outcome class. Predicting then consists of following a path from the root node to a leaf node. An example of a decision tree is seen in figure 2.12. An important aspect of decision trees is how to create one, as different trees give rise to different classifiers. MATLABs (used in this project) method for creating classification tree is known as a top-down induction of decision trees (TDIDT) and is an example of a greedy algorithm. It works recursively in the following manner: 1. Start with all input data and examines all possible binary splits on every feature. 2. Select a split with the best optimization criterion. Information about 14

2.3. MACHINE LEARNING ALGORITHMS

Figure 2.12: An example of a decision tree that based on the outlook, humidity and windy classifies a day to belong to either the P or N class [51]. the split criterion is given in appendix A. 3. Impose the split 4. Repeat recursively for the two child nodes. There are several reasons for the choice of testing decision tree in this project. It is conceptually simple and deals well with noise [43] (data that is not representative of the true function). It is a white box model, meaning that the decisions that led to the specific prediction can be seen. This is desirable if knowledge about the dataset is sought. One of the disadvantages stems from its greedy strategy. Finding the right root of the tree can greatly impact the final result. I.e., small changes early on can have big effects later. Furthermore, if the true underlying function is truly linear, decision tree does usually not work well. Lastly, decision tree does not function well if there are large numbers of uncorrelated features since it works by finding the interactions between variables. Ensemble learning Ensemble learning is the process of combining multiple classifiers into one to obtain better performance. The assumptions behind combining classifiers are: (i) if every classifier is wrong in one unique subset of the dataset, the model will still predict correctly since the rest of the classifiers are correct on the same subset, and (ii) this holds for every subset. The question arises on how to combine the classifiers used. The most widely used way of combining classifiers is through a technique called boosting. In 1989, Robert E. Schapire introduced the first boosting algorithm [59], in the paper “The strength of weak learnability”. The general idea is to collect a set of poor classifiers, where each classifier performs slightly better than chance. By combining them together, they will perform arbitrarily well. An improvement of boosting was introduced by Yoav Freund and the same Schapire in a paper from 1999 [20], where AdaBoost (Adaptive boosting) was presented. The fundamental difference 15

2.3. MACHINE LEARNING ALGORITHMS

was to assign weights to each feature vector according to how difficult previous classifiers have found to get it correct. In short, AdaBoost works iteratively until some number of iterations is reached or if the error rate is beneath some threshold. At each iteration, a new classifier is trained on the weighted training set. Initially, all weights are set equally, but in each round, the weights of incorrectly classified feature vectors are increased. The next classifier will then have to focus to get these feature vectors correct. The change of weights is usually set to decrease with a fixed number as the number of iterations grows. This is because the uncertainty increases with the number of iterations. This fixed number is known as the learning rate. The general idea can be seen in figure 2.13.

Figure 2.13: Figure showing two iterations of AdaBoost, labeled 1, 2 and 3. In figure 1, a point belonging to the triangle class is misclassified. The point gets a bigger weight, indicated by its bigger size and is thus correctly classified in figure two. This leads to a misclassification of a point in the star class. The point gets a bigger weight resulting in a perfect classification in figure 3.

AdaBoost is proven to work on small datasets which are the main reason it is chosen to be tested [45]. AdaBoost uses the mentioned decision trees as its weak learners. In practice, the trees will favor splitting feature vectors with high weights. Consequently, noisy feature vectors will be given higher weights. Hence, AdaBoost will not work well when the number of outliers is to be considered high. Additionally, training can be time-consuming when the dataset is large, and convergence time can be slow for complex problems [60]. This weakness is alleviated since the dataset at hand is relatively sparse

Support vector machines (SVM) The original SVM algorithm was first presented by Vladimir N. Vapnik and Alexey Ya. Chervonekis in 1963 while the standard used today was introduced by Corinna Cortes and Vladimir Vapnik in 1995 [16]. It is considered to be a complex classifier. Hence, the presentation of it will be limited to the general idea. 16

2.3. MACHINE LEARNING ALGORITHMS

Figure 2.14: Figure showing the optimal margin for a two dimensional two-class classification problem [16]. The circles and crosses are feature vectors with values according to the x- and y-axis. The circle and cross represent its belonging class. The examples surrounded with a grey square denotes the support vectors. The dotted line is the decision boundary/hyperplane created by the SVM. Consider the 2-class classification problem in figure 2.14 where the classes are represented as crosses and circles and the approximation h = f(x) i.e. the decision boundary is represented as a dotted line. The prediction of new data consists of investigating which side of the decision boundary it lies on. If it lies above the dotted line, it is classified as belonging to the cross-class, and it is classified as the circle class if it lies below. If the predicted data lies close to the decision boundary, one can see that if the decision boundary was moved only a small amount, there is a risk of misclassifying that data. SVMs however, will take this into account and try to find the best decision boundary. Formally, the best decision boundary is the one that lies furthermost away from any of the training examples. That decision boundary is said to be maximizing the margin. Hence, the training process consists of creating a decision boundary which maximizes the margin. Cortes et al [16] showed that the process of maximizing the margin with respect to the input vectors only depends on the dot product of them. This is a key observation as an apparent weakness of SVMs is how it tries to create a straight line between the classes in the 2-dimensional case or a hyperplane in the high-dimensional case. Consequently, SVMs apparently only feasible if the classes are linearly separable, as the problem in figure 2.14. In order to overcome this issue, one can transform the data to a higher dimensional space where the data is linearly separable, as seen in figure 2.15. The decision boundary was dependent only on the dot-product of the feature vectors. This means that the transformation itself is not needed, but only a function that implicitly computes the high-dimensional dot-product. This function is referred to as a kernel function and is a vital component of the 17

2.4. FEATURES

Figure 2.15: Figure showing the transformation, often denoted φ, transforming the data into a linearly separable space. The solid line is the decision boundary, and the circles are data from the training set [6] classifier. The linear, polynomial and Gaussian kernel is provided by MATLAB. A more detailed description about them is given in appendix B. The choice of which kernel to use depend on the dataset, but the RBF kernel may be the most used kernel [12], hence is the only kernel to be tested in this project. This is done to ease the computational complexity. The reason for including SVM as one of the three classifiers is that it often provide significantly better classification performance than other classifiers on reasonably sized datasets [45]. It has strong theoretical foundations and excellent empirical successes [37]. Furthermore, SVM is very robust against ill-behaved distributions, high dimensionalities and a low ratio of training examples to the dimensionality of the input data [10]. One drawback is the lack of transparency in results as the computation might be done in a very high dimensional space. SVM is also computationally expensive, thus runs slow for large datasets [40]. As with AdaBoost, this weakness is alleviated since the dataset at hand is relatively sparse.

2.4

Features

Choice of features, the components of a feature vector, is crucial when building a model. They are the inputs to the model and is used to discriminate between the classes. For this reason, they need to be informative to the classes. For instance, if the model’s task is to detect spam mail, examples of informative features could be the frequency of certain terms and the grammatical correctness of the text. The following paragraphs will elaborate more about the choice of features.

2.4.1

Curse of dimensionality

The curse of dimensionality refers to the problem that arises when we have high-dimensional input vectors, we will need more data to enable the model to generalize adequately. When the dimensionality of the input 18

2.4. FEATURES

vector increases, the complexity of the underlying pattern might increase. Regardless, the model needs more examples to uncover the underlying pattern, which is not always possible, as in this project.

2.4.2

Feature extraction

Feature extraction is the process of computing the feature vectors that is used as input to the classifier. This is also the “feature extractor” block in figure 2.10. The features extracted should be values that separate the class well. Informally, a good set of features will make it easy for the classifier to discriminate between the various classes, but finding these features is domain specific. The choice of features in this project is elaborated and reasoned in the next chapter, in section 3.3.

2.4.3

Feature scaling

Different values in a feature vector often take on different ranges which are not desirable. Consider two features, one ranging from 1 to 10, and the other from 1-1000. If the classifier uses the distance between the data points as a way of classifying, it will be governed by the latter feature, as the differences in the former are always small in comparison. Ensuring standardized feature values implicitly weights all features equally in their representation.

2.4.4

Feature selection

Feature selection is the process of selecting a subset of features that are relevant. There are several benefits of doing feature selection [25]: (i) increasing classifier performance (ii) reducing storage requirements (iii) reducing training time (iv) and defying curse of dimensionality. Feature selection methods are usually divided into three categories; Filters, wrapper methods and embedded methods. Filters do not involve the classifier but analyze intrinsic properties of the data to find useful features. Wrappers, on the other hand, uses a classifier and tend to perform better since, but is also more computationally inefficient for large datasets [31]. Additionally, wrappers tend to overfit when the number of examples is insufficient. Embedded methods are essentially a classifier with an embedded feature selection. Selecting features is therefore an inherent part of the learning process. They tend to be more computationally efficient compared with wrappers [57]. Note that a feature that is completely useless by itself can provide a significant performance improvement when used together with other features [25]. Two feature selection methods will be tried and tested, one filter method and one wrapper method. The filter method is the Correlation-based Feature Selection (CFS), and the wrapper method is the Relieff algorithm. An introduction and their strengths and weaknesses are given in appendix C.1.

19

2.5. MODEL VALIDATION

Lastly, it is important to note that the no free lunch theorem that is described in section 2.5.2 also holds for feature selection. Consequently, there is no optimal feature selection method for all problems. More than one method will therefore be used. Guo et al. [24] reported that Relieff and CFS performed consistently better than seven other feature selection methods. Additionally, using a filter and a wrapper will ensure that two conceptually different methods are tried and tested.

2.5

Model validation

This section intends to introduce concepts regarding testing of machine learning models. The first paragraph will describe how a machine learning model is commonly tested. The next paragraph states why more than one classifier is tested in this project. Next, an approach on how to find the best parameters for a classifier is described. The final paragraph, 2.5.4 will present the concept of overfitting, which needs to be addressed when training and testing a machine learning model.

2.5.1

Cross-validation

Techniques must be used to assess the quality of the model. A common technique is to split the dataset in a training set and a test set and use a model validation technique known as cross-validation. The model is built using the training set and the quality assessed using the test set. The test set is assumed to approximate the typical unseen data that a model will encounter. The weakness is that this approach only works for large datasets, and not the dataset in this project. If the dataset is not large enough, the two sets will not represent the true function properly, since they become more sensitive to noise. A common workaround is known as k-fold cross-validation. Here, the dataset is randomly partitioned into k subsets, where one subset is used as a test set and the rest is the training set. The model is trained using the training set and then assessed using test set. A new partition is selected as the test set and the old test set becomes part of the training set. This process is repeated until all partitions have been used as a test set. The results can be averaged to produce a single error estimate. This procedure has the advantage that all examples in the dataset are eventually used for both training and testing. If k is set to the number of feature vectors in the training set, it is known as leave-one-outcross validation (LOOCV), and is shown in figure 2.16. 20

2.5. MODEL VALIDATION

Figure 2.16: Figure showing leave-one-out-cross-validation (LOOCV) used on a dataset with ten examples. The lighter grey example represents a sample used in the testing set while the darker squares are samples in the training set. At each iteration, a new example is chosen as the testing example. When ten iterations have been done, all examples are used once in the training set, and nine times in the testing set Here, exactly one feature vector is used as the test set and the rest as the training set. LOOCV can be an alternative when the dataset is small since the focus should be on training on as many examples as possible at once.

2.5.2

No free lunch theorem

The no free lunch theorem [67] states that there is no classifier that works best for every problem as the assumptions of a great model for one problem may not hold for another problem. Therefore, it is common to build several models and choose the best model for the specified problem.

2.5.3

Hyperparameter optimization

Every classifier needs some hyperparameters to be set. A hyperparameter regarding machine learning are values that must be specified outside the training procedure. The setting of a hyperparameter can influence the accuracy of the trained model. Optimal hyperparameter settings often differ for different datasets, and it is difficult to know on beforehand which hyperparameters are the best. Hsu et al. [34] recommend what is known as grid search as a strategy for finding the best parameters. Grid search is simply an exhaustive search through a manually specified subset of the parameter space of a classifier. Grid search is the safest way to find the best hyperparameters, in contrast to search by approximations or heuristics which may get stuck in local optima. The biggest weakness of grid search is its computational complexity. The problem quickly becomes intractable when the number of hyperparameters increases, although there are ways to overcome this weakness. Not all hyperparameters are dependent, which means that the search can be parallelized. Additionally, Hsu et al. [34] found that for trying exponentially growing sequences of the hyperparameters to a SVM is a practical method to identify good 21

2.6. STATISTICS

parameters since it will test a wider range of hyperparameters. When doing so, one will quickly skip past the best region of the grid. To avoid this, one can proceed to do an additional finer grid search on the most promising region of the grid.

2.5.4

Overfitting

Overfitting occurs if there is not enough training data or if the learning process has resulted in approximating the noise in the data, as opposed to the underlying true function. If there is not enough training data, the dataset will not contain the underlying function, hence, the classifier will not be able to generalize to new data. If the learning process is done too extensively, it might lead the classifier to learn about the inherent noise in the training set, instead of the true underlying function. This is seen in figure 2.17. Noise has no causal relation to the true function. Consequently, the model will not be able to generalize well. As noise is inherent in measuring any real world process, overfitting must always be taken into consideration. The cross-validation techniques average the performance over all training- and test sets hence assesses the presence of overfitting.

Figure 2.17: Figure showing two classifiers, and a training set consisting of two classes. The smooth line represents the good classifier as it doesn’t take into account the noisy data points, i.e. the data points that is surrounded by points from the other class. The jagged line represents a classifier subject to overfitting. It tries to maximize the accuracy on the training set, but it will not achieve a good accuracy on new and unseen data points [7].

2.6

Statistics

This section is divided into two subsections. The former subsection introduces the various performance measures used while the latter subsection introduces the test statistic used. 22

2.6. STATISTICS

2.6.1

Performance measures

Accuracy Accuracy can be both a two-class and multiclass performance measure. Accuracy is the number of correct predictions divided by the number of predictions. One needs to take care if one class is by far the most represented one in the dataset. Classifying every example as this class will still yield a high accuracy. In this dataset, however, the number of examples in each class is fairly balanced, with a standard deviation of 2.2. 2-class performance measures performance measures:

Precision4 and recall5 are common 2-class

Precision =

Recall =

TP TP + FP

TP TP + FN

Some research papers also include the specificity 6 measure. Specificity =

TN TN + FP

The four different outcomes of a binary classifier are true positive (TP), false positive (FP), true negative (TN) and false negative (FN). True and false refers to whether the classifier was correct or not in its prediction and positive and negative are general names for the two outcome classes. True positive is therefore the correct (true) classification of an object that belonged to the positive class. A machine learning model which predicts cancer or not cancer is used as an example of precision and recall, seen in figure 2.18. Patients predicted to have cancer lies inside the circle. Precision is the fraction of patients predicted to have cancer and also had cancer. Recall is the fraction of patients having cancer and also were predicted as such. Specificity is similar to recall, but with respect to the not-cancer class. Different problems may wish to focus on achieving either a high precision or a high recall. When predicting a disease, as in the example mentioned above and in this project, recall can be viewed as more important than precision. As long as further assessment is done before treatment, it is important to recognize as many as possible as having the disease. Even though it might come at a cost of assessing patients which were not diseased. Precision and recall is further addressed in the chapter discussing the results. Multiclass performance measures In the multiclass class, it is interesting to see if some classes were easier for the classifier to predict than others. Confusion matrices will yield such information. A confusion matrix is a 4 Also

referred to as positive predictive value referred to as sensitivity or true positive rate 6 Also referred to as true negative rate 5 Also

23

2.6. STATISTICS

Figure 2.18: Figure showing a classifier predicting cancer. Patients are represented as circles. All patients lying within the circle is predicted as cancer. square matrix with one row for each predicted class, and one column for each true class (or vice-versa) Table 2.1 shows an example of a confusion matrix.

Predicted C1 Predicted C2 Predicted C3

Actual C1 10 2 4

Actual C2 2 8 4

Actual C3 0 0 4

Table 2.1: A confusion matrix showing the results of a 3-class classification task. It shows that the classifier got all the instances that belonged to class C1 correct, but misclassified 2 instances of class C2 as C1 and is having difficulties getting the C3 class correct. Naturally, the desirable result is to have low numbers on the off-diagonal.

2.6.2

Significance test

Two-sample Kolmogorov-Smirnov test The two-sample KolmogorovSmirnov test (KS-test) tests if two datasets differ significantly. It has the advantage of making no assumption about the distribution data, that is, it is non-parametric and distribution free. Other tests, like the Student’s ttest, requires that the data follows a Student’s t distribution. This has the advantage of making it more sensitive than the KS-test. Specifically, the KS-test statistic is Dn,n0 = sup | F1,n − F2,n0 | (2.1) x

where F1,n , F2,n are the empirical distribution functions, that is, the distribution function associated with the empirical measure of the data and where n and n’ is the sample size of the two datasets respectively. 24

2.6. STATISTICS

Regarding the null hypothesis, (see appendix C.2 for an overview on null hypothesis and p-value), it is rejected at level α if r n + n0 Dn,n0 > c(α) (2.2) nn0 where c(α) is a known function mapping from α. Informally, the KS-test computes the cumulative distribution function (CDF) of the two datasets and finds their maximum CDF difference, denoted D. This is seen in figure 2.19.

Figure 2.19: Figure illustrating the two-sample Kolmogorov-Smirnov statistic. Red and blue lines each correspond to an empirical distribution function,and the black arrow denotes the test statistic.

25

Chapter 3

Implementation This chapter intends to explain the various choices that need to be considered in creating the machine learning models. The first section reasons behind the choice of implementation environment and language. Section 3.2 explains the segmentation on the dataset. The features extracted is presented in section 3.3. Section 3.4 explains how the models are trained and tested. The second to last section will present the grid search performed. The final section will present how the machine learning models are created and how the optimal models will be identified.

3.1

Choice of implementation environment and language

Some requirements were taken into account in the choice of implementation environment and language. It is beneficial to use libraries to reduce development time. Hence, the range of machine learning applications availR able was important. For this reason, MATLAB was chosen. Its Statistics TM and Machine Learning Toolbox offers a wide variety of classifiers, including the tree classifiers intended to be tested in this project. Furthermore, the toolbox provides a straightforward calculation of all the performance measures. Some feature selection methods are included, but perhaps not sufficiently. However, several 3rd-party libraries exist as MATLAB is a very popular choice for performing machine learning tasks. Though some care needs to be taken, when it comes to the 3rd-party libraries. No official quality check has necessarily been done, so the code might contain bugs and errors. Another reason for the choice is the fact that vectorized operations can be done effortlessly in MATLAB and will be used extensively since the dataset is represented as a matrix. The wide range of libraries available means that it will be easy to try and test a vast number of models and modifications. Other choices were deliberately not used for various reasons. C/C++ is fast but has less machine learning libraries. Java has a broad range of machine learning libraries but is not as high-level as Python and MATLAB, hence, will require more time to create the model. One of the most popular alternatives to MATLAB is using Python combined with

3.2. MOTION DATA SEGMENTATION

scikit-learn. Scikit-learn is an open source library containing a vast number of classification algorithms. The differences between the alternatives are limited, both are reasonably fast, contains ready-to-use machine learning libraries and easy to use. Hence, the decision on which to use were not a vital one.

3.2

Motion data segmentation

This section will first describe the dataset at hand. Next, it will present how the data is segmented into sequences and used as the basis for creating feature vectors. The word cycle is defined, and it is reasoned why it is used as a sequence of motion data. Lastly, graphs of the data using the described sequences are presented.

3.2.1

The dataset

The dataset is the crucial part of the model, and its quality and size are decisive for the quality of the model. The dataset is created as a result of a research project at Oslo University Hospital, Rikshospitalet, Oslo, Norway. 20 pigs underwent five interventions which simulate the different heart functions. Measurements of the motion were done for ten seconds at a time during the interventions. Note that not all 20 subjects underwent all five interventions. The number of subjects in each cardiac function/class is summarized in table 3.1. Class Baseline Adrenaline Esmolol Ischemia Fluid loading

Number of animals 20 17 15 20 17

Table 3.1: Overview of how many subjects there are for each cardiac function. The various cardiac functions are described in section 2.2.2. An example of the data available from one animal during a cardiac dysfunction is shown in figure 3.1. Note that the velocity and position (displacement) in the second and third graph from the top is derived by integrating and double integrating the acceleration. The bottom graph shows the ECG, were the R-peaks have been marked with red crosses. Section 2.1.4 presented an overview of the ECG and its R-peak. The ECG is synchronized with the motion data. Consequently, the location of the R-peaks in the motion data is known.

28

3.2. MOTION DATA SEGMENTATION

Figure 3.1: Figure showing the motion data collected from one animal during during a cardiac function of a period of 10 seconds. The timeline is shown on the x-axis. The upper graph shows the acceleration while the next two are the velocity and displacement, marked in the top right corner of each graph. The acceleration, velocity, and displacement are shown as the calculated resultant of all three directions. The bottom graph is the ECG where the R-peak is shown as a red cross. The graph turns green during the first 150 ms after R-peak, which is also indicated in the motion data as well. The reason for showing the first 150 ms after R-peak in green is given in section 3.3.

29

3.2. MOTION DATA SEGMENTATION

3.2.2

Segmentation

A cycle is the motion data from a cardiac complete cycle, and is either the acceleration, velocity or displacement in either the circumferential, radial or longitudinal direction. Formally, a cardiac cycles starts when the sinoatrial node fires, which is manifested through the P-wave on the ECG, seen in figure 2.4. However, since the location of the R-peak in the cycle is known, hence a cycle is defined to last from one R-peak to the next. Cycles have been extracted and used as the basis for creating feature vector. This step is shown in figure 3.13 in the summary in the final section of chapter. There are several reasons for using cycles as a basis for creating feature vectors: • Every animal underwent each intervention once, for approximately ten seconds each time. Since the sample rate is either 500 Hz and 250 Hz, the motion data would yield 5000 and 2500 samples in each intervention, respectively. The curse of dimensionality mentioned in section 2.4, would be imminent if such a substantial amount of samples would have been used directly as the feature vectors. • Figure 3.1 reveals that the variation between one cycle to the next during an intervention is almost nonexistent, meaning that the pulse is approximately constant. Therefore, the loss of information is limited when investigating a single cycle instead of all cycles from the ten-second reading. • The same strategy on a similar problem has been applied with success before. Holst et al. [33] successfully classified the three different techniques of skate skiing based on an accelerometer placed on the chest of several skiers. The extensive length of the time series collected and the repetitive nature of the techniques made cycle extraction natural. This project also deals with extensive time series which exhibit a repetitive behavior (the heartbeats), hence the two problems are similar. • As each animals intervention is measured for ten seconds, cycle extraction will give approximately 12 cycles, depending on the subject and the cardiac function. Instead of having one example per animal per function, cycle extraction gives approximately 12 examples/inputs per animal per function, which is a great benefit. It is important to note that the extra cycles would optimally come from different animals. Still, the increased dataset will help to alleviate the curse of dimensionality. • The access to the synchronized ECG which shows when the R-peak occurs in the motion data makes cycle extraction trivial and accurate. A common alternative approach is the use of sliding windows. Using a sliding window technique were the step size is half the number of samples 30

3.2. MOTION DATA SEGMENTATION

in the cycle would perhaps find cycles/windows that could discriminate better. However, the next section will show that moving from R-peak to R-peak has been used in previous research with success.

3.2.3

Achieving a fixed length of the sequences

The cycles have been interpolated and decimated in order for to achieve a constant length of 300. This is because the cycles will be used as feature vectors, and thus needs to have the same dimensionality. The choice of using 300 was not clear, since the sampling rate were either 500 Hz or 250 Hz. Since the measurements were always done for ten seconds, the number of samples used to represent a full cycle were highly depended on the sampling rate. The average number of samples used to represent a cycle were 323 and 141 respectively, depended on the sampling rate. Since 80% of the animals had a sampling rate of 500 Hz, the constant number were chosen close to 323, namely 300. A weakness with this choice, is that the interpolation of the measurements having 250 Hz sample rate has to approximately double the number of samples. The technique for achieving 300 samples to represent every a cycle is done in the following way: when a cycle consists of less than 300 samples, MATLABs interp1 is which uses linear interpolation. When there are more than 300 samples, the procedure is shown in figure 3.2.

31

3.2. MOTION DATA SEGMENTATION

Cycle represented with more than 300 samples

ceil(length(cycle)/300) L interp cycle with factor L signal1 ceil(length(signal1)/300) M decimate signal1 with factor M signal2 interp1 signal2

Cycle represented with 300 samples Figure 3.2: Figure showing how a cycle that is represented with more than 300 samples is interpolated and decimated in order to be represented by exactly 300 samples. The decimate, interp and interp1 are all MATLAB functions. Interp1 is explained on the previous page. Interp inserts 0s into the original signal and then applies a lowpass interpolating filter to the expanded sequence. The decimate function lowpass filters the input to guard against aliasing and downsamples the result. The ceil function rounds upwards to the closes integer. Length returns the number of samples of a signal (cycle). It is chosen to interpolate before decimating in order to avoid removing original samples

32

3.2. MOTION DATA SEGMENTATION

3.2.4

Graphs of the motions sequences

The following graphs will try to give further insights into the motion data, hence the average acceleration and velocity of every cycle for each output class will be shown. Figures 3.3 and 3.4 show the acceleration and velocity in the circumferential direction. The acceleration exhibits more diversity than the velocity. Regarding the acceleration, the adrenaline and baseline are distinctive throughout the full cycle. The other classes shows some diversity at the beginning of the cycle, meaning immediately after R-peak/mid-systole and some diversity towards the end of the cycle. The velocity of the occlusion shows some diversity, as do adrenaline. The next two figures, figure 3.5, 3.6, show acceleration and velocity in the longitudinal direction. The only diversity apparent is in the acceleration in the beginning of the cycle. Lastly the average radial acceleration and velocity is shown in figures 3.7 and 3.8. Here, it is evident in both figures that the beginning of a cycle and the beginning of the last half shows diversity where especially adrenaline. It is worth repeating that it is the average of the motion for each class that is shown. Some of the measurements can be greatly influenced by noise, hence will shift the average further away. Noisy signals implies a high variance. If the variance of the motion variance is high, the information gain regarding the figures is rather low. Thus, the standard deviation and the number of cycles is shown in the below table. The standard deviation of the adrenaline class is considerably higher than the rest. This is also expected, as this class increases the heart rate and the force of the contraction. Class Baseline Adrenaline Esmolol Ischemia Fluid loading

Number of cycles 295 413 222 308 255

Standard deviation 2.18 5.64 2.35 2.25 2.34

Table 3.2: Table showing how the number of cycles and the standard deviation for each class.

33

3.2. MOTION DATA SEGMENTATION

Figure 3.3: Figure showing the average acceleration all cycles in the circumferential direction.

Figure 3.4: Figure showing the average velocity of all cycles in the circumferential direction.

34

3.2. MOTION DATA SEGMENTATION

Figure 3.5: Figure showing the average acceleration of all cycles in the longitudinal direction.

Figure 3.6: Figure showing the average velocity of all cycles in longitudinal the direction.

35

3.2. MOTION DATA SEGMENTATION

Figure 3.7: Figure showing the average acceleration of all cycles the radial direction.

Figure 3.8: Figure showing the average velocity of a cycles in the radial direction.

36

3.3. FEATURE EXTRACTION

3.3

Feature extraction

The choice of features is a vital decision. A good classifier depends on good features. As mentioned in section 2.4, the main approach will be to create features based on domain knowledge, hence this approach will be described first. A secondary approach is to combine domain knowledge features and let automatic feature selection methods guide the choice of features in the right direction. The following paragraphs will elaborate and reason behind five feature sets created in this project. The first three are based on domain knowledge, while the final two take use of feature selection methods.

3.3.1

Domain knowledge

These feature sets will take use of the previous research done at The Intervention Centre, Oslo, summarized in section 2.2.3. Feature set one - mid-systole Halvorsen et al. [30] studied the effects of esmolol, infusion and ischemia. When ischemia is present, it was stated that there were “significant decreases in accelerometer measurements of midsystolic velocity. (...) The greatest change in the accelerometer measurements, however, was found in midsystolic displacement, which demonstrated negative values”. This can be seen in figure 3.9. Regarding esmolol, Halvorsen et al. stated that “automated accelerometer measurements, midsystolic velocity (...) and midsystolic displacement all decreased, but reductions were smaller than seen with LAD occlusion (ischemia).”

Figure 3.9: Figure showing velocity and displacement in circumferential direction during baseline and ischemia. Thick lines indicate the middle of systole, as approximated by Halvorsen et al [30]. .

37

3.3. FEATURE EXTRACTION

Namely the middle of systole, mentioned in section 2.1.2, shows the best diversity between the heart functions (baseline, esmolol and ischemia). Additionally, the direction showing best diversity is the circumferential direction. Similar results were found by Grymyr et al. [23] stating that “circumferential peak mid-systole demonstrated high sensitivity and specificity for detecting regional myocardial ischemia.” This is also seen in figure 3.10, where also epinephrine (adrenaline), fluid (fluid loading) and esmolol are included. Ephinedrine, fluid and esmolol are marked in the figure as global interventions, as they do not target a distinct region of the heart while ischemia is simulated through an occlusion of a specific blood vessel of the heart. All dysfunctions are described in section 2.2.2

Figure 3.10: Figure showing average peak midsystolic velocity, denoted Vsys . The bottom figure is from an accelerometer mounted in the basal region of the heart (close to the top), but these measurements were discarded. The top figure is from an accelerometer mounted on the apical region (close to the bottom). Vsys is shown during all four interventions (LAD occlusion is ischemia) in addition to baseline (healthy). ± standard deviation is also shown. [23]. Since the effects of ischemia and esmolol are more prominent in the middle of systole and in the circumferential direction, interesting features are: • Peak acceleration within the first 150ms from R-peak • Peak velocity within the first 150 ms from R-peak 38

3.3. FEATURE EXTRACTION

• Average acceleration of the first 150 ms from R-peak • Average velocity of the first 150 ms from R-peak • Difference in displacement from R-peak to 150ms after R-peak • Minimum acceleration within the first 150ms from R-peak The middle of systole is approximated by Halvorsen et al. [30] to start at R-peak (see 2.1.4), and last for 150 ms. Notice in figure 3.10 that the peak midsystolic velocity in the circumferential direction, from high to low goes: ephinedrine (adrenaline), Fluid (fluid loading), baseline, esmolol, and LAD occlusion (ischemia). The velocity in baseline, fluid and esmolol shows little to no difference between them, hence not all classes are expected to achieve a great accuracy using the aforementioned features. Feature set two - IVR-phase This drawback can be alleviated by including the isovolumetric relaxation (IVR) phase, described in section 2.1.2. According to Halvorsen et al. [28], increases in accelerometer velocities in the isovolumetric relaxation (IVR) phase were observed in all directions during ischemia. A significant change, however, was only found in the circumferential direction. This can be seen in figure 3.11. It would therefore be interesting to look at six features listed from feature set one, but in the IVR phase: • Peak acceleration in the IVR phase • Peak velocity between in the IVR phase • Average acceleration in the IVR phase • Average of velocity in the IVR phase • Difference in displacement from R-peak to middle of IVR phase • Minimum acceleration in the IVR phase The IVR phase is approximated to start 290 ms after R-peak and last 100 ms. The approximation of the IVR phase is somewhat vague, hence is a weakness with the features. Looking at figure 3.10, the circumferential direction is not the only direction showing diversity among classes. Both longitudinal and radial directions shows some diversity between dysfunctions. It would therefore be interesting to investigate these directions in the same manner as in circumferential direction. That is, feature set one and two applied to the longitudinal and radial direction. Another reason for inspecting all directions is that it is of interest to see which direction and phase (midsystolic og IVR phase) is most prominent. 39

3.3. FEATURE EXTRACTION

Figure 3.11: Representative left ventricular curves from a single patient of accelerometer circumferential velocity and acceleration during baseline and preconditioning LAD occlusion (ischemia) [28] . That is, which direction and phase changes most when a dysfunction is present. It is already hypothesized that circumferential direction changes most, although it would be interesting to see if the model built, concluded likewise. Feature set three - Frequency It would also be interesting to investigate features regarding the frequency of the cycles. In the adrenaline intervention, the heart rate will increase, while the opposite happens with esmolol, fluid loading and ischemia, though with varying degree. Consequently, one can expect there to be some frequency characteristic pattern in the classes. Consequently, the created features are: • The 3 highest frequency components of the acceleration in circumferential direction. The adrenaline dysfunction is expected to have higher frequency components than the rest. These features were also used by Preece et al. [50] when doing activity classification using an accelerometer. It was concluded that these features were the best features among four other time- and frequency domain features and five other wavelet features. • Elle et al. [19] proposed using a difference value based on the FFT frequency pattern because of the difference in the frequency domain between baseline and ischemia. It would therefore be interesting 40

3.4. TRAINING AND TESTING

to accumulate the magnitude of the frequency components in the circumferential direction. • Number of samples used to represent a cycle. The number of samples used to represent a cycle is highly depended on the dysfunction since the dysfunction influences the heart rate. The average number of samples used to represent a cycle were calculated and can be seen in table 3.3. The number of samples were scaled into the [0, 1] range since the sampling rate were either 250Hz and 500Hz. Classes Baseline Adrenaline Esmolol Ischemia Fluid loading

Average .52 .10 .58 .47 .52

Table 3.3: Table showing the average number of samples used for each cardiac dysfunction. The results were scaled into the range [0, 1].

3.3.2

Machine learning knowledge

Three feature sets using domain knowledge has been introduced. The two next paragraphs will introduce feature sets where automatic feature selection will be invoked in order to extract the best subset of features. Feature selection enables the creation of a substantial number of features, as the methods will discard those it deems unhelpful to the classification. Feature set four - Feature set 1, 2, 3 The best features might come from different feature sets. All previous feature sets mentioned will be collected in a single set. Three of these sets will be created, one for each direction. For each of these three sets, feature selection methods will be used to extract the most promising features. Initially, CFS feature selection, mentioned in appendix C.1, will be tested. Feature set five - Raw data A different approach will be using feature selection method on the raw data. It is already reasoned that the IVR phase and midsystolic phase shows good diversity between classes. Instead of manually approximating discriminating phases of a cycle, feature selection methods can automatically find discriminative samples which can be used as features. Initially, the acceleration and velocity will be used. The displacement will also be tested if the feature set shows good results.

3.4

Training and testing

This section will introduce and reason behind how the model is trained and tested. 41

3.5. CHOICE OF GRID SEARCH

A reasonable assumption is that the cycles of the patients the model is trying to predict has not been previously seen and labeled. The assumption significantly influences the choice of model validation strategy. A straightforward cross validation strategy with random initializations of training and testing folds/cycles will violate the assumption, as cycles from the same animal can be included in both the training- and test set. A more suitable strategy would be a variant of leave-one-out cross-validation approach, introduced in section 2.5.1 Here, all cycles from one subject are used as the testing set, while the rest is used as the training set. In the next iteration, the cycles from a new subject is chosen to be used as the test set, while the model trains on the rest. The process repeats until every subject has been used in the test set. The performance measures are averaged over all iterations. A weakness of this strategy is that the testing set will only contain examples from a single animal. Additionally, it is important that every class is represented in the test set. Since there are some subjects that did not undergo all interventions, it will limit the number of test sets. Still, there are some reasons why this is the preferred strategy: • If the measurements from one animal are greatly influenced by noise, this would be easy to spot by investigating the error produced by every test set. • The dataset is very sparse. Preferably, both the training set and testing set should have been expanded. When having to deal with a sparse dataset, a cross validation approach is usually the preferred choice rather than a static training and testing split. • A cross validation approach makes it straightforward to get an insight of how the model generalize. LOOCV is a special case of the k-fold-cross-validation, where k is equal to the number of examples. One can also use a slightly lower k. The most important aspect is not to violate the assumption, and secondly, use a suitable ratio between the test size and training size. Since the dataset is small, it is important to train on as many examples as possible, hence a high k is preferred. Ideally, a part of the dataset should have been held out from both the training and testing process, and instead used to produce a final generalization error. Because of the sparse dataset, this have been omitted, and the performance is averaged over the test sets.

3.5

Choice of grid search

This section will present the grid search performed on each classifier. Grid search and hyperparameter optimization were introduced in section 2.5.3. The grid search will be coarse, as reasoned in the same section. More specifically, the series is chosen to grow exponentially. A finer grid search will be done on the most promising regions.

42

3.5. CHOICE OF GRID SEARCH

Note that all classifiers used are from MATLABs Machine Learning Toolbox. Secondly, note that any hyperparameters not mentioned are set to MATLABs default values. Decision tree Table 3.4 shows the hyperparameters used in the grid search for decision tree. The method’s name is fitctree. MaxNumSplits is the constraint on how deep the tree can be, i.e. how many decisions can be taken before a prediction must take place. It is chosen to stop at 126 in order to limit the risk of overfitting.The SplitCriterion defines which node to create next. The ones tried are MATLABs default options, all described in appendix A. MATLAB name SplitCriterion MaxNumSplits

Values gdi, deviance, twoing 1, 2, 4, 8, 64, 126

Table 3.4: Table showing the hyperparameters used in the grid search for decision tree.

AdaBoost Table 3.5 shows the hyperparameters used in the grid search for AdaBoost. The method’s name is fitensemble. NLearn is the number of ensemble learning cycles. The search for NLearn hyperparameters were limited to four values because of the computational complexity while it was decided to stop at 200 to limit the risk of overfitting. Ideally, the learning rate, mentioned in section 2.3.2 would have been searched for, but was not feasible due to the computational complexity of AdaBoost. MATLAB state that 0.1 is a popular choice. Hence, it is used here. MATLAB name NLearn

Values 25, 50, 100, 200

Table 3.5: Table showing the hyperparameters used in the grid search for AdaBoost.

Support vector machine Table 3.6 shows the hyperparameters used in the grid search for SVM. The method’s name is fitcsvm. As mentioned in section the previous chapter regarding the SVM, only the RBF kernel is tested in order for the grid search not to be too computationally expensive. The specification of the RBF kernel can be seen in appendix B. BoxConstraint controls the maximum penalty imposed on margin-violating examples, and aids in preventing overfitting. The kernel scale is the sigma value of the gaussian kernel. The same grid search for the sigma values were used with success q by Hsu et al. although γ instead of σ were used. Their relationship is σ=

1 2γ .

43

3.6. APPROACH

Note that the two parameters are independent, hence the KernelScale were fixed while the best BoxConstraint were searched for and vice versa. MATLAB name BoxConstraint

Values −5 , 2−3 , ..., 215 2q

1 2(2−15 ,2−13 ,...,23 )

Kernel scale

Table 3.6: Table showing the hyperparameters used in the grid search for SVM.

3.6

Approach

This section will present the approach taken in order to identify the optimal machine learning model. Namely the investigation of a superior feature set and classifier. Finding any one of these in an early stage will help to ease the computational complexity. The approach is shown in figure 3.12. For every classifier and for every feature set, the steps in figure 3.13 will be taken. In more detail, the steps are: 1. For every classifier: Decision tree, AdaBoost and SVM, all mentioned in section 2.3.2. 2. For every feature set: Feature set one to five, all mentioned in section 3.3. 3. For all combinations of two classes: Baseline - Adrenaline, Baseline - Esmolol etc. (a) Produce a result: That is, proceed through the steps in figure 3.13 where the feature set, the classifier, and the raw data all have been specified in the above steps. The investigation will be conducted using two dysfunctions at a time. This is the easiest classification task and is expected to yield the best results. With five dysfunctions and two dysfunctions at a time, this will in total yield ten different combinations. Having a vast number of results makes any optimal classifier, and feature set, if any, evident. Consequently, the underachieving alternatives can be discarded. Additionally, the graphs in section 3.2.4, shows the overlap between the dysfunctions. Hence, the 5class classification task is expected to be challenging. The focus is therefore on the easier two-class classification task. Furthermore, there are some combinations that are clinically more important than others. This is because certain cardiac functions are expected during or after surgery. The interesting combinations are the ones involving baseline versus the others as this is either the initial or desirable heart function. Also, esmolol versus ischemia is clinically interesting because patients 44

3.6. APPROACH

can be given esmolol to decrease their heart rate, but given too much esmolol would cause the heart to be ischemic. A two-class machine learning model trained on esmolol and ischemia examples can assess if the heart is in the “esmolol-state” or is ischemic. In addition, an assessment of whether or not feature scaling aids the classification will be done. The training dataset will be standardized, meaning having zero mean and unit variance. The mean and variance of the training dataset will then be applied to the testing set in the following manner: x=

x−µ σ

(3.1)

where µ and σ are the mean and variance, respectively. Having possibly identified the optimal classifier and feature set, the process in figure 3.12 will be left. The next step will be to experiment with the best model in an attempt to further improved them. Lastly, the classes will be added to the best machine learning models. The results are expected to drop for each addition of a class as the feature space of the classes is somewhat to overlapping. This is seen in section 3.2.4.

45

3.6. APPROACH

Start

For all classifiers

Decision tree/AdaBoost/SVM

For all feature sets

Feature sets 1, 2, ..., 5

For all combinations of 2 classes

Baseline and Adrenaline/Baseline and Esmolol/ Ischemia and Volume/ ... Produce result as shown in fig. 3.13

Stop

Figure 3.12: Figure showing the 2-dysfunction-classification task which aims to find the best feature set and classifier. The diamond boxes represent loops while the produce result block is specified in figure 3.13. The parallelograms represents an input and output.

46

3.6. APPROACH

Produce result

Raw data

Extract cycles Extract features

Feature selection Loop Training set

Train and test split

Train classifier Test set Model Test classifier

Partial result

Average results

Final result

Figure 3.13: Figure showing the steps in creating a model and achieving a result. Rectangles represents a process that takes an input and produces an output. The dashed rectangle is a process that is sometimes skipped. White ellipses represents a result from a process. The parallelograms represents an input and output. 47

3.6. APPROACH

Details of produce result 1. Raw data The dataset as it were when received/obtained, described in 3.2.1 2. Extract cycles This step will extract cycles from the raw data as described and reasoned in section 3.2.2. 3. Extract features This step will create one of the six feature sets mentioned in section 3.3. 4. Feature scaling This step takes all features as input, and standardizes the features as specified previously in this section and reasoned in section 2.4.3. 5. Feature selection Feature selection will choose the best subset of features. It is only applied to feature set four and five. 6. Loop The following steps will be repeated: (a) Train and test split Here, the dataset will be split into a training set and testing set as described in section 3.4. That is, the heartbeats from one animal that has not previously used as a testing test in the current realization, will be used as such, and the rest will be used in the training set. (b) Train classifier This steps takes the training set as input and trains a classifier. (c) Model The training phase outputs the final model. That is, a model that now can make predictions on new motion data (d) Test classifier This step takes the model and testing set as inputs and uses the model to predict the classes in the testing set. (e) Partial result The true dysfunction is known, hence the result from the previous step is stored. If every animal has been used in the testing set, the next step is number five. If not, the loop starts over again from the (a) step. 7. Average result All partial results stored in the step above is averaged to produce a final result which is also the output. 8. Final result The final results consist of several measures, specified in section 2.6.1.

48

Chapter 4

Experiments and results This chapter intends to present the results of all the models described in the previous chapter. The results of the two-class classification task will be investigated first, in section 4.1. The best results will be used to further improve the best models, described in section 4.2. The results of the best performing models will then be presented in section 4.3. These model will be used when adding classes to the classification task, starting from section 4.5.

4.1

Two-class experiments and results

Table 4.1 shows the results achieved when carrying out the approach shown in figures 3.12 and 3.13.

4.1. TWO-CLASS EXPERIMENTS AND RESULTS

Feature set

Classifier

Direction

Decision tree Feature set one Mid-systole

SVM

AdaBoost

Decision tree Feature set two IVR-phase

SVM

AdaBoost Feature set three Frequency

Decision tree SVM AdaBoost Decision tree

Feature set four Feature set 1,2,3 CFS feature sel.

SVM

AdaBoost

Decision tree Feature set five Raw data CFS feature sel.

SVM

AdaBoost

long. circ. radi. long. circ. radi. long. circ. radi. long. circ. radi. long. circ. radi. long. circ. radi. circ. circ. circ. long. circ. radi. long. circ. radi. long. circ. radi. long. circ. radi. long. circ. radi. long. circ. radi.

Accuracy Not scaled Scaled 0.67 0.67 0.73 0.73 0.66 0.66 0.54 0.66 0.57 0.61 0.61 0.64 0.70 0.70 0.73 0.73 0.65 0.65 0.58 0.58 0.64 0.64 0.61 0.61 0.54 0.56 0.59 0.64 0.63 0.59 0.60 0.60 0.65 0.65 0.61 0.61 0.66 0.66 0.54 0.62 0.62 0.62 0.68 0.68 0.76 0.76 0.69 0.69 0.59 0.64 0.61 0.72 0.56 0.68 0.70 0.70 0.77 0.77 0.71 0.71 0.58 0.58 0.75 0.75 0.70 0.70 0.54 0.54 0.54 0.54 0.54 0.54 0.59 0.59 0.78 0.78 0.72 0.72

Table 4.1: Table showing the results after following the approach shown in figures 3.12 and 3.13. The three best results are marked with bold text. The various feature sets are specified in section 3.3.1.

50

4.1. TWO-CLASS EXPERIMENTS AND RESULTS

Note that the accuracy is the average accuracy of the combinations of two classes (Baseline versus adrenaline, baseline versus esmolol etc.). All experiments were done twice, both with and without feature scaling. The three best results are marked with bold text. The results create a basis on what to pursue next.

4.1.1

Analysis

Initially, it is seen that almost all models perform better than chance (50%). This implies there is difference in motion, which to some degree can be discriminated by certain machine learning models. Though most of the models performed only slightly better than chance, there are some models that should be further investigated. How the models are tried to be improved is described in section 4.2. An analysis of the initial results are given in the paragraphs below. Regarding feature scaling, it had an effect only on the SVM classifier. decision tree, and AdaBoost which uses decision trees as its classifier does in contrast to the SVM, not rely on the distance between feature vectors. Instead, the rely on the intrinsic values of the features, hence feature scaling did not affect the results and will be discarded in future experiments. Regarding the three different directions, the circumferential direction performed better than the longitudinal direction (P < 0.05), but were not significantly better than the radial direction (P = 0.066). However, in order to limit the number of models created in further experiments, features in the longitudinal and radial direction will still be discarded. Furthermore, the three models with highest accuracy all used feature selection. This shows that feature selection aids the models. Feature selection will thus be further inspected. It is worth noticing that feature set one which uses features from the midsystolic phase, were amongst the highest performing feature sets. This supports the cited articles in section 3.3.1 and 2.2.3, which favors the motion in the same phase. There were some differences in the performance when comparing the classifiers. The mean accuracy of decision tree, SVM and AdaBoost were 66, 61 and 63 respectively. Both decision tree and AdaBoost performed significantly better than SVM, (P < 0.005 on both). Further experiments will therefore not use SVM as the classifier. The results using AdaBoost and decision tree are similar (P = 0.83), but AdaBoost is considerably more computationally expensive due to its iterative behavior. Decision tree will therefore be used in further experiments, and AdaBoost tested in the final experiments on only the best models. Experiments were not conducted with features from different directions or 51

4.2. FURTHER IMPROVING THE BEST TWO-CLASS MODELS

the raw position data. This is further addressed in section 4.2.3.

4.2

Further improving the best two-class models

Before combinations of three-, four- and five classes are considered, further experiments will be devoted to see if the results from the models created in the previous section can be further increased. Section 4.2.1 will specify the experiments conducted. The following section, section 4.2.2 will show the results of the experiments, before an analysis of the results is given in section 4.2.3.

4.2.1

Experiments

The following six experiments are largely based on the best performing models, shown in table 4.1. The experiments are described in detail in each of their paragraphs. The aim of the two first experiments is to see if Relieff can outperform CFS as a feature selection method. Relieff does not output the optimal number of features. A scheme for doing so is described in section 2.4.4. First experiment Use feature set four, but replace CFS with Relieff as the feature selection method. Second experiment Use feature set five, but replace CFS with Relieff as the feature selection method. Third experiment Feature set four, the concatenation of feature set one up to three, followed by feature selection gave the second best results. Adding extra features might therefore aid the model. The following features will be added to feature set four: • Variance of circumferential acceleration in the midsystolic phase. This feature will expectedly discriminate between the because of the different in heart rate and force of contraction. • The correlation coefficients calculated between the longitudinal, circumferential and radial directions in the midsystolic phase. Preece et al. [50] used these features with success for human activity classification using accelerometer. Note that the correlations coefficients are given as a 2x2 matrix. In order to be used as features, the matrix is reshaped as a 4x1 vector. The original coarse grid search will be performed, as seen in table 3.4. 52

4.2. FURTHER IMPROVING THE BEST TWO-CLASS MODELS

Fourth experiment Involving feature selection aids the model, hence adding all features before executing feature selection might further improve the results. More specifically, all features mentioned in chapter 3 will be appended into a set and tested twice. Once using Relieff and once using CFS. Note that a summary of all features is given after the results of these experiments is shown. Also note that if the previous experiment, which added even more features to feature set four, improves the results, this experiment will include these features. It was mentioned in section 3.5 that a finer grid search would be done over the most promising region of hyperparameters on the most promising models. This is performed in the two best models, and is specified in the two upcoming paragraphs.

Fifth experiment The learning rate hyperparameter, named LearnRate in MATLB and used by AdaBoost, were not searched for in the grid search but instead fixed to 0.1. This hyperparameter will be searched for on model achieving the best result using AdaBoost. The full grid search is shown in table 4.2. Name NLearn LearnRate

Values 25, 50, 100, 200 0.1, 0.2, ...., 1

Table 4.2: Table showing the hyperparameters used on the most promising models which uses AdaBoost as its classifier.

Sixth experiment The initial grid search for hyperparameters for the decision tree classifier is seen in table 3.4. The search sees a relatively large gap between 8 and 64 in the MaxNumSplits. The best performing model using decision tree as its classifier will therefore search over this interval. The complete description of the finer grid search is seen in table 4.3. Name SplitCriterion MaxNumSplits

Values gini index, deviance, twoing 10, 15, 20, ..., 60

Table 4.3: Table showing the hyperparameters used in the finer grid search for decision tree.

4.2.2

Results

A summary of the results of the previously described six experiments are shown in table 4.4. 53

4.2. FURTHER IMPROVING THE BEST TWO-CLASS MODELS

Exp. nr 1 2 3

Feature set 4 5 Extended set 4

Decision tree Decision tree

Feature selection Relieff Relieff

Old result 0.76 0.75

New result 0.78 0.74

Decision tree

CFS

0.76

0.76

CFS Relieff CFS Relieff

∼ ∼ ∼ ∼

0.81 0.80 0.78 0.78

AdaBoost

CFS

0.81

0.84

Decision tree

Relieff

0.78

0.78

Classifier

AdaBoost 4

All features Decision tree

5 6

All features All features

Table 4.4: Table showing the results of the experiments specified in section 4.2. The best result is shown in bold. Experiment number five and six are based on the underlined results from experiment four.

4.2.3

Analysis

Experiment one and two from table 4.4 indicates that Relieff can outperform CFS as a feature selection method. Both of them will therefore be tested when performing further experiments. Experiment three, which added new features to feature set four, did not see an improvement. Regardless, the new features will be permanently added to feature set four. They might be deemed useful in further experiments. The drawback of adding features is limited, as both feature selection methods evidently filters away any redundant features. Experiment four, which appended all features created, gave good results. Amongst the combination of classifier and feature selection, AdaBoost and CFS feature selection gave the best result yet. This combination will be tested with a finer grid search in experiment five, hence marked with underline. Regarding decision tree, the two feature selection methods gave the same results. The combination of decision tree and Relieff will be tested with a finer grid search, since CFS is already chosen in experiment five. This is seen in experiment six. Experiment five and six did a finer grid search over the two most promising models in order to further increase the results. The finer grid search further increased the results of AdaBoost, and saw the best results achieved, hence marked with bold. The finer grid search did not aid the decision tree classifier, hence the search will be discarded in further experiments. As a summary, the best results are achieved with the following approach 54

4.2. FURTHER IMPROVING THE BEST TWO-CLASS MODELS

• Add all features created (summarized in the paragraph below). • Invoke either CFS or Relieff feature selection to extract the best subset of features. • As a classifier, use decision tree with the initial grid search seen in table 3.4, or AdaBoost with the grid search seen in table 4.2. The number of neighbors used by Relieff were fixed to five, due to the computational complexity involved in finding the optimal number of features to include. Testing with several numbers of neighbours might have increased the performance of the model, but also increased the computational complexity. However, Robnik et al. [56] states that Relieff is robust in the number of nearest neighbors as long as it remains relatively small. Additionally, the search for the optimal number of features for Relieff were done with respect to the average accuracy of all combinations of classes. Ideally, the search would have been done for each combination of two classes. This is done however on the most relevant combination of dysfunctions, and is elaborated in section 4.4. Because adding a substantial number of features and invoking a feature selection method gave good results, additional experiments were conducted which added more features. They were: • Adding the displacement/position motion to feature set four, which uses the acceleration and velocity (all in the circumferential direction). • Adding the raw displacement data to all features. • Adding all features in all three directions. The experiments were conducted using both CFS and Relieff as the feature selection method and decision tree as the classifier. None of the experiments saw an increase in the results, hence, the position data, and features from the longitudinal and radial directions are discarded. Summary of all features Since many of the best performing models used all features created, a summary of all features is given. 1. Feature set one - Features from the midsystolic phase: 1.1. Peak acceleration within the first 150ms from R-peak 1.2. Peak velocity within the first 150 ms from R-peak 1.3. Mean acceleration of the first 150 ms from R-peak 1.4. Mean velocity of the first 150 ms from R-peak 1.5. Difference in displacement from R-peak to 150ms after R-peak 1.6. Minimum acceleration within the first 150ms from R-peak 55

4.3. DETAILS OF THREE TWO-CLASS MODELS

2. Feature set two - Features from the IVR phase: 2.1. Peak acceleration in the IVR phase 2.2. Peak velocity between in the IVR phase 2.3. Mean of acceleration in the IVR phase 2.4. Mean of velocity in the IVR phase 2.5. Difference in displacement from R-peak to middle of IVR phase 2.6. Minimum acceleration in the IVR phase 3. Feature set three - Various features: 3.1. The three most prominent feature components of the acceleration in circumferential direction 3.2. The number of samples used to represent a cycle. 3.3. The accumulated magnitude of the frequency components in the circumferential direction 4. Feature set four - The raw data, interpolated and/or decimated to 300 samples: 4.1. All acceleration samples from the raw data in the circumferential direction 4.2. All velocity samples from the raw data in the circumferential direction 5. Features added in further experiment: 5.1. The correlation coefficients between the longitudinal acceleration and circumferential acceleration 5.2. The correlation coefficients between the longitudinal acceleration and radial acceleration 5.3. The correlation coefficients between the circumferential acceleration and radial acceleration 5.4. Variance of circumferential acceleration in the midsystolic phase

4.3

Details of three two-class models

This section will present details of three of the models created in section 4.1 and 4.2. These models either achieved a high accuracy or possess some interesting feature characteristics. The models are specified in table 4.5, before they are further assessed in the three next paragraphs in the same order as they appear in the table.

56

4.3. DETAILS OF THREE TWO-CLASS MODELS

Classifier AdaBoost Decision tree AdaBoost

Hyperparameters NLearn:100 LearnRate: 0.2 MaxNumSplits: 4 SplitCriterion: deviance NLearn: 25 LearnRate: 0.7

Feature set

FS

Accuracy

All features

CFS

0.84

All features

Relieff

0.76

Raw data

CFS

0.79

Table 4.5: Table specifying that models that will be further investigated.

4.3.1

Best performing model

This model was the model achieving the highest average accuracy in the two-class classification task, hence a more detailed description of it is shown here. A decomposition of the average accuracy can be seen in table 4.6. Dysfunctions Baseline Adrenaline Baseline Esmolol Baseline Ischemia Baseline Fluid loading Adrenaline Esmolol Adrenaline Ishcemia Adrenaline Fluid loading Esmolol Ischemia Esmolol Fluid loading Ischemia Fluid loading

Accuracy 0.90 0.69 0.89 0.63 0.99 0.99 0.94 0.73 0.81 0.80

Confusion Matrix 207 44 18 369 146 75 57 147 245 17 50 291 164 119 61 136 322 7 1 215 408 0 5 234 401 26 12 229 150 46 72 164 178 39 44 174 186 52 48 203

Table 4.6: Table showing the accuracy and confusion matrix of each class on the model that performed best, which were AdaBoost using all features, achieving an average accuracy of 0.84. Regarding the confusion matrix, columns represents the true class, while rows represents the predicted class. As expected, combinations involving adrenaline performs best. Adrenaline is the only intervention which increases the heart rate compared to base57

4.3. DETAILS OF THREE TWO-CLASS MODELS

line. The two combinations with the lowest accuracy, are the ones involving baseline versus esmolol and fluid loading, respectively. Since the data comes from real world measurements, the influence of noise and inaccuracies in the data must be assessed. If the prediction errors often comes from the same animals, it suggests that the measurements from that animal is particularly noisy. Analyzing the best model, results in one animal with an average error rate of 30%, and two animals with an average error rate of 23%. The mean average error rate across all animals are 15%. This reveals that some animals were somewhat noisy, and did have a negative effect the results. However, noise is inherent of every real world measurement, so discarding the noisy animals would give a result that does not reflect the true quality of the models. The average number of features chosen by CFS were 23. The next paragraph introduces a model using only three features, hence a more detailed analysis of which features are more discriminate is given there.

58

4.3. DETAILS OF THREE TWO-CLASS MODELS

4.3.2

Model using only three features

Decision three using all features and Relieff achieved an average accuracy of 0.78, as seen in table 4.4. Relieff used 13 features, but limiting the use to three features sees only a small decrease down to 0.77 in average accuracy. It would be interesting to see which features are deemed most relevant by Relieff in the different classification tasks. Hence, table 4.7 decomposes the average accuracy and shows which features are used. Dysfunctions Baseline Adrenaline Baseline Esmolol Baseline Ischemia Baseline Fluid loading Adrenaline Esmolol Adrenaline Ishcemia Adrenaline Fluid loading Esmolol Ischemia Esmolol Fluid loading Ischemia Fluid loading

Accuracy

Features

0.77

1.1 5.2 5.2

0.50

4.2 4.2 1.1

0.74

5.2 5.2 4.2

0.72

1.1 5.2 5.2

0.90

1.1 5.1 5.1

0.90

1.5 1.1 4.2

0.94

1.1 5.4 1.6

0.74

5.2 5.2 4.2

0.63

5.3 5.3 4.2

0.79

4.2 4.2 4.2

Confusion Matrix 127 44 98 369 105 113 98 109 188 47 107 261 170 79 55 176 287 21 36 201 375 25 38 209 391 18 22 237 191 82 31 128 144 83 78 130 203 71 31 184

Table 4.7: Table showing the accuracy and confusion matrix of each class on the model that performed second best, which were decision tree using all features and Relieff feature selection. In addition, the features chosen by Relieff is also shown. The numbers point to the list of features shown in 4.3 The numbers in the third column refers to the list of features shown in section 4.2.3. The features are listed according to the rank, where the leftmost feature is the highest ranked feature. It is evident that samples from the raw data from both the acceleration and velocity (both are from the circumferential direction) are often chosen. This illustrates the usefulness of feature selection. Using all samples from the raw data would create a high dimensional feature space and make the curse of dimensionality problem (section 2.4.1), imminent. Moreover, finding the most discriminative samples numbers without such methods is a demanding task. The most common feature that is not from the raw data is the feature de59

4.3. DETAILS OF THREE TWO-CLASS MODELS

noted 1.1. That is, the peak acceleration in circumferential acceleration from the midsystolic phase. This is also the feature deemed most useful by Halvorsen et al [30].

60

4.3. DETAILS OF THREE TWO-CLASS MODELS

4.3.3

Model using the raw data as features

As seen in the initial experiments table, some of the best performing models used the raw data as features. It would therefore be interesting to see if there are a range of sample numbers that is deemed useful by the feature selection method. This can give insights into which parts of a cycle discriminates best between the classes. The best performing model using the raw data used CFS feature selection, where the raw data is the acceleration and velocity in the circumferential direction. Note that the data is interpolated or decimated to 300 samples, as described in section 3.2.3. The finer grid search was performed, seen in table 4.2. The achieved average accuracy were 0.79, and the decomposition of the average accuracy can be seen in table 4.8. Dysfunctions Baseline Adrenaline Baseline Esmolol Baseline Ischemia Baseline Fluid loading Adrenaline Esmolol Adrenaline Ishcemia Adrenaline Fluid loading Esmolol Ischemia Esmolol Fluid loading Ischemia Fluid loading

Accuracy 0.90 0.64 0.86 0.41 0.95 0.97 0.82 0.72 0.78 0.83

Confusion Matrix 215 55 10 358 128 80 75 142 236 23 59 285 114 172 111 83 312 18 11 204 395 1 18 253 355 61 58 194 153 51 69 159 178 52 44 161 203 52 31 203

Table 4.8: Table showing the accuracy and confusion matrix a model only using the raw data as features, in combination with CFS feature selection method. Figure 4.1 shows how many times a sample number from the circumferential acceleration were chosen by CFS to be used as a feature, while figure 4.2 shows how many times a sample number from the circumferential velocity is used as a feature. There were in total ten classification tasks, yielding a maximum number of times to be selected to be ten.

61

4.3. DETAILS OF THREE TWO-CLASS MODELS

It was reasoned that the acceleration in midsystolic phase discriminated well between classes. The midsystolic phase was approximated to lie in the range of [0, 150] ms. A heartbeat lasts approximately 0.6 - 1 seconds, thus the sample numbers in the range [0, 45-75] are in the midsystolic phase and this is also a popular range by CFS, seen in figures 4.1 and 4.2. It was also reasoned that the IVR phase could discriminate well, lying in the sample number range of approximately [109, 146] when using 0.8 seconds as the duration of the heart beat. These samples are not chosen as often as the samples from the midsystolic phase. This is also expected, as the initial results shows that models using features from the midsystolic phase performed better than those which used features from the IVR phase. Instead, samples lying in the range of sample number 200 ± 25 is often chosen Note that the average number of samples chosen from the acceleration is 22, while the average from velocity is 17.6. However, samples from the acceleration was not chosen significantly more often than samples from the velocity (P = 0.31).

Figure 4.1: Figure showing how many times sample number from the acceleration in the circumferential direction is chosen to be used as a feature in the two-class classification task. 62

4.4. OPTIMIZING CERTAIN COMBINATIONS OF TWO CLASSES

Figure 4.2: Figure showing how many times sample number from the velocity in the circumferential direction is chosen to be used as a feature in the two-class classification task.

4.4

Optimizing certain combinations of two classes

Previously, the best hyperparameters were searched for using the average accuracy of all combinations as the performance measure. In section 3.6, it was mentioned and reasoned that certain combinations of two classes are clinically useful to classify from each other. The results of the clinically useful combinations can be further increased by doing a grid search for each of the combinations. The paragraph below will specify the experiments done.

4.4.1

Experiments

Table 4.12 shows the four models that will be created for each of the interesting combinations of classes. The grid search for decision tree and AdaBoost is shown in table 3.4 and 4.2, respectively. Classifier

Features

Decision tree All features AdaBoost

Feature selection CFS Relieff CFS Relieff

Table 4.9: Table showing which models will be created for certain combinations of two classes. Note that a grid search will be done for each of the combinations. All features are summarized in section 4.2.3. 63

4.4. OPTIMIZING CERTAIN COMBINATIONS OF TWO CLASSES

4.4.2

Results

Table 4.10 shows the accuracy, precision, recall/sensitivity and specificity. Classes Baseline - Adrenaline Baseline - Esmolol Baseline - Ischemia Baseline - Fluid loading Esmolol - Ischemia

Accuracy 0.95 0.73 0.90 0.80 0.74

Precision 0.97 0.78 0.87 0.80 0.71

Recall/Sensitivity 0.96 0.67 0.94 0.81 0.85

Specificity 0.94 0.79 0.85 0.78 0.67

Table 4.10: Table showing accuracy, precision, recall and true negative rate for the clinically most important combinations of classes. The latter class in the leftmost column is used as positive class in the calculations.

Classes

Classifier

Hyperparameters

Feature selection

Baseline Adrenaline

AdaBoost

NLearn: 100 LearnRate: 1

Relieff

Baseline Esmolol

AdaBoost

NLearn: 200 LearnRate:0.9

CFS

Baseline Ischemia Baseline Fluid load. Esmolol Ischemia

AdaBoost Dec. tree AdaBoost

NLearn: 100 LearnRate: 0.4 MaxNumSplits: 4 SplitCrit.: deviance NLearn:25 LearnRate: 0.6

CFS Relieff CFS

Features 1.1 5.2 5.3 1.5 5.4 2.2 1.6 4.1 2.4 4.1 4.2 1.4 4.2 4.2 4.2 4.2 4.2 2.1 4.2 2.2 4.1 4.2 4.2 4.2 4.1 4.1 4.1 4.1 4.2 4.2 4.2 4.2 4.2 4.2 4.2 5.4 1.4 2.6 4.1 4.1 4.1 4.1 4.1 4.1 22x4.2 1.1 5.2 5.2 5.4 5.4 9x 4.2 5.3 5x4.2 1.1 1.4 2.2 2.3 2.6 11x4.1 18x4.2 5.4

Table 4.11: Table specifying the models used to achieve the results shown in table 4.10.

4.4.3

Analysis

The results for the most interesting combinations, shown in table 4.10, reflects the highest achieved results in this project. In general, all combinations saw a slight increase in accuracy as opposed to performing a grid search on the average accuracy of all combinations. Regarding the features used, shown in table 4.11 it is seen that feature 1.1, 4.1 and 4.2 are often deemed useful. They are the peak acceleration in the circumferential acceleration from the midsystolic phase, and the raw circumferential acceleration and velocity data, respectively. All features are 64

4.5. ADDING CLASSES TO THE CLASSIFICATION TASK

named in section 4.2.3. Note that Relieff ranks the features and the first feature is the highest ranked. As already mentioned in section 4.3.2, the usefulness of the feature selection methods is clear, given the substantial number of feature chosen from the raw data.

4.5

Adding classes to the classification task

Having identified the optimal classifiers (decision tree and AdaBoost), and optimal features (all features with feature selection), classes can be added to the classification task. Until this section, only two classes have been considered in the model. This is not always fully realistic, as any class can occur during and after cardiac surgery. The multi-class classification task (models trained on more than two classes) poses a bigger challenge than the two-class classification task. By looking at the figures in section 3.2.4, there overlap is seen in the motion data from different classes. It is therefore expected that the models will find it harder to discriminate between the classes. More classes means more potential overlap, hence the final task, which will be adding all five classes into the classification task is expected to yield the poorest performance. A collective analysis is given at the end of the chapter.

4.5.1

Experiments

A reason for the extensive work done on the two-class classification task is to identify the optimal model. Table 4.12 shows which models that will be created for the multi-class classification task. The decision is largely based on the best models found using in the two-class classification task, elaborated in section 4.2.3. Classifier

Features

Decision tree All features AdaBoost

Feature selection CFS Relieff CFS Relieff

Table 4.12: Table showing which models will be created when adding classes to the classification task. The following three paragraphs will present the results of combinations of three- four- and five classes.

4.5.2

Three classes - results

The best performing model is shown in the table below. 65

4.5. ADDING CLASSES TO THE CLASSIFICATION TASK

Classifier

Feature Selection

AdaBoost

Relieff using 16 features

Hyperparameter NLearn: 200 LearnRate: 0.6

Average accuracy 0.67

Table 4.13: Table showing the details of the best performing models using three classes.

Classes Adrenaline - Esmolol - Ischemia Adrenaline - Esmolol - Fluid loading Adrenaline - Ischemia - Fluid loading Baseline - Adrenaline - Esmolol Baseline - Adrenaline - Ischemia Baseline - Adrenaline - Fluid loading Baseline - Esmolol - Ischemia Baseline - Esmolol - Fluid loading Baseline - Ischemia - Fluid loading Esmolol - Ischemia - Fluid loading

Accuracy 0.73 0.72 0.80 0.64 0.74 0.75 0.60 0.67 0.47 0.57

Table 4.14: Table showing the decomposition of the average accuracy for the best model, shown in table 4.13.

4.5.3

Four classes - results

The best performing model is shown in the table below. Classifier

Feature Selection

Decision Tree

CFS

Hyperparameter MaxNumSplits: 4 SplitCriterion: twoing

Average accuracy 0.57

Table 4.15: Table showing the details of the best performing models using three classes. The decomposition of the average accuracy can be seen in table 4.16. Classes Adrenaline - Esmolol - Ischemia - Volume Baseline - Adrenaline - Esmolol - Ischemia Baseline - Adrenaline - Esmolol - Volume Baseline - Adrenaline - Ischemia - Volume Baseline - Esmolol - Ischemia - Volume

Accuracy 0.58 0.61 0.52 0.64 0.50

Table 4.16: Table showing the decomposition of the average accuracy for the best model, shown in table 4.15. The next five tables show the confusion matrixes for the various combinations of classes. 66

4.5. ADDING CLASSES TO THE CLASSIFICATION TASK

Predicted Adrenaline Predicted Esmolol Predicted Ischemia Predicted Fluid loading

Predicted Baseline Predicted Adrenaline Predicted Esmolol Predicted Ischemia

Actual Adrenaline 207 20 77 19 Actual Baseline 83 17 79 24

Actual Esmolol 0 101 55 66

Actual Adrenaline 56 236 4 27

Actual Ischemia 18 51 134 7 Actual Esmolol 61 0 159 2

Actual Fluid loading 2 38 44 119 Actual Ischemia 38 0 64 108

Predicted Baseline Predicted Adrenaline Predicted Esmolol Predicted Fluid loading

Actual Baseline 83 31 73 15

Actual Adrenaline 109 148 47 19

Actual Esmolol 64 0 121 37

Predicted Baseline Predicted Adrenaline Predicted Ischemia Predicted Fluid loading

Actual Baseline 135 31 33 26

Actual Adrenaline 138 237 38 0

Actual Ischemia 35 13 180 6

Predicted Baseline Predicted Esmolol Predicted Ischemia Predicted Fluid loading

Actual Baseline 117 35 29 22

Actual Esmolol 23 89 58 52

4.5.4

Actual Ischemia 10 88 112 0

Actual Fluid loading 48 0 22 143 Actual Fluid loading 31 8 48 168

Actual Fluid loading 47 33 26 107

Five classes - results

Lastly, the results of the five class classification task are presented. Table 4.17 shows the details two best performing models. Since the model performed poorly in predicting esmolol, the second best model were included and is shown in table 4.19

67

4.5. ADDING CLASSES TO THE CLASSIFICATION TASK

Classifier

Feature Selection

AdaBoost

Relieff using 6 features

DT

CFS

Hyperparameter LearnRate: 0.8 NLearn: 200 MaxNumSplits: 8 SplitCriterion: twoing

Average accuracy 0.54 0.50

Table 4.17: Table showing the details of the best performing models using five classes.

Predicted Baseline Predicted Adrenaline Predicted Esmolol Predicted Ischemia Predicted Fluid loading

Actual Baseline 87 41 0 63 12

Actual Adrenaline 37 249 0 27 10

Actual Esmolol 17 14 5 141 45

Actual Ischemia 10 15 0 185 0

Actual Fluid loading 43 3 0 59 108

Table 4.18: Table showing the model achieving the best accuracy when including all classes.

Predicted Baseline Predicted Adrenaline Predicted Esmolol Predicted Ischemia Predicted Fluid loading

Actual Baseline 110 59 12 11 11

Actual Adrenaline 76 172 0 61 14

Actual Esmolol 32 0 68 76 46

Actual Ischemia 10 46 7 136 11

Actual Fluid loading 40 14 7 53 99

Table 4.19: Table showing the model achieving the second best accuracy when including all classes.

68

4.5. ADDING CLASSES TO THE CLASSIFICATION TASK

4.5.5

Analysis

Three classes The average accuracy saw a drop from 0.84 to 0.67 when adding a class. This is a substantial drop and demonstrates the that most combinations of classes perform substantially lower than satisfactory. However, the decomposition of the average accuracy, seen in table 4.14 shows that the model can discriminate reasonably well between adrenaline, ischemia and fluid loading, achieving an accuracy of 0.80. The more clinically interesting combinations are the ones involving baseline, as reasoned in 3.6. Two combinations of baseline, both marked in, achieves somewhat good results. These two combinations are: (i) Baseline, adrenaline, ischemia, and (ii) baseline, adrenaline, fluid loading. They achieved an accuracy of 0.74 and 0.75, respectively. The remaining combinations did not perform close to satisfactory, though all performed better than chance (33%). Four classes In the four-class classification task, the average accuracy further, down to 0.57. The decomposition of the average accuracy reveals that no combination achieved a satisfactory result. Five classes The final classification task, involving all class at one, saw a further decrease in accuracy, by 0.03 to 0.54. The model has difficulties predicting esmolol and often mistakes it for being ischemia.

69

Chapter 5

Discussion This chapter concludes the thesis with a discussion of the results, a conclusive summary, and lastly describes the possibilities for future work.

5.1

General discussion

Five-class classification The results achieved in the five-class classification task, seen in table 4.17, establishes that predicting one class out of the five is not feasible. Although the accuracy of 0.54 is better than chance1 , meaning that there is some characteristic motion on each of the cardiac functions which is noticed by the classifier. Still, the accuracy is substantially lower than satisfactory. Instead of predicting all classes, the model can to some degree distinguish the adrenaline class from the rest. The confusion matrix in table 4.18 shows that the accuracy with respect to the adrenaline class is 0.77. Its clinical use is discussed in the next paragraph. A significant reason for the moderate accuracy is the overlap of data. It was mentioned in section 3.2.4 it was difficult to identify any clear diversity between the heart functions, thus making the classification task challenging. The overlap is most prominent between ischemia and esmolol. This is confirmed in the confusion matrix in table 4.18, where the classifier often predicts ischemia when the actual cardiac function is esmolol. Moreover, having access to measurements to no more than 20 animals limits the accuracy. Even though segmenting the motion data into cycles increased the size of the dataset, the cycles should preferably come from different animals. The optimal number of animals is likely to be higher, though it is difficult to state exactly how many animals are required to achieve satisfactory results. Inflicting the various cardiac functions is a demanding task and is thus a weakness with the approach. Two-class classification Neither the four- or three-class classification task achieves adequate results. The best performing classifier on all combinations of two classes, though, achieves an average accuracy of 0.84. This 1 Choosing

0.2.

a heart function at random. Five classes imply an accuracy of approximately

5.1. GENERAL DISCUSSION

demonstrates that there enough difference in motion from certain heart functions to be noticed by a machine learning model. As mentioned in section 3.6, a two-class classifier is relevant for clinical use since certain cardiac functions are expected during and after cardiac surgery. Five combinations are relevant, four of whom involves baseline as this is either the initial or desirable cardiac function. Esmolol versus ischemia is also of clinical use since esmolol is often injected to decrease the heart rate, but injecting too much might make the heart ischemic. The results of the useful combinations are shown in table 4.10, where three out of the five combinations achieved a precision and recall of more than 0.80. Baseline versus adrenaline was the best combination out of the five, achieving a precision and recall of 0.97 and 0.96, respectively. These results demonstrate that the model can be used as a monitoring technique when the adrenaline dysfunction is expected. Baseline versus ischemia achieved a precision and recall of 0.87 and 0.94, respectively. This result is an improvement compared to the study by Manocha et al. [44], which investigated 22 different machine learning models detecting ischemia using electrocardiography. It was reported that the average precision and recall were 0.81 and 0.83 respectively. Because of the inherent drawbacks of the ECG as an ischemia detection technique, Manocha et al. proposed that a consolidation of two or more techniques may result in building a better classifier for detection of ischemia. The technique described in this project is demonstrated that it can be a suitable part of such a consolidation. A difference in this study and the one mentioned above is the use of the pig heart from the human heart. However, Douglas et al. [18] states that the structure of the cardiac blood vessel of the pig is very uniformly a duplicate of humans. Furthermore, it was stated that pigs may be useful in investigating the treatment of arrhythmias (adrenaline, esmolol), in myocardial infarction (ischemia) and effects of increased wall tension (fluid loading). Esmolol versus ischemia achieved a precision and recall of 0.71 and 0.85, respectively. A recall of 0.85 implies that a high fraction of those with ischemia will be correctly classifier as such, which is clinically important. The lower precision of 0.71 suggests that some of those whose been classified with ischemia are misclassified. However, this combination can still be relevant for clinical use if further assessment is done before treatment. Baseline versus fluid loading achieved a precision and recall of 0.80 and 0.81, respectively. The results can be of clinical use. Frequently used standard preload indexes such as central venous pressure (CVP) or pulmonary capillary wedge pressure (PCWP) often fail to provide reliable information on cardiac preload and are not capable of predicting a cardiac response to fluid therapy [41, 48]. It is important to note, however, that the accelerometer technique is only tested on an increase in preload and not a decrease. 72

5.2. CONCLUSION

5.2

Conclusion

This thesis has investigated the possibility of predicting four different cardiac dysfunctions relating to the heart. The approach were machine learning on motion data from an accelerometer mounted on the heart wall on several pigs. The four different dysfunctions (output classes) were esmolol, ischemia, adrenaline and fluid loading as well as a healthy baseline. The intended goal is to use a combined miniaturized accelerometer and temporary pacemaker lead to improve postoperative monitoring of cardiac surgery patients. Such a sensor would permit continuous monitoring of ventricular function during and after cardiac surgery, enabling early detection of complications and treatment guidance [22]. The goal stems from a research project at The Intervention Centre, Oslo University Hospital, Rikshospitalet, Oslo, Norway (OUS). In their research project, a dataset was built using at most 20 pigs. The pigs underwent interventions in order to simulate the various classes. The measurements were done for ten seconds at a time, or approximately 12 heartbeats. Using the given dataset, motion data from each heartbeat were extracted and formed a basis for the features used to discriminate between the different output classes. Five different feature sets were created. Three of the sets were created using domain knowledge based on the previous research done at OUS. In the final two sets, the former concatenated all previous sets while the latter used the raw data. Both took use of automatic feature selection methods in order to extract the best subset of features. The training and testing technique employed on the models where done iteratively. Motion data from every animal except for one were used to train the model while the motion data from the single animal were used to test the model. In the next iteration, a new animal was chosen as the test animal, while the model was trained on the rest. This was conducted until all animals had been used as the test animal. This technique upheld the assumption that the model will not have trained on any motion data from any patient it is trying to predict. Initially, all combinations of two classes were investigated. An exhaustive approach was conducted, testing the following: (i) three different classifiers (ii) all five feature sets, some who used feature selection (iii) all combinations of two classes. The poorest performing models were identified and discarded. This enabled the output classes to be tested on a fewer number of models. The final result presented were the two best performing models trained and tested on all five output classes. The best models consisted of the following: (i) all features created in the circumferential direction, combined with either CFS or Relieff feature selection method, (ii) either AdaBoost or decision tree as the classifier. Samples from the raw acceleration and velocity data in circumferential direction were deemed useful by the feature selection methods. In addition, peak cir73

5.3. FUTURE WORK

cumferential acceleration in the midsystolic phase were also deemed useful. The results achieved when training and testing on combinations of two classes two classes at a time, saw that the various classes could be discriminated with great accuracy. The best average accuracy over all combinations of two classes were 0.84. Certain combinations of two classes are relevant for clinical use. These are combinations including baseline as this is either the initial or desirable cardiac function, as well as the combination of esmolol and ischemia. Their results are shown in table 4.10. All combinations including baseline, except baseline versus esmolol achieved a precision and recall of more than 0.80. Best performing was baseline versus adrenaline achieving 0.96 and 0.94. Esmolol versus ischemia achieved a precision and recall of 0.71 and 0.85, respectively. These results demonstrate the feasibility of using machine learning and motion data from the heart to predict certain cardiac functions. As anticipated, the accuracy dropped when adding classes. When training and testing on all five classes, the best model achieved an accuracy of 0.54. One of the biggest factors of the drop in accuracy is the overlap of data, seen in the graphs of section 3.2.4. The overlap were most prominent between ischemia and esmolol, and confirmed in the confusion matrix in table 4.18. Secondly, at most 20 animals underwent each intervention. Though splitting the motion data by heartbeats increased the dataset, having motion data from more animals would have improved the results. The result regarding baseline versus ischemia is an improvement to the performance in research regarding machine learning [44] and electrocardiography (ECG). Because of the inherent drawbacks of using ECG, it was proposed that the technique described in this project could be used in combination to machine learning and ECG. Regarding the features used, the same dataset at hand were investigated at OUS, which also concluded that the midsystolic phase discriminated best [30]. Additionally, the circumferential direction were also proven to discriminate better than the longitudinal and radial direction [28]. The results shows that predicting two cardiac functions from another are feasible through machine learning. In best case, a machine learning model which predicts new and unseen motion data of patients could be used as a way of continuously monitoring patients during and after cardiac surgery.

5.3

Future work

In order for the model to be used for more than two cardiac functions further research is needed. Collection motion data from several animals ensure a more robust and presumably better performing model. Addition74

5.3. FUTURE WORK

ally, attaching a gyroscope to the pacemaker lead would give information about the rotation of the heart. Having more information could further aid the model It was proposed that the technique described in this project could be used in conjunction with ECG. Such a combination would need further research in order to validate the technique. This project has not investigated the use of ensemble learners. That is, combining several classifiers and predict based on what the majority of the classifiers predict. A different approach could be to use classifiers which can output probability. The probability could then be interpreted as the how close the heart is from being in one of the cardiac functions. An approach were the classifier at all times knew the baseline, and instead predicted the transition from baseline to another dysfunction would be of clinical use, since baseline is often the initial cardiac function. Lastly, it is worth mentioning that an anomaly detector would be beneficial. When there are motion data that does not belong to any of the different classes, the model should not attempt to predict any of them, but instead predict that an abnormality was seen.

75

Appendices

77

Appendix A

MATLABs decision tree split criterions R split criterion functions for decision tree is provided. The split MATLABs criterion is what defines which node to create next. Let p(i |t) denote the fraction of inputs belonging to a class i at a given node t.

Gini’s Diversity Index The Gini’s Diversity index (gdi) is the most broadly used [65] and is given by: c

Gini(t) = 1 − ∑ [ p(i |t)]2

(A.1)

i =1

where c is the number of classes. A node with just one class (a pure node) has Gini index 0; otherwise the Gini index is positive. Deviance The Deviance is given by: c

Deviance(t) = − ∑ p(i |t) log2 p(i |t)

(A.2)

i =1

Twoing rule Unlike the gdi, twoing rule will search for two classes that will make up together more than 50% of the data. It will maximize the following change-of-impurity measure: " #2 Pl Pr c (A.3) ∆i (t) = ∑ | p(c|tl ) − p(c|tr )| 4 i =1 The above equation is maximized where Pl and Pr are the fractions of observations that split to the left and right respectively, and tl and tr refers to left and right child node. Little research has been devoted in understanding which of the equations works best for different kinds of problems. Like in many other ares of machine learning, all three must be tried and tested to see which one performs best.

Appendix B

MATLABs SVM kernels The provided by MATLAB to be used in MATLABs SVM classifier is shown in the table below. Name Gaussian radial basis function (RBF) Polynomial Linear

Values   −||~x −~y0 ||2 exp or (2σ2 )

exp −γ||~x − ~y0 ||2 (1 + ~x · ~y0 )d 1 ~x · ~x



Table B.1: MATLABs SVM kernels. σ and γ and d are hyperparameters which must be set on beforehand outside of the training process. Hyperparameters is reviewed in section 2.5.3. Kernels can also be though of as a similarity measure. Naturally, if the problem is linearly separable in the original space, the transformation is not needed. This is known as the linear kernel. The polynomial kernel will, in contrast with the linear kernel, calculate feature conjunctions up to the order of d dimensions. Setting the order too high will fit the data well, but will not generalize. The RBF kernel will intuitively, create bell-shaped curves centered around every support vector, where a support vector is a feature vector touching the margin, seen in figure 2.14. Sigma controls the width of the bell. A large sigma gives a pointed bump, while a small sigma gives a softer and more broad dump. Consequently, setting the sigma too small will make the classifier unable to recognize the pattern in the data. Setting the sigma it too will make the classifier overfit, meaning that it only represents the current training data, but it cannot generalize to new and unseen data.

1d

= number of dimensions

Appendix C

Statistics C.1

Feature selection methods

Relieff Relieff randomly selects examples, denoted Ri and searches for its k nearest neighbours from the same class called nearest hit H and k nearest neighbours from the other class, called nearest miss M. Relieff updates the rank for all features depending on their values for Ri , M and H. If instance Ri and H have different values of the feature F, then the feature F separates two instances with the same class which is not desirable so the rank of that feature is decreased. On the other hand, if instance Ri and M have different values of the feature F then the feature separates two instances with different class values which is desirable so the rank of the feature is increased.

Figure C.1: Figure showing the basic idea of the Relieff feature selection algorithm with k set to one. Relieff is one of the most successful [17] feature selection methods Robnik et al. [55](1997) has shown that Relieff is effective at detecting relevant features, even when these features are highly dependent on other features. Robnik et al. [56](2003) also showed that it is robust in the number of nearest

C.2. P-VALUE

neighbors as long as it remains relatively small. A drawback of Relieff is its randomicity. It was mentioned that it chooses examples at random, hence the results can vary in each run of the algorithm. Additionally, a scheme for selecting the number of features to include is necessary, as Relieff only ranks the features. One way to solve this is by sorting the features based on its assigned rank, and then iteratively include the highest ranked feature not yet included and classify for each time. This is done until all features has been included, and the number of features that yielded the best results is used. Correlation-based feature selection (CFS) Correlation-based Feature Selection (CFS) introduced by Hall [26] and is built on the following hypothesis: “A good feature subset is one that contains features highly correlated with (predictive of) the class, yet uncorrelated with (not predictive of) each other.” CFS is a filter approach and uses a function that gives a feature subset S consisting of k features a merit according to: MeritSk = q

krc f

(C.1)

k + k ( k − 1)r f f

The rc f is the average feature-class correlation and r f f is the average value of all feature-feature correlation. CFS starts with an empty set of features and adds the feature giving the highest merit given by equation C.1. This is known as a best-first search. It does so until five consecutive fully expanded non-improving subsets has been created. Consequently, CFS outputs the best subset found, instead of ranking all features, thus eliminating the need for a scheme to select the number of features to include, as in Relieff. Because CFS makes use of all the training data at once, it tends to give better results than the wrapper used by Hall et al. on small datasets [27]. The weakness of CFS is the best-first search approach. When a feature is added, it cannot be removed at later stages, thus CFS is susceptible to local optima. However, the number of features created will not be substantial, thus the probability of it getting stuck in a local optimum is limited.

C.2

P-value

When investigating a result from hypothesis testing, the p-value will state how likely the hypothesis is. More specifically, in hypothesis testing, some event is tested on a sample data to see if the event actually has an effect. An example of event and sample data can be some drug applied on a group of rats. With the result in hand, one would like to know if the difference seen on the sample data was because of chance, or because the event actually had an effect, i.e. investigating if the drug is actually working. The way this is solved is by first creating a null hypothesis denoted H0 which is the hypothesis stating that there is no effect. An alternative hypothesis H1 is 84

C.2. P-VALUE

also created, which is the hypothesis stating that there is an effect. The pvalue is defined as the probability, under the assumption that H0 is true, of obtaining a result equal or more extreme than what was actually observed. Examples of test statistics calculating the p-value are: z-test [42], Student’s t-test [61], and the Kolomogorov-Smirnov test [46]. If the p-value is below some predefined α value, usually 0.05, the null hypothesis can be rejected, hence the event did have an effect on the group.

85

Bibliography [1] accelerometer. https://www.flickr.com/photos/drdrang/ 7277161810/. Accessed: 2-3-2016. [2] Ecg. https://upload.wikimedia.org/wikipedia/commons/9/9e/ SinusRhythmLabels.svg. Accessed: 2-3-2016. [3] http://diagnosoft.com/strain/cardiac-strain. http: //diagnosoft.com/wp-content/uploads/2014/05/ Screen-Shot-2014-05-09-at-5.12.02-PM.png. Accessed: 10-92015. [4] Human physiology. https://en.wikibooks.org/wiki/Human_ Physiology/The_cardiovascular_system#/media/File:Diagram_ of_the_human_heart_(cropped).svg. Accessed: 3-2-2016. [5] Ischemia. https://upload.wikimedia.org/wikipedia/commons/d/ d1/Blausen_0257_CoronaryArtery_Plaque.png. Accessed: 4-3-2016. [6] kernel-trick. https://en.wikipedia.org/wiki/Kernel_method# /media/File:Kernel_Machine.png. Accessed: 3-3-2016. [7] Overfitting. https://commons.wikimedia.org/wiki/File: Overfitting.svg. Accessed: 26-2-2016. [8] Wiggers diagram. https://upload.wikimedia.org/wikipedia/ commons/f/f4/Wiggers_Diagram.svg. Accessed: 2-3-2016. [9] World health organization department of health statistics and informatics in the information, evidence and research cluster. the global burden of disease 2004 update. page 11, 2004. [10] AI Belousov, SA Verzakov, and J Von Frese. A flexible classification approach with optimal generalisation performance: support vector machines. Chemometrics and Intelligent Laboratory Systems, 64(1):15–25, 2002. [11] Steven Bird, Ewan Klein, and Edward Loper. Natural language processing with Python. " O’Reilly Media, Inc.", 2009. [12] Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard, and Chih-Jen Lin. Training and testing low-degree polynomial data mappings via linear svm. The Journal of Machine Learning Research, 11:1471–1490, 2010.

BIBLIOGRAPHY

[13] Ion Codreanu, Matthew D Robson, Stephen J Golding, Bernd A Jung, Kieran Clarke, and Cameron J Holloway. Longitudinally and circumferentially directed movements of the left ventricle studied by cardiovascular magnetic resonance phase contrast velocity mapping. Journal of Cardiovascular Magnetic Resonance, 12(1):48, 2010. [14] Dennis VP Cokkinos, Constantinos Pantos, Gerd Heusch, and Heinrich Taegtmeyer. Myocardial ischemia: from mechanisms to therapeutic potentials, volume 21. Springer Science & Business Media, 2006. [15] Mark E Comunale, Simon C Body, Catherine Ley, Colleen Koch, Gary Roach, Joseph P Mathew, Ahvie Herskowitz, and Dennis T Mangano. The concordance of intraoperative left ventricular wallmotion abnormalities and electrocardiographic st segment changes association with outcome after coronary revascularization. The Journal of the American Society of Anesthesiologists, 88(4):945–954, 1998. [16] Corinna Cortes and Vladimir Vapnik. Machine learning, 20(3):273–297, 1995. [17] Thomas G Dietterich. 18(4):97, 1997.

Support-vector networks.

Machine-learning research.

AI magazine,

[18] William Richard Douglas. Of pigs and men and research. Space life sciences, 3(3):226–234, 1972. [19] Ole Jakob Elle, Steinar Halvorsen, Martin Gunnar Gulbrandsen, Lars Aurdal, Andre Bakken, Eigil Samset, Harald Dugstad, and Erik Fosse. Early recognition of regional cardiac ischemia using a 3-axis accelerometer sensor. Physiological measurement, 26(4):429, 2005. [20] Yoav Freund, Robert Schapire, and N Abe. A short introduction to boosting. Journal-Japanese Society For Artificial Intelligence, 14(771780):1612, 1999. [21] Derek G Gibson and Darrel P Francis. Clinical assessment of left ventricular diastolic function. Heart, 89(2):231–238, 2003. [22] Ole-Johannes HN Grymyr, Anh-Tuan T Nguyen, Fjodors Tjulkins, Andreas Espinoza, Espen W Remme, Helge Skulstad, Erik Fosse, Kristin Imenes, and Per S Halvorsen. Continuous monitoring of cardiac function by 3-dimensional accelerometers in a closed-chest pig model. Interactive CardioVascular and Thoracic Surgery, page ivv191, 2015. [23] Ole-Johannes HN Grymyr, Espen W Remme, Andreas Espinoza, Helge Skulstad, Ole J Elle, Erik Fosse, and Per S Halvorsen. Assessment of 3d motion increases the applicability of accelerometers for monitoring left ventricular function. Interactive cardiovascular and thoracic surgery, 20(3):329–337, 2015. 88

BIBLIOGRAPHY

[24] Gongde Guo, Daniel Neagu, and Mark TD Cronin. A study on feature selection for toxicity prediction. In Fuzzy Systems and Knowledge Discovery, pages 31–34. Springer, 2005. [25] Isabelle Guyon and André Elisseeff. An introduction to variable and feature selection. The Journal of Machine Learning Research, 3:1157–1182, 2003. [26] Mark A Hall and Lloyd A Smith. Feature selection for machine learning: Comparing a correlation-based filter approach to the wrapper. In FLAIRS conference, volume 1999, pages 235–239, 1999. [27] Mark A Hall and Lloyd A Smith. Feature selection for machine learning: Comparing a correlation-based filter approach to the wrapper. In FLAIRS conference, volume 1999, pages 235–239, 1999. [28] Per Steinar Halvorsen, Andreas Espinoza, Lars Albert Fleischer, Ole Jakob Elle, Lars Hoff, Runar Lundblad, Helge Skulstad, Thor Edvardsen, Halfdan Ihlen, and Erik Fosse. Feasibility of a three-axis epicardial accelerometer in detecting myocardial ischemia in cardiac surgical patients. The Journal of thoracic and cardiovascular surgery, 136(6):1496–1502, 2008. [29] Per Steinar Halvorsen, LA Fleischer, A Espinoza, OJ Elle, L Hoff, H Skulstad, T Edvardsen, and E Fosse. Detection of myocardial ischaemia by epicardial accelerometers in the pig. British journal of anaesthesia, 102(1):29–37, 2009. [30] Per Steinar Halvorsen, Espen W Remme, Andreas Espinoza, Helge Skulstad, Runar Lundblad, Jacob Bergsland, Lars Hoff, Kristin Imenes, Thor Edvardsen, Ole Jakob Elle, et al. Automatic real-time detection of myocardial ischemia by epicardial accelerometer. The Journal of thoracic and cardiovascular surgery, 139(4):1026–1032, 2010. [31] Zena M Hira and Duncan F Gillies. A review of feature selection and feature extraction methods applied on microarray data. Advances in bioinformatics, 2015, 2015. [32] Per Kristian Hol, Per Snorre Lingaas, Runar Lundblad, Kjell Arne Rein, Karleif Vatne, Hans-Jørgen Smith, Sigurd Nitter-Hauge, and Erik Fosse. Intraoperative angiography leads to graft revision in coronary artery bypass surgery. The Annals of Thoracic Surgery, 78(2):502 – 505, 2004. [33] Anders Holst and Arndt Jonasson. Classification of movement patterns in skiing. In SCAI, pages 115–124, 2013. [34] Chih-Wei Hsu, Chih-Chung Chang, Chih-Jen Lin, et al. A practical guide to support vector classification. 2003. [35] Abdulmassih S Iskandrian, Charles E Bemis, A-Hamid Hakki, Ioannis Panidis, Jaekyeong Heo, J Gerald Toole, Tsushung A Hua, Douglas 89

BIBLIOGRAPHY

Allin, and Sally Kane-Marsch. Effects of esmolol on patients with left ventricular dysfunction. Journal of the American College of Cardiology, 8(1):225–231, 1986. [36] Uday Jain, CJ Laflamme, Anil Aggarwal, James G Ramsay, Mark E Comunale, Sudhanshu Ghoshal, Long Ngo, Krzysztof Ziola, Milton Hollenberg, and Dennis T Mangano. Electrocardiographic and hemodynamic changes and their association with myocardial infarction during coronary artery bypass surgery. a multicenter study. multicenter study of perioperative ischemia (mcspi) research group. Anesthesiology, 86(3):576–591, 1997. [37] Thorsten Joachims. Text categorization with support vector machines: Learning with many relevant features. Springer, 1998. [38] SVEN ROLAND KJELLBERG, ULF RUDHE, and TORGNY SJÖSTRAND. The effect of adrenaline on the contraction of the human heart under normal circulatory conditions. Acta Physiologica Scandinavica, 24(4):333–349, 1952. [39] JD Kneeshaw. Transoesophageal echocardiography (toe) in the operating room. British journal of anaesthesia, 97(1):77–84, 2006. [40] Ravindra Koggalage and Saman Halgamuge. Reducing the number of training samples for fast support vector machine classification. Neural Information Processing-Letters and Reviews, 2(3):57–65, 2004. [41] Anand Kumar, Ramon Anel, Eugene Bunnell, Kalim Habet, Sergio Zanotti, Stephanie Marshall, Alex Neumann, Amjad Ali, Mary Cheang, Clifford Kavinsky, et al. Pulmonary artery occlusion pressure and central venous pressure fail to predict ventricular filling volume, cardiac performance, or the response to volume infusion in normal subjects. Critical care medicine, 32(3):691–699, 2004. [42] DN Lawley. A generalization of fisher’s z test. Biometrika, 30(1/2):180– 187, 1938. [43] Ying Liu, Dengsheng Zhang, and Guojun Lu. Region-based image retrieval with high-level semantics using decision tree learning. Pattern Recognition, 41(8):2554–2570, 2008. [44] Amit Kumar Manocha and Mandeep Singh. An overview of ischemia detection techniques. International Journal of Scientific & Engineering Research, 2(11), 2011. [45] Stephen Marsland. Machine learning: an algorithmic perspective. CRC press, 2015. [46] Frank J Massey Jr. The kolmogorov-smirnov test for goodness of fit. Journal of the American statistical Association, 46(253):68–78, 1951. 90

BIBLIOGRAPHY

[47] Alain Mercat, Jean-Luc Diehl, Guy Meyer, Jean-Louis Teboul, and Herve Sors. Hemodynamic effects of fluid loading in acute massive pulmonary embolism. Critical care medicine, 27(3):540–544, 1999. [48] Frédéric Michard and Jean-Louis Teboul. Predicting fluid responsiveness in icu patients: a critical analysis of the evidence. CHEST Journal, 121(6):2000–2008, 2002. [49] medical illustrator Patrick J. Lynch. Transthoracic echocardiogram. https://commons.wikimedia.org/w/index.php?curid=1490817. Accessed: 7-4-2016. [50] Stephen J Preece, John Yannis Goulermas, Laurence PJ Kenney, and David Howard. A comparison of feature extraction methods for the classification of dynamic activities from accelerometer data. Biomedical Engineering, IEEE Transactions on, 56(3):871–879, 2009. [51] J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81– 106, 1986. [52] Narasimhan Ranganathan, Vahe Sivaciyan, and Franklin B Saksena. The art and science of cardiac physical examination: with heart sounds and pulse wave forms on CD. Springer Science & Business Media, 2007. [53] Nishkam Ravi, Nikhil Dandekar, Preetham Mysore, and Michael L Littman. Activity recognition from accelerometer data. In AAAI, volume 5, pages 1541–1546, 2005. [54] Espen W Remme, Lars Hoff, Per Steinar Halvorsen, Edvard Nærum, Helge Skulstad, Lars A Fleischer, Ole Jakob Elle, and Erik Fosse. Validation of cardiac accelerometer sensor measurements. Physiological measurement, 30(12):1429, 2009. [55] Marko Robnik-Šikonja and Igor Kononenko. An adaptation of relief for attribute estimation in regression. In Machine Learning: Proceedings of the Fourteenth International Conference (ICML’97), pages 296–304, 1997. [56] Marko Robnik-Šikonja and Igor Kononenko. Theoretical and empirical analysis of relieff and rrelieff. Machine learning, 53(1-2):23–69, 2003. [57] Yvan Saeys, Iñaki Inza, and Pedro Larrañaga. A review of feature selection techniques in bioinformatics. bioinformatics, 23(19):2507– 2517, 2007. [58] Olav Sand, Egil Haug, Kari C Toverud, and Øystein V Sjaastad. Menneskets fysiologi. Gyldendal akademisk, 2014. [59] Robert E Schapire. The strength of weak learnability. Machine learning, 5(2):197–227, 1990. 91

BIBLIOGRAPHY

[60] Mojtaba Seyedhosseini, António RC Paiva, and Tolga Tasdizen. Fast adaboost training using weighted novelty selection. In Neural Networks (IJCNN), The 2011 International Joint Conference on, pages 1245–1250. IEEE, 2011. [61] Student. The probable error of a mean. Biometrika, pages 1–25, 1908. [62] A Taddei, G Costantino, R Silipo, M Emdin, and C Marchesi. A system for the detection of ischemic episodes in ambulatory ecg. In Computers in Cardiology 1995, pages 705–708. IEEE, 1995. [63] Yuan Tao. Analyse statistique de la fonction cardiaque chez des patients atteints de cardiopathies et des triathl‘etes. 2014. [64] Robert Tennant and Carl J Wiggers. The effect of coronary occlusion on myocardial contraction. American Journal of Physiology–Legacy Content, 112(2):351–361, 1935. [65] Roman Timofeev. Classification and regression trees (cart) theory and applications. 2004. [66] Stig Urheim, Thor Edvardsen, Hans Torp, Bjørn Angelsen, and Otto A Smiseth. Myocardial strain by doppler echocardiography validation of a new method to quantify regional myocardial function. Circulation, 102(10):1158–1164, 2000. [67] David H Wolpert and William G Macready. No free lunch theorems for optimization. Evolutionary Computation, IEEE Transactions on, 1(1):67– 82, 1997.

92

Smile Life

When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile

Get in touch

© Copyright 2015 - 2024 PDFFOX.COM - All rights reserved.