Successful Data Mining in Practice: Where do we Start? [PDF]

JSM 2-day Course SF August 2-3, 2003. 6. Paralyzed Veterans of America. â¢ KDD 1998 cup. â¢ Mailing list of 3.5 millio

0 downloads 5 Views 3MB Size

Report

Download PDF

PNG Network

Recommend Stories

WHERE DO WE GO WHEN WE DIE?

We may have all come on different ships, but we're in the same boat now. M.L.King

Where are We and Where Do We Need To Go?

How wonderful it is that nobody need wait a single moment before starting to improve the world. Anne

AISC 360-16 2016 Specification, where did we start? [PDF]

AISC 360-05, published March 9, 2005. â¢ AISC 360-10, published June 22, 2010. â¢ AISC 360-16, published ... Continue to coordinate with all AISC standards. â¢ Integrate with non AISC standards such as. ASCE 7, IBC, and ACI. ... 2016 AISC Specific

Data Mining Practice Final Exam

Where there is ruin, there is hope for a treasure. Rumi

where we stand where we are going

If you want to go quickly, go alone. If you want to go far, go together. African proverb

Operational Excellence Where do I start?

The happiest people don't have the best of everything, they just make the best of everything. Anony

Data Mining in Government Overview Data Mining

You have survived, EVERY SINGLE bad day so far. Anonymous

Python-Pandas introduction Data mining practice 1

Never wish them pain. That's not who you are. If they caused you pain, they must have pain inside. Wish

Data Mining in Education

Those who bring sunshine to the lives of others cannot keep it from themselves. J. M. Barrie

Where should we care

Happiness doesn't result from what we get, but from what we give. Ben Carson

Idea Transcript

Successful Data Mining in Practice: Where do we Start? Richard D. De Veaux

Department of Mathematics and Statistics Williams College Williamstown MA, 01267 [email protected]

http://www.williams.edu/Mathematics/rdeveaux JSM 2-day Course SF August 2-3, 2003

1

Outline •What is it? •Why is it different? •Types of models •How to start •Where do we go next? •Challenges

JSM 2-day Course SF August 2-3, 2003

2

Reason for Data Mining

Data = $$

JSM 2-day Course SF August 2-3, 2003

3

Data Mining Is… “the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.” --- Fayyad “finding interesting structure (patterns, statistical models, relationships) in data bases”.--- Fayyad, Chaduri and Bradley “a knowledge discovery process of extracting previously unknown, actionable information from very large data bases”--- Zornes “ a process that uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make valid predictions.” ---Edelstein JSM 2-day Course SF August 2-3, 2003

4

What is Data Mining?

JSM 2-day Course SF August 2-3, 2003

5

Paralyzed Veterans of America • KDD 1998 cup • Mailing list of 3.5 million potential donors • Lapsed donors ¾ Made their last donation to PVA 13 to 24 months prior to June 1997 ¾ 200,000 (training and test sets)

• Who should get the current mailing? • Cost effective strategy? JSM 2-day Course SF August 2-3, 2003

6

Results for PVA Data Set • If entire list (100,000 donors) are mailed, net donation is $10,500 • Using data mining techniques, this was increased 41.37%

JSM 2-day Course SF August 2-3, 2003

7

KDD CUP 98 Results

JSM 2-day Course SF August 2-3, 2003

8

KDD CUP 98 Results 2

JSM 2-day Course SF August 2-3, 2003

9

Why Is This Hard? • Size of Data Set • Signal/Noise ratio • Example #1 – PVA on

JSM 2-day Course SF August 2-3, 2003

10

Why Is It Taking Off Now? • Because we can ¾ Computer power ¾ The price of digital storage is near zero

• Data warehouses already built ¾ Companies want return on data investment

JSM 2-day Course SF August 2-3, 2003

11

What’s Different? • Users ¾ Domain experts, not statisticians ¾ Have too much data ¾ Want automatic methods ¾ Want useful information

• Problem size ¾ Number of rows ¾ Number of variables

JSM 2-day Course SF August 2-3, 2003

12

Data Mining Data Sets • Massive amounts of data • UPS

¾16TB -- library of congress ¾Mostly tracking • Low signal to noise

¾Many irrelevant variables ¾Subtle relationships ¾Variation

JSM 2-day Course SF August 2-3, 2003

13

Financial Applications • Credit assessment ¾ Is this loan application a good credit risk? ¾ Who is likely to declare bankruptcy?

• Financial performance ¾ What should be a portfolio product mix

JSM 2-day Course SF August 2-3, 2003

14

Manufacturing Applications • Product reliability and quality control • Process control ¾ What can I do to improve batch yields?

• Warranty analysis ¾ Product problems ¾ Fraud ¾ Service assessment

JSM 2-day Course SF August 2-3, 2003

15

Medical Applications • Medical procedure effectiveness ¾ Who are good candidates for surgery?

• Physician effectiveness ¾ Which tests are ineffective? ¾ Which physicians are likely to overprescribe treatments? ¾ What combinations of tests are most effective?

JSM 2-day Course SF August 2-3, 2003

16

E-commerce • Automatic web page design • Recommendations for new purchases • Cross selling

JSM 2-day Course SF August 2-3, 2003

17

Pharmaceutical Applications • Clinical trial databases • Combine clinical trial results with extensive medical/demographic data base to explore: ¾ Prediction of adverse experiences ¾ Who is likely to be non-compliant or drop out? ¾ What are alternative (I.E., Nonapproved) uses supported by the data?

JSM 2-day Course SF August 2-3, 2003

18

Example: Screening Plates • Biological assay ¾ Samples are tested for potency ¾ 8 x 12 arrays of samples ¾ Reference compounds included

• Questions: ¾ Correct for drift ¾ Recognize clogged dispensing tips

JSM 2-day Course SF August 2-3, 2003

19

Pharmaceutical Applications • High throughput screening ¾ Predict actions in assays ¾ Predict results in animals or humans

• Rational drug design ¾ Relating chemical structure with chemical properties ¾ Inverse regression to predict chemical properties from desired structure

• DNA snips

JSM 2-day Course SF August 2-3, 2003

20

Pharmaceutical Applications • Genomics ¾ Associate genes with diseases ¾ Find relationships between genotype and drug response (e.g., dosage requirements, adverse effects) ¾ Find individuals most susceptible to placebo effect

JSM 2-day Course SF August 2-3, 2003

21

Fraud Detection • Identify false: ¾ Medical insurance claims ¾ Accident insurance claims

• Which stock trades are based on insider information? • Whose cell phone number has been stolen? • Which credit card transactions are from stolen cards?

JSM 2-day Course SF August 2-3, 2003

22

Case Study I • Ingot cracking ¾ 953 30,000 lb. Ingots ¾ 20% cracking rate ¾ $30,000 per recast ¾ 90 potential explanatory variables 9Water composition (reduced) 9Metal composition 9Process variables 9Other environmental variables

JSM 2-day Course SF August 2-3, 2003

23

Case Study II – Car Insurance • 42800 mature policies • 65 potential predictors ¾ Tree model found industry, vehicle age, numbers of vehicles, usage and location

JSM 2-day Course SF August 2-3, 2003

24

Data Mining and OLAP • On-line analytical processing (OLAP): users deductively analyze data to verify hypothesis ¾ Descriptive, not predictive

• Data mining: software uses data to inductively find patterns ¾ Predictive or descriptive

• Synergy ¾ OLAP helps users understand data before mining ¾ OLAP helps users evaluate significance and value of patterns

JSM 2-day Course SF August 2-3, 2003

25

Data Mining vs. Statistics Large amount of data: 1,000,000 rows, 3000 columns

1,000 rows, 30 columns

Data Collection Happenstance Data

Sample?

Why bother? We have big, parallel computers

Designed Surveys, Experiments You bet! We even get error estimates.

Reasonable Price for Sofware $1,000,000 a year

$599 with coupon from Amstat News

Presentation Medium PowerPoint, what else?

Overhead foils, of course!

Nice Place for a Meeting Aspen in January, Maui in February,… JSM 2-day Course SF August 2-3, 2003

Indianapolis in August, Dallas in August, Baltimore in August, Atlanta in August,… 26

Data Mining Vs. Statistics • Flexible models • Prediction often most important • Computation matters • Variable selection and overfitting are problems

JSM 2-day Course SF August 2-3, 2003

• Particular model and error structure • Understanding, confidence intervals • Computation not critical • Variable selection and model selection are still problems

27

What’s the Same? • George Box ¾All models are wrong, but some are useful ¾Statisticians, like artists, have the bad habit of falling in love with their models • The model is no better than the data • Twyman’s law ¾If it looks interesting, it’s probably wrong • De Veaux’s corollary ¾If it’s not wrong, it’s probably obvious

JSM 2-day Course SF August 2-3, 2003

28

Knowledge Discovery Process Define business problem Build data mining database Explore data Prepare data for modeling Build model Evaluate model Deploy model and results Note: This process model borrows from CRISP-DM: CRoss Industry Standard Process for Data Mining JSM 2-day Course SF August 2-3, 2003

29

Data Mining Myths

• Find answers to unasked questions • Continuously monitor your data base for interesting patterns • Eliminate the need to understand your business • Eliminate the need to collect good data • Eliminate the need to have good data analysis skills

JSM 2-day Course SF August 2-3, 2003

30

Beer and Diapers •

Made up story?

•

Unrepeatable -Happened once.

• •

•

Lessons learned? Imagine being able to see nobody coming down the road, and at such a distance De Veaux’s theory of evolution

JSM 2-day Course SF August 2-3, 2003

31

Beer and Diapers

Picture from TandemTM ad

JSM 2-day Course SF August 2-3, 2003

32

Successful Data Mining • The keys to success: ¾ Formulating the problem ¾ Using the right data ¾ Flexibility in modeling ¾ Acting on results

• Success depends more on the way you mine the data rather than the specific tool

JSM 2-day Course SF August 2-3, 2003

33

Types of Models • Descriptions • Classification (categorical or discrete values) • Regression (continuous values) ¾ Time series (continuous values)

• Clustering • Association

JSM 2-day Course SF August 2-3, 2003

34

Data Preparation • Build data mining database • Explore data • Prepare data for modeling

60% 60% to to 95% 95% of of the the time time is is spent spent preparing preparing the the data data

JSM 2-day Course SF August 2-3, 2003

35

Data Challenges • Data definitions ¾ Types of variables

• Data consolidation ¾ Combine data from different sources ¾ NASA mars lander

• Data heterogeneity ¾ Homonyms ¾ Synonyms

• Data quality

JSM 2-day Course SF August 2-3, 2003

36

Data Quality

JSM 2-day Course SF August 2-3, 2003

37

Missing Values • Random missing values ¾ Delete row? 9Paralyzed Veterans ¾ Substitute value 9Imputation 9Multiple Imputation

• Systematic missing data ¾ Now what?

JSM 2-day Course SF August 2-3, 2003

38

Missing Values -- Systematic • Ann Landers: 90% of parents said they wouldn’t do it again!! • Wharton Ph.D. Student questionnaire on survey attitudes • Bowdoin college applicants have mean SAT verbal score above 750

JSM 2-day Course SF August 2-3, 2003

39

The Depression Study • Designed to study antidepressant efficacy ¾ Measured via Hamilton Rating Scale

• Side effects ¾ Sexual dysfunction ¾ Misc safety and tolerability issues

• Late '97 and early '98. • 692 patients • Two antidepressants + placebo

JSM 2-day Course SF August 2-3, 2003

40

The Data • Background info ¾ Age ¾ Sex

• Each received either ¾ Placebo ¾ Anti depressant 1 ¾ Anti depressant 2

• Dosages • At time points 7 and 14 days we also have: ¾ Depression scores ¾ Sexual dysfunction indicators ¾ Response indicators JSM 2-day Course SF August 2-3, 2003

41

Example #2 • Depression Study data • Examine data for missing values

JSM 2-day Course SF August 2-3, 2003

42

Build Data Mining Database • • • • • • • •

Collect data Describe data Select data Build metadata Integrate data Clean data Load the data mining database Maintain the data mining database

JSM 2-day Course SF August 2-3, 2003

43

Data Warehouse Architecture • Reference: Data Warehouse from Architecture to Implementation by Barry Devlin, Addison Wesley, 1997 • Three tier data architecture ¾ Source data ¾ Business data warehouse (BDW): the reconciled data that serves as a system of record ¾ Business information warehouse (BIW): the data warehouse you use

JSM 2-day Course SF August 2-3, 2003

44

Data Mining BIW Data Sources

Geographic BIW

Business DW

Subject BIW

JSM 2-day Course SF August 2-3, 2003

Data Mining BIW 45

Metadata • The data survey describes the data set contents and characteristics ¾ Table name ¾ Description ¾ Primary key/foreign key relationships ¾ Collection information: how, where, conditions ¾ Timeframe: daily, weekly, monthly ¾ Cosynchronus: every Monday or Tuesday

JSM 2-day Course SF August 2-3, 2003

46

Relational Data Bases • Data are stored in tables Items ItemID C56621 T35691 RS5292

Shoppers Person ID 135366 135366 259835

ItemName top hat cane red shoes

person name Lyle Lyle dick

JSM 2-day Course SF August 2-3, 2003

price 34.95 4.99 22.95

ZIPCODE 19103 19103 01267

item bought T35691 C56621 RS5292

47

RDBMS Characteristics • Advantages ¾ All major DBMSs are relational ¾ Flexible data structure ¾ Standard language ¾ Many applications can directly access RDBMSs

• Disadvantages ¾ May be slow for data mining ¾ Physical storage required ¾ Database administration overhead

JSM 2-day Course SF August 2-3, 2003

48

Data Selection • Compute time is determined by the number of cases (rows), the number of variables (columns), and the number of distinct values for categorical variables ¾ Reducing the number of variables ¾ Sampling rows

• Extraneous column can result in overfitting your data ¾ Employee ID is predictor of credit risk

JSM 2-day Course SF August 2-3, 2003

49

Sampling Is Ubiquitous • The database itself is almost certainly a sample of some population • Most model building techniques require separating the data into training and testing samples

JSM 2-day Course SF August 2-3, 2003

50

Model Building • Model building ¾ Train ¾ Test

• Evaluate

JSM 2-day Course SF August 2-3, 2003

51

Overfitting in Regression Classical overfitting: ¾ Fit 6th order polynomial to 6 data points 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 -1

0

1

JSM 2-day Course SF August 2-3, 2003

2

3

4

5

6

52

Overfitting • Fitting non-explanatory variables to data • Overfitting is the result of ¾ Including too many predictor variables ¾ Lack of regularizing the model 9 Neural net run too long 9 Decision tree too deep

JSM 2-day Course SF August 2-3, 2003

53

Avoiding Overfitting • Avoiding overfitting is a balancing act ¾ Fit fewer variables rather than more ¾ Have a reason for including a variable (other than it is in the database) ¾ Regularize (don’t overtrain) ¾ Know your field.

All All models models should should be be as as simple simple as as possible possible but but no no simpler simpler than than necessary necessary Albert Albert Einstein Einstein

JSM 2-day Course SF August 2-3, 2003

54

Evaluate the Model • Accuracy ¾ Error rate ¾ Proportion of explained variation

• Significance ¾ Statistical ¾ Reasonableness ¾ Sensitivity ¾ Compute value of decisions 9The “so what” test

JSM 2-day Course SF August 2-3, 2003

55

Simple Validation • Method : split data into a training data set and a testing data set. A third data set for validation may also be used • Advantages: easy to use and understand. Good estimate of prediction error for reasonably large data sets • Disadvantages: lose up to 20%-30% of data from model building

JSM 2-day Course SF August 2-3, 2003

56

Train vs. Test Data Sets

Train

Test

JSM 2-day Course SF August 2-3, 2003

57

N-fold Cross Validation • If you don’t have a large amount of data, build a model using all the available data. ¾ What is the error rate for the model?

• Divide the data into N equal sized groups and build a model on the data with one group left out.

1 2 3 4 5 6 7 8 9 10 JSM 2-day Course SF August 2-3, 2003

58

N-fold Cross Validation • The missing group is predicted and a prediction error rate is calculated • This is repeated for each group in turn and the average over all N repeats is used as the model error rate • Advantages: good for small data sets. Uses all data to calculate prediction error rate • Disadvantages: lots of computing

JSM 2-day Course SF August 2-3, 2003

59

Regularization • A model can be built to closely fit the training set but not the real data. • Symptom: the errors in the training set are reduced, but increased in the test or validation sets. • Regularization minimizes the residual sum of squares adjusted for model complexity. • Accomplished by using a smaller decision tree or by pruning it. In neural nets, avoiding over-training. JSM 2-day Course SF August 2-3, 2003

60

Example #3 • Depression Study data • Fit a tree to DRP using all the variables ¾ Continue until the model won’t let you fit any more

• Predict on the test set

JSM 2-day Course SF August 2-3, 2003

61

Opaque Data Mining Tools •

Visualization

•

Regression ¾ Logistic regression

• •

Decision trees Clustering methods

JSM 2-day Course SF August 2-3, 2003

62

Black Box Data Mining Tools • • • • •

Neural networks K nearest neighbor K-means Support vector machines Genetic algorithms (not a modeling tool)

JSM 2-day Course SF August 2-3, 2003

63

25 20 15

train2$y

5

10

15 10 5

train2$y

20

25

“Toy” Problem 0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.6

0.8

1.0

0.6

0.8

1.0

0.6

0.8

1.0

25 20 15

train2$y 0.4

0.6

0.8

1.0

0.0

0.2

0.4

25 20 15

train2$y

15

5

10

10

20

25

train2[, i]

5

train2$y

1.0

5 0.2

train2[, i]

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

20 15

train2$y

15

5

10 5

10

20

25

train2[, i]

25

train2[, i]

train2$y

0.8

10

20 15

train2$y

10 5

0.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

20 15

train2$y

15

5

10 5

10

20

25

train2[, i]

25

train2[, i]

train2$y

0.6 train2[, i]

25

train2[, i]

0.0

0.2

0.4

0.6

0.8

train2[, i]

JSM 2-day Course SF August 2-3, 2003

1.0

0.0

0.2

0.4 train2[, i]

64

Linear Regression Term Estimate Std Error t Ratio Prob>|t| -0.900 0.482 -1.860 0.063 Intercept 4.658 0.292 15.950 5

Low

High

1,400 Y 100 N

6,000 Y 0N

0Y 1,000 N

JSM 2-day Course SF August 2-3, 2003

113

Class Assignment • The tree is applied to new data to classify it • A case or instance will be assigned to the largest (or modal) class in the leaf to which it goes • Example: 1,400 Y 100 N

• All cases arriving at this node would be given a value of “yes”

JSM 2-day Course SF August 2-3, 2003

114

Tree Algorithms •

• • • •

CART (Breiman, Friedman, Olshen, stone) C4.5, C5.0, cubist (Quinlan) CHAID Slip (IBM) Quest (SPSS)

JSM 2-day Course SF August 2-3, 2003

115

Decision Trees • Find split in predictor variable that best splits data into heterogeneous groups • Build the tree inductively basing future splits on past choices (greedy algorithm) • Classification trees (categorical response) • Regression tree (continuous response) • Size of tree often determined by cross-validation

JSM 2-day Course SF August 2-3, 2003

116

Geometry of Decision Trees N

Y

Y

Y Debt

N

Y Y

Y

Y

Y N

Y Y

Y N

N

Y N N Y

N

N N

Y

Y N N N

N Y

N

N

N

Household Income JSM 2-day Course SF August 2-3, 2003

117

Two Way Tables -- Titanic Cre w S urvival

Live d Die d To tal

Tic ke t Clas s Firs t S e cond Third

Tota l

212

202

118

178

710

673

123

167

528

1491

885

325

285

706

2201

Survivors

Non-Survivors

Class

Crew First Second Third

JSM 2-day Course SF August 2-3, 2003

118

F

D

M

F

S

M

C C32 1

C

3

A

2

1

Mosaic Plot

JSM 2-day Course SF August 2-3, 2003

119

Tree Diagram F

M

|

Adult

2 or 3

Child

1 or Crew

Crew

3

1,2,C

46%

93%

1 or 2

1st

14%

27%

23%

3

100%

33%

JSM 2-day Course SF August 2-3, 2003

120

Regression Tree Price

Successful Data Mining in Practice: Where do we Start? [PDF]

Recommend Stories

Idea Transcript

Helpful Links

Smile Life

Get in touch