Lecture Notes [PDF]

Aug 3, 2010 - Models Enumeration. Algorithm. Background knowledge. â¢ We assume you have some basic knowledge about dat

5 downloads 3 Views 5MB Size

Recommend Stories

compiler design lecture notes - SVECW [PDF]

1.2 Preprocessor. A preprocessor produce input to compilers. They may perform the following functions. 1. Macro processing: A preprocessor may allow a user to define ... 1.4 ASSEMBLER: programmers found it difficult to write or read programs in machi

[PDF] USMLE Step 1 Lecture Notes 2017

Nothing in nature is unbeautiful. Alfred, Lord Tennyson

[PDF] USMLE Step 1 Lecture Notes 2017

Ask yourself: What bad habits do you want to break? Next

Lecture Notes Clinical Anaesthesia 4th Edition Pdf

Don't ruin a good today by thinking about a bad yesterday. Let it go. Anonymous

Syllabus & lecture notes

Nothing in nature is unbeautiful. Alfred, Lord Tennyson

these lecture notes

Silence is the language of God, all else is poor translation. Rumi

Lecture Notes #05

What you seek is seeking you. Rumi

AST1100 Lecture Notes

It always seems impossible until it is done. Nelson Mandela

Lecture notes from FYS5130

Life isn't about getting and having, it's about giving and being. Kevin Kruse

PdF USMLE Step 1 Lecture Notes 2016

You're not going to master the rest of your life in one day. Just relax. Master the day. Than just keep

Idea Transcript

VLDB Database School (China) 2010 August 3-7, 2010, Shenyang

Lecture Notes Part 1 Mining and Searching Complex Structures

Anthony K.H. Tung(邓锦浩) School of Computing National University of Singapore www.comp.nus.edu.sg/~atung

Mining and Searching Complex Structures

Contents

Chapter 1: Introduction ------------------------------------------ 1 Chapter 2: High Dimensional Data ------------------------- 34 Chapter 3: Similarity Search on Sequences ------------ 110 Chapter 4: Similarity Search on Trees ------------------- 156 Chapter 5: Graph Similarity Search ---------------------- 175 Chapter 6: Massive Graph Mining ------------------------ 234

Mining and Searching Complex Structures

Chapter 1 Introduction

Mining and Searching Complex Structures Introduction Anthony K. H. Tung(鄧锦浩) School of Computing National University of Singapore www.comp.nus.edu.sg/~atung

Research Group Link: http://nusdm.comp.nus.edu.sg/index.html Social Network Link: http://www.renren.com/profile.do?id=313870900

What is data mining? Really nothing different from what scientists had been doing for Correct, useful model

Generate data

Real World

Collect data and verify or construct model of real world

Nobel Prize

Output most likely model based on some statistical measure

Feed in data What’s new?

Systematically and efficiently test many statistical models

1

Mining and Searching Complex Structures

Chapter 1 Introduction

Components of data mining Structure of model geneA=high and geneB=low ===> cancer geneA, geneB and geneC exhibit strong correlation Statistical Score for the model Accuracy of rule 1 is 90% Similarity function: Are they sufficiently similar group of records that support a certain model or hypothesis? Search method for the correct model parameters Given 200 genes, there could be 2^200 rules. Which rule give the best prediction power? Database access method Given 1 million records, how to quickly find relevant records to compute the accuracy of a rule?

The Apriori Algorithm • Bottom-up, breadth first search • Only read is perform on the databases • Store candidates in memory to simulate the lattice search • Iteratively follow the two steps: –generate candidates –count and get actual frequent items

a,b,c,e a,b,c a,b,e a,c,e b,c,e a,b

a,c

a,e a

start 4

2

b

b,c

b,e

c

e

{}

c,e

Mining and Searching Complex Structures

Chapter 1 Introduction

The K-Means Clustering Method • Given k, the k-means algorithm is implemented in 4 steps: –Partition objects into k nonempty subsets –Compute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster. –Assign each object to the cluster with the nearest seed point. –Go back to Step 2, stop when no more new assignment.

5

The K-Means Clustering Method • Example 10

10

9

9

8

8

7

7

6

6

5

5

4

4

3

3

2

2

1

1

0

0 0

1

2

3

4

5

6

7

8

9

10

0

10

10

9

9

8

8

7

7

6

6

5

5

4

4

3

3

2

2

1

1

2

3

4

5

6

7

8

9

10

1 0

0 0

1

2

3

4

5

6

7

8

9

10

0

6

3

1

2

3

4

5

6

7

8

9

10

Mining and Searching Complex Structures

Chapter 1 Introduction

Training Dataset (Decision Tree) Outlook

Temp

Humid

Wind

PlayTennis

Sunny Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Rain

Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild

High High High High Normal Normal Normal High Normal Normal Normal High Normal High 7

Weak Strong Weak Weak Weak Strong Strong Weak Weak Weak Strong Strong Weak Strong

No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No

Selecting the Next Attribute S=[9+,5-] E=0.940

S=[9+,5-] E=0.940

Humidity

Wind

High [3+, 4-] E=0.985

Normal

Weak

[6+, 1-]

[6+, 2-]

E=0.592

Gain(S,Humidity) =0.940-(7/14)*0.985 – (7/14)*0.592 =0.151

8

4

Strong [3+, 3-]

E=0.811 E=1.0 Gain(S,Wind) =0.940-(8/14)*0.811 – (6/14)*1.0 =0.048

Mining and Searching Complex Structures

Chapter 1 Introduction

Selecting the Next Attribute S=[9+,5-] E=0.940 Outlook Over cast

Sunny

Rain

[2+, 3-]

[4+, 0]

[3+, 2-]

E=0.971

E=0.0

E=0.971

Gain(S,Outlook) =0.940-(5/14)*0.971 -(4/14)*0.0 – (5/14)*0.0971 =0.247 9

ID3 Algorithm [D1,D2,…,D14] [9+,5-]

Outlook

Sunny

Overcast

Ssunny=[D1,D2,D8,D9,D11] [2+,3-] ?

Rain

[D3,D7,D12,D13] [D4,D5,D6,D10,D14] [4+,0-] [3+,2-] Yes

?

Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970 Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570 Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019 10

5

Mining and Searching Complex Structures

Chapter 1 Introduction

Decision Tree for PlayTennis Outlook Sunny

Overcast

Humidity High

Rain

Yes

Wind

Normal

No

Strong

Yes

Weak

No

Yes

11

Can we fit what we learn into the framework? Apriori

K-means

ID3

rule pattern discovery clustering task clusters structure of the model association rules or pattern choice of any k lattice of all possible search space

classification decision tree

combination of items size= 2m

points as center size=infinity

score function

support, confidence

square error

search /optimization method data management technique

breadth first with pruning

gradient descent

all possible combination of decision tree size= potentially infinity accuracy, information gain greedy

TBD

TBD

TBD

12

6

Mining and Searching Complex Structures

Chapter 1 Introduction

Components of data mining(II)

Models Enumeration Algorithm

Statistical Score Function Similarity/Search Function Database Access Method

Database

Background knowledge • We assume you have some basic knowledge about data mining, some of the slides here will be very useful for this purpose • Association Rule Mining http://www.comp.nus.edu.sg/~atung/Renmin56.pdf • Classification and Regression http://www.comp.nus.edu.sg/~atung/Renmin67.pdf • Clustering http://www.comp.nus.edu.sg/~atung/Renmin78.pdf

7

Mining and Searching Complex Structures

Chapter 1 Introduction

IT Trend Processors are cheap and will become cheaper(multi-core processor, graphic cards) Storage will be cheap but might not be fast Bandwidth will be growing What can we do with this? Play more realistic games! Not exactly a joke since any technologies that speed up games can speed up scientific simulation

Smarter (more intensive) computation Can store more personal semantic/ontology People can collaborate more over the Internet (Flickr, Wikipedia) to make things more intelligent

The AI dream now have the support of much better hardwares Essentially, data mining can be made much more simple for the man on the street Data mining should be human-centered, not machine centered 2010-7-31

15

What is complex data? What is “simple” data? zWhat are complex data? tabular table, with small number of attributes (of the same type), no Test1 Regular Gene1 Progress comments Pos missing 2.0 Fever …… values. Neg

-0.3

N.A

5.7

Unconscious

High dimensional data: Lots of attributes with different data types with missing values

Sequences/ time series

Trees

8

Graphs

Mining and Searching Complex Structures

Chapter 1 Introduction

Why complex data? They come naturally in many applications. Bring research nearer to real world Lots of challenges which mean more fun! Some fundamental challenges: How do you compare complex objects effectively and efficiently? How do you find special subset in the data that is interesting? Test1 Gene1 Progress Pos What 2.0 new Fever type of models and score function must you used? Neg -0.3 Unconscious How do you handle noise and error ? N.A

comments ……

5.7

a

a d d

c

b

e

c

d

b c

c

d

d b

e T2

T1

Personalized Semantic for Personal Data Management everyone will own terabytes of data soon improve query/search interface by mining and extracting personalized semantics like entities and their relationship etc. by comparing them against high quality tagged databases

Query by documents

Query by audio/music

Query by video

Query by photographs/images

Wikipedia

High Quality Data Sources

singers

authors actor/actress

songs

papers

semantic layer places

movies Personal Data

documents

audio music

video

9

photographs/i mages

Webpage/Blogs/Bookmarks

Mining and Searching Complex Structures

Chapter 1 Introduction

Integrated Approach to Mining Software Engineering Data software engineering data: code base, change history, bug reports, runtime trace integrated into a data warehouse to support decision making and mining, Example: Which code module should I modify to create a new function? Which module need maintenance? programming

defect detection

testing

debugging

maintenance

…

software engineering tasks helped by data mining

classification

association/ patterns

clustering

…

Data Warehouse

code bases

change history

program states

structural entities

bug reports/nl

…

software engineering data

WikiScience Web 2.0: Facebook for scientists Collaborative platform for scientist to build scientific models/hypothesis and share data, applications Based on some articles, I make some changes to Model A to create Model B

Model ModelAA

supporting articles tagged to Model B

Centralized, Centralized, Hybrid HybridModel Model CCConstructed Constructed by bySystem System

supporting dataset tagged to Model A

Model ModelBB

This is my model of the solar system base on my supporting dataset

10

Mining and Searching Complex Structures

Chapter 1 Introduction

Hey, why not Cloud Computing, Map/Reduce? • These are platform for scaling up services to large number of users on large amount of data • But what exactly do you want to scale up? • Services that provide useful and semantically correct information to the users • We have too many scalable data mining algorithms that find nothing or too many things • Let’s focus on finding useful things first (assuming we have lot’s of processing power) and then try to scale it up

Schedule of the Course Date/Time

Content

Lesson 1

Introduction

Lesson 2

Mining and Search High Dimensional Data I

Lesson 3

Mining and Search High Dimensional Data II

Lesson 4

Mining and Search High Dimensional Data III

Lesson 5

Similarity Search for Sequences and Trees I

Lesson 6

Similarity Search for Sequences and Trees III

Lesson 7

Similarity Search for Graph I

Lesson 8

Similarity Search for Graph II

Lesson 9

Similarity Search for Graph III

Lesson 10

Mining Massive Graph I

Lesson 11

Mining Massive Graph II

Lesson 12

Mining Massive Graph III

11

Mining and Searching Complex Structures

Chapter 1 Introduction

Focus of the course • Techniques that can handle high dimensional, complex structures –Providing semantics to similarity search –Shotgun and Assembly: Column/Feature Wise Processing using Inverted Index –Row-wise Enumeration –Using local properties to infer global properties • Throughout the course, please try to think of how these techniques are applicable across different type of complex structures

Databases Queries z

z z z z

To start off, we will consider something very basic call ranking queries since we need ranking any similarity search (usually from most similar to most dissimilar) In relational database, SQL returns all results at one go How many tuples can be fitted in one screen? How many tuples can you remember? Options: z z

z

Summarize the results Display representative tuples

How to select representative tuples?

12

Mining and Searching Complex Structures

Chapter 1 Introduction

Retrieve Relevant Information z z

z z z

Search videos related to Shanghai Expo Too many results: as long as you click “next”, there are 20 more new results Are we interested in all results? No, only most relevant ones Search engines have to rank the results, out of which they make money from

Question: How to Select a Small Result Set z

z

Selecting the most representative or most interesting results is not trivial Find an apartment with rental cheaper than 1000, the cheaper the better z

z

The result tuples can be sorted in the ascending order of rental prices, those in front are more favorable

Find an apartment with rental cheaper than 1000 near NEU, the lower the better, the nearer the better z

z

Apartment with lower rent may not be near, nearer one may not be cheap Order by prices? Order by distances?

13

Mining and Searching Complex Structures

Chapter 1 Introduction

Top-k Queries z

z z z z

z

Define a scoring function, which maps a tuple to a real number, as a score The higher the score is, the more favorable the tuple is Define an integer k Answer: k objects with highest scores Different scoring function may give different top-k result Price

Distance to NEU

Apartment A

$800

500 meter

Apartment B

$1200

200 meters

Given k = 1, if the score function is defined as the sum of price and distance, the first tuple is better; if it is defined as the product, the second tuple is better

Brute Force Top-k z z

z z

z

z

Compute scores for each result tuple Sort the tuples according to the descending order of the scores Select the first k tuples What if the number of tuples is unlimited? Search engines can give unlimited number of results Even if the number of tuples is limited, it is too slow to compute score for each tuple We have to do it efficiently

14

Mining and Searching Complex Structures

Chapter 1 Introduction

Outline z

z

Two well-known top-k algorithms z Fagin's Algorithm (FA) z The Threshold Algorithm (TA) Take random access into consideration z No Random Access Algorithm (NRA) z The Combined Algorithm (CA)

Monotonicity z

z

A score function f is monotone if f(x1,x2,...,xm)≤f(y1,y2,...,ym) whenever xi≤yi for every i Select top-3 students with highest total score in mathematics, physics and computer science:

•

select name, math+phys+comp as score from student order by score desc limit 3 z

sum(x.math,x.phys,x.comp)≤sum(y.math,y.phys,y.comp) if x.math≤y.math and x.phys≤y.phys and x.comp≤y.comp

15

Mining and Searching Complex Structures

Chapter 1 Introduction

Sorted Lists z

We shall think of a database consisting of m sorted lists L1, L2, … Lm

Lmath

Lphys

Ann

98

Hugh

97

z

Kurt

96

Ben

96

Ryan

94

Ann

95

Kurt

93

Ann

92

Jane

95

Hugh

91

Kurt

91

Ben

93

Carl

90

Jane

89

Hugh

92

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

Outline z

Lcomp

Two well-known top-k algorithms z Fagin's Algorithm (FA) z The Threshold Algorithm (TA) Take random access into consideration z No Random Access Algorithm (NRA) z The Combined Algorithm (CA)

16

Mining and Searching Complex Structures

Chapter 1 Introduction

Fagin's Algorithm (I) z

z

Do sequential access until there are at least k matches Ann

98

Hugh

97

Kurt

96

Ben

96

Ryan

94

Ann

95

Kurt

93

Ann

92

Jane

95

Hugh

91

Kurt

91

Ben

93

Carl

90

Jane

89

Hugh

92

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

Sequential accesses are stopped when 3 students are seen, i.e. Ann, Hugh and Kurt

Fagin's Algorithm (II) z

z

For each object that has been seen, do random accesses on other lists to compute its score Ann

98

Hugh

97

Kurt

96

Ben

96

Ryan

94

Ann

95

Kurt

93

Ann

92

Jane

95

Hugh

91

Kurt

91

Ben

93

Carl

90

Jane

89

Hugh

92

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

Random accesses need to be done for Ben, Carl, Jane and Ryan

17

Mining and Searching Complex Structures

Chapter 1 Introduction

Fagin's Algorithm (III) z

Select the k objects with highest score as top-k result

Ann

98

Hugh

97

Kurt

96

Ben

96

Ryan

94

Ann

95

Kurt

93

Ann

92

Jane

95

Hugh

91

Kurt

91

Ben

93

Carl

90

Jane

89

Hugh

92

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

Why is FA correct? (I) z

z

There are at least k objects seen on all attributes when sequential access is stopped By monotonicity, those objects that are not seen do not have higher score than the above k objects Ann

98

Hugh

97

Kurt

96

Ben

96

Ryan

94

Ann

95

Kurt

93

Ann

92

Jane

95

Hugh

91

Kurt

91

Ben

93

Carl

90

Jane

89

Hugh

92

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

18

Mining and Searching Complex Structures

Chapter 1 Introduction

Why is FA correct? (II) z

z

For those that have been seen, it is either all attributes has been seen, or random accesses are performed to know all attributes The k objects with highest scores are therefore the top-k result Ann

98

Hugh

97

z

96

Ben

96

Ryan

94

Ann

95

Kurt

93

Ann

92

Jane

95

Hugh

91

Kurt

91

Ben

93

Carl

90

Jane

89

Hugh

92

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

Outline z

Kurt

Two well-known top-k algorithms z Fagin's Algorithm (FA) z The Threshold Algorithm (TA) Take random access into consideration z No Random Access Algorithm (NRA) z The Combined Algorithm (CA)

19

Mining and Searching Complex Structures

Chapter 1 Introduction

The Threshold Algorithm (I) z

z

Do sequential access on all lists. If an object is seen, do random access to the other lists to compute its score Ann

98

Hugh

97

Kurt

96

Ben

96

Ryan

94

Ann

95

Kurt

93

Ann

92

Jane

95

Hugh

91

Kurt

91

Ben

93

Carl

90

Jane

89

Hugh

92

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

Random accesses on Ann, Hugh and Kurt first, then on Ben and Ryan

The Threshold Algorithm (II) z

z

Remember the k objects with highest scores, together with their scores Ann

98

Hugh

97

Kurt

96

Ben

96

Ryan

94

Ann

95

Kurt

93

Ann

92

Jane

95

Hugh

91

Kurt

91

Ben

93

Carl

90

Jane

89

Hugh

92

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

Score (Ann) = 285 Score (Hugh) = 280 Score (Kurt) = 280

20

Mining and Searching Complex Structures

Chapter 1 Introduction

The Threshold Algorithm (III) • Let threshold value τ be the function value on last seen values on all sorted lists • As soon as at least k objects with score at least τ, then halt

Ann

98

Hugh

97

Kurt

96

Ben

96

Ryan

94

Ann

95

Kurt

93

Ann

92

Jane

95

Hugh

91

Kurt

91

Ben

93

Carl

90

Jane

89

Hugh

92

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

τ(1) = 291 τ(2) = 285 τ(3) = 280

Why is TA correct? • By monotonicity, those unseen objects do not have higher score than τ • For those that have been seen, random accesses are performed, the k objects with highest scores are therefore the top-k result Ann

98

Hugh

97

Kurt

96

Ben

96

Ryan

94

Ann

95

Kurt

93

Ann

92

Jane

95

Hugh

91

Kurt

91

Ben

93

Carl

90

Jane

89

Hugh

92

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

21

τ(1) = 291 τ(2) = 285 τ(3) = 280

Mining and Searching Complex Structures

Chapter 1 Introduction

Comparing TA with FA • Number of sequential accesses z At the time FA stops sequential accesses, τ is guaranteed not higher than the k objects seen on all sorted lists • Number of random accesses z TA requires m-1 random accesses for each object z But FA is expected to random access more objects • Size of buffers used z Buffer used by FA can be unbounded z TA only needs to remember k objects with k scores, and the threshold value τ

Outline z

z

Two well-known top-k algorithms z Fagin's Algorithm (FA) z The Threshold Algorithm (TA) Take random access into consideration z No Random Access Algorithm (NRA) z The Combined Algorithm (CA)

22

Mining and Searching Complex Structures

Chapter 1 Introduction

Random Access z

z

z

Random accesses are impossible z Text retrieval: sorted lists are results of search engines Random accesses are expensive z Sequential accesses on disk are orders of magnitude faster than random accesses We need to consider not using random accesses or using them as few as possible

No Random Access z

Without random access, all we know are the upper bounds

Lmath

z

Lphys

Ann

98

Hugh

97

Lcomp Kurt

96

Ben

96

Ryan

94

Ann

95

Kurt

93

Ann

92

Jane

95

Hugh

91

Kurt

91

Ben

93

Carl

90

Jane

89

Hugh

92

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

Carl’s scores on physics and computer science are not higher than 89 and 92 respectively

23

Mining and Searching Complex Structures

Chapter 1 Introduction

Lower and Upper Bounds z

z

If an object has not been seen on one attribute z Lower bound is 0 z Upper bound is the last seen value Ann

98

Hugh

97

Kurt

96

Ben

96

Ryan

94

Ann

95

Kurt

93

Ann

92

Jane

95

Hugh

91

Kurt

91

Ben

93

Carl

90

Jane

89

Hugh

92

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

The lower bound of Carl’s score on physics is 0 The upper bound of Carl’s score on physics is 89

Worse and Best Scores (I) z z

z

W (R): The worst possible score of tuple R B (R): The best possible score of tuple R Ann

98

Hugh

97

Kurt

96

Ben

96

Ryan

94

Ann

95

Kurt

93

Ann

92

Jane

95

Hugh

91

Kurt

91

Ben

93

Carl

90

Jane

89

Hugh

92

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

W (Carl) = 90 B (Carl) = 90 + 89 + 92

24

Mining and Searching Complex Structures

Chapter 1 Introduction

Worse and Best Scores (II) z z

W (R) ≤ Score of R ≤ B (R) W (R) and B (R) get updated as its value gets sequential accessed Ann

98

Hugh

97

Kurt

96

Ben

96

Ryan

94

Ann

95

Kurt

93

Ann

92

Jane

95

Hugh

91

Kurt

91

Ben

93

Carl

90

Jane

89

Hugh

92

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

Ann

Hugh

Kurt

W

98

97

96

B

291

291

291

Worse and Best Scores (II) z z

W (R) ≤ Score of R ≤ B (R) W (R) and B (R) get updated as its value gets sequential accessed Ann

98

Hugh

97

Kurt

96

Ben

96

Ryan

94

Ann

95

Kurt

93

Ann

92

Jane

95

Hugh

91

Kurt

91

Ben

93

Carl

90

Jane

89

Hugh

92

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

Ann

Hugh

Kurt

Ben

Ryan

W

98→193

97

96

96

94

B

291→287

291→288

291→286

285

285

25

Mining and Searching Complex Structures

Chapter 1 Introduction

Worse and Best Scores (II) z z

W (R) ≤ Score of R ≤ B (R) W (R) and B (R) get updated as its value gets sequential accessed Ann

98

Hugh

97

Kurt

96

Ben

96

Ryan

94

Ann

95

Kurt

93

Ann

92

Jane

95

Hugh

91

Kurt

91

Ben

93

Carl

90

Jane

89

Hugh

92

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

Ann W 193→285

Hugh

Kurt

Ben

Ryan

Jane

97

96→189

96

94

95

B 287→285 288→285 286→281 285→283 285→282

Outline z

z

Two well-known top-k algorithms z Fagin's Algorithm (FA) z The Threshold Algorithm (TA) Take random access into consideration z No Random Access Algorithm (NRA) z The Combined Algorithm (CA)

26

280

Mining and Searching Complex Structures

Chapter 1 Introduction

No Random Access Algorithm (I) z z

Maintain the last-seen values x1,x2,…,xm For every seen object, maintain its worst possible score, its known attributes and their values Ann

98

Hugh

97

Kurt

96

Ben

96

Ryan

94

Ann

95

Kurt

93

Ann

92

Jane

95

Hugh

91

Kurt

91

Ben

93

Carl

90

Jane

89

Hugh

92

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

z

xmath = 96; xphys = 94; xcomp = 95

z

Ann:193:{;}

No Random Access Algorithm (II) z

Why not maintain the best possible score for each objects Ann

98

Hugh

97

Kurt

96

Ben

96

Ryan

94

Ann

95

Kurt

93

Ann

92

Jane

95

Hugh

91

Kurt

91

Ben

93

Carl

90

Jane

89

Hugh

92

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

Ann W 193→285

Hugh

Kurt

Ben

Ryan

Jane

97

96→189

96

94

95

B 287→285 288→285 286→281 285→283 285→282

Too Frequently Updated!

27

280

Mining and Searching Complex Structures

Chapter 1 Introduction

No Random Access Algorithm (III) z z

Let M be the kth largest W value An object R is viable if B (R) ≥ M Ann

98

Hugh

97

Kurt

96

Ben

96

Ryan

94

Ann

95

Kurt

93

Ann

92

Jane

95

Hugh

91

Kurt

91

Ben

93

Carl

90

Jane

89

Hugh

92

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

Hugh

Kurt

Ben

Ryan

Jane

W 285

Ann

97→188

189→280

96→189

94

95

B

285→280 281→280 283→280 282→278 280→277

285

M = 189

No Random Access Algorithm (III) z z

Let M be the kth largest W value An object R is viable if B (R) ≥ M Ann

98

Hugh

97

Kurt

96

Ben

96

Ryan

94

Ann

95

Kurt

93

Ann

92

Jane

95

Hugh

91

Kurt

91

Ben

93

Carl

90

Jane

89

Hugh

92

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

Ann

Hugh

Kurt

W

285

188→280

280

B

285

285→280

280

Ben

Ryan

Jane

188

94

95→184

280→278 278→276 277→274

28

M = 280

Mining and Searching Complex Structures

Chapter 1 Introduction

No Random Access Algorithm (IV) z z

z

Let set T contain objects with W (R) ≥ M Halt when z There are at least k objects seen on all sorted lists z No viable objects left outside set T

Ann

Hugh

Kurt

W

285

188→280

280

B

285

285→280

280

Ben

Ryan

Jane

188

94

95→184

M = 280

280→278 278→276 277→274

T = {Ann, Hugh, Kurt}

Why is NRA correct? z z

z

W (R) ≤ Score of R ≤ B (R) always holds If an object R is not viable, Score of R ≤ B (R) ≤ M, then there are at least k objects with scores not lower than R Therefore, if there is no viable object outside T and T contains at least k objects, T is the set of top-k result

29

Mining and Searching Complex Structures

Chapter 1 Introduction

Comparing NRA with TA • Number of sequential accesses z The number of sequential accesses of NRA is at least the last position of top-k result on all attributes • Number of random accesses z NRA is obviously 0 • Size of buffers used z TA remembers k objects with k scores, and the threshold value τ z NRA remembers all viable objects with its scores on all seen attributes, and the last-seen value on all attributes

How deep can NRA go?

z

z

Ann

98

Hugh

97

Kurt

96

Hugh

97

Kurt

96

Ann

95

Ben

60

Ryan

60

Jane

60

Ryan

60

Ben

60

Ben

60

Carl

60

Jane

60

Carl

60

...

...

...

...

...

...

Jane

60

Carl

60

Ryan

60

Kurt

0

Ann

0

Hugh

0

The set T can be identified quickly, but their scores will only be certain at the end of lists If we allow relatively fewer number of random accesses, scanning the entire lists can be avoided

30

Mining and Searching Complex Structures

Chapter 1 Introduction

Outline z

z

Two well-known top-k algorithms z Fagin's Algorithm (FA) z The Threshold Algorithm (TA) Take random access into consideration z No Random Access Algorithm (NRA) z The Combined Algorithm (CA)

The Combined Algorithm (I) z z z z z z

CA combines TA and NRA cR: the cost of a random access cS: the cost of a sequential access h= Run NRA, but every h steps to run random accesses, like TA h = ∞ → never do random access, CA is then NRA

31

Mining and Searching Complex Structures

Chapter 1 Introduction

The Combined Algorithm (II)

z

Ann

98

Hugh

97

Kurt

96

Hugh

97

Kurt

96

Ann

95

Ben

60

Ryan

60

Jane

60

Ryan

60

Ben

60

Ben

60

Carl

60

Jane

60

Carl

60

...

...

...

...

...

...

Jane

60

Carl

60

Ryan

60

Kurt

0

Ann

0

Hugh

0

Random accesses for Ann, Hugh and Kurt quickly find out the scores for Ann, Hugh and Kurt

The Combined Algorithm (III) z

z

In CA, by doing random accesses, we wish to either z Confirm an object is a top-k result, or z Prune a viable object As the number of random accesses in CA is limited, various heuristics can be made to optimize CA in terms of total cost

32

Mining and Searching Complex Structures

Chapter 1 Introduction

Reference • Ronald Fagin, Amnon Lotem, Moni Naor: Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci. 66(4): 614-656 (2003)

33

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Mining and Searching Complex Structures High Dimensional Data Anthony K. H. Tung(鄧锦浩) School of Computing National University of Singapore www.comp.nus.edu.sg/~atung

Research Group Link: http://nusdm.comp.nus.edu.sg/index.html Social Network Link: http://www.renren.com/profile.do?id=313870900

Outline • Sources of HDD • Challenges of HDD • Searching and Mining Mixed Typed Data –Similarity Function on k-n-match –ItCompress • Bregman Divergence: Towards Similarity Search on Non-metric Distance • Earth Mover Distance: Similarity Search on Probabilistic Data • Finding Patterns in High Dimensional Data

Mining and Searching Complex Structures

34

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Sources of High Dimensional Data • • • • •

Microarray gene expression Text documents Images Features of Sequences, Trees and Graphs Audio, Video, Human Motion Database (spatiotemporal as well!)

Mining and Searching Complex Structures

Challenges of High Dimensional Data • Indistinguishable –Distance between two nearest points and two furthest points could be almost the same • Sparsity –As a result of the above, data distribution are very sparse giving no obvious indication on where the interesting knowledge is • Large number of combination –Efficiency: How to test the number of combinations –Effectiveness: How do we understand and interpret so many combinations?

Mining and Searching Complex Structures

35

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Outline • Sources of HDD • Challenges of HDD • Searching and Mining Mixed Typed Data –Similarity Function on k-n-match –ItCompress • Bregman Divergence: Towards Similarity Search on Non-metric Distance • Earth Mover Distance: Similarity Search on Probabilistic Data • Finding Patterns in High Dimensional Data

Mining and Searching Complex Structures

Similarity Search : Traditional Approach •

Objects represented by multidimensional vectors

Elevation

Aspect

Slope

Hillshade (9am)

Hillshade (noon)

Hillshade (3pm)

2596

51

3

221

232

148

…

…

•

The traditional approach to similarity search: kNN query Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) ID

d1

d2

d3

d4

d5

d6

d7

d8

d9

d10

Dist

P1

1.1

1

1.2

1.6

1.1

1.6

1.2

1.2

1

1

0.93

P2

1.4

1.4

1.4

1.5

1.4

1

1.2

1.2

1

1

0.98

P3

1

1

1

1

1

1

2

1

2

2

1.73

P4

20

20

21

20

22

20

20

19

20

20

57.7

P5

19

21

20

20

20

21

18

20

22

20

60.5

P6

21

21

18

19

20

19

21

20

20

20

59.8

Mining and Searching Complex Structures

36

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Deficiencies of the Traditional Approach •

Deficiencies –Distance is affected by a few dimensions with high dissimilarity –Partial similarities can not be discovered

•

The traditional approach to similarity search: kNN query Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)

ID

d1

d2

d3

d4

d5

d6

d7

d8

d9

d10

Dist

P1

1.1

1 100

1.2

1.6

1.1

1.6

1.2

1.2

1

1

0.93 99.0

P2

1.4

1.4

1.4

1.5

1.4

1 100

1.2

1.2

1

1

99.0 0.98

P3

1

1

1

1

1

1

2

1 100

2

2

1.73 99.0

P4

20

20

21

20

22

20

20

19

20

20

57.7

P5

19

21

20

20

20

21

18

20

22

20

60.5

P6

21

21

18

19

20

19

21

20

20

20

59.8

Mining and Searching Complex Structures

Thoughts • Aggregating too many dimensional differences into a single value result in too much information loss. Can we try to reduce that loss? • While high dimensional data typically give us problem when in come to similarity search, can we turn what is against us into advantage? • Our approach: Since we have so many dimensions, we can compute more complex statistics over these dimensions to overcome some of the “noise” introduce due to scaling of dimensions, outliers etc.

Mining and Searching Complex Structures

37

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

The N-Match Query : Warm-Up •

Description –Matches between two objects in n dimensions. (n ≤ d) –The n dimensions are chosen dynamically to make the two objects match best.

•

How to define a “match” –Exact match –Match with tolerance δ

•

The similarity search example Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) ID

d1

d2

d3

d4

P1

1.1

1 100

1.2

1.6

P2

1.4

1.4

1.4

1.5

n=6 d5

d6

d7

d8

d9

d10

Dist

1.1

1.6

1.2

1.2

1

1

0.2

1.4

1 100

1.2

1.2

1

1

0.4 0.98

P3

1

1

1

1

1

1

2

1 100

2

2

1.73 0

P4

20

20

21

20

22

20

20

19

20

20

19

P5

19

21

20

20

20

21

18

20

22

20

19

P6

21

21

18

19

20

19

21

20

20

20

19

Mining and Searching Complex Structures

The N-Match Query : The Definition •

The n-match difference Given two d-dimensional points P(p1, p2, …, pd) and Q(q1, q2, …, qd), let δi = |pi - qi|, i=1,…,d. Sort the array {δ1 , …, δd} in increasing order and let the sorted array be {δ1’, …, δd’}. Then δn’ is the n-match difference y between P and Q.

•

The n-match query Given a d-dimensional database DB, a query point Q and an integer n (n≤d), find the point P ∈ DB that has the smallest n-match difference to Q. P is called the n-match of Q.

10

E

8

D

1-match=A 2-match=B

6 4

A B

•

C

2

The similarity search example

n=8 6 7

Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)

Q

2

4

6

8

ID

d1

d2

d3

d4

d5

d6

d7

d8

d9

d10

Dist

P1

1.1

1 100

1.2

1.6

1.1

1.6

1.2

1.2

1

1

0.2 0.6

P2

1.4

1.4

1.4

1.5

1.4

1 100

1.2

1.2

1

1

0.4 0.98

P3

1

1

1

1

1

1

2

1 100

2

2

1.73 0 1

P4

20

20

21

20

22

20

20

19

20

20

19

P5

19

21

20

20

20

21

18

20

22

20

19

P6

21

21

18

19

20

19

21

20

20

20

19

Mining and Searching Complex Structures

38

10

x

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

The N-Match Query : Extensions •

The k-n-match query Given a d-dimensional database DB, a query point Q, an integer k, and an integer n, find a set S which consists of k points from DB so that for any point P1 ∈ S and any point P2∈ DB-S, P1’s n-match difference is smaller y than P2’s n-match difference. S is called the k-n-match of Q.

•

The frequent k-n-match query

•

Given a d-dimensional database DB, a query point Q, an integer k, and an integer range [n0, n1] within [1,d], let S0, …, Si be the answer sets of k-n0-match, …, k-n1-match, respectively, find a set T of k points, so that for any point P1 ∈ T and any point P2 ∈ DB-T, P1’s number of appearances in S0, …, Si is larger than or equal to P2’s number of appearances in S0, …, Si .

10

E

8

D

The similarity search example

Q

2-1-match={A,D} 2-2-match={A,B}

6 4

A B

C

2 2

4

6

8

n=6

Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) ID

d1

d2

d3

d4

d5

d6

d7

d8

d9

d10

P1

1.1

1 100

1.2

1.6

1.1

1.6

1.2

1.2

1

1

Dist 0.2

P2

1.4

1.4

1.4

1.5

1.4

1 100

1.2

1.2

1

1

0.4 0.98

P3

1

1

1

1

1

1

2

1 100

2

2

1.73 0

P4

20

20

21

20

22

20

20

19

20

20

19

P5

19

21

20

20

20

21

18

20

22

20

19

P6

21

21

18

19

20

19

21

20

20

20

19

Mining and Searching Complex Structures

Cost Model •

The multiple system information retrieval model –Objects are stored in different systems and scored by each system –Each system can sort the objects according to their scores –A query retrieves the scores of objects from different systems and then combine them using some aggregation function

Q : color=“red” & shape=“round” & texture “cloud”

System 1: Color

•

System 2: Shape

Object ID

Score

Object ID

Score

Object ID

Score

1

0.4 0.4

1

1.0 1.0

1

1.0 1.0

2

2.8 2.8

5 2

1.5 5.5

2

2.0 2.0

3 5

3.5 6.5

2 3

5.5 7.8

3

5.0 5.0

3 4

6.5 9.0

3 4

7.8 9.0

4 5

8.0 9.0

4 5

9.0 3.5

4 5

9.0 1.5

5 4

9.0 8.0

The cost –Retrieval of scores – proportional to the number of scores retrieved

•

System 3: Texture

The goal –To minimize the scores retrieved

Mining and Searching Complex Structures

39

10

x

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

The AD Algorithm •

The AD algorithm for the k-n-match query –Locate the query’s attributes value in every dimension –Retrieve the objects’ attributes value from the query’s attributes in both directions –The objects’ attributes are retrieved in Ascending order of their Differences to the query’s attributes. An n-match is found when it appears n times.

Q : color=“red” 2-2-match &Qshape=“round” : (of 3.0 Q ,:7.0 ( 3.0 , 4.0 , 7.0 &) texture , 4.0 ) “cloud” Systemd11: Color

Systemd2 2: Shape

Object ID

Score Attr

Object ID

Score Attr

Object ID

Score Attr

1

0.4

1

1.0

1

1.0

2

2.8

5

1.5

2

2.0

5

3.5

2

5.5

3

5.0

3

6.5

3

7.8

5

8.0

4

9.0

4

9.0

4

9.0

3.0

Auxiliary structures

Next attribute to retrieve g[2d]

Number of appearances appear[c]

System d3 3: Texture

7.0

d1 2 ,, 2.6 0.2 1

d2 0.5 35 ,, 3.5

Answer set S

2 , 1.5

4.0

d3 3 0.8 4 ,, 2.0

2 , 2.0

1.0 53 ,, 4.0

1

2

3

4

5

0

20 1

2 10

0

1 0

{ 3 , {2{3} }

Mining and Searching Complex Structures

The AD Algorithm : Extensions •

The AD algorithm for the frequent k-n-match query –The frequent k-n-match query • Given an integer range [n0, n1], find k-n0-match, k-(n0+1)-match, ... , k-n1match of the query, S0, S1, ... , Si. • Find k objects that appear most frequently in S0, S1, ... , Si. –Retrieve the same number of attributes as processing a k-n1-match query.

•

Disk based solutions for the (frequent) k-n-match query –Disk based AD algorithm • Sort each dimension and store them sequentially on the disk • When reaching the end of a disk page, read the next page from disk –Existing indexing techniques • Tree-like structures: R-trees, k-d-trees • Mapping based indexing: space-filling curves, iDistance • Sequential scan • Compression based approach (VA-file)

Mining and Searching Complex Structures

40

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Experiments : Effectiveness •

Searching by k-n-match –COIL-100 database –54 features extracted, such as color histograms, area moments k-n-match query, k=4 n

Images returned

5

36, 42, 78, 94

10

Searching by frequent k-nmatch

UCI Machine learning repository Competitors: IGrid Human-Computer Interactive NN search (HCINN)

27, 35, 42, 78

15

3, 38, 42, 78

20

27, 38, 42, 78

25

35, 40, 42, 94

30

10, 35, 42, 94

35

35, 42, 94, 96

40

35, 42, 94, 96

45

35, 42, 94, 96

50

35, 42, 94, 96

13, 35, 36, 40, 42 64, 85, 88, 94, 96

Freq. k-n-match

Data sets (d)

IGrid

HCINN

80.1%

86%

87.5%

Segmentation (19)

79.9%

83%

87.3%

Wdbc (30)

87.1%

N.A.

92.5%

Glass (9)

58.6%

N.A.

67.8%

Iris (4)

88.9%

N.A.

89.6%

Experiments : Efficiency Disk based algorithms for the Frequent k-n-mach query –Texture dataset (68,040 records); uniform dataset (100,000 records) –Competitors: • The AD algorithm • VA-file • Sequential scan

Mining and Searching Complex Structures

41

Images returned

10

Ionosphere (34)

Mining and Searching Complex Structures

•

kNN query k

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Experiments : Efficiency (continued) •

Comparison with other similarity search techniques –Texture dataset ; synthetic dataset –Competitors: • Frequent k-n-match query using the AD algorithm • IGrid • scan

Mining and Searching Complex Structures

Future Work(I) • We now have a natural way to handle similarity search for data with categorical , numerical and attributes. Investigating k-n-match performance on such mixed-type data is currently under way • Likewise, applying k-n-match on data with missing or uncertain attributes will be interesting • Query={1,1,1,1,1,1,1,M,No,R} ID

d1

d2

d3

d4

d5

d6

d7

d8

d9

d10

P1

1.1

1

1.2

1.6

1.1

1.6

1.2

M

Yes

R

P2

1.4

1.4

1.4

1.5

1.4

1

1.2

F

No

B

P3

1

1

1

1

1

1

2

M

No

B

P4

20

20

21

20

22

20

20

M

Yes

G

P5

19

21

20

20

20

21

18

F

Yes

R

P6

21

21

18

19

20

19

21

F

Yes

Y

Mining and Searching Complex Structures

42

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Future Work(I) • We now have a natural way to handle similarity search for data with categorical , numerical and attributes. Investigating k-n-match performance on such mixed-type data is currently under way • Likewise, applying k-n-match on data with missing or uncertain attributes will be interesting • Query={1,1,1,1,1,1,1,M,No,R} ID

d1

P1 P2

1.4

d2

d3

d4

d5

d6

d7

d8

1

1.2

1.6

1.1

1.6

1.2

M

1

1.2

F

No

2

M

No

20

20

M

1.4

P3

1

1

P4

20

20

P5

19

21

P6

21

1.5 1

20

1

1

20

22

20

20

18

20

21

18

F

d9

d10 R B B G

Yes

R

Yes

Y

Mining and Searching Complex Structures

Future Work(II) • In general, three things affect the result from a similarity search: noise, scaling and axes orientation. K-n-match reduce the effect of noise. Ultimate aim is to have a similarity function that is robust to noise, scaling and axes orientation • Eventually will look at creating mining algorithms using k-nmatch

Mining and Searching Complex Structures

43

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Outline • Sources of HDD • Challenges of HDD • Searching and Mining Mixed Typed Data –Similarity Function on k-n-match –ItCompress • Bregman Divergence: Towards Similarity Search on Non-metric Distance • Earth Mover Distance: Similarity Search on Probabilistic Data • Finding Patterns in High Dimensional Data

Mining and Searching Complex Structures

Motivation

query Large Data Sets

results

Ever-increasing data collection rates of modern enterprises and the need for effective, guaranteedquality approximate answers to queries Concern: compress as much as possible.

Mining and Searching Complex Structures

44

22

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Conventional Compression Method • Try to find the optimal encoding of arbitrary strings for the input data: –Huffman Coding –Lempel-Ziv Coding (gzip) • View the whole table as a large byte string • Statistical or dictionary based • Operate at the byte level

Mining and Searching Complex Structures

23

Why not just “syntactic”? • Do not exploit the complex dependency patterns in the table • Individual retrieval of tuple is difficult • Do not utilize lossy compression

Mining and Searching Complex Structures

45

24

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Semantic compression methods • Derive a descriptive model M • Identify the data values which can be derived from M (within some error tolerance), which are essential for deriving, and which are the outliers • Derived values need not to be stored, only the outliers need

Mining and Searching Complex Structures

25

Advantages • More Complex Analysis –Example: detect correlation among columns

• Fast Retrieval –Tuple-wise access

• Query Enhancement –Possible to answer query directly from discover semantic –Compress in way which enhanced answering of some complex queries, eg. “Go Green: Recycle and Reuse Frequent Patterns”, C. Gao, B. C. Ooi, K. L. Tan and A. K. H. Tung. ICDE’2004.

Choose a combination of compression methods based on semantic and syntactic information Mining and Searching Complex Structures

46

26

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Fascicles • Key observation –Often, numerous subsets of records in T have similar values for many attributes

Protocol http http http http http ftp ftp ftp

Duration 12 16 15 19 26 27 32 18

Bytes Packets 20K 3 24K 5 20K 8 40K 11 58K 18 100K 24 300K 35 80K 15

• Compress data by storing representative values (e.g., “centroid”) only once for each attribute cluster • Lossy compression: information loss is controlled by the notion of “similar values” for attributes (user-defined)

Mining and Searching Complex Structures

27

ItCompress: Compression Format Representative Rows (Patterns)

Original Table

RRid age salary

age salary credit sex

credit

sex

1

30

90k

good

F

2

70

35k

poor

M

20

30k

poor

M

25

76k

good

F

30

90k

good

F

40

100k

poor

M

50

110k

good

F

2

0111

60

50k

good

M

1

1111

70

35k

poor

F

1

1111

75

15k

poor

M

1

0100

40, poor, M

1

0111

50

1

0010

60, 50k, M

2

1110

F

2

1111

Compressed Table RRid bitmap

Error Tolerance: age salary credit sex 5

25k

0

0

Mining and Searching Complex Structures

47

Outlying value 20

28

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Some definitions • Error tolerance –Numeric attributes • The upper bound that x’ can be different from x • x ∈ [ x’-ei, x’+ei ]

–Categorical attributes • The upper bound on the probability that the compressed value differs from actual value • Given an actual value x and its error tolerance ei, the compressed value x’ should satisfy: Prob( x=x’ ) ≥ 1 - ei

Mining and Searching Complex Structures

29

Some definitions • Coverage –Let R be a row in the table T, and Pi be a pattern –The coverage of Pi on R :

cov( Pi , R ) = number of attributes X i in which R[ X i ] is match by Pi [ X i ] • Total coverage –Let P be a set of patterns P1,…,Pk; and the table T contains n rows R1,…,Rn –

totalcov ( P, T ) =

∑ cov( P

i =1..n

max

( Ri ), Ri ) 30

Mining and Searching Complex Structures

48

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

ItCompress: basic algorithm • First randomly choose k rows as initial patterns • Scan the table T: Phase1

–For each row R, compute the coverage of each pattern on it, then try to find Pmax(R) –Allocate R to its most covered pattern

• After each iteration, re-compute all patterns’ Phase2 attributes, always using the most frequent values • Iterate until sum of total coverage does not increase

Mining and Searching Complex Structures

31

Example: the 1st iteration begins

RRid age salary

age salary credit

sex

20

30k

poor

M

1

20

25

76k

good

F

2

25

30

90k

good

F

40

100k

poor

M

50

110k

good

F

60

50k

good

M

70

35k

poor

F

75

15k

poor

M

age salary credit

sex

credit

sex

30k

poor

M

76k

good

F

Error Tolerance: 5

25k

0

0

32

Mining and Searching Complex Structures

49

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Example: Phase 1 RRid age salary

credit

sex

30k

poor

M

76k

good

F

age salary

credit

sex

20

30k

poor

M

40

100k

poor

M

60

50k

good

M

70

35k

poor

F

75

15k

poor

M

age salary credit

sex

age salary credit

sex

1

20

20

30k

poor

M

2

25

25

76k

good

F

30

90k

good

F

40

100k

poor

M

50

110k

good

F

60

50k

good

M

70

35k

poor

F

75

15k

poor

M

Error Tolerance: age salary credit 5

25k

0

sex 0

25

76k

good

F

30

90k

good

F

50

110k

good

F

33

Mining and Searching Complex Structures

Example: Phase 2 RRid

age

salary credit

sex

1

20 70

30k

poor

M M

2

25 25

90k 76k

good

FF

age salary credit

sex

20

30k

poor

M

25

76k

good

F

30

90k

good

F

age salary

credit

sex

40

100k

poor

M

20

30k

poor

M

50

110k

good

F

40

100k

poor

M

60

50k

good

M

60

50k

good

M

70

35k

poor

F

70

35k

poor

F

75

15k

poor

M

75

15k

poor

M

age salary credit

sex

Error Tolerance: age salary credit 5

25k

0

sex 0

25

76k

good

F

30

90k

good

F

50

110k

good

F

Mining and Searching Complex Structures

50

34

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Convergence(I) • Phase 1: –When we assign the rows to their most coverage patterns: • For each row, the coverage increases or maintain ÎSo the total coverage also increases or maintain

• Phase 2: –When we re-compute the attribute values for the patterns: • For each pattern, the coverage increases or maintains ÎSo the total coverage also increases or maintains

Mining and Searching Complex Structures

35

Convergence(II) • In both Phase 1&2, the total coverage is either increased or maintained, and it has a obvious upper bound (cover the whole table) Î The algorithm will converge eventually

Mining and Searching Complex Structures

51

36

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Complexity • Phase 1: –In l iterations, we need to go through the n rows in the table and match each row against the k patterns(2m comparisons,) ÎThe running time complexity is O(kmnl) where m is the number of attributes • Phase 2: –Computing each new pattern Pi will require going through all the domain values/intervals of each value ÎAssuming the total number of domain values/intervals is d, the running time complexity is O(kdl)

ÎThe total time complexity is O(kmnl+kdl)

Mining and Searching Complex Structures

37

Advantages of ItCompress • Simplicity and Directness –Two phases process of Fascicle and Spartan • Find rules/patterns • Compress database using discovered rules/patterns

–ItCompress optimize the compression directly without finding rules/patterns that may not be useful (a.k.a microeconomic approach) • Less constraints –Do not need patterns to be matched completely or rules that apply globally • Easily tuned parameters

Mining and Searching Complex Structures

52

38

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Performance Comparison • Algorithms –ItCompress, ItCompress+gzip –Fascicles, Fascicles+gzip –SPARTAN+gzip

• Platform –ItCompress,Fascicles: AMD Duron 700Mhz, 256MB Memory –SPARTAN: Four 700Mhz Pentium CPU, 1GB Memory)

• Datasets –Corel: 32 numeric attributes, 35000 rows, 10.5MB –Census: 7 numeric, 7 categorical, 676000 rows, 28.6MB –Forest-cover: 10 numeric, 44 categorical, 581000 rows, 75.2MB Mining and Searching Complex Structures

39

Effectiveness (Corel)

Mining and Searching Complex Structures

53

40

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Effectiveness (Census)

Mining and Searching Complex Structures

41

Effectiveness (Forest Cover)

Mining and Searching Complex Structures

54

42

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Efficiency

Mining and Searching Complex Structures

43

Mining and Searching Complex Structures

44

Varying k

55

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Varying Sample Ratio

Mining and Searching Complex Structures

45

Adding Noises (Census)

Mining and Searching Complex Structures

56

46

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

20% Corruption?

Effect of Corruption A1

A2

A3

A4

A5

A6

A7

A8

A9

A10 A11 A12

47

Mining and Searching Complex Structures

Effect of Corruption A1

A2

A3

A4

A5

A6

20% Corruption? A7

A8

A9

A10 A11 A12

48

Mining and Searching Complex Structures

57

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Findings •

ItCompress is –More efficient than SPARTAN –More effective than Fascicles –Insensitive to parameter setting –Robust to noises

Mining and Searching Complex Structures

49

Future work • Can we perform mining on the compressed datasets using only the patterns and the bitmap ? –Example: Building Bayesian Belief Network • Is ItCompress a good “bootstrap” semantic compression algorithm ? ItCompress database

Compressed database

Other Semantic Compression Algorithms 50

Mining and Searching Complex Structures

58

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Outline • Sources of HDD • Challenges of HDD • Searching and Mining Mixed Typed Data –Similarity Function on k-n-match –ItCompress • Bregman Divergence: Towards Similarity Search on Non-metric Distance • Earth Mover Distance: Similarity Search on Probabilistic Data • Finding Patterns in High Dimensional Data

Mining and Searching Complex Structures

Metric v.s. Non-Metric • Euclidean distance dominates DB queries • Similarity in human perception

• Metric distance is not enough!

2010-7-31

Mining and Searching Complex Structures

59

52

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Bregman Divergence h

(q,f(q))

convex function f(x) (p,f(p)) Bregman divergence Df(p,q)

q

p Euclidean dist.

2010-7-31

Mining and Searching Complex Structures

53

Bregman Divergence • Mathematical Interpretation –The distance between p and q is defined as the difference between f(p) and the first order Taylor expansion at q

f(x) at p

2010-7-31

first order Taylor expansion at q

Mining and Searching Complex Structures

60

54

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Bregman Divergence • General Properties –Non-Negativity • Df(p,q)≥0 for any p, q

–Identity of Indiscernible • Df(p,p)=0 for any p

–Symmetry and Triangle Inequality • Do NOT hold any more

Mining and Searching Complex Structures

2010-7-31

55

Examples

2010-7-31

Distance

f(x)

Df(p,q)

Usage

KL-Divergence

x logx

p log (p/q)

distribution, color histogram

Itakura-Saito Distance

-logx

p/q-log (p/q)-1

signal, speech

Squared Euclidean

x2

(p-q)2

Euclidean space

Von-Nuemann Entropy

tr(X log X – X)

tr(X logX – X logY – X + Y)

symmetric matrix

Mining and Searching Complex Structures

61

56

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Why in DB system? • Database application –Retrieval of similar images, speech signals, or time series –Optimization on matrices in machine learning –Efficiency is important! • Query Types –Nearest Neighbor Query –Range Query

2010-7-31

Mining and Searching Complex Structures

57

Euclidean Space • How to answer the queries –R-Tree

2010-7-31

Mining and Searching Complex Structures

62

58

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Euclidean Space • How to answer the queries –VA File

2010-7-31

Mining and Searching Complex Structures

59

Our goal • Re-use the infrastructure of existing DB system to support Bregman divergence –Storage management –Indexing structures –Query processing algorithms

2010-7-31

Mining and Searching Complex Structures

63

60

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Basic Solution • Extended Space –Convex function f(x) = x2

point

D1

D2

point

D1

D2

D3

p

0

1

p+

0

1

1

q

0.5

0.5

q+

0.5

0.5

0.5

r

1

0.8

r+

1

0.8

1.64

t

1.5

0.3

t+

1.5

0.3

3.15

2010-7-31

Mining and Searching Complex Structures

61

Basic Solution • After the extension –Index extended points with R-Tree or VA File –Re-use existing algorithms with lower and upper bounds on the rectangles

2010-7-31

Mining and Searching Complex Structures

64

62

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

How to improve? • Reformulation of Bregman divergence • Tighter bounds are derived • No change on index construction or query processing algorithm

Mining and Searching Complex Structures

2010-7-31

63

A New Formulation h h’

query vector vq

Df(p,q)+Δ

q

p D*f(p,q)

2010-7-31

Mining and Searching Complex Structures

65

64

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Math. Interpretation • Reformulation of similarity search queries –k-NN query: query q, data set P, divergence Df • Find the point p, minimizing

–Range query: query q, threshold θ, data set P • Return any point p that

2010-7-31

Mining and Searching Complex Structures

65

Naïve Bounds • Check the corners of the bounding rectangles

2010-7-31

Mining and Searching Complex Structures

66

66

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Tighter Bounds • Take the curve f(x) into consideration

2010-7-31

Mining and Searching Complex Structures

67

Query distribution • Distortion of rectangles –The difference between maximum and minimum distances from inside the rectangle to the query

2010-7-31

Mining and Searching Complex Structures

67

68

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Can we improve it more? • When Building R-Tree in Euclidean space –Minimize the volume/edge length of MBRs –Does it remain valid?

2010-7-31

Mining and Searching Complex Structures

69

Query distribution • Distortion of bounding rectangles –Invariant in Euclidean space (triangle inequality)

–Query-dependent for Bregman Divergence

2010-7-31

Mining and Searching Complex Structures

68

70

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Utilize Query Distribution • Summarize query distribution with O(d) real number • Estimation on expected distortion on any bounding rectangle in O(d) time • Allows better index to be constructed for both R-Tree and VA File

Mining and Searching Complex Structures

2010-7-31

71

Experiments • Data Sets –KDD’99 data • Network data, the proportion of packages in 72 different TCP/IP connection Types

–DBLP data • Use co-authorship graph to generate the probabilities of the authors related to 8 different areas

2010-7-31

Mining and Searching Complex Structures

69

72

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Experiment • Data Sets –Uniform Synthetic data • Generate synthetic data with uniform distribution

–Clustered Synthetic data • Generate synthetic data with Gaussian Mixture Model

2010-7-31

Mining and Searching Complex Structures

73

Experiments • Methods to compare

Basic

Improved Bounds

Query Distribution

R-Tree

R

R-B

R-BQ

VA File

V

V-B

V-BQ

Linear Scan

LS

BB-Tree

BBT

2010-7-31

Mining and Searching Complex Structures

70

74

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Existing Solution • BB-Tree (L. Clayton, ICML 2009) –Memory-based indexing tree –Construct with k-means clustering –Hard to update –Ineffective in high-dimensional space

2010-7-31

Mining and Searching Complex Structures

75

Experiments • Index Construction Time

2010-7-31

Mining and Searching Complex Structures

71

76

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Experiments • Varying dimensionality

2010-7-31

Mining and Searching Complex Structures

77

Experiments • Varying dimensionality (cont.)

2010-7-31

Mining and Searching Complex Structures

72

78

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Experiments • Varying data cardinality

2010-7-31

Mining and Searching Complex Structures

79

Conclusion • A general technique on similarity for Bregman Divergence • All techniques are based on existing infrastructure of commercial database • Extensive experiments to compare performances with RTree and VA File with different optimizations

2010-7-31

Mining and Searching Complex Structures

73

80

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Outline • Sources of HDD • Challenges of HDD • Searching and Mining Mixed Typed Data –Similarity Function on k-n-match –ItCompress • Bregman Divergence: Towards Similarity Search on Non-metric Distance • Earth Mover Distance: Similarity Search on Probabilistic Data • Finding Patterns in High Dimensional Data

Mining and Searching Complex Structures

Motivation • Probabilistic data is ubiquitous –To represent the data uncertainty (WSN, RFID, moving object monitoring) –To compress data (image processing) • Histogram is a good way to represent the prob. data –Easy to capture –Is very useful in image representation • • • •

Colors Textures Gradient Depth

Mining and Searching Complex Structures

74

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Motivation • Similarity search is important for managing prob. data –Given a threshold θ, can answer which sensors’ readings are similar with sensor A (range query) –Can answer which k pictures are similar (top-k query) • Similarity function for prob. data should be carefully chosen –Bin by bin methods • L1 and L2 norms • χ2 distance

–Cross-bin methods • Earth Mover’s Distance (EMD) • Quadratic form

Mining and Searching Complex Structures

Outline • • • • • •

Motivation Introduction to Earth Mover’s Distance (EMD) Related works Indexing the probabilistic data based on EMD Experimental results Conclusion and future work

Mining and Searching Complex Structures

75

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Introduction to Earth Mover’s Dist • Bin by bin vs. cross bin

Bin-by-bin

Not good!

Cross bin

Good! Can handle distribution shift Mining and Searching Complex Structures

Introduction to Earth Mover’s Dist • What is EMD? –Earth （泥土） –Mover （搬运） –Distance （代价） –Can be understood as 搬运泥土的代价 • See an example…

Mining and Searching Complex Structures

76

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Moving Earth

≠ Mining and Searching Complex Structures

Moving Earth

≠ Mining and Searching Complex Structures

77

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Moving Earth

= Mining and Searching Complex Structures

The Difference?

(amount moved)

= Mining and Searching Complex Structures

78

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

The Difference?

Difference

(amount moved) * (distance moved)

= Mining and Searching Complex Structures

Linear programming

P m bins (distance moved) * (amount moved)

Q

All movements

n bins

Mining and Searching Complex Structures

79

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Linear programming

P m clusters (distance moved) * (amount moved)

Q n clusters

Mining and Searching Complex Structures

Linear programming

P m clusters * (amount moved)

Q n clusters

Mining and Searching Complex Structures

80

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Linear programming

P m clusters

Q n clusters

Mining and Searching Complex Structures

Constraints

1. Move “earth” only from P to Q P m clusters

P’

n clusters

Q’

Q

Mining and Searching Complex Structures

81

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Constraints

2. Cannot send more “earth” than there is

P m clusters

P’

n clusters

Q’

Q

Mining and Searching Complex Structures

Constraints

3. Q cannot receive more “earth” than it can hold

P m clusters

P’

n clusters

Q’

Q

Mining and Searching Complex Structures

82

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Constraints

4. As much “earth” as possible must be moved

P m clusters

P’

n clusters

Q’

Q

Mining and Searching Complex Structures

The Formal Definition of EMD • Earth Mover’s Distance (EMD) –the minimum amount of work needed to change one histogram into another

• Challenge of EMD –O(N^3logN)

Mining and Searching Complex Structures

83

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Related Works •

Filter-and-refine framework

–[1] Approximation Techniques for Indexing the Earth Mover's Distance in Multimedia Databases. ICDE 2006 • Cannot handle high dimensional histograms

–[2] Efficient EMD-based Similarity Search in Multimedia Databases via Flexible Dimensionality Reduction. SIGMOD 2008 • Based on scan framework and influence the scalability

•

Use scanning scheme to process queries –Merit: can obtain a good order to access when execute the k-NN queries and thus can minimize the number of candidates –Demerit: need to scan the whole dataset to obtain the order and thus low algo. scalability

Mining and Searching Complex Structures

Related Works •

•

Related works –Based on the filter-and-refine framework –Based on scanning method and low scalability Our work –Also based on the filter-and-refine method –But avoid to scan the whole data set • Use B+ trees • And thus can obtain high scalability

•

Our contributions –To the best of our knowledge, the 1st paper to index the high dimensional prob. data based on the EMD –Proposed algorithms of processing the similarity query based on B+ tree filter –Improve the efficiency and scalability of EMD-based similarity search

Mining and Searching Complex Structures

84

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Indexing the probabilistic data based on EMD •

Our intuition: –primal-dual theory in linear programming

•

Primal problem (EMD)

•

Dual problem

Mining and Searching Complex Structures

Indexing the probabilistic data based on EMD

•

Good properties of dual space –Constrains of dual space are independent of prob. data points (i.e., p and q in this example) • Thus, give any feasible solution (π, Ф) in dual space we can derives a lower bound for EMD(p, q) • Lower bound can help to filter out the not-hit histograms.

–given any feasible solution (π, Ф) in dual space, a histogram p can be mapped as a value, using the operation of • Can index histograms using B+ tree

Mining and Searching Complex Structures

85

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Indexing the probabilistic data based on EMD • 1. Mapping Construction –Key and counter key

Key

Counter key

–Assuming p is a histogram in DB, given a feasible solution (π, Ф), we calculate the Key for each record in DB –We can index those keys using B+ tree –For each feasible solution (π, Ф), a B+ tree can be constructed

Mining and Searching Complex Structures

Answering Range Query • Range query based on B+ index –Given any feasible solution (π, Ф) , we construct a B+ tree using keys of histograms –Given a query histogram, we calculate its counter key using the operation of –Given a similarity search threshold θ, we have proved that all candidate histogram’s key can be bounded by

–To further filter the candidates, we use L B+ tree and make an intersection among their candidate results

Mining and Searching Complex Structures

86

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Answering KNN Query •

K-NN query based on B+ index –Given a query q, we issue search on each B+ tree Tl with key(q, Фl) –We create two cursors for each tree and let them to fetch records from different directions (one left and one right) –Whenever record r has already been accessed by all B+ tree, it can be output as a candidate for k-NN query

Mining and Searching Complex Structures

Experimental Setup • 3 real data set –RETINA1 • an image data set consists of 3932 feline retina scans labeled with various antibodies.

–IRMA • contains 10000 radiography images from the Image Retrieval in Medical Application (IRMA) project

–DBLP • With parameter setting

Mining and Searching Complex Structures

87

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Experimental Results on Query CPU Time

Mining and Searching Complex Structures

Experimental Results on Scalability

sigmod our

Mining and Searching Complex Structures

88

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Conclusions • We present a new indexing scheme for the general purposes of similarity search on Earth Mover's Distance • Our index method relies on the primal-dual theory to construct mapping functions from the original probabilistic space to one-dimensional domain • Our B+ tree-based index framework has –High scalability –High efficiency –can handle High dimensional data

Mining and Searching Complex Structures

Outline • Sources of HDD • Challenges of HDD • Searching and Mining Mixed Typed Data –Similarity Function on k-n-match –ItCompress • Bregman Divergence: Towards Similarity Search on Non-metric Distance • Earth Mover Distance: Similarity Search on Probabilistic Data • Finding Patterns in High Dimensional Data

Mining and Searching Complex Structures

89

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

A Microarray Dataset 1000 - 100,000 columns Class

100500 rows

Gene1

Sample1

Cancer

Sample2

Cancer

Gene2

Gene3

Gene4

Gene 5

Gene 6

. . . SampleN-1

~Cance r

SampleN

~Cance r

• Find closed patterns which occur frequently among genes. • Find rules which associate certain combination of the columns that affect the class of the rows –Gene1,Gene10,Gene1001 -> Cancer

Mining and Searching Complex Structures

Challenge I • Large number of patterns/rules –number of possible column combinations is extremely high • Solution: Concept of a closed pattern –Patterns are found in exactly the same set of rows are grouped together and represented by their upper bound • Example: the following patterns are found in row 2,3 and 4 aeh ae

upper bound (closed pattern)

ah eh e

h

i 1 2 3 4 5

ri a ,b,c,l,o,s a ,d, e , h ,p,l,r a ,c, e , h ,o,q,t a , e ,f, h ,p,r b,d,f,g,l,q,s,t

Class C C C ~C ~C

“a” however not part of the group

lower bounds

Mining and Searching Complex Structures

90

Ge

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Challenge II • Most existing frequent pattern discovery algorithms perform searches in the column/item enumeration space i.e. systematically testing various combination of columns/items • For datasets with 1000-100,000 columns, this search space is enormous • Instead we adopt a novel row/sample enumeration algorithm for this purpose. CARPENTER (SIGKDD’03) is the FIRST algorithm which adopt this approach

Mining and Searching Complex Structures

Column/Item Enumeration Lattice • Each nodes in the lattice represent a combination of columns/items • An edge exists from node A to B if A is subset of B and A differ from B by only 1 column/item • Search can be done breadth first i 1 2 3 4 5

ri a,b,c,l,o,s a,d,e,h,p,l,r a,c,e,h,o,q,t a,e,f,h,p,r b,d,f,g,l,q,s,t

a,b,c,e a,b,c a,b,e a,c,e b,c

a,b

Class C C C ~C ~C

Mining and Searching Complex Structures

91

a,c

a,e a

start

b

b,c c

{}

b

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Column/Item Enumeration Lattice • Each nodes in the lattice represent a combination of columns/items • An edge exists from node A to B if A is subset of B and A differ from B by only 1 column/item • Search can be done depth first • Keep edges from parent to child only if child is the prefix of parent i 1 2 3 4 5

ri a,b,c,l,o,s a,d,e,h,p,l,r a,c,e,h,o,q,t a,e,f,h,p,r b,d,f,g,l,q,s,t

a,b,c,e a,b,c a,b,e a,c,e b,c

a,b

Class C C C ~C ~C

a,c

a,e a

b,c

b

start

c

{}

Mining and Searching Complex Structures

General Framework for Column/Item Enumeration Read-based

Write-based

Point-based

Association Mining

Apriori[AgSr94], DIC

Eclat, MaxClique[Zaki01], FPGrowth [HaPe00]

Hmine

Sequential Pattern Discovery

GSP[AgSr96]

SPADE [Zaki98,Zaki01], PrefixSpan [PHPC01]

Iceberg Cube

Apriori[AgSr94]

BUC[BeRa99], HCubing [HPDW01]

Mining and Searching Complex Structures

92

b

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

A Multidimensional View types of data or knowledge

others

other interest measure

associative pattern

constraints pruning method

sequential pattern

iceberg cube

compression method

closed/max pattern

lattice transversal/ main operations read

write

point

Mining and Searching Complex Structures

Sample/Row Enumeration Algorihtms • To avoid searching the large column/item enumeration space, our mining algorithm search for patterms/rules in the sample/row enumeration space • Our algorithms does not fitted into the column/item enumeration algorithms • They are not YAARMA (Yet Another Association Rules Mining Algorithm) • Column/item enumeration algorithms simply does not scale for microarray datasets

Mining and Searching Complex Structures

93

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Existing Row/Sample Enumeration Algorithms • CARPENTER(SIGKDD'03) –Find closed patterns using row enumeration • FARMER(SIGMOD’04) –Find interesting rule groups and building classifiers based on them • COBBLER(SSDBM'04) –Combined row and column enumeration for tables with large number of rows and columns • Topk-IRG(SIGMOD’05) –Find top-k covering rules for each sample and build classifier directly • Efficiently Finding Lower Bound Rules(TKDE’2010) –Ruichu Cai, Anthony K. H. Tung, Zhenjie Zhang, Zhifeng Hao. What is Unequal among the Equals? Ranking Equivalent Rules from Gene Expression Data. Accepted in TKDE Mining and Searching Complex Structures

Concepts of CARPENTER ij R (ij ) i 1 2 3 4 5

ri a,b,c,l,o,s a,d,e,h,p,l,r a,c,e,h,o,q,t a,e,f,h,p,r b,d,f,g,l,q,s,t

Class C C C ~C ~C

Example Table

a b c d e f g h l o p q r s t

C 1,2,3 1 1,3 2 2,3

2,3 1,2 1,3 2 3 2 1 3

~C 4 5 5 4 4,5 5 4 5 4 5 4 5 5

Transposed Table,TT Mining and Searching Complex Structures

94

a e h

C 1,2,3 2,3 2,3

TT|{2,3}

~C 4 4 4

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

ij

Row Enumeration 123 {a} 12 {al}

124 {a} 125 {l}

13 {aco} 14 {a}

1 {abclos}

2 {adehplr}

{} 3 {acehoqt}

134 {a}

15 {bls}

135 {}

23 {aeh}

145 {}

24 {aehpr}

234 {aeh}

25 {dl}

4 {aefhpr}

1235 {}

ij

1245 {}

a

TT|{1} b c l o s

1345 {}

R (ij ) C ~C 1,2,3 4 1 5 1,3 1,2 5 1,3 1 5

2345 {}

ij a

TT|{12} l ij

TT|{124} {123}

345 {}

35 {q}

5 {bdfglqst}

12345 {}

1234 {a}

235 {} 245 {}

34 {aeh}

a b c d e f g h l o p q r s t

a

R (ij ) C ~C 1,2,3 4 1 5 1,3 2 5 2,3 4 4,5 5 2,3 4 1,2 5 1,3 2 4 3 5 2 4 1 5 3 5

R (ij ) C ~C 1,2,3 4 1,2 5

R (ij ) C ~C 1,2,3 4

45 {f}

Mining and Searching Complex Structures

Pruning Method 1 •

Removing rows that appear in all tuples of transposed table will not affect results

a e h r2 r3

{aeh}

r2 r3 r4 {aeh}

r4 has 100% support in the conditional table of “r2r3”, therefore branch “r2 r3r4” will be pruned.

Mining and Searching Complex Structures

95

C 1,2,3 2,3 2,3

TT|{2,3}

~C 4 4 4

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Pruning method 2 123 {a} 12 {al}

1 {abclos}

2 {adehplr}

{} 3 {acehoqt} 4 {aefhpr}

5 {bdfglqst}

13 {aco} 14 {a}

124 {a} 125 {l} 134 {a}

15 {bls}

135 {}

23 {aeh}

145 {}

24 {aehpr}

234 {aeh}

25 {dl} 34 {aeh} 35 {q}

1235 {} 1245 {} 1345 {}

2345 {}

235 {} 245 {} 345 {}

• if a rule is discovered before, we can prune enumeration below this node

12345 {}

1234 {a}

a e h

C 1,2,3 2,3 2,3

45 {f}

–Because all rules below this node has been discovered before –For example, at node 34, if we found that {aeh} has been found, we can prune ~Coff all branches below it 4 4 4

TT|{3,4}

Mining and Searching Complex Structures

Pruning Method 3: Minimum Support • Example: From TT|{1}, we can see that the support of all possible pattern below node {1} will be at most 5 rows.

TT|{1}

Mining and Searching Complex Structures

96

ij R (ij ) C ~C a 1,2,3 4 b 1 5 c 1,3 l 1,2 5 o 1,3 s 1 5

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

From CARPENTER to FARMER • What if classes exists ? What more can we do ? • Pruning with Interestingness Measure –Minimum confidence –Minimum chi-square

• Generate lower bounds for classification/ prediction

Mining and Searching Complex Structures

Interesting Rule Groups • Concept of a rule group/equivalent class –rules supported by exactly the same set of rows are grouped together • Example: the following rules are derived from row 2,3 and 4 with 66% confidence

i aeh--> C(66%) ae-->C (66%)

ah--> C(66%)

e-->C (66%)

upper bound eh-->C (66%)

h-->C (66%)

lower bounds

1 2 3 4 5

ri a ,b,c,l,o,s a ,d, e , h ,p,l,r a ,c, e , h ,o,q,t a , e ,f, h ,p,r b,d,f,g,l,q,s,t

a-->C however is not in the group

Mining and Searching Complex Structures

97

Class C C C ~C ~C

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Pruning by Interestingness Measure • In addition, find only interesting rule groups (IRGs) based on some measures: –minconf: the rules in the rule group can predict the class on the RHS with high confidence –minchi: there is high correlation between LHS and RHS of the rules based on chi-square test • Other measures like lift, entropy gain, conviction etc. can be handle similarly

Mining and Searching Complex Structures

ij

Ordering of Rows: All Class C before ~C 123 {a} 12 {al}

1 {abclos}

2 {adehplr}

{} 3 {acehoqt} 4 {aefhpr}

5 {bdfglqst}

13 {aco} 14 {a}

124 {a} 125 {l} 134 {a}

15 {bls}

135 {}

23 {aeh}

145 {}

24 {aehpr}

234 {aeh}

25 {dl} 34 {aeh} 35 {q}

1234 {a}

12345 {}

1235 {} 1245 {}

ij a

TT|{1} b c l o s

1345 {}

R (ij ) C ~C 1,2,3 4 1 5 1,3 1,2 5 1,3 1 5

2345 {}

a

ij

TT|{124} {123}

a

R (ij ) C ~C 1,2,3 4

45 {f}

Mining and Searching Complex Structures

98

ij

TT|{12} l

235 {} 245 {} 345 {}

a b c d e f g h l o p q r s t

R (ij ) C ~C 1,2,3 4 1 5 1,3 2 5 2,3 4 4,5 5 2,3 4 1,2 5 1,3 2 4 3 5 2 4 1 5 3 5

R (ij ) C ~C 1,2,3 4 1,2 5

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Pruning Method: Minimum Confidence

• Example: In TT|{2,3} on the right, the maximum confidence of all rules below node {2,3} is at most 4/5

a e h

C 1,2,3,6 2,3,7 2,3

~C 4,5 4,9 4

TT|{2,3}

Mining and Searching Complex Structures

Pruning method: Minimum chi-square

Same as in computing maximum confidence

a e h

C

~C

Total

A

max=5

min=1

Computed

~A

Computed

Computed

Computed

Constant

Constant

Constant

Mining and Searching Complex Structures

99

C 1,2,3,6 2,3,7 2,3

TT|{2,3}

~C 4,5 4,9 4

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Finding Lower Bound, MineLB

a,b,c,d,e

ad ae

abc

a

b

bd

be

cde

c

e

d

–Example: An upper bound rule with antecedent A=abcde and two rows (r1 : abcf ) and (r2 : cdeg) –Initialize lower bounds {a, b, c, d, e} –add “abcf”--- new lower {d ,e} –Add “cdeg”--- new lower bound{ad, bd, ae, be}

Candidate Candidatelower lowerbound: bound:ad, ad,ae, ae,bd, bd,bebe, cd, ce Kept Removed since no since lower d,ebound are stilloverride lower bound them Mining and Searching Complex Structures

Implementation • In general, CARPENTER FARMER can be implemented in many ways: –FP-tree –Vertical format • For our case, we assume the dataset can be fitted into the main memory and used pointer-based algorithm similar to BUC

Mining and Searching Complex Structures

100

ij a b c d e f g h l o p q r s t

R (ij ) C ~C 1,2,3 4 1 5 1,3 2 5 2,3 4 4,5 5 2,3 4 1,2 5 1,3 2 4 3 5 2 4 1 5 3 5

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Experimental studies • Efficiency of FARMER –On five real-life dataset • lung cancer (LC), breast cancer (BC) , prostate cancer (PC), ALLAML leukemia (ALL), Colon Tumor(CT)

–Varying minsup, minconf, minchi –Benchmark against • CHARM [ZaHs02] ICDM'02 • Bayardo’s algorithm (ColumE) [BaAg99] SIGKDD'99

• Usefulness of IRGs –Classification

Mining and Searching Complex Structures

Example results--Prostate 100000 FA RM ER 10000

Co lumnE

1000

CHA RM

100 10 1 3

4

5

6

7

mi ni mum sup p o r t

Mining and Searching Complex Structures

101

8

9

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Example results--Prostate 1200 FA RM ER:minsup=1:minchi=10

1000

FA RM ER:minsup =1 800 600 400 200 0 0

50

70

80

85

90

99

minimum confidence(%)

Mining and Searching Complex Structures

Top k Covering Rule Groups • Rank rule groups (upper bound) according to – Confidence – Support • Top k Covering Rule Groups for row r – k highest ranking rule groups that has row r as support and support > minimum support • Top k Covering Rule Groups = TopKRGS for each row

Mining and Searching Complex Structures

102

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Usefulness of Rule Groups • • • •

Rules for every row Top-1 covering rule groups sufficient to build CBA classifier No min confidence threshold, only min support #TopKRGS = k x #rows

Mining and Searching Complex Structures

Top-k covering rule groups • For each row, we find the most significant k rule groups: –based on confidence first –then support

• Given minsup=1, Top-1 –row 1: abcÆC1(sup = 2, conf= 100%) –row 2: abcÆC1 • abcdÆC1(sup=1,conf = 100%) –row 3: cdÆC1(sup=2, conf = 66.7%) • If minconf = 80%, ? –row 4: cdeÆC2 (sup=1, conf = 50%)

class

Items

C1

a,b,c

C1

a,b,c,d

C1

c,d,e

C2

c,d,e

Mining and Searching Complex Structures

103

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Main advantages of Top-k coverage rule group • The number is bounded by the product of k and the number of samples • Treat each sample equally Æ provide a complete description for each row (small) • The minimum confidence parameter-- instead k. • Sufficient to build classifiers while avoiding excessive computation

Mining and Searching Complex Structures

Top-k pruning • At node X, the maximal set of rows covered by rules to be discovered down X-- rows containing X and rows ordered after X. – minconf Å MIN confidence of the discovered TopkRGs for all rows in the above set – minsup Å the corresponding minsup

• Pruning –If the estimated upper bound of confidence down X < minconf Æ prune –If same confidence and smaller support Æ prune

• Optimizations

Mining and Searching Complex Structures

104

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Classification based on association rules • Step 1: Generate the complete set of association rules for each class ( minimum support and minimum confidence.) –CBA algorithm adopts apriori-like algorithm -fails at this step on microarray data.

• Step 2:Sort the set of generated rules • Step 3: select a subset of rules from the sorted rule sets to form classifiers.

Mining and Searching Complex Structures

Features of RCBT classifiers Problems

RCBT

To discover, store, retrieve and sort a large number of rules

Mine those rules to be used for classification.e.g.Top-1 rule group is sufficient to build CBA classifier

Default class not convincing for biologists

Main classifier + some back-up classifiers

Rules with the same discriminating ability, how to integrate? Upper bound rules: specific Lower bound rules: general

A subset of lower bound rules— integrate using a score considering both confidence and support.

Mining and Searching Complex Structures

105

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Experimental studies • Datasets: 4 real-life data • Efficiency of Top-k Rule mining –Benchmark: Farmer, Charm, Closet+ • Classification Methods: –CBA (build using top-1 rule group) –RCBT (our proposed method) –IRG Classifier –Decision trees (single, bagging, boosting) –SVM

Mining and Searching Complex Structures

Runtime v.s. Minimum support on ALL-AML dataset 10000

FARMER FARMER(minconf=0.9) FARMER+prefix(minconf=0.9) TOP1 TOP100

Runtime(s)

1000 100 10 1 0.1 0.01 17

19

21 22 Minimum Support

Mining and Searching Complex Structures

106

23

25

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Scalability with k 100

Runtime(s)

PC ALL 10

1

0.1 100

300

500

600

800

1000

k

Mining and Searching Complex Structures

Biological meaning –Prostate Cancer Data Frequncy of Occurrence 1800

W72186 1600 1400 1200

AF017418 1000

AI635895

800

X14487 600

AB014519 M61916

400

Y13323

200 0 0

200

400

600

800

1000

1200

Gene Rank

Mining and Searching Complex Structures

107

1400

1600

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

Classification results

Mining and Searching Complex Structures

Classification results

Mining and Searching Complex Structures

108

Mining and Searching Complex

Chapter 2 Structures High Dimensional Data

References •

Anthony K. H. Tung, Rui Zhang, Nick Koudas, Beng Chin Ooi. "Similarity Search: A Matching Based Approach", VLDB'06 H. V. Jagadish, Raymond T. Ng, Beng Chin Ooi, Anthony K. H. Tung, "ItCompress: An Iterative Semantic Compression Algorithm". International Conference on Data Engineering (ICDE'2004), Boston, 2004. Zhenjie Zhang, Beng Chin Ooi, Srinivasan Parthasarathy, Anthony K. H. Tung. Similarity Search on Bregman Divergence: Towards Non-Metric Indexing. In the Proceedings of the 35th International Conference on Very Large Data Bases(VLDB), Lyon, France August 24-28, 2009. Jia Xu, Zhenjie Zhang, Anthony K. H. Tung, and Ge Yu. "Efficient and Effective Similarity Search over Probabilistic Data Based on Earth Mover's Distance". to appear in VLDB 2010, a preliminary version on Technical Report TRA5-10, National University of Singapore. [Codes & Data] Gao Cong, Kian-Lee Tan, Anthony K. H. Tung, Xin Xu. "Mining Top-k Covering Rule Groups for Gene Expression Data". In Proceedings SIGMOD'05,Baltimore, Maryland 2005 Ruichu Cai, Anthony K. H. Tung, Zhenjie Zhang, Zhifeng Hao. What is Unequal among the Equals? Ranking Equivalent Rules from Gene Expression Data. Accepted in TKDE

• •

•

• •

Mining and Searching Complex Structures

Optional References: • • • • •

Feng Pan, Gao Cong, Anthony K. H. Tung, Jiong Yang, Mohammed Zaki, "CARPENTER: Finding Closed Patterns in Long Biological Datasets", In Proceedings KDD'03, Washington, DC, USA, August 24-27, 2003. Gao Cong, Anthony K. H. Tung, Xin Xu, Feng Pan, Jiong Yang. "FARMER: Finding Interesting Rule Groups in Microarray Datasets". Iin SIGMOD'04, June 13-18, 2004, Maison de la Chimie, Paris, France. Feng Pang, Anthony K. H. Tung, Gao Cong, Xin Xu. "COBBLER: Combining Column and Row Enumeration for Closed Pattern Discovery". SSDBM 2004 Santorini Island Greece. Gao Cong, Kian-Lee Tan, Anthony K.H. Tung, Feng Pan. “Mining Frequent Closed Patterns in Microarray Data”. In IEEE International Conference on Data Mining, (ICDM). 2004 Xin Xu, Ying Lu, Anthony K.H. Tung, Wei Wang. "Mining Shifting-andScaling Co-Regulation Patterns on Gene Expression Profiles". ICDE 2006.

Mining and Searching Complex Structures

109

Mining and Searching Complex Structures

Chapter 3 Similarity Search on Sequences

Searching and Mining Complex Structures Similarity Search on Sequences Anthony K. H. Tung(鄧锦浩) School of Computing National University of Singapore www.comp.nus.edu.sg/~atung

Research Group Link: http://nusdm.comp.nus.edu.sg/index.html Social Network Link: http://www.renren.com/profile.do?id=313870900

Types of sequences Symbolic vs Numeric We only touch discrete symbols here. Sequences of number are called time series and is a huge topic by itself!

Single dimension vs multi-dimensional Example: Yueguo Chen, Shouxu Jiang, Beng Chin Ooi, Anthony K. H. Tung. "Querying Complex Spatial-Temporal Sequences in Human Motion Databases" accepted and to appear in 24th IEEE International Conference on Data Engineering (ICDE) 2008

Single long sequence vs multiple sequences

2010-7-31

110

Mining and Searching Complex Structures

Chapter 3 Similarity Search on Sequences

Outline • Searching based on a disk based suffix tree • Approximate Matching Using Inverted List (Vgrams) • Approximate Matching Based on B+ Tree (BED Tree)

2010-7-31

Suffix Suffixes of acacag$:

1. 2. 3. 4. 5. 6. 7.

acacag$ cacag$ acag$ cag$ ag$ g$ $

2010-7-31

111

Mining and Searching Complex Structures

Chapter 3 Similarity Search on Sequences

Suffix Trie E.g. consider the string S = acacag$ $ Suffix Trie: a ties of all possible suffices of S 7

1 2 3 4 5 6 7

Suffix acacag$ cacag$ acag$ cag$ ag$ g$ $

a c

g

6

a $ 5

a

c

g $ 4

a

$

g

$

g

a c

g c

g

3 $ 1

$ 2

2010-7-31

Suffix Tree (I) Suffix tree for S=acacag$: merge nodes with only one child 1 2 3 4 5 6 7

S= a c a c a g $

$

a

c

7 g $

c a Path-label of node v is “aca” Denoted as α(v)

v c ga $ 1

5 g $

a

c a g $ 2

3

2010-7-31

112

g $ 6

g $ 4

“ca” is an edge label

This is a leaf edge

Mining and Searching Complex Structures

Chapter 3 Similarity Search on Sequences

Suffix Tree (II) Suffix tree has exactly n leaves and at most n edges The label of each edge can be represented using 2 indices Thus, suffix tree can be represented using O(n log n) bits

$ 7

1 2 3 4 5 6 7

7,7

1,1 a

a

g $

c a

4,7c

6,7

4,7

a g $ 2

5

6,7

g $

S= a c a c a g $

6

2,3

2,3

c ga $ 1

c

g$

6,7

Note: The end index of every leaf edge should be 7, the last index of S. Thus, for leaf edges, we only need to store the start index.

g 6,7 $ 4

3

2010-7-31

Generalized suffix tree Build a suffix tree for two or more strings E.g. S1 = acgat#, S2 = cgt$

# 6

$

4

a

c t #

g a #t 1

4

c

g

g

a t t$ # 2

1

2010-7-31

113

t a t # 3

t $

#

2

5

$ 3

Mining and Searching Complex Structures

Chapter 3 Similarity Search on Sequences

Straightforward construction of suffix tree Consider S = s1s2…sn where sn=$ Algorithm: Initialize the tree we only a root For i = n to 1 Includes S[i..n] into the tree

Time: O(n2)

2010-7-31

Example of construction S=acca$

Init

For-loop a c

c $

$

a $

c $ a$ a $

5

5 4

5 4 3

5 4 3 2

I4

I3

I2

I5

$ $a a ca $ $

2010-7-31

114

c $ ca $a $ 4 1 3 5 $

I1

c a $ 2

Mining and Searching Complex Structures

Chapter 3 Similarity Search on Sequences

Construction of generalized suffix tree S’= c#

Init

For-loop

a c

a c

c a c $ ca $ a$ $ 5 4 1 3 2

#$ c a c $ ca $ a$ $ 5 4 1 3 2

$

2

I1

J2

a c

2

#$ c ac $ ca $ a$ # $ 5 4 1 3 2 1 J1

2010-7-31

Property of suffix tree Fact: For any internal node v in the suffix tree, if the path label of v is α(v)=ap, then there exists another node w in the suffix tree such that α(w)=p.

Proof: Skip the proof. Definition of Suffix Link: For any internal node v, define its suffix link sl(v) = w.

2010-7-31

115

Mining and Searching Complex Structures

Chapter 3 Similarity Search on Sequences

Suffix Link example S=acacag$

$

a

7 g $

c a c ga $ 1

c

a

c a g $ 2

5 g $

g $ 6

g $ 4

3

2010-7-31

Can we construct a suffix tree in O(n) time? Yes. We can construct it in O(n) time and O(n) space Weiner’s algorithm [1973] Linear time for constant size alphabet, but much space McGreight’s algorithm [JACM 1976] Linear time for constant size alphabet, quadratic space Ukkonen’s algorithm [Algorithmica, 1995] Online algorithm, linear time for constant size alphabet, less space Farach’s algorithm [FOCS 1997] Linear time for general alphabet Hon,Sadakane, and Sung’s algorithm [FOCS 2003] O(n) bit space O(n logen) time for 00 ⎧V (i − 1, j − 1) + δ ( S [i ], T [ j ]) ⎪ V (i, j ) = max ⎨ V (i − 1, j ) + δ ( S [i ], _) ⎪ V (i, j − 1) + δ (_, T [ j ]) ⎩

Match/mismatch Delete Insert

In the alignment, the last pair must be either match/mismatch, delete, insert. xxx…xx | xxx…yy

xxx…xx | yyy…y_

match/mismatch

delete

xxx…x_ | yyy…yy insert

2010-7-31

Example (I) _

A

G

_

0

-1 -2 -3 -4 -5 -6 -7

A

-1

C

-2

A

-3

A

-4

T

-5

C

-6

C

-7

2010-7-31

129

C

A

T

G

C

Mining and Searching Complex Structures

Chapter 3 Similarity Search on Sequences

Example (II) _

A

G

C

A

T

G

C

_

0

-1 -2 -3 -4 -5 -6 -7

A

-1

2

1

0

C

-2

1

1

?3

2

A

-3

A

-4

T

-5

C

-6

C

-7

_

A

G

C

A

_

0

-1 -2 -3 -4 -5 -6 -7

A

-1

2

1

0

-1 -2 -3 -4

C

-2

1

1

3

2

1

0

-1

A

-3

0

0

2

5

4

3

2

A

-4 -1 -1

1

4

4

3

2

T

-5 -2 -2

0

3

6

5

4

C

-6 -3 -3

0

2

5

5

7

C

-7 -4 -4 -1

1

4

4

7

-1 -2 -3 -4

2010-7-31

Example (III)

2010-7-31

130

T

G

C

Mining and Searching Complex Structures

Chapter 3 Similarity Search on Sequences

“q-grams” of strings

universal

2-grams

2010-7-31

q-gram inverted lists

id 0 1 2 3 4

strings rich stick stich stuck static

2-grams

at ch ck ic ri st ta ti tu uc

4

2010-7-31

131

0

2

1 0 0 1 4 1 3 3

3 1

2

4

2

3

4

2

4

Mining and Searching Complex Structures

Chapter 3 Similarity Search on Sequences

Searching using inverted lists Query: “shtick”, ED(shtick, ?)≤1 sh ht ti ic ck id 0 1 2 3 4

strings rich stick stich stuck static

2-grams

at ch ck ic ri st ta ti tu uc

# of common grams >= 3

4 0

2

1 0 0 1 4 1 3 3

3 1

2

4

2

3

4

2

4

2010-7-31

2-grams -> 3-grams? Query: “shtick”, ED(shtick, ?)≤1 sht hti tic ick

id 0 1 2 3 4

2010-7-31

strings rich stick stich stuck static

3-grams

ati ich ick ric sta sti stu tat tic tuc uck

# of common grams >= 1

4 0 1 0 4 1 3 4 1 3 3

132

2

2

2

4

Mining and Searching Complex Structures

Chapter 3 Similarity Search on Sequences

Observation 1: dilemma of choosing “q” Increasing “q” causing: Longer grams Shorter lists Smaller # of common grams of similar strings 4 at 2 ch 0 id strings ck 1 3 0 rich ic 0 1 2 4 2-grams 1 stick ri 0 2 stich st 4 2 3 1 3 stuck ta 4 4 static ti 1 2 4 tu 3 uc 3 2010-7-31

Observation 2: skew distributions of gram frequencies DBLP: 276,699 article titles

Popular 5-grams: ation (>114K times), tions, ystem, catio

2010-7-31

133

Mining and Searching Complex Structures

Chapter 3 Similarity Search on Sequences

VGRAM: Main idea Grams with variable lengths (between qmin and qmax) zebra ze(123)

corrasion co(5213), cor(859), corr(171)

Advantages Reduce index size ☺ Reducing running time ☺ Adoptable by many algorithms ☺

2010-7-31

Challenges Generating variable-length grams? Constructing a high-quality gram dictionary? Relationship between string similarity and their gram-set similarity? Adopting VGRAM in existing algorithms?

2010-7-31

134

Mining and Searching Complex Structures

Chapter 3 Similarity Search on Sequences

Challenge 1: String

Variable-length grams?

Fixed-length 2-grams

universal

Variable-length grams [2,4]-gram dictionary

universal

ni ivr sal uni vers

2010-7-31

Representing gram dictionary as a trie

ni ivr sal uni vers

2010-7-31

135

Mining and Searching Complex Structures

Chapter 3 Similarity Search on Sequences

Challenge 2: Constructing gram dictionary Step 1: Collecting frequencies of grams with length in [qmin, qmax] st sti stu stic stuc

0, 1, 3 0, 1 3 0, 1 3

Gram trie with frequencies 2010-7-31

Step 2: selecting grams Pruning trie using a frequency threshold T (e.g., 2)

2010-7-31

136

Mining and Searching Complex Structures

Chapter 3 Similarity Search on Sequences

Step 2: selecting grams (cont) Threshold T = 2

2010-7-31

Final gram dictionary

[2,4]-grams 2010-7-31

137

Mining and Searching Complex Structures

Chapter 3 Similarity Search on Sequences

Challenge 3: Edit operation’s effect on grams

universal

Fixed length: q

k operations could affect k * q grams

2010-7-31

Deletion affects variable-length grams

Not affected

i-qmax+1

Affected

i Deletion

2010-7-31

138

Not affected

i+qmax- 1

Mining and Searching Complex Structures

Chapter 3 Similarity Search on Sequences

Grams affected by a deletion Affected?

i Deletion

i-qmax+1

i+qmax- 1 [2,4]-grams

ni ivr sal uni vers

Deletion

universal Affected? 2010-7-31

Grams affected by a deletion (cont) Affected?

i-qmax+1

i Deletion

Trie of grams

i+qmax- 1

Trie of reversed grams

2010-7-31

139

Mining and Searching Complex Structures

Chapter 3 Similarity Search on Sequences

# of grams affected by each operation

Deletion/substitution

Insertion

0 1 1 1 1 2 1 2 2 2 1 1 1 2 1 1 1 1 0

_u_n_i_v_e_r_s_a_l_

2010-7-31

Max # of grams affected by k operations Vector of s =

With 2 edit operations, at most 4 grams can be affected

Called NAG vector (# of affected grams) Precomputed and stored

2010-7-31

140

Mining and Searching Complex Structures

Chapter 3 Similarity Search on Sequences

Summary of VGRAM index

2010-7-31

Challenge 4: adopting VGRAM Easily adoptable by many algorithms Basic interfaces: String s grams String s1, s2 such that ed(s1,s2)

Recommend Stories

Idea Transcript

Helpful Links

Smile Life

Get in touch