Idea Transcript
VLDB Database School (China) 2010 August 3-7, 2010, Shenyang
Lecture Notes Part 1 Mining and Searching Complex Structures
Anthony K.H. Tung(邓锦浩) School of Computing National University of Singapore www.comp.nus.edu.sg/~atung
Mining and Searching Complex Structures
Contents
Chapter 1: Introduction ------------------------------------------ 1 Chapter 2: High Dimensional Data ------------------------- 34 Chapter 3: Similarity Search on Sequences ------------ 110 Chapter 4: Similarity Search on Trees ------------------- 156 Chapter 5: Graph Similarity Search ---------------------- 175 Chapter 6: Massive Graph Mining ------------------------ 234
Mining and Searching Complex Structures
Chapter 1 Introduction
Mining and Searching Complex Structures Introduction Anthony K. H. Tung(鄧锦浩) School of Computing National University of Singapore www.comp.nus.edu.sg/~atung
Research Group Link: http://nusdm.comp.nus.edu.sg/index.html Social Network Link: http://www.renren.com/profile.do?id=313870900
What is data mining? Really nothing different from what scientists had been doing for Correct, useful model
Generate data
Real World
Collect data and verify or construct model of real world
Nobel Prize
Output most likely model based on some statistical measure
Feed in data What’s new?
Systematically and efficiently test many statistical models
1
Mining and Searching Complex Structures
Chapter 1 Introduction
Components of data mining Structure of model geneA=high and geneB=low ===> cancer geneA, geneB and geneC exhibit strong correlation Statistical Score for the model Accuracy of rule 1 is 90% Similarity function: Are they sufficiently similar group of records that support a certain model or hypothesis? Search method for the correct model parameters Given 200 genes, there could be 2^200 rules. Which rule give the best prediction power? Database access method Given 1 million records, how to quickly find relevant records to compute the accuracy of a rule?
The Apriori Algorithm • Bottom-up, breadth first search • Only read is perform on the databases • Store candidates in memory to simulate the lattice search • Iteratively follow the two steps: –generate candidates –count and get actual frequent items
a,b,c,e a,b,c a,b,e a,c,e b,c,e a,b
a,c
a,e a
start 4
2
b
b,c
b,e
c
e
{}
c,e
Mining and Searching Complex Structures
Chapter 1 Introduction
The K-Means Clustering Method • Given k, the k-means algorithm is implemented in 4 steps: –Partition objects into k nonempty subsets –Compute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster. –Assign each object to the cluster with the nearest seed point. –Go back to Step 2, stop when no more new assignment.
5
The K-Means Clustering Method • Example 10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0 0
1
2
3
4
5
6
7
8
9
10
0
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
2
3
4
5
6
7
8
9
10
1 0
0 0
1
2
3
4
5
6
7
8
9
10
0
6
3
1
2
3
4
5
6
7
8
9
10
Mining and Searching Complex Structures
Chapter 1 Introduction
Training Dataset (Decision Tree) Outlook
Temp
Humid
Wind
PlayTennis
Sunny Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Rain
Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild
High High High High Normal Normal Normal High Normal Normal Normal High Normal High 7
Weak Strong Weak Weak Weak Strong Strong Weak Weak Weak Strong Strong Weak Strong
No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No
Selecting the Next Attribute S=[9+,5-] E=0.940
S=[9+,5-] E=0.940
Humidity
Wind
High [3+, 4-] E=0.985
Normal
Weak
[6+, 1-]
[6+, 2-]
E=0.592
Gain(S,Humidity) =0.940-(7/14)*0.985 – (7/14)*0.592 =0.151
8
4
Strong [3+, 3-]
E=0.811 E=1.0 Gain(S,Wind) =0.940-(8/14)*0.811 – (6/14)*1.0 =0.048
Mining and Searching Complex Structures
Chapter 1 Introduction
Selecting the Next Attribute S=[9+,5-] E=0.940 Outlook Over cast
Sunny
Rain
[2+, 3-]
[4+, 0]
[3+, 2-]
E=0.971
E=0.0
E=0.971
Gain(S,Outlook) =0.940-(5/14)*0.971 -(4/14)*0.0 – (5/14)*0.0971 =0.247 9
ID3 Algorithm [D1,D2,…,D14] [9+,5-]
Outlook
Sunny
Overcast
Ssunny=[D1,D2,D8,D9,D11] [2+,3-] ?
Rain
[D3,D7,D12,D13] [D4,D5,D6,D10,D14] [4+,0-] [3+,2-] Yes
?
Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970 Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570 Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019 10
5
Mining and Searching Complex Structures
Chapter 1 Introduction
Decision Tree for PlayTennis Outlook Sunny
Overcast
Humidity High
Rain
Yes
Wind
Normal
No
Strong
Yes
Weak
No
Yes
11
Can we fit what we learn into the framework? Apriori
K-means
ID3
rule pattern discovery clustering task clusters structure of the model association rules or pattern choice of any k lattice of all possible search space
classification decision tree
combination of items size= 2m
points as center size=infinity
score function
support, confidence
square error
search /optimization method data management technique
breadth first with pruning
gradient descent
all possible combination of decision tree size= potentially infinity accuracy, information gain greedy
TBD
TBD
TBD
12
6
Mining and Searching Complex Structures
Chapter 1 Introduction
Components of data mining(II)
Models Enumeration Algorithm
Statistical Score Function Similarity/Search Function Database Access Method
Database
Background knowledge • We assume you have some basic knowledge about data mining, some of the slides here will be very useful for this purpose • Association Rule Mining http://www.comp.nus.edu.sg/~atung/Renmin56.pdf • Classification and Regression http://www.comp.nus.edu.sg/~atung/Renmin67.pdf • Clustering http://www.comp.nus.edu.sg/~atung/Renmin78.pdf
7
Mining and Searching Complex Structures
Chapter 1 Introduction
IT Trend Processors are cheap and will become cheaper(multi-core processor, graphic cards) Storage will be cheap but might not be fast Bandwidth will be growing What can we do with this? Play more realistic games! Not exactly a joke since any technologies that speed up games can speed up scientific simulation
Smarter (more intensive) computation Can store more personal semantic/ontology People can collaborate more over the Internet (Flickr, Wikipedia) to make things more intelligent
The AI dream now have the support of much better hardwares Essentially, data mining can be made much more simple for the man on the street Data mining should be human-centered, not machine centered 2010-7-31
15
What is complex data? What is “simple” data? zWhat are complex data? tabular table, with small number of attributes (of the same type), no Test1 Regular Gene1 Progress comments Pos missing 2.0 Fever …… values. Neg
-0.3
N.A
5.7
Unconscious
High dimensional data: Lots of attributes with different data types with missing values
Sequences/ time series
Trees
8
Graphs
Mining and Searching Complex Structures
Chapter 1 Introduction
Why complex data? They come naturally in many applications. Bring research nearer to real world Lots of challenges which mean more fun! Some fundamental challenges: How do you compare complex objects effectively and efficiently? How do you find special subset in the data that is interesting? Test1 Gene1 Progress Pos What 2.0 new Fever type of models and score function must you used? Neg -0.3 Unconscious How do you handle noise and error ? N.A
comments ……
5.7
a
a d d
c
b
e
c
d
b c
c
d
d b
e T2
T1
Personalized Semantic for Personal Data Management everyone will own terabytes of data soon improve query/search interface by mining and extracting personalized semantics like entities and their relationship etc. by comparing them against high quality tagged databases
Query by documents
Query by audio/music
Query by video
Query by photographs/images
Wikipedia
High Quality Data Sources
singers
authors actor/actress
songs
papers
semantic layer places
movies Personal Data
documents
audio music
video
9
photographs/i mages
Webpage/Blogs/Bookmarks
Mining and Searching Complex Structures
Chapter 1 Introduction
Integrated Approach to Mining Software Engineering Data software engineering data: code base, change history, bug reports, runtime trace integrated into a data warehouse to support decision making and mining, Example: Which code module should I modify to create a new function? Which module need maintenance? programming
defect detection
testing
debugging
maintenance
…
software engineering tasks helped by data mining
classification
association/ patterns
clustering
…
Data Warehouse
code bases
change history
program states
structural entities
bug reports/nl
…
software engineering data
WikiScience Web 2.0: Facebook for scientists Collaborative platform for scientist to build scientific models/hypothesis and share data, applications Based on some articles, I make some changes to Model A to create Model B
Model ModelAA
supporting articles tagged to Model B
Centralized, Centralized, Hybrid HybridModel Model CCConstructed Constructed by bySystem System
supporting dataset tagged to Model A
Model ModelBB
This is my model of the solar system base on my supporting dataset
10
Mining and Searching Complex Structures
Chapter 1 Introduction
Hey, why not Cloud Computing, Map/Reduce? • These are platform for scaling up services to large number of users on large amount of data • But what exactly do you want to scale up? • Services that provide useful and semantically correct information to the users • We have too many scalable data mining algorithms that find nothing or too many things • Let’s focus on finding useful things first (assuming we have lot’s of processing power) and then try to scale it up
Schedule of the Course Date/Time
Content
Lesson 1
Introduction
Lesson 2
Mining and Search High Dimensional Data I
Lesson 3
Mining and Search High Dimensional Data II
Lesson 4
Mining and Search High Dimensional Data III
Lesson 5
Similarity Search for Sequences and Trees I
Lesson 6
Similarity Search for Sequences and Trees III
Lesson 7
Similarity Search for Graph I
Lesson 8
Similarity Search for Graph II
Lesson 9
Similarity Search for Graph III
Lesson 10
Mining Massive Graph I
Lesson 11
Mining Massive Graph II
Lesson 12
Mining Massive Graph III
11
Mining and Searching Complex Structures
Chapter 1 Introduction
Focus of the course • Techniques that can handle high dimensional, complex structures –Providing semantics to similarity search –Shotgun and Assembly: Column/Feature Wise Processing using Inverted Index –Row-wise Enumeration –Using local properties to infer global properties • Throughout the course, please try to think of how these techniques are applicable across different type of complex structures
Databases Queries z
z z z z
To start off, we will consider something very basic call ranking queries since we need ranking any similarity search (usually from most similar to most dissimilar) In relational database, SQL returns all results at one go How many tuples can be fitted in one screen? How many tuples can you remember? Options: z z
z
Summarize the results Display representative tuples
How to select representative tuples?
12
Mining and Searching Complex Structures
Chapter 1 Introduction
Retrieve Relevant Information z z
z z z
Search videos related to Shanghai Expo Too many results: as long as you click “next”, there are 20 more new results Are we interested in all results? No, only most relevant ones Search engines have to rank the results, out of which they make money from
Question: How to Select a Small Result Set z
z
Selecting the most representative or most interesting results is not trivial Find an apartment with rental cheaper than 1000, the cheaper the better z
z
The result tuples can be sorted in the ascending order of rental prices, those in front are more favorable
Find an apartment with rental cheaper than 1000 near NEU, the lower the better, the nearer the better z
z
Apartment with lower rent may not be near, nearer one may not be cheap Order by prices? Order by distances?
13
Mining and Searching Complex Structures
Chapter 1 Introduction
Top-k Queries z
z z z z
z
Define a scoring function, which maps a tuple to a real number, as a score The higher the score is, the more favorable the tuple is Define an integer k Answer: k objects with highest scores Different scoring function may give different top-k result Price
Distance to NEU
Apartment A
$800
500 meter
Apartment B
$1200
200 meters
Given k = 1, if the score function is defined as the sum of price and distance, the first tuple is better; if it is defined as the product, the second tuple is better
Brute Force Top-k z z
z z
z
z
Compute scores for each result tuple Sort the tuples according to the descending order of the scores Select the first k tuples What if the number of tuples is unlimited? Search engines can give unlimited number of results Even if the number of tuples is limited, it is too slow to compute score for each tuple We have to do it efficiently
14
Mining and Searching Complex Structures
Chapter 1 Introduction
Outline z
z
Two well-known top-k algorithms z Fagin's Algorithm (FA) z The Threshold Algorithm (TA) Take random access into consideration z No Random Access Algorithm (NRA) z The Combined Algorithm (CA)
Monotonicity z
z
A score function f is monotone if f(x1,x2,...,xm)≤f(y1,y2,...,ym) whenever xi≤yi for every i Select top-3 students with highest total score in mathematics, physics and computer science:
•
select name, math+phys+comp as score from student order by score desc limit 3 z
sum(x.math,x.phys,x.comp)≤sum(y.math,y.phys,y.comp) if x.math≤y.math and x.phys≤y.phys and x.comp≤y.comp
15
Mining and Searching Complex Structures
Chapter 1 Introduction
Sorted Lists z
We shall think of a database consisting of m sorted lists L1, L2, … Lm
Lmath
Lphys
Ann
98
Hugh
97
z
Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Outline z
Lcomp
Two well-known top-k algorithms z Fagin's Algorithm (FA) z The Threshold Algorithm (TA) Take random access into consideration z No Random Access Algorithm (NRA) z The Combined Algorithm (CA)
16
Mining and Searching Complex Structures
Chapter 1 Introduction
Fagin's Algorithm (I) z
z
Do sequential access until there are at least k matches Ann
98
Hugh
97
Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Sequential accesses are stopped when 3 students are seen, i.e. Ann, Hugh and Kurt
Fagin's Algorithm (II) z
z
For each object that has been seen, do random accesses on other lists to compute its score Ann
98
Hugh
97
Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Random accesses need to be done for Ben, Carl, Jane and Ryan
17
Mining and Searching Complex Structures
Chapter 1 Introduction
Fagin's Algorithm (III) z
Select the k objects with highest score as top-k result
Ann
98
Hugh
97
Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Why is FA correct? (I) z
z
There are at least k objects seen on all attributes when sequential access is stopped By monotonicity, those objects that are not seen do not have higher score than the above k objects Ann
98
Hugh
97
Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
18
Mining and Searching Complex Structures
Chapter 1 Introduction
Why is FA correct? (II) z
z
For those that have been seen, it is either all attributes has been seen, or random accesses are performed to know all attributes The k objects with highest scores are therefore the top-k result Ann
98
Hugh
97
z
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Outline z
Kurt
Two well-known top-k algorithms z Fagin's Algorithm (FA) z The Threshold Algorithm (TA) Take random access into consideration z No Random Access Algorithm (NRA) z The Combined Algorithm (CA)
19
Mining and Searching Complex Structures
Chapter 1 Introduction
The Threshold Algorithm (I) z
z
Do sequential access on all lists. If an object is seen, do random access to the other lists to compute its score Ann
98
Hugh
97
Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Random accesses on Ann, Hugh and Kurt first, then on Ben and Ryan
The Threshold Algorithm (II) z
z
Remember the k objects with highest scores, together with their scores Ann
98
Hugh
97
Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Score (Ann) = 285 Score (Hugh) = 280 Score (Kurt) = 280
20
Mining and Searching Complex Structures
Chapter 1 Introduction
The Threshold Algorithm (III) • Let threshold value τ be the function value on last seen values on all sorted lists • As soon as at least k objects with score at least τ, then halt
Ann
98
Hugh
97
Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
τ(1) = 291 τ(2) = 285 τ(3) = 280
Why is TA correct? • By monotonicity, those unseen objects do not have higher score than τ • For those that have been seen, random accesses are performed, the k objects with highest scores are therefore the top-k result Ann
98
Hugh
97
Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
21
τ(1) = 291 τ(2) = 285 τ(3) = 280
Mining and Searching Complex Structures
Chapter 1 Introduction
Comparing TA with FA • Number of sequential accesses z At the time FA stops sequential accesses, τ is guaranteed not higher than the k objects seen on all sorted lists • Number of random accesses z TA requires m-1 random accesses for each object z But FA is expected to random access more objects • Size of buffers used z Buffer used by FA can be unbounded z TA only needs to remember k objects with k scores, and the threshold value τ
Outline z
z
Two well-known top-k algorithms z Fagin's Algorithm (FA) z The Threshold Algorithm (TA) Take random access into consideration z No Random Access Algorithm (NRA) z The Combined Algorithm (CA)
22
Mining and Searching Complex Structures
Chapter 1 Introduction
Random Access z
z
z
Random accesses are impossible z Text retrieval: sorted lists are results of search engines Random accesses are expensive z Sequential accesses on disk are orders of magnitude faster than random accesses We need to consider not using random accesses or using them as few as possible
No Random Access z
Without random access, all we know are the upper bounds
Lmath
z
Lphys
Ann
98
Hugh
97
Lcomp Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Carl’s scores on physics and computer science are not higher than 89 and 92 respectively
23
Mining and Searching Complex Structures
Chapter 1 Introduction
Lower and Upper Bounds z
z
If an object has not been seen on one attribute z Lower bound is 0 z Upper bound is the last seen value Ann
98
Hugh
97
Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
The lower bound of Carl’s score on physics is 0 The upper bound of Carl’s score on physics is 89
Worse and Best Scores (I) z z
z
W (R): The worst possible score of tuple R B (R): The best possible score of tuple R Ann
98
Hugh
97
Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
W (Carl) = 90 B (Carl) = 90 + 89 + 92
24
Mining and Searching Complex Structures
Chapter 1 Introduction
Worse and Best Scores (II) z z
W (R) ≤ Score of R ≤ B (R) W (R) and B (R) get updated as its value gets sequential accessed Ann
98
Hugh
97
Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Ann
Hugh
Kurt
W
98
97
96
B
291
291
291
Worse and Best Scores (II) z z
W (R) ≤ Score of R ≤ B (R) W (R) and B (R) get updated as its value gets sequential accessed Ann
98
Hugh
97
Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Ann
Hugh
Kurt
Ben
Ryan
W
98→193
97
96
96
94
B
291→287
291→288
291→286
285
285
25
Mining and Searching Complex Structures
Chapter 1 Introduction
Worse and Best Scores (II) z z
W (R) ≤ Score of R ≤ B (R) W (R) and B (R) get updated as its value gets sequential accessed Ann
98
Hugh
97
Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Ann W 193→285
Hugh
Kurt
Ben
Ryan
Jane
97
96→189
96
94
95
B 287→285 288→285 286→281 285→283 285→282
Outline z
z
Two well-known top-k algorithms z Fagin's Algorithm (FA) z The Threshold Algorithm (TA) Take random access into consideration z No Random Access Algorithm (NRA) z The Combined Algorithm (CA)
26
280
Mining and Searching Complex Structures
Chapter 1 Introduction
No Random Access Algorithm (I) z z
Maintain the last-seen values x1,x2,…,xm For every seen object, maintain its worst possible score, its known attributes and their values Ann
98
Hugh
97
Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
z
xmath = 96; xphys = 94; xcomp = 95
z
Ann:193:{;}
No Random Access Algorithm (II) z
Why not maintain the best possible score for each objects Ann
98
Hugh
97
Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Ann W 193→285
Hugh
Kurt
Ben
Ryan
Jane
97
96→189
96
94
95
B 287→285 288→285 286→281 285→283 285→282
Too Frequently Updated!
27
280
Mining and Searching Complex Structures
Chapter 1 Introduction
No Random Access Algorithm (III) z z
Let M be the kth largest W value An object R is viable if B (R) ≥ M Ann
98
Hugh
97
Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Hugh
Kurt
Ben
Ryan
Jane
W 285
Ann
97→188
189→280
96→189
94
95
B
285→280 281→280 283→280 282→278 280→277
285
M = 189
No Random Access Algorithm (III) z z
Let M be the kth largest W value An object R is viable if B (R) ≥ M Ann
98
Hugh
97
Kurt
96
Ben
96
Ryan
94
Ann
95
Kurt
93
Ann
92
Jane
95
Hugh
91
Kurt
91
Ben
93
Carl
90
Jane
89
Hugh
92
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Ann
Hugh
Kurt
W
285
188→280
280
B
285
285→280
280
Ben
Ryan
Jane
188
94
95→184
280→278 278→276 277→274
28
M = 280
Mining and Searching Complex Structures
Chapter 1 Introduction
No Random Access Algorithm (IV) z z
z
Let set T contain objects with W (R) ≥ M Halt when z There are at least k objects seen on all sorted lists z No viable objects left outside set T
Ann
Hugh
Kurt
W
285
188→280
280
B
285
285→280
280
Ben
Ryan
Jane
188
94
95→184
M = 280
280→278 278→276 277→274
T = {Ann, Hugh, Kurt}
Why is NRA correct? z z
z
W (R) ≤ Score of R ≤ B (R) always holds If an object R is not viable, Score of R ≤ B (R) ≤ M, then there are at least k objects with scores not lower than R Therefore, if there is no viable object outside T and T contains at least k objects, T is the set of top-k result
29
Mining and Searching Complex Structures
Chapter 1 Introduction
Comparing NRA with TA • Number of sequential accesses z The number of sequential accesses of NRA is at least the last position of top-k result on all attributes • Number of random accesses z NRA is obviously 0 • Size of buffers used z TA remembers k objects with k scores, and the threshold value τ z NRA remembers all viable objects with its scores on all seen attributes, and the last-seen value on all attributes
How deep can NRA go?
z
z
Ann
98
Hugh
97
Kurt
96
Hugh
97
Kurt
96
Ann
95
Ben
60
Ryan
60
Jane
60
Ryan
60
Ben
60
Ben
60
Carl
60
Jane
60
Carl
60
...
...
...
...
...
...
Jane
60
Carl
60
Ryan
60
Kurt
0
Ann
0
Hugh
0
The set T can be identified quickly, but their scores will only be certain at the end of lists If we allow relatively fewer number of random accesses, scanning the entire lists can be avoided
30
Mining and Searching Complex Structures
Chapter 1 Introduction
Outline z
z
Two well-known top-k algorithms z Fagin's Algorithm (FA) z The Threshold Algorithm (TA) Take random access into consideration z No Random Access Algorithm (NRA) z The Combined Algorithm (CA)
The Combined Algorithm (I) z z z z z z
CA combines TA and NRA cR: the cost of a random access cS: the cost of a sequential access h= Run NRA, but every h steps to run random accesses, like TA h = ∞ → never do random access, CA is then NRA
31
Mining and Searching Complex Structures
Chapter 1 Introduction
The Combined Algorithm (II)
z
Ann
98
Hugh
97
Kurt
96
Hugh
97
Kurt
96
Ann
95
Ben
60
Ryan
60
Jane
60
Ryan
60
Ben
60
Ben
60
Carl
60
Jane
60
Carl
60
...
...
...
...
...
...
Jane
60
Carl
60
Ryan
60
Kurt
0
Ann
0
Hugh
0
Random accesses for Ann, Hugh and Kurt quickly find out the scores for Ann, Hugh and Kurt
The Combined Algorithm (III) z
z
In CA, by doing random accesses, we wish to either z Confirm an object is a top-k result, or z Prune a viable object As the number of random accesses in CA is limited, various heuristics can be made to optimize CA in terms of total cost
32
Mining and Searching Complex Structures
Chapter 1 Introduction
Reference • Ronald Fagin, Amnon Lotem, Moni Naor: Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci. 66(4): 614-656 (2003)
33
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Mining and Searching Complex Structures High Dimensional Data Anthony K. H. Tung(鄧锦浩) School of Computing National University of Singapore www.comp.nus.edu.sg/~atung
Research Group Link: http://nusdm.comp.nus.edu.sg/index.html Social Network Link: http://www.renren.com/profile.do?id=313870900
Outline • Sources of HDD • Challenges of HDD • Searching and Mining Mixed Typed Data –Similarity Function on k-n-match –ItCompress • Bregman Divergence: Towards Similarity Search on Non-metric Distance • Earth Mover Distance: Similarity Search on Probabilistic Data • Finding Patterns in High Dimensional Data
Mining and Searching Complex Structures
34
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Sources of High Dimensional Data • • • • •
Microarray gene expression Text documents Images Features of Sequences, Trees and Graphs Audio, Video, Human Motion Database (spatiotemporal as well!)
Mining and Searching Complex Structures
Challenges of High Dimensional Data • Indistinguishable –Distance between two nearest points and two furthest points could be almost the same • Sparsity –As a result of the above, data distribution are very sparse giving no obvious indication on where the interesting knowledge is • Large number of combination –Efficiency: How to test the number of combinations –Effectiveness: How do we understand and interpret so many combinations?
Mining and Searching Complex Structures
35
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Outline • Sources of HDD • Challenges of HDD • Searching and Mining Mixed Typed Data –Similarity Function on k-n-match –ItCompress • Bregman Divergence: Towards Similarity Search on Non-metric Distance • Earth Mover Distance: Similarity Search on Probabilistic Data • Finding Patterns in High Dimensional Data
Mining and Searching Complex Structures
Similarity Search : Traditional Approach •
Objects represented by multidimensional vectors
Elevation
Aspect
Slope
Hillshade (9am)
Hillshade (noon)
Hillshade (3pm)
2596
51
3
221
232
148
…
…
•
The traditional approach to similarity search: kNN query Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) ID
d1
d2
d3
d4
d5
d6
d7
d8
d9
d10
Dist
P1
1.1
1
1.2
1.6
1.1
1.6
1.2
1.2
1
1
0.93
P2
1.4
1.4
1.4
1.5
1.4
1
1.2
1.2
1
1
0.98
P3
1
1
1
1
1
1
2
1
2
2
1.73
P4
20
20
21
20
22
20
20
19
20
20
57.7
P5
19
21
20
20
20
21
18
20
22
20
60.5
P6
21
21
18
19
20
19
21
20
20
20
59.8
Mining and Searching Complex Structures
36
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Deficiencies of the Traditional Approach •
Deficiencies –Distance is affected by a few dimensions with high dissimilarity –Partial similarities can not be discovered
•
The traditional approach to similarity search: kNN query Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
ID
d1
d2
d3
d4
d5
d6
d7
d8
d9
d10
Dist
P1
1.1
1 100
1.2
1.6
1.1
1.6
1.2
1.2
1
1
0.93 99.0
P2
1.4
1.4
1.4
1.5
1.4
1 100
1.2
1.2
1
1
99.0 0.98
P3
1
1
1
1
1
1
2
1 100
2
2
1.73 99.0
P4
20
20
21
20
22
20
20
19
20
20
57.7
P5
19
21
20
20
20
21
18
20
22
20
60.5
P6
21
21
18
19
20
19
21
20
20
20
59.8
Mining and Searching Complex Structures
Thoughts • Aggregating too many dimensional differences into a single value result in too much information loss. Can we try to reduce that loss? • While high dimensional data typically give us problem when in come to similarity search, can we turn what is against us into advantage? • Our approach: Since we have so many dimensions, we can compute more complex statistics over these dimensions to overcome some of the “noise” introduce due to scaling of dimensions, outliers etc.
Mining and Searching Complex Structures
37
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
The N-Match Query : Warm-Up •
Description –Matches between two objects in n dimensions. (n ≤ d) –The n dimensions are chosen dynamically to make the two objects match best.
•
How to define a “match” –Exact match –Match with tolerance δ
•
The similarity search example Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) ID
d1
d2
d3
d4
P1
1.1
1 100
1.2
1.6
P2
1.4
1.4
1.4
1.5
n=6 d5
d6
d7
d8
d9
d10
Dist
1.1
1.6
1.2
1.2
1
1
0.2
1.4
1 100
1.2
1.2
1
1
0.4 0.98
P3
1
1
1
1
1
1
2
1 100
2
2
1.73 0
P4
20
20
21
20
22
20
20
19
20
20
19
P5
19
21
20
20
20
21
18
20
22
20
19
P6
21
21
18
19
20
19
21
20
20
20
19
Mining and Searching Complex Structures
The N-Match Query : The Definition •
The n-match difference Given two d-dimensional points P(p1, p2, …, pd) and Q(q1, q2, …, qd), let δi = |pi - qi|, i=1,…,d. Sort the array {δ1 , …, δd} in increasing order and let the sorted array be {δ1’, …, δd’}. Then δn’ is the n-match difference y between P and Q.
•
The n-match query Given a d-dimensional database DB, a query point Q and an integer n (n≤d), find the point P ∈ DB that has the smallest n-match difference to Q. P is called the n-match of Q.
10
E
8
D
1-match=A 2-match=B
6 4
A B
•
C
2
The similarity search example
n=8 6 7
Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
Q
2
4
6
8
ID
d1
d2
d3
d4
d5
d6
d7
d8
d9
d10
Dist
P1
1.1
1 100
1.2
1.6
1.1
1.6
1.2
1.2
1
1
0.2 0.6
P2
1.4
1.4
1.4
1.5
1.4
1 100
1.2
1.2
1
1
0.4 0.98
P3
1
1
1
1
1
1
2
1 100
2
2
1.73 0 1
P4
20
20
21
20
22
20
20
19
20
20
19
P5
19
21
20
20
20
21
18
20
22
20
19
P6
21
21
18
19
20
19
21
20
20
20
19
Mining and Searching Complex Structures
38
10
x
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
The N-Match Query : Extensions •
The k-n-match query Given a d-dimensional database DB, a query point Q, an integer k, and an integer n, find a set S which consists of k points from DB so that for any point P1 ∈ S and any point P2∈ DB-S, P1’s n-match difference is smaller y than P2’s n-match difference. S is called the k-n-match of Q.
•
The frequent k-n-match query
•
Given a d-dimensional database DB, a query point Q, an integer k, and an integer range [n0, n1] within [1,d], let S0, …, Si be the answer sets of k-n0-match, …, k-n1-match, respectively, find a set T of k points, so that for any point P1 ∈ T and any point P2 ∈ DB-T, P1’s number of appearances in S0, …, Si is larger than or equal to P2’s number of appearances in S0, …, Si .
10
E
8
D
The similarity search example
Q
2-1-match={A,D} 2-2-match={A,B}
6 4
A B
C
2 2
4
6
8
n=6
Q = ( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) ID
d1
d2
d3
d4
d5
d6
d7
d8
d9
d10
P1
1.1
1 100
1.2
1.6
1.1
1.6
1.2
1.2
1
1
Dist 0.2
P2
1.4
1.4
1.4
1.5
1.4
1 100
1.2
1.2
1
1
0.4 0.98
P3
1
1
1
1
1
1
2
1 100
2
2
1.73 0
P4
20
20
21
20
22
20
20
19
20
20
19
P5
19
21
20
20
20
21
18
20
22
20
19
P6
21
21
18
19
20
19
21
20
20
20
19
Mining and Searching Complex Structures
Cost Model •
The multiple system information retrieval model –Objects are stored in different systems and scored by each system –Each system can sort the objects according to their scores –A query retrieves the scores of objects from different systems and then combine them using some aggregation function
Q : color=“red” & shape=“round” & texture “cloud”
System 1: Color
•
System 2: Shape
Object ID
Score
Object ID
Score
Object ID
Score
1
0.4 0.4
1
1.0 1.0
1
1.0 1.0
2
2.8 2.8
5 2
1.5 5.5
2
2.0 2.0
3 5
3.5 6.5
2 3
5.5 7.8
3
5.0 5.0
3 4
6.5 9.0
3 4
7.8 9.0
4 5
8.0 9.0
4 5
9.0 3.5
4 5
9.0 1.5
5 4
9.0 8.0
The cost –Retrieval of scores – proportional to the number of scores retrieved
•
System 3: Texture
The goal –To minimize the scores retrieved
Mining and Searching Complex Structures
39
10
x
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
The AD Algorithm •
The AD algorithm for the k-n-match query –Locate the query’s attributes value in every dimension –Retrieve the objects’ attributes value from the query’s attributes in both directions –The objects’ attributes are retrieved in Ascending order of their Differences to the query’s attributes. An n-match is found when it appears n times.
Q : color=“red” 2-2-match &Qshape=“round” : (of 3.0 Q ,:7.0 ( 3.0 , 4.0 , 7.0 &) texture , 4.0 ) “cloud” Systemd11: Color
Systemd2 2: Shape
Object ID
Score Attr
Object ID
Score Attr
Object ID
Score Attr
1
0.4
1
1.0
1
1.0
2
2.8
5
1.5
2
2.0
5
3.5
2
5.5
3
5.0
3
6.5
3
7.8
5
8.0
4
9.0
4
9.0
4
9.0
3.0
Auxiliary structures
Next attribute to retrieve g[2d]
Number of appearances appear[c]
System d3 3: Texture
7.0
d1 2 ,, 2.6 0.2 1
d2 0.5 35 ,, 3.5
Answer set S
2 , 1.5
4.0
d3 3 0.8 4 ,, 2.0
2 , 2.0
1.0 53 ,, 4.0
1
2
3
4
5
0
20 1
2 10
0
1 0
{ 3 , {2{3} }
Mining and Searching Complex Structures
The AD Algorithm : Extensions •
The AD algorithm for the frequent k-n-match query –The frequent k-n-match query • Given an integer range [n0, n1], find k-n0-match, k-(n0+1)-match, ... , k-n1match of the query, S0, S1, ... , Si. • Find k objects that appear most frequently in S0, S1, ... , Si. –Retrieve the same number of attributes as processing a k-n1-match query.
•
Disk based solutions for the (frequent) k-n-match query –Disk based AD algorithm • Sort each dimension and store them sequentially on the disk • When reaching the end of a disk page, read the next page from disk –Existing indexing techniques • Tree-like structures: R-trees, k-d-trees • Mapping based indexing: space-filling curves, iDistance • Sequential scan • Compression based approach (VA-file)
Mining and Searching Complex Structures
40
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Experiments : Effectiveness •
Searching by k-n-match –COIL-100 database –54 features extracted, such as color histograms, area moments k-n-match query, k=4 n
Images returned
5
36, 42, 78, 94
10
Searching by frequent k-nmatch
UCI Machine learning repository Competitors: IGrid Human-Computer Interactive NN search (HCINN)
27, 35, 42, 78
15
3, 38, 42, 78
20
27, 38, 42, 78
25
35, 40, 42, 94
30
10, 35, 42, 94
35
35, 42, 94, 96
40
35, 42, 94, 96
45
35, 42, 94, 96
50
35, 42, 94, 96
13, 35, 36, 40, 42 64, 85, 88, 94, 96
Freq. k-n-match
Data sets (d)
IGrid
HCINN
80.1%
86%
87.5%
Segmentation (19)
79.9%
83%
87.3%
Wdbc (30)
87.1%
N.A.
92.5%
Glass (9)
58.6%
N.A.
67.8%
Iris (4)
88.9%
N.A.
89.6%
Experiments : Efficiency Disk based algorithms for the Frequent k-n-mach query –Texture dataset (68,040 records); uniform dataset (100,000 records) –Competitors: • The AD algorithm • VA-file • Sequential scan
Mining and Searching Complex Structures
41
Images returned
10
Ionosphere (34)
Mining and Searching Complex Structures
•
kNN query k
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Experiments : Efficiency (continued) •
Comparison with other similarity search techniques –Texture dataset ; synthetic dataset –Competitors: • Frequent k-n-match query using the AD algorithm • IGrid • scan
Mining and Searching Complex Structures
Future Work(I) • We now have a natural way to handle similarity search for data with categorical , numerical and attributes. Investigating k-n-match performance on such mixed-type data is currently under way • Likewise, applying k-n-match on data with missing or uncertain attributes will be interesting • Query={1,1,1,1,1,1,1,M,No,R} ID
d1
d2
d3
d4
d5
d6
d7
d8
d9
d10
P1
1.1
1
1.2
1.6
1.1
1.6
1.2
M
Yes
R
P2
1.4
1.4
1.4
1.5
1.4
1
1.2
F
No
B
P3
1
1
1
1
1
1
2
M
No
B
P4
20
20
21
20
22
20
20
M
Yes
G
P5
19
21
20
20
20
21
18
F
Yes
R
P6
21
21
18
19
20
19
21
F
Yes
Y
Mining and Searching Complex Structures
42
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Future Work(I) • We now have a natural way to handle similarity search for data with categorical , numerical and attributes. Investigating k-n-match performance on such mixed-type data is currently under way • Likewise, applying k-n-match on data with missing or uncertain attributes will be interesting • Query={1,1,1,1,1,1,1,M,No,R} ID
d1
P1 P2
1.4
d2
d3
d4
d5
d6
d7
d8
1
1.2
1.6
1.1
1.6
1.2
M
1
1.2
F
No
2
M
No
20
20
M
1.4
P3
1
1
P4
20
20
P5
19
21
P6
21
1.5 1
20
1
1
20
22
20
20
18
20
21
18
F
d9
d10 R B B G
Yes
R
Yes
Y
Mining and Searching Complex Structures
Future Work(II) • In general, three things affect the result from a similarity search: noise, scaling and axes orientation. K-n-match reduce the effect of noise. Ultimate aim is to have a similarity function that is robust to noise, scaling and axes orientation • Eventually will look at creating mining algorithms using k-nmatch
Mining and Searching Complex Structures
43
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Outline • Sources of HDD • Challenges of HDD • Searching and Mining Mixed Typed Data –Similarity Function on k-n-match –ItCompress • Bregman Divergence: Towards Similarity Search on Non-metric Distance • Earth Mover Distance: Similarity Search on Probabilistic Data • Finding Patterns in High Dimensional Data
Mining and Searching Complex Structures
Motivation
query Large Data Sets
results
Ever-increasing data collection rates of modern enterprises and the need for effective, guaranteedquality approximate answers to queries Concern: compress as much as possible.
Mining and Searching Complex Structures
44
22
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Conventional Compression Method • Try to find the optimal encoding of arbitrary strings for the input data: –Huffman Coding –Lempel-Ziv Coding (gzip) • View the whole table as a large byte string • Statistical or dictionary based • Operate at the byte level
Mining and Searching Complex Structures
23
Why not just “syntactic”? • Do not exploit the complex dependency patterns in the table • Individual retrieval of tuple is difficult • Do not utilize lossy compression
Mining and Searching Complex Structures
45
24
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Semantic compression methods • Derive a descriptive model M • Identify the data values which can be derived from M (within some error tolerance), which are essential for deriving, and which are the outliers • Derived values need not to be stored, only the outliers need
Mining and Searching Complex Structures
25
Advantages • More Complex Analysis –Example: detect correlation among columns
• Fast Retrieval –Tuple-wise access
• Query Enhancement –Possible to answer query directly from discover semantic –Compress in way which enhanced answering of some complex queries, eg. “Go Green: Recycle and Reuse Frequent Patterns”, C. Gao, B. C. Ooi, K. L. Tan and A. K. H. Tung. ICDE’2004.
Choose a combination of compression methods based on semantic and syntactic information Mining and Searching Complex Structures
46
26
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Fascicles • Key observation –Often, numerous subsets of records in T have similar values for many attributes
Protocol http http http http http ftp ftp ftp
Duration 12 16 15 19 26 27 32 18
Bytes Packets 20K 3 24K 5 20K 8 40K 11 58K 18 100K 24 300K 35 80K 15
• Compress data by storing representative values (e.g., “centroid”) only once for each attribute cluster • Lossy compression: information loss is controlled by the notion of “similar values” for attributes (user-defined)
Mining and Searching Complex Structures
27
ItCompress: Compression Format Representative Rows (Patterns)
Original Table
RRid age salary
age salary credit sex
credit
sex
1
30
90k
good
F
2
70
35k
poor
M
20
30k
poor
M
25
76k
good
F
30
90k
good
F
40
100k
poor
M
50
110k
good
F
2
0111
60
50k
good
M
1
1111
70
35k
poor
F
1
1111
75
15k
poor
M
1
0100
40, poor, M
1
0111
50
1
0010
60, 50k, M
2
1110
F
2
1111
Compressed Table RRid bitmap
Error Tolerance: age salary credit sex 5
25k
0
0
Mining and Searching Complex Structures
47
Outlying value 20
28
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Some definitions • Error tolerance –Numeric attributes • The upper bound that x’ can be different from x • x ∈ [ x’-ei, x’+ei ]
–Categorical attributes • The upper bound on the probability that the compressed value differs from actual value • Given an actual value x and its error tolerance ei, the compressed value x’ should satisfy: Prob( x=x’ ) ≥ 1 - ei
Mining and Searching Complex Structures
29
Some definitions • Coverage –Let R be a row in the table T, and Pi be a pattern –The coverage of Pi on R :
cov( Pi , R ) = number of attributes X i in which R[ X i ] is match by Pi [ X i ] • Total coverage –Let P be a set of patterns P1,…,Pk; and the table T contains n rows R1,…,Rn –
totalcov ( P, T ) =
∑ cov( P
i =1..n
max
( Ri ), Ri ) 30
Mining and Searching Complex Structures
48
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
ItCompress: basic algorithm • First randomly choose k rows as initial patterns • Scan the table T: Phase1
–For each row R, compute the coverage of each pattern on it, then try to find Pmax(R) –Allocate R to its most covered pattern
• After each iteration, re-compute all patterns’ Phase2 attributes, always using the most frequent values • Iterate until sum of total coverage does not increase
Mining and Searching Complex Structures
31
Example: the 1st iteration begins
RRid age salary
age salary credit
sex
20
30k
poor
M
1
20
25
76k
good
F
2
25
30
90k
good
F
40
100k
poor
M
50
110k
good
F
60
50k
good
M
70
35k
poor
F
75
15k
poor
M
age salary credit
sex
credit
sex
30k
poor
M
76k
good
F
Error Tolerance: 5
25k
0
0
32
Mining and Searching Complex Structures
49
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Example: Phase 1 RRid age salary
credit
sex
30k
poor
M
76k
good
F
age salary
credit
sex
20
30k
poor
M
40
100k
poor
M
60
50k
good
M
70
35k
poor
F
75
15k
poor
M
age salary credit
sex
age salary credit
sex
1
20
20
30k
poor
M
2
25
25
76k
good
F
30
90k
good
F
40
100k
poor
M
50
110k
good
F
60
50k
good
M
70
35k
poor
F
75
15k
poor
M
Error Tolerance: age salary credit 5
25k
0
sex 0
25
76k
good
F
30
90k
good
F
50
110k
good
F
33
Mining and Searching Complex Structures
Example: Phase 2 RRid
age
salary credit
sex
1
20 70
30k
poor
M M
2
25 25
90k 76k
good
FF
age salary credit
sex
20
30k
poor
M
25
76k
good
F
30
90k
good
F
age salary
credit
sex
40
100k
poor
M
20
30k
poor
M
50
110k
good
F
40
100k
poor
M
60
50k
good
M
60
50k
good
M
70
35k
poor
F
70
35k
poor
F
75
15k
poor
M
75
15k
poor
M
age salary credit
sex
Error Tolerance: age salary credit 5
25k
0
sex 0
25
76k
good
F
30
90k
good
F
50
110k
good
F
Mining and Searching Complex Structures
50
34
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Convergence(I) • Phase 1: –When we assign the rows to their most coverage patterns: • For each row, the coverage increases or maintain ÎSo the total coverage also increases or maintain
• Phase 2: –When we re-compute the attribute values for the patterns: • For each pattern, the coverage increases or maintains ÎSo the total coverage also increases or maintains
Mining and Searching Complex Structures
35
Convergence(II) • In both Phase 1&2, the total coverage is either increased or maintained, and it has a obvious upper bound (cover the whole table) Î The algorithm will converge eventually
Mining and Searching Complex Structures
51
36
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Complexity • Phase 1: –In l iterations, we need to go through the n rows in the table and match each row against the k patterns(2m comparisons,) ÎThe running time complexity is O(kmnl) where m is the number of attributes • Phase 2: –Computing each new pattern Pi will require going through all the domain values/intervals of each value ÎAssuming the total number of domain values/intervals is d, the running time complexity is O(kdl)
ÎThe total time complexity is O(kmnl+kdl)
Mining and Searching Complex Structures
37
Advantages of ItCompress • Simplicity and Directness –Two phases process of Fascicle and Spartan • Find rules/patterns • Compress database using discovered rules/patterns
–ItCompress optimize the compression directly without finding rules/patterns that may not be useful (a.k.a microeconomic approach) • Less constraints –Do not need patterns to be matched completely or rules that apply globally • Easily tuned parameters
Mining and Searching Complex Structures
52
38
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Performance Comparison • Algorithms –ItCompress, ItCompress+gzip –Fascicles, Fascicles+gzip –SPARTAN+gzip
• Platform –ItCompress,Fascicles: AMD Duron 700Mhz, 256MB Memory –SPARTAN: Four 700Mhz Pentium CPU, 1GB Memory)
• Datasets –Corel: 32 numeric attributes, 35000 rows, 10.5MB –Census: 7 numeric, 7 categorical, 676000 rows, 28.6MB –Forest-cover: 10 numeric, 44 categorical, 581000 rows, 75.2MB Mining and Searching Complex Structures
39
Effectiveness (Corel)
Mining and Searching Complex Structures
53
40
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Effectiveness (Census)
Mining and Searching Complex Structures
41
Effectiveness (Forest Cover)
Mining and Searching Complex Structures
54
42
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Efficiency
Mining and Searching Complex Structures
43
Mining and Searching Complex Structures
44
Varying k
55
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Varying Sample Ratio
Mining and Searching Complex Structures
45
Adding Noises (Census)
Mining and Searching Complex Structures
56
46
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
20% Corruption?
Effect of Corruption A1
A2
A3
A4
A5
A6
A7
A8
A9
A10 A11 A12
47
Mining and Searching Complex Structures
Effect of Corruption A1
A2
A3
A4
A5
A6
20% Corruption? A7
A8
A9
A10 A11 A12
48
Mining and Searching Complex Structures
57
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Findings •
ItCompress is –More efficient than SPARTAN –More effective than Fascicles –Insensitive to parameter setting –Robust to noises
Mining and Searching Complex Structures
49
Future work • Can we perform mining on the compressed datasets using only the patterns and the bitmap ? –Example: Building Bayesian Belief Network • Is ItCompress a good “bootstrap” semantic compression algorithm ? ItCompress database
Compressed database
Other Semantic Compression Algorithms 50
Mining and Searching Complex Structures
58
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Outline • Sources of HDD • Challenges of HDD • Searching and Mining Mixed Typed Data –Similarity Function on k-n-match –ItCompress • Bregman Divergence: Towards Similarity Search on Non-metric Distance • Earth Mover Distance: Similarity Search on Probabilistic Data • Finding Patterns in High Dimensional Data
Mining and Searching Complex Structures
Metric v.s. Non-Metric • Euclidean distance dominates DB queries • Similarity in human perception
• Metric distance is not enough!
2010-7-31
Mining and Searching Complex Structures
59
52
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Bregman Divergence h
(q,f(q))
convex function f(x) (p,f(p)) Bregman divergence Df(p,q)
q
p Euclidean dist.
2010-7-31
Mining and Searching Complex Structures
53
Bregman Divergence • Mathematical Interpretation –The distance between p and q is defined as the difference between f(p) and the first order Taylor expansion at q
f(x) at p
2010-7-31
first order Taylor expansion at q
Mining and Searching Complex Structures
60
54
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Bregman Divergence • General Properties –Non-Negativity • Df(p,q)≥0 for any p, q
–Identity of Indiscernible • Df(p,p)=0 for any p
–Symmetry and Triangle Inequality • Do NOT hold any more
Mining and Searching Complex Structures
2010-7-31
55
Examples
2010-7-31
Distance
f(x)
Df(p,q)
Usage
KL-Divergence
x logx
p log (p/q)
distribution, color histogram
Itakura-Saito Distance
-logx
p/q-log (p/q)-1
signal, speech
Squared Euclidean
x2
(p-q)2
Euclidean space
Von-Nuemann Entropy
tr(X log X – X)
tr(X logX – X logY – X + Y)
symmetric matrix
Mining and Searching Complex Structures
61
56
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Why in DB system? • Database application –Retrieval of similar images, speech signals, or time series –Optimization on matrices in machine learning –Efficiency is important! • Query Types –Nearest Neighbor Query –Range Query
2010-7-31
Mining and Searching Complex Structures
57
Euclidean Space • How to answer the queries –R-Tree
2010-7-31
Mining and Searching Complex Structures
62
58
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Euclidean Space • How to answer the queries –VA File
2010-7-31
Mining and Searching Complex Structures
59
Our goal • Re-use the infrastructure of existing DB system to support Bregman divergence –Storage management –Indexing structures –Query processing algorithms
2010-7-31
Mining and Searching Complex Structures
63
60
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Basic Solution • Extended Space –Convex function f(x) = x2
point
D1
D2
point
D1
D2
D3
p
0
1
p+
0
1
1
q
0.5
0.5
q+
0.5
0.5
0.5
r
1
0.8
r+
1
0.8
1.64
t
1.5
0.3
t+
1.5
0.3
3.15
2010-7-31
Mining and Searching Complex Structures
61
Basic Solution • After the extension –Index extended points with R-Tree or VA File –Re-use existing algorithms with lower and upper bounds on the rectangles
2010-7-31
Mining and Searching Complex Structures
64
62
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
How to improve? • Reformulation of Bregman divergence • Tighter bounds are derived • No change on index construction or query processing algorithm
Mining and Searching Complex Structures
2010-7-31
63
A New Formulation h h’
query vector vq
Df(p,q)+Δ
q
p D*f(p,q)
2010-7-31
Mining and Searching Complex Structures
65
64
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Math. Interpretation • Reformulation of similarity search queries –k-NN query: query q, data set P, divergence Df • Find the point p, minimizing
–Range query: query q, threshold θ, data set P • Return any point p that
2010-7-31
Mining and Searching Complex Structures
65
Naïve Bounds • Check the corners of the bounding rectangles
2010-7-31
Mining and Searching Complex Structures
66
66
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Tighter Bounds • Take the curve f(x) into consideration
2010-7-31
Mining and Searching Complex Structures
67
Query distribution • Distortion of rectangles –The difference between maximum and minimum distances from inside the rectangle to the query
2010-7-31
Mining and Searching Complex Structures
67
68
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Can we improve it more? • When Building R-Tree in Euclidean space –Minimize the volume/edge length of MBRs –Does it remain valid?
2010-7-31
Mining and Searching Complex Structures
69
Query distribution • Distortion of bounding rectangles –Invariant in Euclidean space (triangle inequality)
–Query-dependent for Bregman Divergence
2010-7-31
Mining and Searching Complex Structures
68
70
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Utilize Query Distribution • Summarize query distribution with O(d) real number • Estimation on expected distortion on any bounding rectangle in O(d) time • Allows better index to be constructed for both R-Tree and VA File
Mining and Searching Complex Structures
2010-7-31
71
Experiments • Data Sets –KDD’99 data • Network data, the proportion of packages in 72 different TCP/IP connection Types
–DBLP data • Use co-authorship graph to generate the probabilities of the authors related to 8 different areas
2010-7-31
Mining and Searching Complex Structures
69
72
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Experiment • Data Sets –Uniform Synthetic data • Generate synthetic data with uniform distribution
–Clustered Synthetic data • Generate synthetic data with Gaussian Mixture Model
2010-7-31
Mining and Searching Complex Structures
73
Experiments • Methods to compare
Basic
Improved Bounds
Query Distribution
R-Tree
R
R-B
R-BQ
VA File
V
V-B
V-BQ
Linear Scan
LS
BB-Tree
BBT
2010-7-31
Mining and Searching Complex Structures
70
74
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Existing Solution • BB-Tree (L. Clayton, ICML 2009) –Memory-based indexing tree –Construct with k-means clustering –Hard to update –Ineffective in high-dimensional space
2010-7-31
Mining and Searching Complex Structures
75
Experiments • Index Construction Time
2010-7-31
Mining and Searching Complex Structures
71
76
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Experiments • Varying dimensionality
2010-7-31
Mining and Searching Complex Structures
77
Experiments • Varying dimensionality (cont.)
2010-7-31
Mining and Searching Complex Structures
72
78
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Experiments • Varying data cardinality
2010-7-31
Mining and Searching Complex Structures
79
Conclusion • A general technique on similarity for Bregman Divergence • All techniques are based on existing infrastructure of commercial database • Extensive experiments to compare performances with RTree and VA File with different optimizations
2010-7-31
Mining and Searching Complex Structures
73
80
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Outline • Sources of HDD • Challenges of HDD • Searching and Mining Mixed Typed Data –Similarity Function on k-n-match –ItCompress • Bregman Divergence: Towards Similarity Search on Non-metric Distance • Earth Mover Distance: Similarity Search on Probabilistic Data • Finding Patterns in High Dimensional Data
Mining and Searching Complex Structures
Motivation • Probabilistic data is ubiquitous –To represent the data uncertainty (WSN, RFID, moving object monitoring) –To compress data (image processing) • Histogram is a good way to represent the prob. data –Easy to capture –Is very useful in image representation • • • •
Colors Textures Gradient Depth
Mining and Searching Complex Structures
74
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Motivation • Similarity search is important for managing prob. data –Given a threshold θ, can answer which sensors’ readings are similar with sensor A (range query) –Can answer which k pictures are similar (top-k query) • Similarity function for prob. data should be carefully chosen –Bin by bin methods • L1 and L2 norms • χ2 distance
–Cross-bin methods • Earth Mover’s Distance (EMD) • Quadratic form
Mining and Searching Complex Structures
Outline • • • • • •
Motivation Introduction to Earth Mover’s Distance (EMD) Related works Indexing the probabilistic data based on EMD Experimental results Conclusion and future work
Mining and Searching Complex Structures
75
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Introduction to Earth Mover’s Dist • Bin by bin vs. cross bin
Bin-by-bin
Not good!
Cross bin
Good! Can handle distribution shift Mining and Searching Complex Structures
Introduction to Earth Mover’s Dist • What is EMD? –Earth (泥土) –Mover (搬运) –Distance (代价) –Can be understood as 搬运泥土的代价 • See an example…
Mining and Searching Complex Structures
76
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Moving Earth
≠ Mining and Searching Complex Structures
Moving Earth
≠ Mining and Searching Complex Structures
77
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Moving Earth
= Mining and Searching Complex Structures
The Difference?
(amount moved)
= Mining and Searching Complex Structures
78
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
The Difference?
Difference
(amount moved) * (distance moved)
= Mining and Searching Complex Structures
Linear programming
P m bins (distance moved) * (amount moved)
Q
All movements
n bins
Mining and Searching Complex Structures
79
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Linear programming
P m clusters (distance moved) * (amount moved)
Q n clusters
Mining and Searching Complex Structures
Linear programming
P m clusters * (amount moved)
Q n clusters
Mining and Searching Complex Structures
80
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Linear programming
P m clusters
Q n clusters
Mining and Searching Complex Structures
Constraints
1. Move “earth” only from P to Q P m clusters
P’
n clusters
Q’
Q
Mining and Searching Complex Structures
81
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Constraints
2. Cannot send more “earth” than there is
P m clusters
P’
n clusters
Q’
Q
Mining and Searching Complex Structures
Constraints
3. Q cannot receive more “earth” than it can hold
P m clusters
P’
n clusters
Q’
Q
Mining and Searching Complex Structures
82
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Constraints
4. As much “earth” as possible must be moved
P m clusters
P’
n clusters
Q’
Q
Mining and Searching Complex Structures
The Formal Definition of EMD • Earth Mover’s Distance (EMD) –the minimum amount of work needed to change one histogram into another
• Challenge of EMD –O(N^3logN)
Mining and Searching Complex Structures
83
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Related Works •
Filter-and-refine framework
–[1] Approximation Techniques for Indexing the Earth Mover's Distance in Multimedia Databases. ICDE 2006 • Cannot handle high dimensional histograms
–[2] Efficient EMD-based Similarity Search in Multimedia Databases via Flexible Dimensionality Reduction. SIGMOD 2008 • Based on scan framework and influence the scalability
•
Use scanning scheme to process queries –Merit: can obtain a good order to access when execute the k-NN queries and thus can minimize the number of candidates –Demerit: need to scan the whole dataset to obtain the order and thus low algo. scalability
Mining and Searching Complex Structures
Related Works •
•
Related works –Based on the filter-and-refine framework –Based on scanning method and low scalability Our work –Also based on the filter-and-refine method –But avoid to scan the whole data set • Use B+ trees • And thus can obtain high scalability
•
Our contributions –To the best of our knowledge, the 1st paper to index the high dimensional prob. data based on the EMD –Proposed algorithms of processing the similarity query based on B+ tree filter –Improve the efficiency and scalability of EMD-based similarity search
Mining and Searching Complex Structures
84
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Indexing the probabilistic data based on EMD •
Our intuition: –primal-dual theory in linear programming
•
Primal problem (EMD)
•
Dual problem
Mining and Searching Complex Structures
Indexing the probabilistic data based on EMD
•
Good properties of dual space –Constrains of dual space are independent of prob. data points (i.e., p and q in this example) • Thus, give any feasible solution (π, Ф) in dual space we can derives a lower bound for EMD(p, q) • Lower bound can help to filter out the not-hit histograms.
–given any feasible solution (π, Ф) in dual space, a histogram p can be mapped as a value, using the operation of • Can index histograms using B+ tree
Mining and Searching Complex Structures
85
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Indexing the probabilistic data based on EMD • 1. Mapping Construction –Key and counter key
Key
Counter key
–Assuming p is a histogram in DB, given a feasible solution (π, Ф), we calculate the Key for each record in DB –We can index those keys using B+ tree –For each feasible solution (π, Ф), a B+ tree can be constructed
Mining and Searching Complex Structures
Answering Range Query • Range query based on B+ index –Given any feasible solution (π, Ф) , we construct a B+ tree using keys of histograms –Given a query histogram, we calculate its counter key using the operation of –Given a similarity search threshold θ, we have proved that all candidate histogram’s key can be bounded by
–To further filter the candidates, we use L B+ tree and make an intersection among their candidate results
Mining and Searching Complex Structures
86
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Answering KNN Query •
K-NN query based on B+ index –Given a query q, we issue search on each B+ tree Tl with key(q, Фl) –We create two cursors for each tree and let them to fetch records from different directions (one left and one right) –Whenever record r has already been accessed by all B+ tree, it can be output as a candidate for k-NN query
Mining and Searching Complex Structures
Experimental Setup • 3 real data set –RETINA1 • an image data set consists of 3932 feline retina scans labeled with various antibodies.
–IRMA • contains 10000 radiography images from the Image Retrieval in Medical Application (IRMA) project
–DBLP • With parameter setting
Mining and Searching Complex Structures
87
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Experimental Results on Query CPU Time
Mining and Searching Complex Structures
Experimental Results on Scalability
sigmod our
Mining and Searching Complex Structures
88
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Conclusions • We present a new indexing scheme for the general purposes of similarity search on Earth Mover's Distance • Our index method relies on the primal-dual theory to construct mapping functions from the original probabilistic space to one-dimensional domain • Our B+ tree-based index framework has –High scalability –High efficiency –can handle High dimensional data
Mining and Searching Complex Structures
Outline • Sources of HDD • Challenges of HDD • Searching and Mining Mixed Typed Data –Similarity Function on k-n-match –ItCompress • Bregman Divergence: Towards Similarity Search on Non-metric Distance • Earth Mover Distance: Similarity Search on Probabilistic Data • Finding Patterns in High Dimensional Data
Mining and Searching Complex Structures
89
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
A Microarray Dataset 1000 - 100,000 columns Class
100500 rows
Gene1
Sample1
Cancer
Sample2
Cancer
Gene2
Gene3
Gene4
Gene 5
Gene 6
. . . SampleN-1
~Cance r
SampleN
~Cance r
• Find closed patterns which occur frequently among genes. • Find rules which associate certain combination of the columns that affect the class of the rows –Gene1,Gene10,Gene1001 -> Cancer
Mining and Searching Complex Structures
Challenge I • Large number of patterns/rules –number of possible column combinations is extremely high • Solution: Concept of a closed pattern –Patterns are found in exactly the same set of rows are grouped together and represented by their upper bound • Example: the following patterns are found in row 2,3 and 4 aeh ae
upper bound (closed pattern)
ah eh e
h
i 1 2 3 4 5
ri a ,b,c,l,o,s a ,d, e , h ,p,l,r a ,c, e , h ,o,q,t a , e ,f, h ,p,r b,d,f,g,l,q,s,t
Class C C C ~C ~C
“a” however not part of the group
lower bounds
Mining and Searching Complex Structures
90
Ge
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Challenge II • Most existing frequent pattern discovery algorithms perform searches in the column/item enumeration space i.e. systematically testing various combination of columns/items • For datasets with 1000-100,000 columns, this search space is enormous • Instead we adopt a novel row/sample enumeration algorithm for this purpose. CARPENTER (SIGKDD’03) is the FIRST algorithm which adopt this approach
Mining and Searching Complex Structures
Column/Item Enumeration Lattice • Each nodes in the lattice represent a combination of columns/items • An edge exists from node A to B if A is subset of B and A differ from B by only 1 column/item • Search can be done breadth first i 1 2 3 4 5
ri a,b,c,l,o,s a,d,e,h,p,l,r a,c,e,h,o,q,t a,e,f,h,p,r b,d,f,g,l,q,s,t
a,b,c,e a,b,c a,b,e a,c,e b,c
a,b
Class C C C ~C ~C
Mining and Searching Complex Structures
91
a,c
a,e a
start
b
b,c c
{}
b
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Column/Item Enumeration Lattice • Each nodes in the lattice represent a combination of columns/items • An edge exists from node A to B if A is subset of B and A differ from B by only 1 column/item • Search can be done depth first • Keep edges from parent to child only if child is the prefix of parent i 1 2 3 4 5
ri a,b,c,l,o,s a,d,e,h,p,l,r a,c,e,h,o,q,t a,e,f,h,p,r b,d,f,g,l,q,s,t
a,b,c,e a,b,c a,b,e a,c,e b,c
a,b
Class C C C ~C ~C
a,c
a,e a
b,c
b
start
c
{}
Mining and Searching Complex Structures
General Framework for Column/Item Enumeration Read-based
Write-based
Point-based
Association Mining
Apriori[AgSr94], DIC
Eclat, MaxClique[Zaki01], FPGrowth [HaPe00]
Hmine
Sequential Pattern Discovery
GSP[AgSr96]
SPADE [Zaki98,Zaki01], PrefixSpan [PHPC01]
Iceberg Cube
Apriori[AgSr94]
BUC[BeRa99], HCubing [HPDW01]
Mining and Searching Complex Structures
92
b
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
A Multidimensional View types of data or knowledge
others
other interest measure
associative pattern
constraints pruning method
sequential pattern
iceberg cube
compression method
closed/max pattern
lattice transversal/ main operations read
write
point
Mining and Searching Complex Structures
Sample/Row Enumeration Algorihtms • To avoid searching the large column/item enumeration space, our mining algorithm search for patterms/rules in the sample/row enumeration space • Our algorithms does not fitted into the column/item enumeration algorithms • They are not YAARMA (Yet Another Association Rules Mining Algorithm) • Column/item enumeration algorithms simply does not scale for microarray datasets
Mining and Searching Complex Structures
93
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Existing Row/Sample Enumeration Algorithms • CARPENTER(SIGKDD'03) –Find closed patterns using row enumeration • FARMER(SIGMOD’04) –Find interesting rule groups and building classifiers based on them • COBBLER(SSDBM'04) –Combined row and column enumeration for tables with large number of rows and columns • Topk-IRG(SIGMOD’05) –Find top-k covering rules for each sample and build classifier directly • Efficiently Finding Lower Bound Rules(TKDE’2010) –Ruichu Cai, Anthony K. H. Tung, Zhenjie Zhang, Zhifeng Hao. What is Unequal among the Equals? Ranking Equivalent Rules from Gene Expression Data. Accepted in TKDE Mining and Searching Complex Structures
Concepts of CARPENTER ij R (ij ) i 1 2 3 4 5
ri a,b,c,l,o,s a,d,e,h,p,l,r a,c,e,h,o,q,t a,e,f,h,p,r b,d,f,g,l,q,s,t
Class C C C ~C ~C
Example Table
a b c d e f g h l o p q r s t
C 1,2,3 1 1,3 2 2,3
2,3 1,2 1,3 2 3 2 1 3
~C 4 5 5 4 4,5 5 4 5 4 5 4 5 5
Transposed Table,TT Mining and Searching Complex Structures
94
a e h
C 1,2,3 2,3 2,3
TT|{2,3}
~C 4 4 4
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
ij
Row Enumeration 123 {a} 12 {al}
124 {a} 125 {l}
13 {aco} 14 {a}
1 {abclos}
2 {adehplr}
{} 3 {acehoqt}
134 {a}
15 {bls}
135 {}
23 {aeh}
145 {}
24 {aehpr}
234 {aeh}
25 {dl}
4 {aefhpr}
1235 {}
ij
1245 {}
a
TT|{1} b c l o s
1345 {}
R (ij ) C ~C 1,2,3 4 1 5 1,3 1,2 5 1,3 1 5
2345 {}
ij a
TT|{12} l ij
TT|{124} {123}
345 {}
35 {q}
5 {bdfglqst}
12345 {}
1234 {a}
235 {} 245 {}
34 {aeh}
a b c d e f g h l o p q r s t
a
R (ij ) C ~C 1,2,3 4 1 5 1,3 2 5 2,3 4 4,5 5 2,3 4 1,2 5 1,3 2 4 3 5 2 4 1 5 3 5
R (ij ) C ~C 1,2,3 4 1,2 5
R (ij ) C ~C 1,2,3 4
45 {f}
Mining and Searching Complex Structures
Pruning Method 1 •
Removing rows that appear in all tuples of transposed table will not affect results
a e h r2 r3
{aeh}
r2 r3 r4 {aeh}
r4 has 100% support in the conditional table of “r2r3”, therefore branch “r2 r3r4” will be pruned.
Mining and Searching Complex Structures
95
C 1,2,3 2,3 2,3
TT|{2,3}
~C 4 4 4
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Pruning method 2 123 {a} 12 {al}
1 {abclos}
2 {adehplr}
{} 3 {acehoqt} 4 {aefhpr}
5 {bdfglqst}
13 {aco} 14 {a}
124 {a} 125 {l} 134 {a}
15 {bls}
135 {}
23 {aeh}
145 {}
24 {aehpr}
234 {aeh}
25 {dl} 34 {aeh} 35 {q}
1235 {} 1245 {} 1345 {}
2345 {}
235 {} 245 {} 345 {}
• if a rule is discovered before, we can prune enumeration below this node
12345 {}
1234 {a}
a e h
C 1,2,3 2,3 2,3
45 {f}
–Because all rules below this node has been discovered before –For example, at node 34, if we found that {aeh} has been found, we can prune ~Coff all branches below it 4 4 4
TT|{3,4}
Mining and Searching Complex Structures
Pruning Method 3: Minimum Support • Example: From TT|{1}, we can see that the support of all possible pattern below node {1} will be at most 5 rows.
TT|{1}
Mining and Searching Complex Structures
96
ij R (ij ) C ~C a 1,2,3 4 b 1 5 c 1,3 l 1,2 5 o 1,3 s 1 5
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
From CARPENTER to FARMER • What if classes exists ? What more can we do ? • Pruning with Interestingness Measure –Minimum confidence –Minimum chi-square
• Generate lower bounds for classification/ prediction
Mining and Searching Complex Structures
Interesting Rule Groups • Concept of a rule group/equivalent class –rules supported by exactly the same set of rows are grouped together • Example: the following rules are derived from row 2,3 and 4 with 66% confidence
i aeh--> C(66%) ae-->C (66%)
ah--> C(66%)
e-->C (66%)
upper bound eh-->C (66%)
h-->C (66%)
lower bounds
1 2 3 4 5
ri a ,b,c,l,o,s a ,d, e , h ,p,l,r a ,c, e , h ,o,q,t a , e ,f, h ,p,r b,d,f,g,l,q,s,t
a-->C however is not in the group
Mining and Searching Complex Structures
97
Class C C C ~C ~C
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Pruning by Interestingness Measure • In addition, find only interesting rule groups (IRGs) based on some measures: –minconf: the rules in the rule group can predict the class on the RHS with high confidence –minchi: there is high correlation between LHS and RHS of the rules based on chi-square test • Other measures like lift, entropy gain, conviction etc. can be handle similarly
Mining and Searching Complex Structures
ij
Ordering of Rows: All Class C before ~C 123 {a} 12 {al}
1 {abclos}
2 {adehplr}
{} 3 {acehoqt} 4 {aefhpr}
5 {bdfglqst}
13 {aco} 14 {a}
124 {a} 125 {l} 134 {a}
15 {bls}
135 {}
23 {aeh}
145 {}
24 {aehpr}
234 {aeh}
25 {dl} 34 {aeh} 35 {q}
1234 {a}
12345 {}
1235 {} 1245 {}
ij a
TT|{1} b c l o s
1345 {}
R (ij ) C ~C 1,2,3 4 1 5 1,3 1,2 5 1,3 1 5
2345 {}
a
ij
TT|{124} {123}
a
R (ij ) C ~C 1,2,3 4
45 {f}
Mining and Searching Complex Structures
98
ij
TT|{12} l
235 {} 245 {} 345 {}
a b c d e f g h l o p q r s t
R (ij ) C ~C 1,2,3 4 1 5 1,3 2 5 2,3 4 4,5 5 2,3 4 1,2 5 1,3 2 4 3 5 2 4 1 5 3 5
R (ij ) C ~C 1,2,3 4 1,2 5
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Pruning Method: Minimum Confidence
• Example: In TT|{2,3} on the right, the maximum confidence of all rules below node {2,3} is at most 4/5
a e h
C 1,2,3,6 2,3,7 2,3
~C 4,5 4,9 4
TT|{2,3}
Mining and Searching Complex Structures
Pruning method: Minimum chi-square
Same as in computing maximum confidence
a e h
C
~C
Total
A
max=5
min=1
Computed
~A
Computed
Computed
Computed
Constant
Constant
Constant
Mining and Searching Complex Structures
99
C 1,2,3,6 2,3,7 2,3
TT|{2,3}
~C 4,5 4,9 4
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Finding Lower Bound, MineLB
a,b,c,d,e
ad ae
abc
a
b
bd
be
cde
c
e
d
–Example: An upper bound rule with antecedent A=abcde and two rows (r1 : abcf ) and (r2 : cdeg) –Initialize lower bounds {a, b, c, d, e} –add “abcf”--- new lower {d ,e} –Add “cdeg”--- new lower bound{ad, bd, ae, be}
Candidate Candidatelower lowerbound: bound:ad, ad,ae, ae,bd, bd,bebe, cd, ce Kept Removed since no since lower d,ebound are stilloverride lower bound them Mining and Searching Complex Structures
Implementation • In general, CARPENTER FARMER can be implemented in many ways: –FP-tree –Vertical format • For our case, we assume the dataset can be fitted into the main memory and used pointer-based algorithm similar to BUC
Mining and Searching Complex Structures
100
ij a b c d e f g h l o p q r s t
R (ij ) C ~C 1,2,3 4 1 5 1,3 2 5 2,3 4 4,5 5 2,3 4 1,2 5 1,3 2 4 3 5 2 4 1 5 3 5
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Experimental studies • Efficiency of FARMER –On five real-life dataset • lung cancer (LC), breast cancer (BC) , prostate cancer (PC), ALLAML leukemia (ALL), Colon Tumor(CT)
–Varying minsup, minconf, minchi –Benchmark against • CHARM [ZaHs02] ICDM'02 • Bayardo’s algorithm (ColumE) [BaAg99] SIGKDD'99
• Usefulness of IRGs –Classification
Mining and Searching Complex Structures
Example results--Prostate 100000 FA RM ER 10000
Co lumnE
1000
CHA RM
100 10 1 3
4
5
6
7
mi ni mum sup p o r t
Mining and Searching Complex Structures
101
8
9
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Example results--Prostate 1200 FA RM ER:minsup=1:minchi=10
1000
FA RM ER:minsup =1 800 600 400 200 0 0
50
70
80
85
90
99
minimum confidence(%)
Mining and Searching Complex Structures
Top k Covering Rule Groups • Rank rule groups (upper bound) according to – Confidence – Support • Top k Covering Rule Groups for row r – k highest ranking rule groups that has row r as support and support > minimum support • Top k Covering Rule Groups = TopKRGS for each row
Mining and Searching Complex Structures
102
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Usefulness of Rule Groups • • • •
Rules for every row Top-1 covering rule groups sufficient to build CBA classifier No min confidence threshold, only min support #TopKRGS = k x #rows
Mining and Searching Complex Structures
Top-k covering rule groups • For each row, we find the most significant k rule groups: –based on confidence first –then support
• Given minsup=1, Top-1 –row 1: abcÆC1(sup = 2, conf= 100%) –row 2: abcÆC1 • abcdÆC1(sup=1,conf = 100%) –row 3: cdÆC1(sup=2, conf = 66.7%) • If minconf = 80%, ? –row 4: cdeÆC2 (sup=1, conf = 50%)
class
Items
C1
a,b,c
C1
a,b,c,d
C1
c,d,e
C2
c,d,e
Mining and Searching Complex Structures
103
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Main advantages of Top-k coverage rule group • The number is bounded by the product of k and the number of samples • Treat each sample equally Æ provide a complete description for each row (small) • The minimum confidence parameter-- instead k. • Sufficient to build classifiers while avoiding excessive computation
Mining and Searching Complex Structures
Top-k pruning • At node X, the maximal set of rows covered by rules to be discovered down X-- rows containing X and rows ordered after X. – minconf Å MIN confidence of the discovered TopkRGs for all rows in the above set – minsup Å the corresponding minsup
• Pruning –If the estimated upper bound of confidence down X < minconf Æ prune –If same confidence and smaller support Æ prune
• Optimizations
Mining and Searching Complex Structures
104
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Classification based on association rules • Step 1: Generate the complete set of association rules for each class ( minimum support and minimum confidence.) –CBA algorithm adopts apriori-like algorithm -fails at this step on microarray data.
• Step 2:Sort the set of generated rules • Step 3: select a subset of rules from the sorted rule sets to form classifiers.
Mining and Searching Complex Structures
Features of RCBT classifiers Problems
RCBT
To discover, store, retrieve and sort a large number of rules
Mine those rules to be used for classification.e.g.Top-1 rule group is sufficient to build CBA classifier
Default class not convincing for biologists
Main classifier + some back-up classifiers
Rules with the same discriminating ability, how to integrate? Upper bound rules: specific Lower bound rules: general
A subset of lower bound rules— integrate using a score considering both confidence and support.
Mining and Searching Complex Structures
105
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Experimental studies • Datasets: 4 real-life data • Efficiency of Top-k Rule mining –Benchmark: Farmer, Charm, Closet+ • Classification Methods: –CBA (build using top-1 rule group) –RCBT (our proposed method) –IRG Classifier –Decision trees (single, bagging, boosting) –SVM
Mining and Searching Complex Structures
Runtime v.s. Minimum support on ALL-AML dataset 10000
FARMER FARMER(minconf=0.9) FARMER+prefix(minconf=0.9) TOP1 TOP100
Runtime(s)
1000 100 10 1 0.1 0.01 17
19
21 22 Minimum Support
Mining and Searching Complex Structures
106
23
25
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Scalability with k 100
Runtime(s)
PC ALL 10
1
0.1 100
300
500
600
800
1000
k
Mining and Searching Complex Structures
Biological meaning –Prostate Cancer Data Frequncy of Occurrence 1800
W72186 1600 1400 1200
AF017418 1000
AI635895
800
X14487 600
AB014519 M61916
400
Y13323
200 0 0
200
400
600
800
1000
1200
Gene Rank
Mining and Searching Complex Structures
107
1400
1600
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
Classification results
Mining and Searching Complex Structures
Classification results
Mining and Searching Complex Structures
108
Mining and Searching Complex
Chapter 2 Structures High Dimensional Data
References •
Anthony K. H. Tung, Rui Zhang, Nick Koudas, Beng Chin Ooi. "Similarity Search: A Matching Based Approach", VLDB'06 H. V. Jagadish, Raymond T. Ng, Beng Chin Ooi, Anthony K. H. Tung, "ItCompress: An Iterative Semantic Compression Algorithm". International Conference on Data Engineering (ICDE'2004), Boston, 2004. Zhenjie Zhang, Beng Chin Ooi, Srinivasan Parthasarathy, Anthony K. H. Tung. Similarity Search on Bregman Divergence: Towards Non-Metric Indexing. In the Proceedings of the 35th International Conference on Very Large Data Bases(VLDB), Lyon, France August 24-28, 2009. Jia Xu, Zhenjie Zhang, Anthony K. H. Tung, and Ge Yu. "Efficient and Effective Similarity Search over Probabilistic Data Based on Earth Mover's Distance". to appear in VLDB 2010, a preliminary version on Technical Report TRA5-10, National University of Singapore. [Codes & Data] Gao Cong, Kian-Lee Tan, Anthony K. H. Tung, Xin Xu. "Mining Top-k Covering Rule Groups for Gene Expression Data". In Proceedings SIGMOD'05,Baltimore, Maryland 2005 Ruichu Cai, Anthony K. H. Tung, Zhenjie Zhang, Zhifeng Hao. What is Unequal among the Equals? Ranking Equivalent Rules from Gene Expression Data. Accepted in TKDE
• •
•
• •
Mining and Searching Complex Structures
Optional References: • • • • •
Feng Pan, Gao Cong, Anthony K. H. Tung, Jiong Yang, Mohammed Zaki, "CARPENTER: Finding Closed Patterns in Long Biological Datasets", In Proceedings KDD'03, Washington, DC, USA, August 24-27, 2003. Gao Cong, Anthony K. H. Tung, Xin Xu, Feng Pan, Jiong Yang. "FARMER: Finding Interesting Rule Groups in Microarray Datasets". Iin SIGMOD'04, June 13-18, 2004, Maison de la Chimie, Paris, France. Feng Pang, Anthony K. H. Tung, Gao Cong, Xin Xu. "COBBLER: Combining Column and Row Enumeration for Closed Pattern Discovery". SSDBM 2004 Santorini Island Greece. Gao Cong, Kian-Lee Tan, Anthony K.H. Tung, Feng Pan. “Mining Frequent Closed Patterns in Microarray Data”. In IEEE International Conference on Data Mining, (ICDM). 2004 Xin Xu, Ying Lu, Anthony K.H. Tung, Wei Wang. "Mining Shifting-andScaling Co-Regulation Patterns on Gene Expression Profiles". ICDE 2006.
Mining and Searching Complex Structures
109
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Searching and Mining Complex Structures Similarity Search on Sequences Anthony K. H. Tung(鄧锦浩) School of Computing National University of Singapore www.comp.nus.edu.sg/~atung
Research Group Link: http://nusdm.comp.nus.edu.sg/index.html Social Network Link: http://www.renren.com/profile.do?id=313870900
Types of sequences Symbolic vs Numeric We only touch discrete symbols here. Sequences of number are called time series and is a huge topic by itself!
Single dimension vs multi-dimensional Example: Yueguo Chen, Shouxu Jiang, Beng Chin Ooi, Anthony K. H. Tung. "Querying Complex Spatial-Temporal Sequences in Human Motion Databases" accepted and to appear in 24th IEEE International Conference on Data Engineering (ICDE) 2008
Single long sequence vs multiple sequences
2010-7-31
110
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Outline • Searching based on a disk based suffix tree • Approximate Matching Using Inverted List (Vgrams) • Approximate Matching Based on B+ Tree (BED Tree)
2010-7-31
Suffix Suffixes of acacag$:
1. 2. 3. 4. 5. 6. 7.
acacag$ cacag$ acag$ cag$ ag$ g$ $
2010-7-31
111
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Suffix Trie E.g. consider the string S = acacag$ $ Suffix Trie: a ties of all possible suffices of S 7
1 2 3 4 5 6 7
Suffix acacag$ cacag$ acag$ cag$ ag$ g$ $
a c
g
6
a $ 5
a
c
g $ 4
a
$
g
$
g
a c
g c
g
3 $ 1
$ 2
2010-7-31
Suffix Tree (I) Suffix tree for S=acacag$: merge nodes with only one child 1 2 3 4 5 6 7
S= a c a c a g $
$
a
c
7 g $
c a Path-label of node v is “aca” Denoted as α(v)
v c ga $ 1
5 g $
a
c a g $ 2
3
2010-7-31
112
g $ 6
g $ 4
“ca” is an edge label
This is a leaf edge
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Suffix Tree (II) Suffix tree has exactly n leaves and at most n edges The label of each edge can be represented using 2 indices Thus, suffix tree can be represented using O(n log n) bits
$ 7
1 2 3 4 5 6 7
7,7
1,1 a
a
g $
c a
4,7c
6,7
4,7
a g $ 2
5
6,7
g $
S= a c a c a g $
6
2,3
2,3
c ga $ 1
c
g$
6,7
Note: The end index of every leaf edge should be 7, the last index of S. Thus, for leaf edges, we only need to store the start index.
g 6,7 $ 4
3
2010-7-31
Generalized suffix tree Build a suffix tree for two or more strings E.g. S1 = acgat#, S2 = cgt$
# 6
$
4
a
c t #
g a #t 1
4
c
g
g
a t t$ # 2
1
2010-7-31
113
t a t # 3
t $
#
2
5
$ 3
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Straightforward construction of suffix tree Consider S = s1s2…sn where sn=$ Algorithm: Initialize the tree we only a root For i = n to 1 Includes S[i..n] into the tree
Time: O(n2)
2010-7-31
Example of construction S=acca$
Init
For-loop a c
c $
$
a $
c $ a$ a $
5
5 4
5 4 3
5 4 3 2
I4
I3
I2
I5
$ $a a ca $ $
2010-7-31
114
c $ ca $a $ 4 1 3 5 $
I1
c a $ 2
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Construction of generalized suffix tree S’= c#
Init
For-loop
a c
a c
c a c $ ca $ a$ $ 5 4 1 3 2
#$ c a c $ ca $ a$ $ 5 4 1 3 2
$
2
I1
J2
a c
2
#$ c ac $ ca $ a$ # $ 5 4 1 3 2 1 J1
2010-7-31
Property of suffix tree Fact: For any internal node v in the suffix tree, if the path label of v is α(v)=ap, then there exists another node w in the suffix tree such that α(w)=p.
Proof: Skip the proof. Definition of Suffix Link: For any internal node v, define its suffix link sl(v) = w.
2010-7-31
115
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Suffix Link example S=acacag$
$
a
7 g $
c a c ga $ 1
c
a
c a g $ 2
5 g $
g $ 6
g $ 4
3
2010-7-31
Can we construct a suffix tree in O(n) time? Yes. We can construct it in O(n) time and O(n) space Weiner’s algorithm [1973] Linear time for constant size alphabet, but much space McGreight’s algorithm [JACM 1976] Linear time for constant size alphabet, quadratic space Ukkonen’s algorithm [Algorithmica, 1995] Online algorithm, linear time for constant size alphabet, less space Farach’s algorithm [FOCS 1997] Linear time for general alphabet Hon,Sadakane, and Sung’s algorithm [FOCS 2003] O(n) bit space O(n logen) time for 00 ⎧V (i − 1, j − 1) + δ ( S [i ], T [ j ]) ⎪ V (i, j ) = max ⎨ V (i − 1, j ) + δ ( S [i ], _) ⎪ V (i, j − 1) + δ (_, T [ j ]) ⎩
Match/mismatch Delete Insert
In the alignment, the last pair must be either match/mismatch, delete, insert. xxx…xx | xxx…yy
xxx…xx | yyy…y_
match/mismatch
delete
xxx…x_ | yyy…yy insert
2010-7-31
Example (I) _
A
G
_
0
-1 -2 -3 -4 -5 -6 -7
A
-1
C
-2
A
-3
A
-4
T
-5
C
-6
C
-7
2010-7-31
129
C
A
T
G
C
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Example (II) _
A
G
C
A
T
G
C
_
0
-1 -2 -3 -4 -5 -6 -7
A
-1
2
1
0
C
-2
1
1
?3
2
A
-3
A
-4
T
-5
C
-6
C
-7
_
A
G
C
A
_
0
-1 -2 -3 -4 -5 -6 -7
A
-1
2
1
0
-1 -2 -3 -4
C
-2
1
1
3
2
1
0
-1
A
-3
0
0
2
5
4
3
2
A
-4 -1 -1
1
4
4
3
2
T
-5 -2 -2
0
3
6
5
4
C
-6 -3 -3
0
2
5
5
7
C
-7 -4 -4 -1
1
4
4
7
-1 -2 -3 -4
2010-7-31
Example (III)
2010-7-31
130
T
G
C
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
“q-grams” of strings
universal
2-grams
2010-7-31
q-gram inverted lists
id 0 1 2 3 4
strings rich stick stich stuck static
2-grams
at ch ck ic ri st ta ti tu uc
4
2010-7-31
131
0
2
1 0 0 1 4 1 3 3
3 1
2
4
2
3
4
2
4
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Searching using inverted lists Query: “shtick”, ED(shtick, ?)≤1 sh ht ti ic ck id 0 1 2 3 4
strings rich stick stich stuck static
2-grams
at ch ck ic ri st ta ti tu uc
# of common grams >= 3
4 0
2
1 0 0 1 4 1 3 3
3 1
2
4
2
3
4
2
4
2010-7-31
2-grams -> 3-grams? Query: “shtick”, ED(shtick, ?)≤1 sht hti tic ick
id 0 1 2 3 4
2010-7-31
strings rich stick stich stuck static
3-grams
ati ich ick ric sta sti stu tat tic tuc uck
# of common grams >= 1
4 0 1 0 4 1 3 4 1 3 3
132
2
2
2
4
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Observation 1: dilemma of choosing “q” Increasing “q” causing: Longer grams Shorter lists Smaller # of common grams of similar strings 4 at 2 ch 0 id strings ck 1 3 0 rich ic 0 1 2 4 2-grams 1 stick ri 0 2 stich st 4 2 3 1 3 stuck ta 4 4 static ti 1 2 4 tu 3 uc 3 2010-7-31
Observation 2: skew distributions of gram frequencies DBLP: 276,699 article titles
Popular 5-grams: ation (>114K times), tions, ystem, catio
2010-7-31
133
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
VGRAM: Main idea Grams with variable lengths (between qmin and qmax) zebra ze(123)
corrasion co(5213), cor(859), corr(171)
Advantages Reduce index size ☺ Reducing running time ☺ Adoptable by many algorithms ☺
2010-7-31
Challenges Generating variable-length grams? Constructing a high-quality gram dictionary? Relationship between string similarity and their gram-set similarity? Adopting VGRAM in existing algorithms?
2010-7-31
134
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Challenge 1: String
Variable-length grams?
Fixed-length 2-grams
universal
Variable-length grams [2,4]-gram dictionary
universal
ni ivr sal uni vers
2010-7-31
Representing gram dictionary as a trie
ni ivr sal uni vers
2010-7-31
135
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Challenge 2: Constructing gram dictionary Step 1: Collecting frequencies of grams with length in [qmin, qmax] st sti stu stic stuc
0, 1, 3 0, 1 3 0, 1 3
Gram trie with frequencies 2010-7-31
Step 2: selecting grams Pruning trie using a frequency threshold T (e.g., 2)
2010-7-31
136
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Step 2: selecting grams (cont) Threshold T = 2
2010-7-31
Final gram dictionary
[2,4]-grams 2010-7-31
137
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Challenge 3: Edit operation’s effect on grams
universal
Fixed length: q
k operations could affect k * q grams
2010-7-31
Deletion affects variable-length grams
Not affected
i-qmax+1
Affected
i Deletion
2010-7-31
138
Not affected
i+qmax- 1
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Grams affected by a deletion Affected?
i Deletion
i-qmax+1
i+qmax- 1 [2,4]-grams
ni ivr sal uni vers
Deletion
universal Affected? 2010-7-31
Grams affected by a deletion (cont) Affected?
i-qmax+1
i Deletion
Trie of grams
i+qmax- 1
Trie of reversed grams
2010-7-31
139
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
# of grams affected by each operation
Deletion/substitution
Insertion
0 1 1 1 1 2 1 2 2 2 1 1 1 2 1 1 1 1 0
_u_n_i_v_e_r_s_a_l_
2010-7-31
Max # of grams affected by k operations Vector of s =
With 2 edit operations, at most 4 grams can be affected
Called NAG vector (# of affected grams) Precomputed and stored
2010-7-31
140
Mining and Searching Complex Structures
Chapter 3 Similarity Search on Sequences
Summary of VGRAM index
2010-7-31
Challenge 4: adopting VGRAM Easily adoptable by many algorithms Basic interfaces: String s grams String s1, s2 such that ed(s1,s2)