Idea Transcript
Evaluating Entity Resolution Results
Duplicate Detection – Winter Term 10/11 Cindy Fähnrich
Outline 2
1. Introduction 2. Current ER Measures 3. Generalized Merge Distance 4. Conclusion
Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011
Authors 3
■ Hector Garcia-Molina □ Professor, Stanford University ■ Steven Wang, David Menestrina □ (former) PhD Students, Stanford University ■ SERF - a generic infrastructure for Entity Resolution (ER) □ Performance □ Blocking □ ER ER Measures Measures □ … ■ presented at VLDB 2010 (available as technical report from 2009) Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011
ER Measures – Motivation 4
■ Implemented duplicate detection can be erroneous □ False negatives □ False positives Declared duplicates
Real duplicates
Data set
■ Compare computed result to the Gold Standard □ No false positives, no false negatives □ Extracted manually or by exhaustive algorithms Different ER Measures determine the similiarity Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011
Paper - Contributions 5
■ The authors… … present a survey on current ER measures
… introduce the new ER measure „Generalized Merge Distance“ □ Slice algorithm for computation
… conduct various experiments with ER measures □ Conflicts of current ER measures □ Configuration possibilities of GMD □ Scalability of Slice algorithm
Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011
Outline 6
1. Introduction 2. Current ER Measures 3. Generalized Merge Distance 4. Conclusion
Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011
Current ER Measures - Classification 7 Duplicate Detection Result
■ Pairwise comparison
Gold Standard
□ What pairs of records match in R and S? ■ Cluster-level comparison □ What clusters in R and S match or are similar? ■ Edit distance □ What is edited during a transformation from R to S?
Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011
Current ER Measures – Short Reminder 8 Have records been mistakenly declared as duplicates?
Precision =
Have all duplicates been found?
True positives Declared duplicates
Recall =
True positives Real duplicates
Harmonic mean
F1 Measure =
2 * Precision * Recall Precision + Recall
Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011
Current ER Measures – Pairwise Comparison 9
■ Pairwise F1 Measure (pF1) □ Counts the number of correctly found duplicates □ Harmonic mean of pairwise precision and recall □ Dominant measure in current literature R = {ac, b, de}
S = {ab, c, de}
Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011
Current ER Measures – Cluster-level Comparison 10
■ Cluster F1 Measure (cF1) □ Counts clusters that exactly match □ Harmonic mean of cluster recall and precision R = {ac, b, de}
S = {ab, c, de}
Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011
Current ER Measures – Cluster-level Comparison 11
■ K Measure □ Sums similarities of all cluster pairs □ Geometric mean of average cluster purity and average author purity R = {ac, b, de}
S = {ab, c, de}
Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011
Current ER Measures – Cluster-level Comparison 12
■ Closest Cluster F1 Measure (ccF1) □ Sums similarities of cluster pairs most similar □ Jaccard index as similarity measure □ Harmonic mean of closest cluster recall and precision R = {ac, b, de}
S = {ab, c, de}
Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011
Current ER Measures – Edit Distance 13
■ Basic Merge Distance (BMD) □ Counts all splits and merges during transformation □ All splits must precede any merges
R = {ac, b, de}
S = {ab, c, de}
Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011
Current ER Measures – Edit Distance 14
■ Variation of Information (VI) □ What information is lost and gained during the conversion from R to S? □ Adds the total entropies of single clusters and mutual information
entropy of set
mutual information between both sets
Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011
Current ER Measures - Experiments 15
■ Experiment on real data □ Hotel data □ 5 000 randomly chosen entities ■ Gold Standards for data set □ Manually generated ■ Two algorithms for ER □ R-Swoosh □ Monge-Elkan
Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011
Current ER Measures - Conflicts 16
BMD
pF1
cF1
K
ccF1
Hotel R-Swoosh
427
0.34
0.88
0.95
0.93
Monge-Elkan
435
0.61
0.86
0.96
0.93
Examples R1
2
0.5
0.3
0.8
0.72
R2
2
0.8
0.3
0.73
0.66
R1 = {ac, b, de} R2 = {ade, b, c}
S = {ab, c, de}
Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011
Outline 17
1. Introduction 2. Current ER Measures 3. Generalized Merge Distance 4. Conclusion
Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011
The Generalized Merge Distance (GMD) 18
The fm, fs generalized merge distance GMDfm,fs(R,S) from a partition R to another partition S is the minimum cost of a legal path from R to S where fs(x,y) is the cost for a split operation, fm(x,y) is the cost of a merge operation. ■ Generalization of BMD □ Relaxing the split/merge constraint □ Configurable costs for split (fs) and merge (fm) operations ■ Legal path: No creation of invalid clusters during transformation □ Invalid: Cluster is not contained in Gold Standard ■ Minimum cost legal path: Legal path where all splits precede any merges Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011
GMD - Cost Functions 19
fs(|c1|,|c2|) = Costs for split operation c c1,c2
fm(|c1|,|c2|) = Costs for merge operation c1,c2 c ■ Properties □ No negative costs □ Symmetry □ Monotone increase with parameters □ Operation order independence Slice Algorithm to compute a minimum cost legal path Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011
GMD – Slice Algorithm 20
R = {acd, be} S = {abd, c, e} 1. Configure cost functions fs(x,y) = x*y
fm(x,y) = 1
2. Split off all fragments from clusters in R that are needed to build S and add costs R = {ad, c, be}
fs(2,1) = 2*1
R = {ad, c, b, e}
fs(1,1) = 1*1
3. Merge fragments to S and add costs R = {abd, c, e}
fm(2,1) = 1
4. Costs = 4 = fs(2,1) + fs(1,1) + fm(2,1) Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011
GMD - Relationship to Existing Measures 21
■ GMD can compute some of the existing ER measures
Measure
fs(x,y)
fm(x,y)
BMD
1
Variation of Information
h(x + y) – h(x) – h(y) h(x + y) – h(x) – h(y) h(z)= h(z)=
1
pF1 Pairwise Precision
xy
0
Pairwise Recall
0
xy
Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011
Experiments – GMD 22
■ Experiments on synthetic data □ Different Configurations for GMD □ Performance of Slice algorithm
fs(x,y)
fm(x,y)
BMD
1
1
GMDV
h(x + y) – h(x) – h(y) h(x)=
h(x + y) – h(x) – h(y) h(x)=
GMDH
xy + 1
xy + 1
GMDP
xy
0
GMDR
0
xy
Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011
BMD VI
pF1
GMD - Synthetic Data for Experiments 23
■ 10 000 entities ■ Classification of errors □ Glued entities (wrong merges) □ Broken entities (wrong splits) □ Misplaced entities (wrong clustering) ■ Generation of 10 possible ER results for each error type □ Artificially implanting errors into a generated gold standard □ Increasing number of errors
Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011
GMD - Experiments 24
■ GMDV with logaritmic cost functions has distances near 0
Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011
GMD - Cost Configurations 25
■ Configuration of GMD‘s cost functions depends on preferred focus □ Different sensitivities of cost functions adjustable ■ Sensitivity to error types □ fm is weighted stronger for broken entity errors □ fs is weighted stronger for glued entity errors ■ Sensitivity to cluster size □ fm/fs return values depending on cluster size
Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011
GMD - Performance 26
■ Evaluating Slice algorithm against straightforward implementations of other ER measures
Own experiment: pF1 (GMD)
pF1 (native)
0.31 s
51 s
DuDe-Results for CORA 1879 Entities
Slice algorithm performs □ better than other straightforward measure implementations □ in linear time Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011
Outline 27
1. Introduction 2. Current ER Measures 3. Generalized Merge Distance 4. Conclusion
Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011
Conclusion 28
Not solved! ■ When evaluating ER Results … □ … every ER measure has different foci and therefore conflicts with others □ … more than one ER measure should be used ■ The Generalized Merge Distance … □ … has configurable cost functions for split and merge operations □ … can compute VI, pF1 and BMD as current ER measures □ … combines the application of different ER measures ■ The Slice algorithm… □ … computes results in linear time Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011
Conclusion – Paper 29
☺
Well structured argumentation
Shift of important content to appendix
☺
Comprehensible experiments and examples
Graphics confusing in parts
☺ Code well explained
Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011
Sources 30 [1] An efficient domain-independent for detecting approximately duplicate database records, Monge, Elkan, Proc. of the ACM-SIGMOD Workshop on Research Issues in on Knowledge Discovery and Data Mining, 1997 [2] A heuristic-based hierarchical clustering method for author name disambiguation in digital libraries, Cota et al., Proc. of the 22nd Brazilian Symposium on Databases, 2007 [3] An Introduction to Duplicate Detection, Naumann, Herschel, Synthesis Lectures on Data Management, Morgan & Claypool 2010 [4] Comparing clusterings by the variation of information, Meila, 16th Annual Conference on Learning Theory and 7th Kernel Workshop, 2003 [5] Efficient name disambiguation for large-scale databases, Huang et al., PKDD, 2006 [6] Evaluating Entity Resolution Results, Menestrina et al., Proceedings of the VLDB Endowment, 2010 [7] Evaluating Entity Resolution Results (Extended Version), Menestrina et al., Stanford InfoLab, 2009 [8] Grouping search-engine returned citations for person-name queries, Al-Kamha, Proceedings of the 6th annual ACM international workshop on Web information and data management, 2004 [9] Information retrieval, Van Rijsbergen, Butterworths, 1979 [10] Introduction to Information Retrieval, Manning et al., Cambridge University Press, 2008 [11] Swoosh: a generic approach to entity resolution, Benjelloun et al., The VLDB Journal, 2009 [12] SERF, http://infolab.stanford.edu/serf/, Stanford InfoLab, December 15 2010
Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011
Current ER Measures – Edit Distance 31
■ Variation of Information (VI) □ What information is lost and gained during the conversion from R to S? □ Adds the total entropies of single clusters and mutual information
Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011