Evaluating Entity Resolution Results [PDF]

Jan 25, 2011 - [7] Evaluating Entity Resolution Results (Extended Version), Menestrina et al., Stanford InfoLab, 2009. [

0 downloads 4 Views 872KB Size

Recommend Stories


Entity Resolution
Keep your face always toward the sunshine - and shadows will fall behind you. Walt Whitman

Entity Resolution in Healthcare
Don't be satisfied with stories, how things have gone with others. Unfold your own myth. Rumi

Query-Time Entity Resolution
Almost everything will work again if you unplug it for a few minutes, including you. Anne Lamott

Adaptive Progressive Approach to Relational Entity Resolution
The greatest of richness is the richness of the soul. Prophet Muhammad (Peace be upon him)

Entity-Centric Coreference Resolution with Model Stacking
So many books, so little time. Frank Zappa

Query-Driven Approach to Entity Resolution
Make yourself a priority once in a while. It's not selfish. It's necessary. Anonymous

Machine Learning for Entity Coreference Resolution
Be grateful for whoever comes, because each has been sent as a guide from beyond. Rumi

Robust Entity Resolution Using a CrowdOracle
You often feel tired, not because you've done too much, but because you've done too little of what sparks

Query-Driven Approach to Entity Resolution
Don't count the days, make the days count. Muhammad Ali

Question Selection for Crowd Entity Resolution
If you want to become full, let yourself be empty. Lao Tzu

Idea Transcript


Evaluating Entity Resolution Results

Duplicate Detection – Winter Term 10/11 Cindy Fähnrich

Outline 2

1. Introduction 2. Current ER Measures 3. Generalized Merge Distance 4. Conclusion

Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011

Authors 3

■ Hector Garcia-Molina □ Professor, Stanford University ■ Steven Wang, David Menestrina □ (former) PhD Students, Stanford University ■ SERF - a generic infrastructure for Entity Resolution (ER) □ Performance □ Blocking □ ER ER Measures Measures □ … ■ presented at VLDB 2010 (available as technical report from 2009) Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011

ER Measures – Motivation 4

■ Implemented duplicate detection can be erroneous □ False negatives □ False positives Declared duplicates

Real duplicates

Data set

■ Compare computed result to the Gold Standard □ No false positives, no false negatives □ Extracted manually or by exhaustive algorithms  Different ER Measures determine the similiarity Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011

Paper - Contributions 5

■ The authors… … present a survey on current ER measures

… introduce the new ER measure „Generalized Merge Distance“ □ Slice algorithm for computation

… conduct various experiments with ER measures □ Conflicts of current ER measures □ Configuration possibilities of GMD □ Scalability of Slice algorithm

Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011

Outline 6

1. Introduction 2. Current ER Measures 3. Generalized Merge Distance 4. Conclusion

Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011

Current ER Measures - Classification 7 Duplicate Detection Result

■ Pairwise comparison

Gold Standard

□ What pairs of records match in R and S? ■ Cluster-level comparison □ What clusters in R and S match or are similar? ■ Edit distance □ What is edited during a transformation from R to S?

Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011

Current ER Measures – Short Reminder 8 Have records been mistakenly declared as duplicates?

Precision =

Have all duplicates been found?

True positives Declared duplicates

Recall =

True positives Real duplicates

Harmonic mean

F1 Measure =

2 * Precision * Recall Precision + Recall

Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011

Current ER Measures – Pairwise Comparison 9

■ Pairwise F1 Measure (pF1) □ Counts the number of correctly found duplicates □ Harmonic mean of pairwise precision and recall □ Dominant measure in current literature R = {ac, b, de}

S = {ab, c, de}

Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011

Current ER Measures – Cluster-level Comparison 10

■ Cluster F1 Measure (cF1) □ Counts clusters that exactly match □ Harmonic mean of cluster recall and precision R = {ac, b, de}

S = {ab, c, de}

Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011

Current ER Measures – Cluster-level Comparison 11

■ K Measure □ Sums similarities of all cluster pairs □ Geometric mean of average cluster purity and average author purity R = {ac, b, de}

S = {ab, c, de}

Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011

Current ER Measures – Cluster-level Comparison 12

■ Closest Cluster F1 Measure (ccF1) □ Sums similarities of cluster pairs most similar □ Jaccard index as similarity measure □ Harmonic mean of closest cluster recall and precision R = {ac, b, de}

S = {ab, c, de}

Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011

Current ER Measures – Edit Distance 13

■ Basic Merge Distance (BMD) □ Counts all splits and merges during transformation □ All splits must precede any merges

R = {ac, b, de}

S = {ab, c, de}

Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011

Current ER Measures – Edit Distance 14

■ Variation of Information (VI) □ What information is lost and gained during the conversion from R to S? □ Adds the total entropies of single clusters and mutual information

entropy of set

mutual information between both sets

Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011

Current ER Measures - Experiments 15

■ Experiment on real data □ Hotel data □ 5 000 randomly chosen entities ■ Gold Standards for data set □ Manually generated ■ Two algorithms for ER □ R-Swoosh □ Monge-Elkan

Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011

Current ER Measures - Conflicts 16

BMD

pF1

cF1

K

ccF1

Hotel R-Swoosh

427

0.34

0.88

0.95

0.93

Monge-Elkan

435

0.61

0.86

0.96

0.93

Examples R1

2

0.5

0.3

0.8

0.72

R2

2

0.8

0.3

0.73

0.66

R1 = {ac, b, de} R2 = {ade, b, c}

S = {ab, c, de}

Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011

Outline 17

1. Introduction 2. Current ER Measures 3. Generalized Merge Distance 4. Conclusion

Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011

The Generalized Merge Distance (GMD) 18

The fm, fs generalized merge distance GMDfm,fs(R,S) from a partition R to another partition S is the minimum cost of a legal path from R to S where fs(x,y) is the cost for a split operation, fm(x,y) is the cost of a merge operation. ■ Generalization of BMD □ Relaxing the split/merge constraint □ Configurable costs for split (fs) and merge (fm) operations ■ Legal path: No creation of invalid clusters during transformation □ Invalid: Cluster is not contained in Gold Standard ■ Minimum cost legal path: Legal path where all splits precede any merges Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011

GMD - Cost Functions 19

fs(|c1|,|c2|) = Costs for split operation c  c1,c2

fm(|c1|,|c2|) = Costs for merge operation c1,c2  c ■ Properties □ No negative costs □ Symmetry □ Monotone increase with parameters □ Operation order independence  Slice Algorithm to compute a minimum cost legal path Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011

GMD – Slice Algorithm 20

R = {acd, be} S = {abd, c, e} 1. Configure cost functions fs(x,y) = x*y

fm(x,y) = 1

2. Split off all fragments from clusters in R that are needed to build S and add costs R = {ad, c, be}

fs(2,1) = 2*1

R = {ad, c, b, e}

fs(1,1) = 1*1

3. Merge fragments to S and add costs R = {abd, c, e}

fm(2,1) = 1

4. Costs = 4 = fs(2,1) + fs(1,1) + fm(2,1) Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011

GMD - Relationship to Existing Measures 21

■ GMD can compute some of the existing ER measures

Measure

fs(x,y)

fm(x,y)

BMD

1

Variation of Information

h(x + y) – h(x) – h(y) h(x + y) – h(x) – h(y) h(z)= h(z)=

1

pF1 Pairwise Precision

xy

0

Pairwise Recall

0

xy

Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011

Experiments – GMD 22

■ Experiments on synthetic data □ Different Configurations for GMD □ Performance of Slice algorithm

fs(x,y)

fm(x,y)

BMD

1

1

GMDV

h(x + y) – h(x) – h(y) h(x)=

h(x + y) – h(x) – h(y) h(x)=

GMDH

xy + 1

xy + 1

GMDP

xy

0

GMDR

0

xy

Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011

BMD VI

pF1

GMD - Synthetic Data for Experiments 23

■ 10 000 entities ■ Classification of errors □ Glued entities (wrong merges) □ Broken entities (wrong splits) □ Misplaced entities (wrong clustering) ■ Generation of 10 possible ER results for each error type □ Artificially implanting errors into a generated gold standard □ Increasing number of errors

Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011

GMD - Experiments 24

■ GMDV with logaritmic cost functions has distances near 0

Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011

GMD - Cost Configurations 25

■ Configuration of GMD‘s cost functions depends on preferred focus □ Different sensitivities of cost functions adjustable ■ Sensitivity to error types □ fm is weighted stronger for broken entity errors □ fs is weighted stronger for glued entity errors ■ Sensitivity to cluster size □ fm/fs return values depending on cluster size

Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011

GMD - Performance 26

■ Evaluating Slice algorithm against straightforward implementations of other ER measures

Own experiment: pF1 (GMD)

pF1 (native)

0.31 s

51 s

DuDe-Results for CORA 1879 Entities

 Slice algorithm performs □ better than other straightforward measure implementations □ in linear time Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011

Outline 27

1. Introduction 2. Current ER Measures 3. Generalized Merge Distance 4. Conclusion

Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011

Conclusion 28

Not solved! ■ When evaluating ER Results … □ … every ER measure has different foci and therefore conflicts with others □ … more than one ER measure should be used ■ The Generalized Merge Distance … □ … has configurable cost functions for split and merge operations □ … can compute VI, pF1 and BMD as current ER measures □ … combines the application of different ER measures ■ The Slice algorithm… □ … computes results in linear time Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011

Conclusion – Paper 29



Well structured argumentation



Shift of important content to appendix



Comprehensible experiments and examples



Graphics confusing in parts

☺ Code well explained

Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011

Sources 30 [1] An efficient domain-independent for detecting approximately duplicate database records, Monge, Elkan, Proc. of the ACM-SIGMOD Workshop on Research Issues in on Knowledge Discovery and Data Mining, 1997 [2] A heuristic-based hierarchical clustering method for author name disambiguation in digital libraries, Cota et al., Proc. of the 22nd Brazilian Symposium on Databases, 2007 [3] An Introduction to Duplicate Detection, Naumann, Herschel, Synthesis Lectures on Data Management, Morgan & Claypool 2010 [4] Comparing clusterings by the variation of information, Meila, 16th Annual Conference on Learning Theory and 7th Kernel Workshop, 2003 [5] Efficient name disambiguation for large-scale databases, Huang et al., PKDD, 2006 [6] Evaluating Entity Resolution Results, Menestrina et al., Proceedings of the VLDB Endowment, 2010 [7] Evaluating Entity Resolution Results (Extended Version), Menestrina et al., Stanford InfoLab, 2009 [8] Grouping search-engine returned citations for person-name queries, Al-Kamha, Proceedings of the 6th annual ACM international workshop on Web information and data management, 2004 [9] Information retrieval, Van Rijsbergen, Butterworths, 1979 [10] Introduction to Information Retrieval, Manning et al., Cambridge University Press, 2008 [11] Swoosh: a generic approach to entity resolution, Benjelloun et al., The VLDB Journal, 2009 [12] SERF, http://infolab.stanford.edu/serf/, Stanford InfoLab, December 15 2010

Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011

Current ER Measures – Edit Distance 31

■ Variation of Information (VI) □ What information is lost and gained during the conversion from R to S? □ Adds the total entropies of single clusters and mutual information

Evaluating Entity Resolution Results | Cindy Fähnrich | January 25 2011

Smile Life

When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile

Get in touch

© Copyright 2015 - 2024 PDFFOX.COM - All rights reserved.