As a little side project, I’ve been working on analyzing the finishing times in Ironman Triathlons. (If you’re interested, please head over to my Triathlon Rating site. This involves getting race results and trying to match the athlete names in order to figure out different results from the same athlete.
Finding Candidates
In a first, relatively simple implementation, I’ve used Excel to group results from athletes with exactly matching names and only corrected some obvious issues. One example is German umlauts (äöü). There is an athlete named “Mühlbauer”, and there are a number of different spellings (Muehlbauer, Muhlbauer, other strange representations that seem to indicate encoding problems like M¨lbauer). Another typical issues is abbreviations of first names (Timothy DeBoom and Tim DeBoom).
However, this was a completely manual process, and while it’s next to impossible to built a completely automated solutions, I wanted some automated help. So I’ve built a simple implementation looking for duplicates within my data that performs the following checks:
- Is one athletes name a substring of another athlete’s name? (Example: “Osborne, Steven” and “Osborne, Steve”)
- Are two athletes names very similar (using the Levenshtein distance)? (Example: “Csomor, Erica” and "Csomor, Erika”)
Both checks are performed matching firstname and lastname with each other and “crosswise” (firstname with lastname and lastname with firstname). (Example: “Gilles, Reboul” and “Reboul, Gilles”) I should add that checks using just these two fields will not be sufficient for a typical business scenario, other fields have to be taken into account (for example birth date or address).
After implementing these relatively simple checks, I found a couple of “pairs”, ranging from pretty obvious to borderline cases (Chris Brown and Christopher Brown could be the same person, but could also be two different athletes). All in all, I was able to identify 11 pairs that are in fact duplicates, representing 1.7% of my 632 athletes. I found this number to be quite surprisingly large – I would have guessed that the number would be smaller: with the small number of athletes, just one person (me) adding athletes to the database and the manual checks I had already performed. The typical situation in a business would be more conducive for adding duplicates (larger pool of records, a lot of people adding data, users not quite as diligent as I thought I was).
Once the pairs are identified, there has to be a manual step to determine if the pairs are indeed duplicates of one another. This is very hard to be done by an algorithm, there are just too many scenarios to consider.
Survivorship
Once the duplicates are determined, the issue of survivorship comes up – i.e. identifying the “best” record that should be used in the future. In a business context, there are some automatic steps that can be performed (for example collecting the different fields that are filled in the records). When having to decide between different values, there may be some more help available for limited areas (for example when identifying valid addresses). But typically, when making a decision which value is right a human has to be involved (Erica or Erika?).
What to do with the Duplicates?
Once the survivor is clear, what to do with the duplicates is still open. In my case, I could just change the race results to point to the “right” athlete record and the other record could be deleted. In business cases, this may be a bit more difficult to do. For example, changing the owner of an account in a banking system is quite an involved procedure (and may even involve some communication with the customer). Also, deleting partners may not always be possible – in the banking example, you probably want to preserve the fact that for a time some other “record” was the owner of an account. In these cases, the minimum you want to do is to “mark” the duplicates in a certain way (so that it is not used anew) and you also want to point to the survivor ( so people know which is the right partner to use and some systems may provide a unified view on these two partners).
Most of the times, correctly updating the system to reflect that duplicates and survivors have been identified is a highly manual process, especially if there is a complicated IT infrastructure with a number of different applications involved. I’m not aware of any general tools that help with this.
Summary
Using my example, a relatively simple algorithm already provided good results in identify duplicate candidates. The percentage of 1.7% that I found in my data is probably a low estimate for bigger, commercial data. I was able to deal with the candidates pretty easily, in a commercial environment this would have been a lot more complicated.
Leave a Reply