Duplicate Contacts – Where do they come from? (Or: McFly, you’re a slacker)

March 29, 2012 by · Leave a Comment
Filed under: DataQuality, iOS, SmarterContacts 

When dealing with data quality issues such as duplicates, the question people usually focus on is “how do I get rid of the duplicate records?” While this is important, this does not remove the cause of the problem and usually leads to ongoing or recurring cleaning efforts. Therefore, if you really want to resolve a data quality issue, you have to ask the question of  “where do the duplicates come from?”

As with almost all data quality issues, there are easy answers to this question:

“McFly, you’re a slacker”
Strickland
(image from http://images4.wikia.nocookie.net)

After the teacher needling Marty, I’m also calling this the “Strickland Explanation”.

or a bit less harsh

“I don’t have time!”
outatime
(image from http://images.forum-auto.com/)

While these explanations may be true in some cases, they are not very helpful: They insult the people that you enter the data, making them a lot less willing to help you resolve the problem. Also, it shows that you haven’t thought about the problem or discussed the issue with the users – as there are always other explanations that take a bit of effort to unearth.

Here’s a list of issues that I found lead to duplicates in your address book:

  1. Technical Limitations
    Older versions of address books were limited in the number of fields you had available. One typical issue is that you could only have one email address or phone-number. If you wanted to store multiple phone numbers for a person (e.g. the home number, the work number and the cell phone) you had to create multiple records that have the same name, but different details. This is especially true for address book programs from older phones.
  2. Synching Gone Wrong
    Synching is a surprisingly difficult problem, and almost everyone has their own horror stories of synchs that have gone wrong. Typically issues are an extra copy of each record, information showing up in unexpected places (for example a zip code being stored in a phone number) or some information being lost during synch (one of my pet peeves is the birthdate).
  3. Information Hoarders
    Some email programs had the option of storing the email address of each person that sent you an email in your address book, resulting in a large number of “sparse” records (records that may consist of only an email address or just a phone number, but not a proper name). Also, the large number of resulting records makes it hard to figure out if multiple records belong to the same contact or if you already for a record for a person.

I’m sure that there are even more causes for duplicate contacts. Please note that these causes require completely different approaches than the “Strickland Explanation”. I will have a closer look at these in a future post.

Where do the duplicates in your address book come from?

IconSmarterContactsAppstore_BadgeIf you want to find duplicate contacts, please give my iPhone app “SmarterContacts” a try. You can find it on the app store. Please let me know when you identify other causes for duplicates so I can update this post and provide additional functionality in my app.

Low Data Quality in your iPhone Address Book – Why care?

February 9, 2012 by · Leave a Comment
Filed under: DataQuality, SmarterContacts 

As with all data quality issues, it is important to understand the consequences of a low quality address book. Too often this discussion is skipped, resulting in only half-hearted efforts of cleaning up and quickly slipping back into old habits.

Here are a few scenarios that show the impact of low data quality in your address book – for the sake of this discussion I equate a low quality address book as having lots of duplicate records, i.e. more than one record for a single contact.

Which is the right record to use?

When you want to use the information in your address book, it is difficult to pick “the right” record to use. Consider the following excerpt from (fictional) iPhone contacts:

IMG_0183

If you want to send an email to McFly, which address should you use?

IMG_0184

There are a couple of contacts, but you can’t see the context of each email address. Some might be from his school (is he still going to school?) or from a college, some seem work-related. Just from looking at this list of potential addresses, there is no way to figure out which one is still valid, let alone which one to use.

(This example is not unusual, as older programs were only able to store one email address per contact.)

While you may have some additional information to help you decide (you know that Marty has already left high school and college, and that he has been “terminated” from his job at Fujitsu Enterprises), this is next to impossible for Siri, the iPhone’s digital assistant. Here’s a look what happens if you ask Siri to “Call McFly”:

IMG_0178

If you try narrow it down with “Martin” or “Martin McFly”, you’ll still get to choose between two (not necessarily the right choices) – and then you’re stuck and can’t even get Siri to pick one of them:

IMG_0181

Which is the right record to update?

A similar problem arises when you want to update the contact information for Marty. If Marty lets you know he’s got a new job and had to move to a new postal address, will you remember to delete the old address? If you just add the new address as a new record or just update one of the records, after a month or two you will have no way to decide which is the right address. If you have multiple records for Marty and decide to finally write down his birthday after missing it a few times – which record should you update? Just one (at least you’ve committed the info to your digital memory) or all (better safe than sorry)?

There is no good answer if you have more than one record for a person. Things get easy if you have just one record for Marty – only one place to store his birthday, and you’ll see what other addresses you already have for Marty when you enter a new one, so you can delete those that are no longer valid.

What’s the data quality in your address book?

IconSmarterContactsAppstore_BadgeIf you want to find out how many duplicates you have in your address book, please give my iPhone app “SmarterContacts” a try. You can find it on the app store.

Duplicates in Address Book: Some Progress on my next iPhone App

April 5, 2011 by · Leave a Comment
Filed under: DataQuality, iOS 

After trying out a few things during my last train rides to my consulting engagement, my next iPhone app is slowly taking shape.

Development progress

Here is a very rough, “in work” screenshot of what I have for now (using some test data):

image

Each cell in the TableView represents the contacts in the address book. The number in braces shows how “similar” another contact record is. (The number shown is for the “closest” record.) I want to make things a bit easier to spot by coloring the cells (e.g. red = 100% match, yellow between 90 and 100% match). I’ll also add some way to filter the displayed records (i.e. display only records above a certain threshold).

In the detail view, I want to display the “closest” records for one address and provide some way to look at the details of the involved contacts. (As this might require quite some back and forth between the different contact records, an iPad with additional screen estate may be a good idea here.) The detail view should also have some functionality to “remove” unwanted records (duplicates) or “merge” two records (collect all different phone numbers, email addresses etc. into one contact record and remove the other one).

Obviously, there is still a lot of work to do, but I’m making some progress. While I’m still looking for a good name, I already have an icon for the app: colored_background_clear_glass_512

Emerging Tiers

I’m also getting a clearer picture of which different modules I can offer:

  • Comparison – reading the address book and figuring out which records are close
  • Removing exact duplicates – Often, there are exact duplicates in the address book (for example after some technical problems with synching). These should be relatively simple to identify and delete.
  • Merging contact records – This is a lot more complicated as there are quite a few different scenarios. For example, you have to pick a “surviving” name (as there can be only one name), but there can be multiple addresses, phone numbers or email addresses.
  • Standardization – For some fields, different content can mean the same thing, e.g. “1231234567” is the same phone number as “123 123 4567” or “(123) 123-4567” or “+1-123-123 4567”. I’m not sure how much there can be done here, especially in addresses (Memory Lane = Memory Ln).
  • Once I get started, there’ll probably be more modules that make sense in the given context.

However, this looks like a good way to provide different tiers of the app:

  • entry level (low cost, maybe even free) – just offering the base comparison module
  • mid level – comparison plus dealing with exact duplicates
  • pro level – all modules as outlined above

This will also be a good scenario for “in app purchases” that I want to look into.

Idea for my next iPhone App: Duplicates in your Address Book

March 19, 2011 by · Leave a Comment
Filed under: DataQuality, iOS 

My Tax App “SmarterSteuer” has been in the iTunes App Store for almost a month now. Of course, the sales are not even remotely close to allow me to quit my day job – at 1.59€ and about 25 sold apps I haven’t even recovered the cost for my membership in the Apple development program. Nonetheless, I’ve been impressed with the number of sales and the pretty level number of sales from day to day: The tax app is quite limited to a certain type of user (freelancers that have recently started and are trying to figure out their taxes) and it’s only applicable to Germany (and it’s only available in the German App Store). I’ve been trying to guesstimate the market size for a general purpose app that would be sold worldwide.

Guessing a Market Size

I’ve been trying to find out how many iPhones have been sold worldwide and in Germany. Apparently, Apple releases quarterly numbers (compiled in a nice sales graph by wikipedia) which would put the number of iPhone sales worldwide to about 90 million at the end of 2010. I couldn’t find any matching numbers for Germany, the latest number I could find was 1.5 million at the end of 2009 where the worldwide number was at about 40 million. That would put the worldwide market at about 27 times the German market which is consistent with a 4% market share that admob is estimating for Germany. Let’s work with a nice round factor of 25. 

It’s even trickier to estimate what part of the current German market would consist of freelancers. In the general German population you have roughly 1 million of freelancers (with 81 million Germans total). I’m assuming that that percentage would be mich higher among the owners of an iPhone. Just for arguments sake, I’m working with a percentage of 10% of freelancers in the whole iPhone market in Germany.

If we take these two numbers, we come up with a factor of 250 – i.e. the market size of a general purpose worldwide app is about 250 times as large as the market that I’m currently addressing with my text app. This may not translate to 250 times as many sales or 250 times the revenue, but even 100 times the sale of my current app would be quite impressive.

I’m not sure how if these numbers are reasonable (please let me know if you have better numbers or estimates) – but it’s certainly encouraging enough to contemplate writing a general purpose app that can be used worldwide.

App Idea

I’ve got a few ideas that come close to the description of a general purpose app with a worldwide appeal, but the one that fits best is an app that analyzes the iPhone’s Address Book and looks for duplicates (either created because of synching problems or by manually entering the same person twice). So here is a short description of that idea:

Level 1:

  • look in your address book for potential duplicates
  • visualize the “duplication” in a nice way so the user has a chance to follow up on the results by manually editing/deleting the records

Level 2:

  • offer a way to “automagically” delete complete duplicates (i.e. record identical in all fields)
  • offer a way to merge records (i.e. if the names are the same but each address is different create a record with both addresses so the user can manually edit the “survivor”)

Level 3:

  • add-on services such as address validation or other ideas that come up in the meantime

I think that this may be a valuable app, something that people are willing to pay for. I’m not sure about the effort required to develop such an app, but I’m ready to start. Most of the technology should be pretty straightforward (accessing the address book, comparing records/strings etc.). I’m going to mull this over for a few more days, but I think I’ll get started as soon as I have some time available.

Looking for duplicates: Results of a simple algorithm

February 10, 2011 by · Leave a Comment
Filed under: DataQuality 

As a little side project, I’ve been working on analyzing the finishing times in Ironman Triathlons. (If you’re interested, please head over to my Triathlon Rating site. This involves getting race results and trying to match the athlete names in order to figure out different results from the same athlete.

Finding Candidates

In a first, relatively simple implementation, I’ve used Excel to group results from athletes with exactly matching names and only corrected some obvious issues. One example is German umlauts (äöü). There is an athlete named “Mühlbauer”, and there are a number of different spellings (Muehlbauer, Muhlbauer, other strange representations that seem to indicate encoding problems like M¨lbauer). Another typical issues is abbreviations of first names (Timothy DeBoom and Tim DeBoom).

However, this was a completely manual process, and while it’s next to impossible to built a completely automated solutions, I wanted some automated help. So I’ve built a simple implementation looking for duplicates within my data that performs the following checks:

  • Is one athletes name a substring of another athlete’s name? (Example: “Osborne, Steven” and “Osborne, Steve”)
  • Are two athletes names very similar (using the Levenshtein distance)? (Example: “Csomor, Erica” and "Csomor, Erika”)

Both checks are performed matching firstname and lastname with each other and “crosswise” (firstname with lastname and lastname with firstname). (Example: “Gilles, Reboul” and “Reboul, Gilles”) I should add that checks using just these two fields will not be sufficient for a typical business scenario, other fields have to be taken into account (for example birth date or address).

After implementing these relatively simple checks, I found a couple of “pairs”, ranging from pretty obvious to borderline cases (Chris Brown and Christopher Brown could be the same person, but could also be two different athletes). All in all, I was able to identify 11 pairs that are in fact duplicates, representing 1.7% of my 632 athletes. I found this number to be quite surprisingly large – I would have guessed that the number would be smaller: with the small number of athletes, just one person (me) adding athletes to the database and the manual checks I had already performed. The typical situation in a business would be more conducive for adding duplicates (larger pool of records, a lot of people adding data, users not quite as diligent as I thought I was).

Once the pairs are identified, there has to be a manual step to determine if the pairs are indeed duplicates of one another. This is very hard to be done by an algorithm, there are just too many scenarios to consider.

Survivorship

Once the duplicates are determined, the issue of survivorship comes up – i.e. identifying the “best” record that should be used in the future. In a business context, there are some automatic steps that can be performed (for example collecting the different fields that are filled in the records). When having to decide between different values, there may be some more help available for limited areas (for example when identifying valid addresses). But typically, when making a decision which value is right a human has to be involved (Erica or Erika?).

What to do with the Duplicates?

Once the survivor is clear, what to do with the duplicates is still open. In my case, I could just change the race results to point to the “right” athlete record and the other record could be deleted. In business cases, this may be a bit more difficult to do. For example, changing the owner of an account in a banking system is quite an involved procedure (and may even involve some communication with the customer). Also, deleting partners may not always be possible – in the banking example, you probably want to preserve the fact that for a time some other “record” was the owner of an account. In these cases, the minimum you want to do is to “mark” the duplicates in a certain way (so that it is not used anew) and you also want to point to the survivor ( so people know which is the right partner to use and some systems may provide a unified view on these two partners).

Most of the times, correctly updating the system to reflect that duplicates and survivors have been identified is a highly manual process, especially if there is a complicated IT infrastructure with a number of different applications involved. I’m not aware of any general tools that help with this.

Summary

Using my example, a relatively simple algorithm already provided good results in identify duplicate candidates. The percentage of 1.7% that I found in my data is probably a low estimate for bigger, commercial data. I was able to deal with the candidates pretty easily, in a commercial environment this would have been a lot more complicated.