Fuzzy string comparison

pibbur who · February 16, 2018

At work we are currently replacing 4 different PACS (digital X-Ray) installations with a new system serving all 10 hospitals in the region. My group is responsible for migrating data from the 4 systems into the new one (10 000 000 examinations and 500 000 patients). One important task before migration is cleansing of data. I have a list of 40000 patients, where 2 and 2 have the same patient ID, but their names differ. 49% of the cases are due to different encoding of the special Norwegian characters ('Æ', 'Ø' and 'Å'), around the same for other differences in spelling.

In most cases the differences are small. But in some cases they are significant, and in these case we may have different patients sharing the same ID. Not good.

To separate those cases from insignificant spelling errors, I'm doing fuzzy comparison of the names. An example: "Ronald Reagan", and "Rnald R aegn" is most likely the same person, and I need to filter away patients like that. (Due to doctor patient confidentiality, I can't tell you if mr. Reagan actually was a patient of ours).

Currently I'm using the Levenshtein Distance algorithm, which for the example above gives a score of 4. While the score for a few distinctly different patient pairs I've found so far is in the 18-20 range. Observe that not all high score name pairs are really different. "Reagan, Ronald " and "Ronald Wilson Reagan" score 18, but is very likely the same person. So, some manual work will be necessary. But, I have 10 000 pairs of names with spelling differences, so the more I can reduce the number of candidates for manual check, the better. For those having to do the manual work, that is. I'm not one of them.

Now, here's my question: There are many different algorithms available, the Levenshtein approach is not necessarily the best one. Have any of you worked with things like this, and if so, can you recommend a better algorithm?

pibbur who assumes that most watchers don't know what he's talking about, partly due to his own (dis)ability to explain himself. And typos.

PS. We're using C#. DS.

Ripper · February 16, 2018

I see what you're up to, but I've only done comparatively trivial things like this, sorting through jumbled file archives. I didn't get into the weeds with the algorithms.

bkrueger · February 16, 2018

pibbur who said:
49% of the cases are due to different encoding of the special Norwegian characters ('Æ', 'Ø' and 'Å'), around the same for other differences in spelling.

Does this mean that you can encode the same speech sound in different ways? e.g. in German the "Umlaute" ä, ö, ü are sometimes written as ae oe or ue if the special Characters are not available. So Müller might be the same person as Mueller.

For such differences normalization is better than distance: define one method of writing the sound as the normal one and compare the normalization of the names.

Example: Norm(Müller) ==Mueller ==Norm(Mueller) but
Norm(Möller)==Moeller != Mueller == Norm(Müller).

So Müller and Mueller are probably the same but Müller and Möller are different.

The same for "Reagan, Ronald " and "Ronald Wilson Reagan": If you define
Norm (Lastname, Firstname) = Firstname Lastname and Norm (Firstname Middlename Lastname) = Firstname Lastname, then

Norm (Reagan, Ronald) == Ronald Reagan == Norm (Ronald Wilson Reagan).

To separate those cases from insignificant spelling errors, I'm doing fuzzy comparison of the names. An example: "Ronald Reagan", and "Rnald R aegn" is most likely the same person, and I need to filter away patients like that. (Due to doctor patient confidentiality, I can't tell you if mr. Reagan actually was a patient of ours).

For Random spelling errors a distance metric is probably the only way to do it. theoretically you could try to weight frequent spelling errors (e.g. nearby characters on a keyboard, or exchanging consecutive characters) higher than less frequent ones but I don't know if somebody has made a library for that and doing it from scratch might be to much work.

Edit: You might try to use a spellchecker to convert names with spelling errors to ones without spelling errors and compare them afterwards. However, in most languages names can be more random than normal words, so this might not help much.

Currently I'm using the Levenshtein Distance algorithm, which for the example above gives a score of 4. While the score for a few distinctly different patient pairs I've found so far is in the 18-20 range. Observe that not all high score name pairs are really different. "Reagan, Ronald " and "Ronald Wilson Reagan" score 18, but is very likely the same person. So, some manual work will be necessary. But, I have 10 000 pairs of names with spelling differences, so the more I can reduce the number of candidates for manual check, the better. For those having to do the manual work, that is. I'm not one of them.

Now, here's my question: There are many different algorithms available, the Levenshtein approach is not necessarily the best one. Have any of you worked with things like this, and if so, can you recommend a better algorithm?

pibbur who assumes that most watchers don't know what he's talking about, partly due to his own (dis)ability to explain himself. And typos.

PS. We're using C#. DS.

Again: For random errors a distance metric is probably best, for systematic variants of the same name normalization is probably better.

Zloth · February 18, 2018

Are those ID numbers something people are typing in or are they internally generated? I'm wondering if they could also have typos or possibly scanning errors.

We have a similar issue with company names and trying to stop people from entering duplicates. We were pretty much doing Soundex searches with exact matches put in bold. "Were" because, even when we showed people flat out that they were making a duplicate, they insisted on making it anyway because…. reasons? I had to let them because "Wendy's" could be the hamburger joint or it could also be a local hardware store. They may even be on opposite corners of 101st & Johnson Dr. (or Johnson Drive, or Johnson Dr, or just Johnson…) so I couldn't even block them if the address matched.

Fuzzy string comparison

pibbur who

Guest

Ripper

Зичу Вам успіхів

bkrueger

Nothing to see here.

Zloth

I smell a... wumpus!?