|
Your donations keep RPGWatch running!
RPGWatch Forums » General Forums » Off-Topic » Fuzzy string comparison

Default Fuzzy string comparison

February 16th, 2018, 15:09
At work we are currently replacing 4 different PACS (digital X-Ray) installations with a new system serving all 10 hospitals in the region. My group is responsible for migrating data from the 4 systems into the new one (10 000 000 examinations and 500 000 patients). One important task before migration is cleansing of data. I have a list of 40000 patients, where 2 and 2 have the same patient ID, but their names differ. 49% of the cases are due to different encoding of the special Norwegian characters ('Æ', 'Ø' and 'Å'), around the same for other differences in spelling.

In most cases the differences are small. But in some cases they are significant, and in these case we may have different patients sharing the same ID. Not good.

To separate those cases from insignificant spelling errors, I'm doing fuzzy comparison of the names. An example: "Ronald Reagan", and "Rnald R aegn" is most likely the same person, and I need to filter away patients like that. (Due to doctor patient confidentiality, I can't tell you if mr. Reagan actually was a patient of ours).

Currently I'm using the Levenshtein Distance algorithm, which for the example above gives a score of 4. While the score for a few distinctly different patient pairs I've found so far is in the 18-20 range. Observe that not all high score name pairs are really different. "Reagan, Ronald " and "Ronald Wilson Reagan" score 18, but is very likely the same person. So, some manual work will be necessary. But, I have 10 000 pairs of names with spelling differences, so the more I can reduce the number of candidates for manual check, the better. For those having to do the manual work, that is. I'm not one of them.

Now, here's my question: There are many different algorithms available, the Levenshtein approach is not necessarily the best one. Have any of you worked with things like this, and if so, can you recommend a better algorithm?

pibbur who assumes that most watchers don't know what he's talking about, partly due to his own (dis)ability to explain himself. And typos.

PS. We're using C#. DS.
--
d++a63e++TU4567'!S'!89!A!WM!LuC++++u+++uF+++nR——nS ++++wC—-o++++wS——uLB++++

1. The cat is alive! And pissed!!!
2. It's been 82 years. The cat is dead, and the stench is unbearable!!!
Last edited by pibbur who; February 16th, 2018 at 15:30.
pibbur who is online now

pibbur who

pibbur who's Avatar
Number 14
RPGWatch Donor
Original Sin 1 & 2 Donor

#1

Join Date: May 2012
Location: Bergen, Norway
Posts: 3,487
Mentioned: 11 Post(s)

Default 

February 16th, 2018, 15:32
I see what you're up to, but I've only done comparatively trivial things like this, sorting through jumbled file archives. I didn't get into the weeds with the algorithms.
--
"Orwell was almost exactly wrong in a strange way. He thought the world would end with Big Brother watching us, but it ended with us watching Big Brother." Alan Moore
Ripper is online now

Ripper

Ripper's Avatar
Ngikufisela iwela

#2

Join Date: Nov 2014
Posts: 6,668
Mentioned: 26 Post(s)

Default 

February 16th, 2018, 19:21
Originally Posted by pibbur who View Post
49% of the cases are due to different encoding of the special Norwegian characters ('Æ', 'Ø' and 'Å'), around the same for other differences in spelling.
Does this mean that you can encode the same speech sound in different ways? e.g. in German the "Umlaute" ä, ö, ü are sometimes written as ae oe or ue if the special Characters are not available. So Müller might be the same person as Mueller.

For such differences normalization is better than distance: define one method of writing the sound as the normal one and compare the normalization of the names.

Example: Norm(Müller) ==Mueller ==Norm(Mueller) but
Norm(Möller)==Moeller != Mueller == Norm(Müller).

So Müller and Mueller are probably the same but Müller and Möller are different.

The same for "Reagan, Ronald " and "Ronald Wilson Reagan": If you define
Norm (Lastname, Firstname) = Firstname Lastname and Norm (Firstname Middlename Lastname) = Firstname Lastname, then

Norm (Reagan, Ronald) == Ronald Reagan == Norm (Ronald Wilson Reagan).



To separate those cases from insignificant spelling errors, I'm doing fuzzy comparison of the names. An example: "Ronald Reagan", and "Rnald R aegn" is most likely the same person, and I need to filter away patients like that. (Due to doctor patient confidentiality, I can't tell you if mr. Reagan actually was a patient of ours).
For Random spelling errors a distance metric is probably the only way to do it. theoretically you could try to weight frequent spelling errors (e.g. nearby characters on a keyboard, or exchanging consecutive characters) higher than less frequent ones but I don't know if somebody has made a library for that and doing it from scratch might be to much work.

Edit: You might try to use a spellchecker to convert names with spelling errors to ones without spelling errors and compare them afterwards. However, in most languages names can be more random than normal words, so this might not help much.

Currently I'm using the Levenshtein Distance algorithm, which for the example above gives a score of 4. While the score for a few distinctly different patient pairs I've found so far is in the 18-20 range. Observe that not all high score name pairs are really different. "Reagan, Ronald " and "Ronald Wilson Reagan" score 18, but is very likely the same person. So, some manual work will be necessary. But, I have 10 000 pairs of names with spelling differences, so the more I can reduce the number of candidates for manual check, the better. For those having to do the manual work, that is. I'm not one of them.

Now, here's my question: There are many different algorithms available, the Levenshtein approach is not necessarily the best one. Have any of you worked with things like this, and if so, can you recommend a better algorithm?

pibbur who assumes that most watchers don't know what he's talking about, partly due to his own (dis)ability to explain himself. And typos.

PS. We're using C#. DS.
Again: For random errors a distance metric is probably best, for systematic variants of the same name normalization is probably better.
Last edited by bkrueger; February 16th, 2018 at 19:29. Reason: Spelling error (no pun intended)
bkrueger is offline

bkrueger

Nothing to see here.

#3

Join Date: Dec 2007
Posts: 845
Mentioned: 0 Post(s)
+1:

Default 

February 18th, 2018, 02:00
Are those ID numbers something people are typing in or are they internally generated? I'm wondering if they could also have typos or possibly scanning errors.

We have a similar issue with company names and trying to stop people from entering duplicates. We were pretty much doing Soundex searches with exact matches put in bold. "Were" because, even when we showed people flat out that they were making a duplicate, they insisted on making it anyway because…. reasons? I had to let them because "Wendy's" could be the hamburger joint or it could also be a local hardware store. They may even be on opposite corners of 101st & Johnson Dr. (or Johnson Drive, or Johnson Dr, or just Johnson…) so I couldn't even block them if the address matched.
--
The very powerful and the very stupid have one thing in common: instead of altering their views to fit the facts, they alter the facts to fit their views….
-- Doctor Who in "Face of Evil"
Zloth is offline

Zloth

Zloth's Avatar
I smell a… wumpus!?

#4

Join Date: Aug 2008
Location: Kansas City
Posts: 5,585
Mentioned: 11 Post(s)
RPGWatch Forums » General Forums » Off-Topic » Fuzzy string comparison
Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

All times are GMT +2. The time now is 15:14.
Powered by vBulletin® Version 3.8.10
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
User Alert System provided by Advanced User Tagging (Lite) - vBulletin Mods & Addons Copyright © 2018 DragonByte Technologies Ltd.
Copyright by RPGWatch