P
pibbur who
Guest
At work we are currently replacing 4 different PACS (digital X-Ray) installations with a new system serving all 10 hospitals in the region. My group is responsible for migrating data from the 4 systems into the new one (10 000 000 examinations and 500 000 patients). One important task before migration is cleansing of data. I have a list of 40000 patients, where 2 and 2 have the same patient ID, but their names differ. 49% of the cases are due to different encoding of the special Norwegian characters ('Æ', 'Ø' and 'Å'), around the same for other differences in spelling.
In most cases the differences are small. But in some cases they are significant, and in these case we may have different patients sharing the same ID. Not good.
To separate those cases from insignificant spelling errors, I'm doing fuzzy comparison of the names. An example: "Ronald Reagan", and "Rnald R aegn" is most likely the same person, and I need to filter away patients like that. (Due to doctor patient confidentiality, I can't tell you if mr. Reagan actually was a patient of ours).
Currently I'm using the Levenshtein Distance algorithm, which for the example above gives a score of 4. While the score for a few distinctly different patient pairs I've found so far is in the 18-20 range. Observe that not all high score name pairs are really different. "Reagan, Ronald " and "Ronald Wilson Reagan" score 18, but is very likely the same person. So, some manual work will be necessary. But, I have 10 000 pairs of names with spelling differences, so the more I can reduce the number of candidates for manual check, the better. For those having to do the manual work, that is. I'm not one of them.
Now, here's my question: There are many different algorithms available, the Levenshtein approach is not necessarily the best one. Have any of you worked with things like this, and if so, can you recommend a better algorithm?
pibbur who assumes that most watchers don't know what he's talking about, partly due to his own (dis)ability to explain himself. And typos.
PS. We're using C#. DS.
In most cases the differences are small. But in some cases they are significant, and in these case we may have different patients sharing the same ID. Not good.
To separate those cases from insignificant spelling errors, I'm doing fuzzy comparison of the names. An example: "Ronald Reagan", and "Rnald R aegn" is most likely the same person, and I need to filter away patients like that. (Due to doctor patient confidentiality, I can't tell you if mr. Reagan actually was a patient of ours).
Currently I'm using the Levenshtein Distance algorithm, which for the example above gives a score of 4. While the score for a few distinctly different patient pairs I've found so far is in the 18-20 range. Observe that not all high score name pairs are really different. "Reagan, Ronald " and "Ronald Wilson Reagan" score 18, but is very likely the same person. So, some manual work will be necessary. But, I have 10 000 pairs of names with spelling differences, so the more I can reduce the number of candidates for manual check, the better. For those having to do the manual work, that is. I'm not one of them.
Now, here's my question: There are many different algorithms available, the Levenshtein approach is not necessarily the best one. Have any of you worked with things like this, and if so, can you recommend a better algorithm?
pibbur who assumes that most watchers don't know what he's talking about, partly due to his own (dis)ability to explain himself. And typos.
PS. We're using C#. DS.
Last edited: