Machine learning model for identifying duplicates

Nick Reilingh $organization over 7 years ago

Hey all,

Just wondering if this idea has ever been considered before. After spending a lot of my own time building a utility for creating "fuzzy match" scores between external and tessitura records, I've finally come to terms that, while useful, it still requires a lot of handholding, and to truly be as effective as a human operator would require a much larger level of complexity and branching logic. So, why not train a machine learning model to encapsulate all that complexity? This would also capitalize on all of the human work that people are already doing to merge dupes within Tessitura — this work could be leveraged to train such an ML model.

Parents

Tom Brown (Past Member) $organization over 7 years ago
Nick Reilingh

There are really two parts to this problem you are describing

Record Linkage

Duplicate resolution.

Where Record Linkage is finding the matching records. And Duplicate resolution is what to do about the duplicates once you found them.

In the latter item, some of the questions are how sure that you have a good record linkage. And how certain must you be in order to allow the computer to do the deduplication? Then which record do we keep and which do we delete and so forth.

Once we have a model or method that has found Linked records. The approach would be to allow the computer to automatically de-dup records that have a high certainty of a match. Then ask humans to supervise a group of lower certainly matches. With the understanding at some point, you just have to give up. I do agree that for the second problem duplicate resolution we might be able to create a model to decide what to do about duplicate based on what humans do about the duplicates as stored in the Tessitura Database. If we can reconstruct the state of the records at the time of the merge.

For the first problem,

Record Linkage I believe that the folks at BAM looked into this and found an R Library that looks useful. I'm not clear if this is the library or not https://cran.r-project.org/web/packages/RecordLinkage/index.html https://journal.r-project.org/archive/2010-2/RJournal_2010-2_Sariyar+Borg.pdf.

The tool Knime has a de-deduplication sample or two that are part of their tutorial demonstrations. It is not mature enough to do a database of hundreds of thousands or millions of customers. If I remember the record blocking approach was not very sophisticated enough to work with large datasets.

Freely Extensible Biomedical Record Linkage (FEBRL) is a Python Application for record linkage.

Python Record Linkage Toolkit Documentation https://recordlinkage.readthedocs.io/en/latest/index.html

There are a number of other libraries out there that can be used for such projects. https://medium.com/@louis_amon/how-to-build-a-machine-learning-powered-record-linkage-workflow-b1890a0eb4ae.

I have done more work on this problem, but more than, I've got time to write out here.

If others would like to jump in I'd love to work with others on this issue. I'll be on the phone tomorrow 6/14 at 12 noon EDT for analytic Coffee! if you want to discuss further. That may provide a great place to take a conversation a bit further.

--Tom

P.S. This kind of project would be a lot less scary if we had a standard way to reverse merges done by Tessitura.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel

Reply

Tom Brown (Past Member) $organization over 7 years ago
Nick Reilingh

There are really two parts to this problem you are describing

Record Linkage

Duplicate resolution.

Where Record Linkage is finding the matching records. And Duplicate resolution is what to do about the duplicates once you found them.

In the latter item, some of the questions are how sure that you have a good record linkage. And how certain must you be in order to allow the computer to do the deduplication? Then which record do we keep and which do we delete and so forth.

Once we have a model or method that has found Linked records. The approach would be to allow the computer to automatically de-dup records that have a high certainty of a match. Then ask humans to supervise a group of lower certainly matches. With the understanding at some point, you just have to give up. I do agree that for the second problem duplicate resolution we might be able to create a model to decide what to do about duplicate based on what humans do about the duplicates as stored in the Tessitura Database. If we can reconstruct the state of the records at the time of the merge.

For the first problem,

Record Linkage I believe that the folks at BAM looked into this and found an R Library that looks useful. I'm not clear if this is the library or not https://cran.r-project.org/web/packages/RecordLinkage/index.html https://journal.r-project.org/archive/2010-2/RJournal_2010-2_Sariyar+Borg.pdf.

The tool Knime has a de-deduplication sample or two that are part of their tutorial demonstrations. It is not mature enough to do a database of hundreds of thousands or millions of customers. If I remember the record blocking approach was not very sophisticated enough to work with large datasets.

Freely Extensible Biomedical Record Linkage (FEBRL) is a Python Application for record linkage.

Python Record Linkage Toolkit Documentation https://recordlinkage.readthedocs.io/en/latest/index.html

There are a number of other libraries out there that can be used for such projects. https://medium.com/@louis_amon/how-to-build-a-machine-learning-powered-record-linkage-workflow-b1890a0eb4ae.

I have done more work on this problem, but more than, I've got time to write out here.

If others would like to jump in I'd love to work with others on this issue. I'll be on the phone tomorrow 6/14 at 12 noon EDT for analytic Coffee! if you want to discuss further. That may provide a great place to take a conversation a bit further.

--Tom

P.S. This kind of project would be a lot less scary if we had a standard way to reverse merges done by Tessitura.
Cancel
Vote Up 0 Vote Down

Sign in to reply

Cancel

Children

No Data