Machine learning model for identifying duplicates

Hey all,

Just wondering if this idea has ever been considered before. After spending a lot of my own time building a utility for creating "fuzzy match" scores between external and tessitura records, I've finally come to terms that, while useful, it still requires a lot of handholding, and to truly be as effective as a human operator would require a much larger level of complexity and branching logic. So, why not train a machine learning model to encapsulate all that complexity? This would also capitalize on all of the human work that people are already doing to merge dupes within Tessitura — this work could be leveraged to train such an ML model.

Parents
  • There are really two parts to this problem you are describing

    1. Record Linkage
    2. Duplicate resolution.

    Where Record Linkage is finding the matching records.  And Duplicate resolution is what to do about the duplicates once you found them. 

    In the latter item, some of the questions are how sure that you have a good record linkage.  And how certain must you be in order to allow the computer to do the deduplication?  Then which record do we keep and which do we delete and so forth.

    Once we have a model or method that has found Linked records.  The approach would be to allow the computer to automatically de-dup records that have a high certainty of a match.  Then ask humans to supervise a group of lower certainly matches.  With the understanding at some point, you just have to give up.  I do agree that for the second problem duplicate resolution we might be able to create a model to decide what to do about duplicate based on what humans do about the duplicates as stored in the Tessitura Database.  If we can reconstruct the state of the records at the time of the merge.

    For the first problem,

    I have done more work on this problem, but more than, I've got time to write out here.

    If others would like to jump in I'd love to work with others on this issue.  I'll be on the phone tomorrow 6/14 at 12 noon EDT for analytic Coffee! if you want to discuss further.  That may provide a great place to take a conversation a bit further.

    --Tom

    P.S. This kind of project would be a lot less scary if we had a standard way to reverse merges done by Tessitura. 

Reply
  • There are really two parts to this problem you are describing

    1. Record Linkage
    2. Duplicate resolution.

    Where Record Linkage is finding the matching records.  And Duplicate resolution is what to do about the duplicates once you found them. 

    In the latter item, some of the questions are how sure that you have a good record linkage.  And how certain must you be in order to allow the computer to do the deduplication?  Then which record do we keep and which do we delete and so forth.

    Once we have a model or method that has found Linked records.  The approach would be to allow the computer to automatically de-dup records that have a high certainty of a match.  Then ask humans to supervise a group of lower certainly matches.  With the understanding at some point, you just have to give up.  I do agree that for the second problem duplicate resolution we might be able to create a model to decide what to do about duplicate based on what humans do about the duplicates as stored in the Tessitura Database.  If we can reconstruct the state of the records at the time of the merge.

    For the first problem,

    I have done more work on this problem, but more than, I've got time to write out here.

    If others would like to jump in I'd love to work with others on this issue.  I'll be on the phone tomorrow 6/14 at 12 noon EDT for analytic Coffee! if you want to discuss further.  That may provide a great place to take a conversation a bit further.

    --Tom

    P.S. This kind of project would be a lot less scary if we had a standard way to reverse merges done by Tessitura. 

Children
No Data