Hello all,
I am hoping to get a sense of how different orgs manage their merging of duplicates. I have read on the forums about some great examples from a couple of years ago of people scheduling the Identify Duplicates procedure, and having a daily task of manually putting them into the Merge queue.
Some people appeared to have been confident enough to merge constituents automatically if they meet the criteria.
I would love to hear how many people are automatically merging duplicates, and how many people are doing it manually, which is how I was always taught.
At ACMI, we have thousands of duplicate accounts from importing e-news subscribers over time. The bulk of these accounts only have First Name, Last Name and email. We are working with Tessitura consultants to get our Identify Duplicates logic to pick theses up, and would very much like to merge automatically as many as possible.
Thanks all!
Nicholas
I think that the best way to de-dupe would be to use distance calculations and cataloging functions
While not perfect it's probably going to be as-close-to perfect as you can get without a serious algorithm designed for Tess's database (and since we all use it differently, this may not even be possible?)
If we create table and LTX_CONST_DISTANCE:
Unknown said: customer_no1 int, customer_no2 int, table varchar(128), columns varchar(128), distance int, type int
customer_no1 int,
customer_no2 int,
table varchar(128),
columns varchar(128),
distance int,
type int
and table LTR_DISTANCE_TYPE:
Unknown said: id int, description varchar(256)
id int,
description varchar(256)
We can then populate it based on tables like T_ACCOUNT_DATA, T_ADDRESS and T_EADDRESS over time using functions such as Jaro-Winkler (probably the single best for data-entry errors) on the most error-prone fields. Note this should be done over time until a database is set, possibly on separate days depending on how long each table/column(s) matching takes. Once completed it won't have to be run unless there is a change to the constituent's data for that table.
Also I would highly suggest using a cataloging function (soundex, metaphone) for names so that names like 'Thompson' and 'Tompson' etc. are matched; this can be useful for cases where a BO associate my put in a name vs a person spelling it themselves. What's great about these cases as well is we usually can tell how a name/field was added in Tessitura via modified/created by values in tables which would be an interesting twist.
In all I would if we were going to automatically merge constituents I would probably require at least:
Use of same/nearly same credit card (via T_ACCOUNT_DATA) (jaro winkler score of > 85 on t_act with matching values on the other fields) I like to think of this as a best-way to matching; limiting conditionals and relying on a verified field it could probably be okay to skip if everything else matched with a JW>90+ Jaro Winkler matched email or postal address (JW >75-85 as base) Jaro Winkler/Soundex/Metaphone matched first name (JW > 90) Jaro Winkler/Soundex/Metaphone matched last name (JW > 90)
I'm on a RAMP server right now and don't have access to installing the MDS database so I haven't had a chance to map out the best base Jaro-Winkler values, but the ones would probably be a good starting point IMHO. I have used similar functions in the past to generate numerous matching schemes across databases which would usually have different values, Tess shouldn't be much different.
Hi Chris
That's an interesting idea or two.
We've extended the identify-dupes in various ways, including matching across name+address, name+email, and name+phone, and cross-matching as well ( ie if A matches B on address, and B matches C on email, then we say that A matches C as well), and we have been Brave enough to auto-merge across sets ( ie where we identify a Set of dupes, selecting which one to keep and merging all the others into it), based on fairly rigorous conditions. As I may have mentioned elsewhere, our first auto-merge merged 100,000 records, so that's the scale of our problem, which puts it a bit beyond manual solution.
But I don't think i'd be keen about using any approximate-matching methods on email addresses or cc numbers. We only ever use exact-matching on emails - since they're effectively forced to be unique, they're really a good identifier (except where people share them of course - that's a trap.) Obviously approximate-matching would increase the chance of a match in the case of typos, but I would only want to use that for identifying potential dupes for manual checking, not rely on it for auto-merge.
And of course cc numbers are stored encrypted, and since about version 10, they can't be unencrypted by us (thanks to PCI rules), so we can only access the masked version - I don't think I'd want to rely on them at all - even an exact match.
Ken
Ken-
I couldn't agree with you more. Personally I think any auto-deduping in Tess is probably a horrible idea since accounts are sometimes just separate to be separate which was a disclaimer I put on a draft of that post (must have lost it in the actual post). it should be something done on a one time basis and it should be thoroughly checked before being deployed or committed in a live database.
One thing I wouldn't agree with is throwing out distance functions for de-duping. I would take a highly approximate match on multiple fields over a single field perfect match since the likelihood they are not the same person goes down over each field matched. If they meet a minimum criteria set then we don't include them.
I do however believe that approximate matching on anything that we know for certain (CC numbers) is just dumb and maybe I wasn't clear when I was explaining I would look at the act_name for a match of some sort (much like a person would do when checking for a dupe) along with the 10 out of 16 numbers of the PAN we know along with expiration date which are all stored in the DB. This was in addition to matching on 3 or 4 other fields with fairly high certainty.
OFC any code could be built on the base I had suggested; even ones that look for perfect matches to even consider an automatic merge.
So I suppose FWIW- I agree, but still would love to see a more semantic approach to the duplication matching features of Tess
Hi Ken,
I've been reading this thread from last year as we're still working on our duplicate issue at ACMI! I wondered if you could please let me know if and how you set further business rules regarding which record to keep? For example, once duplicates are identified we'd like to give preference to accounts that are attached to a household or have a more recent activity date. Is this something that's managed in SQL code?
Thanks,
Lydia