Want to Learn R

Just attended a nice intro to R and RStudio webinare.

Here is a link to the article.

https://thomasmock.netlify.com/post/a-gentle-guide-to-tidy-statistics-in-r/

I'll try to add a link to the video in 3-4 days when it is posted.

However, here is a good start on videos

https://resources.rstudio.com/webinars

You can get access to R Studio in the cloud for free to learn at https://rstudio.cloud.

Parents
  • Former Member
    Former Member $organization

    I'm an intermediate level R user with machine learning and data visualization experience if anyone ever wants to use me as an additional resource! Really happy to see more people using R and look forward to hearing any success stories. Thanks Tom!

Reply
  • Former Member
    Former Member $organization

    I'm an intermediate level R user with machine learning and data visualization experience if anyone ever wants to use me as an additional resource! Really happy to see more people using R and look forward to hearing any success stories. Thanks Tom!

Children
  • How have you been using R at the Aquarium?

  • Former Member
    Former Member $organization in reply to Tom Brown (Past Member)

    I started my position in April and have been working to clean up our data before I get to have any real fun with it, so nothing yet, unfortunately. But, I am looking forward to using R as a way to quickly summarize data and make visualizations of more complex datasets.  You have exponentially more power in R for exploratory data analysis than you do in, say, Excel. It's amazing if you want advanced statistics, such as a cluster analysis. It can be especially useful when your data becomes too large to read in Excel, too.

  •  Are you using R for any of your clean up work?  Over on the Developers Group here on TessituraNetwork.com we have started a conversation about "Machine learning model for identifying duplicates".  Have you tried anything like this with R?  There are a few libraries out there for Record Linkage.  

    However, then there is the problem of what you do about the linked records once you have found them.  Which one do you keep?  Which to delete, and a host of other questions.

    cc:  

  • Former Member
    Former Member $organization in reply to Tom Brown (Past Member)

    What an exciting question! I'm going to consolidate all of the info and experiences I've had on this and get something fleshed out by Monday. My initial thoughts are that machine learning models will always have some margin of error associated with them, so unless we are willing to have some percentage of our data be merged by mistake, or not merged when it should have been, then a human will need to clean up any residuals that the cleanup model might have missed. Machine learning can really help with optimization but it rarely gives a product that has 100% accuracy. That being said, that doesn't mean a machine learning model for deduplication wouldn't be useful, it just might become exceedingly complex to produce. This ultimately depends on what information you have about your constituents and how high their error rates are/how inconsistent responses are (such as how people choose to enter their phone number). If you have an additional database to compare with that is ideal. For instance, I use the California Directory of Schools to find out which schools are the correct ones, and what should be merged. 

    I know that may have been a mouthful but I would love to continue this discussion. I hope to make the meeting on the 28th my first Analytic Coffee! session. Have a great weekend!

  • ,

    I agree that an ML model will always have some error.  This is why I wish there were a standard way to un-merge accounts in Tessitura.  However, we don't have one at this point. 

    Regardless, there are some sites that are automatically scheduling merges when there is very high certainty of record linkage.  (To date I think that most groups are looking at things like exact email address matches.)  I've also thought of using additional features in the record linkage process like partial credit card numbers as part of match criteria. 

    Another challenge comes when an account has multiple phones, email addresses, and postal codes, past credit cards used.  How do we do feature paring without blowing up computational complexity to badly? 

    That said I'm inclined to try to work on an MVP and see how far we get.

    For me, the first step will be to extract customer records from Tessitura get the data into an analytics database outside Tessitura.  Likely to use the List and Output set method through the REST API to get my data.  (Because I'm on RAMP, and don't have direct database access from my data science environment, which includes PostgreSQL, Jupyter Notebooks, and R.)

    cc: 

  • Former Member
    Former Member $organization in reply to Tom Brown (Past Member)

    One thing I can say is that we have had specific use cases that have presented some trouble.  One of them is siblings that are in our summer programs.  They often have the same address, use a family email, same home phone number, etc.  Sometimes we even have trouble with siblings with names like 'Roberto' and 'Roberta'. Fringe cases like these have often slipped under the radar for my department and I have only recently discovered some of these issues.  Would be interested to hear if you have dealt with similar issues as well. 

    Good luck on your journey!

  • The layered complexities of householding I understand.  However, I don't have good ways to sort out these issues.  I did a bit of playing with the problems over the weekend.