Just attended a nice intro to R and RStudio webinare.
Here is a link to the article.
https://thomasmock.netlify.com/post/a-gentle-guide-to-tidy-statistics-in-r/
I'll try to add a link to the video in 3-4 days when it is posted.
However, here is a good start on videos
https://resources.rstudio.com/webinars
You can get access to R Studio in the cloud for free to learn at https://rstudio.cloud.
Thanks for this, Tom!
Just what the Dr ordered
I'm an intermediate level R user with machine learning and data visualization experience if anyone ever wants to use me as an additional resource! Really happy to see more people using R and look forward to hearing any success stories. Thanks Tom!
How have you been using R at the Aquarium?
I started my position in April and have been working to clean up our data before I get to have any real fun with it, so nothing yet, unfortunately. But, I am looking forward to using R as a way to quickly summarize data and make visualizations of more complex datasets. You have exponentially more power in R for exploratory data analysis than you do in, say, Excel. It's amazing if you want advanced statistics, such as a cluster analysis. It can be especially useful when your data becomes too large to read in Excel, too.
Here is the link to the video.
The recording of the webinar is available to watch here. Additionally, you can access the code that was presented on RStudio Cloud.
Let me know if you give it a try and if you run into any problems. Would be glad to give you a hand if I can.
Are you using R for any of your clean up work? Over on the Developers Group here on TessituraNetwork.com we have started a conversation about "Machine learning model for identifying duplicates". Have you tried anything like this with R? There are a few libraries out there for Record Linkage.
However, then there is the problem of what you do about the linked records once you have found them. Which one do you keep? Which to delete, and a host of other questions.
cc: Nick Reilingh
Here is a youtube showing the use of the Record Linkage Package.
https://www.youtube.com/watch?v=Msl1Q5Yv8Ow
What an exciting question! I'm going to consolidate all of the info and experiences I've had on this and get something fleshed out by Monday. My initial thoughts are that machine learning models will always have some margin of error associated with them, so unless we are willing to have some percentage of our data be merged by mistake, or not merged when it should have been, then a human will need to clean up any residuals that the cleanup model might have missed. Machine learning can really help with optimization but it rarely gives a product that has 100% accuracy. That being said, that doesn't mean a machine learning model for deduplication wouldn't be useful, it just might become exceedingly complex to produce. This ultimately depends on what information you have about your constituents and how high their error rates are/how inconsistent responses are (such as how people choose to enter their phone number). If you have an additional database to compare with that is ideal. For instance, I use the California Directory of Schools to find out which schools are the correct ones, and what should be merged.
I know that may have been a mouthful but I would love to continue this discussion. I hope to make the meeting on the 28th my first Analytic Coffee! session. Have a great weekend!