Attendance tracking data - weather

Tom had previously posted a question about weather and attendance, looking at it from a forecasting attendance point of view.  (https://community.tessituranetwork.com/topical_groups/analytics-coffee/f/discussions/22444/weather-and-attendance)   Our question is more about tracking weather data in an effort to help analyze past attendance.  Currently, the Museum tracks the temperature at a certain point of the day and an indicator of what type of weather was happening (cloudy, rainy, partly cloudy).  I don't think this is quite what we need for analysis (i.e., we don't need to determine how partly cloudy and 73 degrees varies from cloudy and 69 degrees).   My goal is to come up with a single indicator, preferably with a small number of potential values (less than 6) to help quantify if the weather was either a positive or negative factor in attendance for that day.  Part of the key is to make the evaluation of what is the value for that day as objective as possible since this value will be recorded by different people of different days.  (I'd prefer not to do a general "on a scale of 1 to 5, did the weather positively or negatively affect attendance today?)  Is anyone else doing this sort of analysis?  If so, how do you quantify the effect of the weather?  Thanks!

  • Random forest yields the best results for us. We are predicting at the hour level, then we calculate the total for a given day.

  • Using this approach. Are you comfortable sharing any accuracy metrics?

    What sorts of periods are giving you the hardest time predicting? 

    For us, Spring Break (Easter, Passover...) can be quite challenging because from a calendar point of view these days move around the calendar a lot.

  • Hi Tom, you've probably already thought of this but have you considered adding a categorical variable to your model for holidays that move? I.e. for every day have the column IsSpringBreak=0 or =1. An ML algorithm would then be able to take the break into account. 

  • , I can get the actual days of Easter, Passover, Chinese New Year and Ramadan. These are some of the “holidays” that are based on a lunar cycle, not our Gregorian calendar.  And you can calculate the weekends adjacent to these dates.

    The problem for our prediction of spring break numbers arises out of how various organizations like schools and municipalities choose to celebrate / adapt their break calendars. In the New York City, New Jersy area these decisions are made hyper locally, and often vary wildly depending on the cultural make up of the local communities.   We have not found a reproducible way to inexpensively model this variability 18 or more months in advance. I’ve wondered about proxies.  Currently our best method is manually looking at a lot of school calendars. And hoping our distribution reflects the broader distribution of schools. 

    thoughts?

  • Hi Tom, I understand your difficulties now, that's quite a challenge. Here in the UK schools don't often know themselves when their holidays will be exactly 18 months in advance but it's at least quite standardised! I'm afraid I can't think of a better method than what you're already doing, taking a sample of schools. It would be possible to automate (without getting schools to submit their data directly) by modelling that distribution for future years based on a historical sample and perhaps some national exam dates that can be known in advance? However I doubt it would be possible to get much accuracy like this. One last thought is looking into whether any schools authorities (governing bodies, examination authorities, local governments who give holiday approvals) have calendars combine data from lots of schools. It would only be a little better than what you already do but might save you a small chunk of time.

  • we don’t even have national exam dates. 

    At the moment I don’t know who to ask about the approval of holiday schedules.  And I expect that I have to investigate both NY State and NJ state governments.