Data Focused Decision making for Organizations: A DSI Case StudyTyler RichardsBlockedUnblockFollowFollowingMar 12In late Spring of 2018, I was elected President of DSI (Data Science and Informatics), which is the Data Science student group at the University of Florida.
We teach workshops (Python, R, NLP, ML, you name it) and grow the Data Science community.
Soon after my election came this:“What kind of idiot would I be if I ran a Data Science Organization without applying Data Science to it”The rest of this post elaborates how we, throughout the Fall of 2018, brought the organization from the state of “we have very little data and the data we do have is unusable” to “we have an organized and useful source of data and have begun to take action from our generated insights.
” Over the years of reading data science-related posts, I’ve often felt like this sort of data engineering/collection/synthesizing work is underrepresented, so here we go!Data SourcingThankfully, DSI has a history of creating sign-in sheets for our workshops with detailed information about the participants including names, emails, majors, and the extent of their programming experience.
However, the data we have kept over the past 3 years has not been kept with analytics in mind, and before the data cleaning process, the set of auto-generated google sheets looked a bit like this.
One year, DSI had tracked participant’s class as a string (Freshman, Sophomore, etc), another year had years at UF as integers (1, 2, etc) without the ability to distinguish between first-year graduate students and first-year undergrads and even a third kept academic standing (Undergrad vs Grad school).
We had tracked email 5 (!!!!) different ways: Email, email, e-mail, Contact (email) and E mail.
These discrepancies were clearly created over the years as the executive board turned over and new people were creating sign-in sheets, which makes sense and comes from good instincts!.But the end data is partially unusable as it is not analytics first.
The lesson here is: Any time spent on data intentionality is compounded 10x as the data grow.
DSI has become a pillar of analytics and teaching at UF, and as the organization matured, serving over a thousand students per year, the data’s issues grew alongside.
Standardizing and AutomatingStandardizing data is equivalent to asking, in our case, what do we want to learn about people who come to our workshops?.The easiest way to tell what an org/company cares about is to find out what they track.
Is it user growth?.Repeat attendance?.Demographic characteristics?.Once your organization has come together and figured that out, standardization comes more naturally.
Our solution?.Templated forms and the R package googlesheets.
The form ensures that the same data is kept time after time, and the package automatically scrapes the sheets and pulls the data together.
The new executive board is creating a better solution using login information and databases, but google sheets and a couple good R (or Python) scripts should do in a pinch.
Finally, at this point, we had a relatively clean dataset with DSI’s history over the years, and we could attempt to use this data for mission-driven organization change.
This is, in my opinion, the hardest part of Data Science because you never really know if anything you’re working towards will be useful.
What if we spent all this time, all this effort, for nothing?.There is no a priori way to know the value of data, only a posteriori.
This is really why discussing the data collection and cleaning is so tremendously crucial because it comprises 80% of the workflow.
There is no business end to DSI; we teach and help because we enjoy it and find it fulfilling, and these lessons we’re learning as an organization, however cheap now, are invaluable to young data scientists in the workforceExploratory Data AnalysisBack to the analysis: using data to further DSI’s missions.
This begs the question, what is DSI’s mission?.For the first few years of DSI, it was to learn and teach as fast as we can.
This has worked quite well, as, in its history, DSI has had ~2500 attendees.
This graph is cumulative attendance, but it’s pretty clear that as our content and outreach continue to improve (thanks to a much shorter feedback loop than nearly any other group at UF), students will want to learn programming skills.
The breakdown of DSI attendees is expected, with the plurality being technology majors but with significant numbers from social studies, engineering (formal sciences), business, etc.
This next visualization, a histogram of return attendees, was the one that really struck the DSI exec board.
A huge percentage of people who went to DSI went only for one or two workshops (85% in total).
This is rather unsurprising, as there isn’t a strong reason to attend an intro to Python workshop multiple times.
DSI has done a tremendous job of being a place for learning data science at UF but hasn’t approached a different problem: creating a data science community.
This created a leaky user bucket for user retention, which wasn’t our intention.
Creating CommunityStarting a community is difficult for a few reasons, one being there are really only proxies for good metrics.
Is having a high return rate all the evidence needed for a community?.Certainly not.
It seems like a necessary but not sufficient proxy for community.
We took three main initiatives in the Fall of 2018 to try and build this community.
First, we created Data Gator, UF’s first data science competition in collaboration with UF libraries (if you’re reading this as a UF student, you should enter!!).
Then, every other week, we came up with an event called Data Science Wednesday, where our theory was: Community = Data + Coffee + Food + Time.
We’ll give students food and interesting datasets, and see what they come up with.
One dataset focused on detecting poisonous mushrooms, another on playing Fortnite while high, others on bike share rides, and even one on statistics about Pokemon which produced the graph below.
And finally, after looking at the breakdown of our workshops, we found that industry-specific workshops had higher percentages of first-time attendees (in our first Natural Language Processing workshop, we had a plurality of the linguistics Ph.
D.
students because there were no Python classes taught by the department).
We then continued to develop more niche workshops, like Statistics for Data Science, a wonderful Tableau workshop, and even an Actuary workshop, to attract different parts of campus.
ResultsIt’s difficult, and probably a bad idea, to evaluate changes to an organization after less than a semester.
With that in mind, a few numbers popped out of the last semester.
Fall 2018 was the highest attended DSI semester ever and the percentage of students who came to more than two workshops doubled.
I’m really excited about these results, not only as a potential success story of data-focused decision making at an organizational level but also with where the exec board after my tenure is bound to take the org.
Anyway, that’s the more complete story about how we did all the boring but highest leverage data work at a student org and saw some great preliminary results by looking at some nice graphs, clearly defining what we wanted, and making some changes we could measure.
The exec team has some other projects they’re working on, including building a login system and proper databases for the organization and trying to make a ‘how much pizza should we order’ model to optimize our budget.
There is no group at UF that I’m more excited about (find them here and attend the annual symposium at the end of the month), keep an eye on this space!.As a plug, if you’re a hiring manager and you’ve made it this far, congrats!.Your prize is this advice: Hire these people early because I’m sure there will be a bidding war for all of them soon enough.
Special thanks are due to Delaney Gomen, who was instrumental in the data cleaning and also was behind a lot of the visualizations.
Some of this post, in presentation form, can be found here from a talk I gave at UF in the Fall of 2018 and a portion of the code can be found here.
See more analytics work like this on my website or follow me on twitter.
.. More details