The Data Science Behind the New York Times’ Dialect Quiz, Part 2

(I have attempted to reach Josh Katz and find the code for his project, but have been unsuccessful.)Some Additional Dialect-Quiz BackgroundFrom listening to Katz’s talk at NYC Data Science Academy and reading his interview with Ryan Graff, I have gathered the following:Katz created a 142-question pilot dialect quiz that had the original 122 questions that Vaux and Golder used in their survey, plus 20 more that Katz came up with via input from the RStudio community (the same community to which he posted his original dialect map-visualizations, which got him noticed by the Times).In addition to answering these 142 questions, users could select an answer of “other” for each question and write in custom responses.In the pilot study, in addition to linguistic and locational questions, Katz also surveyed people on age and gender.In total, 350k people participated in the pilot quiz.The data from the pilot is what Katz used to build his final model for the NYT version.Katz pared down the questions for the final Dialect Quiz from 142 to 35 (only 25 of which are fed to a user in a single session, making the quiz slightly different each time someone takes it) based on which ones he found most revealing.Supervised v..Unsupervised MLAs mentioned in the first part of this series, K-Nearest Neighbors (K-NN), the algorithm that Katz used in his dialect quiz, is a supervised ML algorithm..This means that K-NN learns how to do its job by being fed data that has both questions and answers..In contrast to unsupervised ML, K-NN, and algorithms like it, are given a set of problems along with their solutions so that they can easily see what type of output is expected of them in the future.Claudio Masolo does a great job describing the differences between the two types of ML in his blog post “Supervised, Unsupervised, and Deep Learning”:With supervised learning, a set of examples, the training set, is submitted as input to the system during the training phase..Each input is labeled with a desired output value, in this way the system knows [how the output should be, depending on the input] . . ..[In u]nsupervised learning, on the other hand, the training examples provide by the system are not labelled with the belonging class..So the system develops and organizes the data, searching common characteristics among them, and changing based on internal knowledgSupervised learning schema from Masolo’s post “Supervised, Unsupervised, and Deep Learning”Unsupervised learning schema from Masolo’s post “Supervised, Unsupervised, and Deep Learning”In summary:Supervised ML (e.g. K-NN) = feeding your model data containing questions and answers so that it can make accurate predictions.Unsupervised ML = feeding your model data containing questions and asking it to tease out patterns from those questions that it can then use to make accurate predictions.More jargon!So, we know that K-NN is a supervised ML algorithm, and now we know what that means..Before we move on, there’s just a bit more jargon we have to tackle. the world of ML, data scientists train our algorithms by feeding them training data (usually 80% of our dataset)..This training data consists of things called feature vectors and labels..Let’s go over both of these concepts in addition to a couple more terms to jumpstart our understanding.FeaturesFeatures are essentially your dataset’s column headers..They are your independent variables, the change in any one of which may or may not result in a subsequent change in your dependent variable (or “target”)..(If you’ve ever heard of a data scientist doing “feature engineering” these are the things they’re adding/deleting in order to optimize their model.)In the case of the Dialect Quiz, some features were likely the different questions in the quiz, age, and sex.Feature VectorsFeature vectors are essentially your dataset’s rows.. More details

Leave a Reply