A journey into supervised machine learning

The data set is rather straightforward and trying to classify these data points into different letters seemed like a great use case for the data.Code from this analysis can be found hereI started by reading the data into a Pandas dataframe, investigating a few of the data points, and then separating them into features and labels as NumPy arrays..Aside from being super easy to separate into arrays from the dataframe, I found another really cool Sklearn module called preprocessing which allowed me to scale the value of the features in one line..I found later, that since my data set was rather small (~20k rows) the preprocessing only ended up saving me a few seconds during training and predicting, but it seems like a useful step when working with much larger data sets..Finally, I used Sklearn’s train_test_split to create my training and testing data sets, leveraging the default 75%-25% split respectively.My next step was to consider the types of models I wanted to try out given my problem/data..I knew I needed something good at classification, could separate data on a fair amount of dimensions, didnt need to be super fast, and had parameters to control over fitting..To me, SMV and Random Forest fit the bill here, so I went ahead with testing those.Support Vector MachineMy first pass at a SVM resulted in 94.6% accuracy and only took a total of 4.5 seconds to train and predict..I thought this was pretty good given the minimal effort, but moved on to tuning a few of the parameters to see if I could bump the accuracy up..Experimenting with the kernel and C parameters, I was able to get the accuracy to over 97%..Bumping the C value up to 1000 seemed to give me the biggest boost in accuracy, while the kernel didn’t seem to have much positive impact (something i need to investigate more)..Another peculiar thing with the kernels was that some of them, such as the sigmoid kernel, reduced accuracy to as little as 43.44%..I’m not well versed in how to apply the “kernel trick”, so thats probably a bit more interesting to me than to someone who knows the ins and outs of kernels.With other data sets,` I found the SVM to be very slow, but with this smaller data set and without time as a constraint, the SVM turned out to be a great choice..With more time, I’d like to do some more validation to ensure that my model isnt overfit and see if i can improve accuracy by adjusting gamma.Random ForestThe other model I wanted to try was a Random Forest.. More details

Leave a Reply