Here’s what I did to transform the training data below.
# Read in JSON Data# Train Metadata for profile picturesfileNames <- list.
files(".
/input/train_metadata", pattern = "*-1.
json", full.
names=TRUE)# Create empty file list to fill with dataframes of json data to bind laterfilelist <- list()for (file in fileNames) { json <- fromJSON(file, flatten=TRUE) # turn json file into r object dat <- rbindlist(json, fill=TRUE) # make datatable dat$PetID <- basename(file)# add filename column filelist[[file]] <- dat # add it to list}# bind data tables togethertrain_metadata <- rbindlist(filelist, fill=TRUE)#Split filenames into PetID and image number, create columns for them bothtrain_metadata$PetID <- str_split_fixed(train_metadata$PetID, "-", 1)train_metadata$imgNumber <- gsub(".
*-","",train_metadata$PetID)train_metadata$imgNumber <- gsub(".
json.
*", "",train_metadata$imgNumber)train_metadata$PetID <- gsub("-.
*","",train_metadata$PetID)rm(fileNames,filelist,json,dat,file)And for concatenating image keywords, determining a clear subject…(Whether the image subject was clear or not was determined by whether the first keyword contained ‘cat’ or ‘dog’.
)imageDescriptions <- subset(combined_metadata, select = c(2,7))imageDescriptions <- imageDescriptions %>% na.
omit() %>% group_by(PetID) %>% mutate(image.
keywords = paste0(description, collapse = ", ")) %>% filter(row_number()==1) %>% select (-c(description))# If cat/dog isn't first image keyword, mark subject as unclear.
# If first word in description is not cat or dog, write as unfocusedfor (row in imageDescriptions) { imageDescriptions$isCat <- startsWith(imageDescriptions$image.
keywords, 'Cat', ignore.
case=TRUE) imageDescriptions$isDog <- startsWith(imageDescriptions$image.
keywords, 'Dog', ignore.
case=TRUE)}imageDescriptions$clear.
subject <- ifelse(grepl(TRUE, imageDescriptions$isCat | imageDescriptions$isDog), "Yes", "No")imageDescriptions <- subset(imageDescriptions, select = c(1:2,5))The end result!Find Image Object Focus’ Dominant ColorThis also makes use of the JSON image metadata.
I wanted another rudimentary way to tell whether the pet in the image was the focus of the picture or not.
There are far more robust ways to do this, but because of the memory and run time restraints of the competition I was unable to deploy solutions I found or am currently unaware of alternatives that I could’ve used on Kaggle.
#Subset colors and ID to new dataframeimgColors <- subset(combined_metadata, select=c(5,7,24))# Remove Null RowsimgColors <- imgColors[imgColors$dominantColors != "NULL"]# Unnest dataframes within dominantColorsimgColors <- imgColors %>% unnest(dominantColors)# Rename color columnsnames(imgColors)[5]<-"red"names(imgColors)[6]<-"green"names(imgColors)[7]<-"blue"# Subset for highest score color on profile images for modelingimgDominantColors <- imgColors %>% group_by(PetID,imgNumber) %>% top_n(1, score) %>% filter(imgNumber == 1)# add hex rowrgb2hex <- function(r,g,b) rgb(r, g, b, maxColorValue = 255)imgDominantColors$hex <- rgb2hex(imgDominantColors$red,imgDominantColors$green,imgDominantColors$blue)# add color name columnimgDominantColors$color <- sapply(imgDominantColors$hex, color.
id)imgDominantColors <- imgDominantColors %>% unnest(imgDominantColors$color)# rename color columncolnames(imgDominantColors)[colnames(imgDominantColors)=="imgDominantColors$color"] <- "dominantColor"# Thanks to grey/gray, we have to remove duplicated rows for these colorsimgDominantColors <- subset(imgDominantColors, select = c(-imgNumber))imgDominantColors <- imgDominantColors %>% group_by(PetID) %>% filter(row_number()==1) rm(imgColors)# Show tally of pictures by dominant colorimgDominantColors %>% group_by(dominantColor) %>% tally(sort=TRUE)At the very least this is a far more interesting look at colors than the 7 the competition used to categorize the pets!I also determined image size and dimensions using the OpenImageR package to add to the features.
Sentiment Score of Description#combined is the merge of the training and testing datasets that included the description that came along with the pet profile.
trainWords <- subset(combined, select = c("PetID", "Description"))# replace all numbers and dashes with empty stringtrainWords$Description <- gsub("[0-9]+", "", trainWords$Description)trainWords$Description <- gsub("_", "", trainWords$Description)trainWords$Description <- as.
character(trainWords$Description)trainWords <- trainWords %>% unnest_tokens(word, Description) %>% #Break down text into individual words anti_join(stop_words) %>% #Data provided by the tidytext package mutate(word = iconv(word, from = "latin1", to = "ASCII")) %>% # Remove special characters filter(!is.
na(word)) sentimentScore <- trainWords %>% inner_join(get_sentiments("afinn"), by = "word") %>% group_by(PetID) %>% summarise(sentimentScore = round(mean(score),2))head(sentimentScore)Topic Modeling for Description and KeywordsI wanted to calculate the probability of a description/keywords belonging to a particular adoption speed (0–4, 0 being the fastest).
I used the topicmodels package to achieve this.
Below is an example of how I achieved this with the keywords from the image metadata.
keywordSimilarity <- imageDescriptions %>% select(PetID, image.
keywords)# replace all numbers and dashes with empty stringkeywordSimilarity$image.
keywords <- gsub("[0-9]+", "", keywordSimilarity$image.
keywords)keywordSimilarity$image.
keywords <- gsub("_", "", keywordSimilarity$image.
keywords)# drop observations that are only empty stringskeywordSimilarity <- keywordSimilarity[keywordSimilarity$image.
keywords!="",] keywordSimilarity <- keywordSimilarity %>% unnest_tokens(word, image.
keywords) %>% #Break into individual words anti_join(stop_words) %>% #Remove stop words mutate(word = iconv(word, from = "latin1", to = "ASCII")) %>% # Remove special characters filter(!is.
na(word))word_counts <- keywordSimilarity %>% count(PetID, word, sort = TRUE) %>% na.
omit() %>% ungroup()keywordDTM <- word_counts %>% cast_dtm(PetID, word, n)# Remove empty documents from DTM for LDArowTotals <- apply(keywordDTM , 1, sum) #Find the sum of words in each descriptionkeywordDTM <- keywordDTM[rowTotals > 0, ] #remove all descriptions without wordslda <- LDA(keywordDTM, k = 5, control = list(seed = 1234))#Extract Pet ID and gamma information from LDAdocument <- lda@documentsdocument <- as.
data.
frame(document)gamma <- lda@gammagamma <- as.
data.
frame(gamma)keywordTopics <- cbind(document, gamma)# Rename columns to help readability and to merge with datasetnames(keywordTopics) <- c("PetID", "keyword_topic_1", "keyword_topic_2", "keyword_topic_3", "keyword_topic_4", "keyword_topic_5")rm(keywordSimilarity,word_counts,keywordDTM,imageDescriptions,document,gamma)head(keywordTopics)Other features added: Pet description word count, breed and color count, as well as replacing name data with has/has no name.
Photo by Mikhail Vasilyev on UnsplashModel TimeMistake #2: When I did up my first model I didn’t bother to factor/one hot encode my data since I knew I was going to use H2O’s AutoML that I had understood would take categorical data with no issue.
This is how I ended up getting a public score of .
283 in the competition.
What I DID NOT know was that H2O would skip over models that won’t take categorical data and instead return the models that give the best result THAT DO take and process categorical data.
Once I learned about this I encoded the data, creating quantile variables like age, fee and state and binning columns like word count etc.
I also scaled the Gabor image features.
Mistake #3: I didn’t think to see what H2O version lives on Kaggle, so when I tried to enter my revised encoded data the model would error out because of a package bug that has since been fixed.
I thought I could do a workaround and enable the internet to at least see how my model would do with the updated package (which you can achieve directly in the kernel), but the competition doesn’t allow internet access so I can’t submit my new predictions.
What.
A.
Bummer.
The results of the revised model gave me the following results:Extract training frame with `h2o.
getFrame("automl_training_trainSet_sid_b5ce_1")`MSE: (Extract with `h2o.
mse`) 0.
3218763RMSE: (Extract with `h2o.
rmse`) 0.
5673414Logloss: (Extract with `h2o.
logloss`) 0.
85676Mean Per-Class Error: 0.
3051386Null Deviance: (Extract with `h2o.
nulldeviance`) 37022.
39Residual Deviance: (Extract with `h2o.
residual_deviance`) 21717.
15AIC: (Extract with `h2o.
aic`) NaNConfusion Matrix: Extract with `h2o.
confusionMatrix(<model>,train = TRUE)`)=========================================================================Confusion Matrix: Row labels: Actual class; Column labels: Predicted class 0 1 2 3 4 Error Rate0 1 16 259 3 37 0.
9968 = 315 / 3161 0 2158 336 0 106 0.
1700 = 442 / 2,6002 0 218 2607 229 414 0.
2483 = 861 / 3,4683 0 3 121 2587 103 0.
0807 = 227 / 2,8144 0 63 23 18 3372 0.
0299 = 104 / 3,476Totals 1 2458 3346 2837 4032 0.
1538 = 1,949 / 12,674Hit Ratio Table: Extract with `h2o.
hit_ratio_table(<model>,train = TRUE)`=======================================================================Top-5 Hit Ratios: k hit_ratio1 1 0.
8462212 2 0.
9497403 3 0.
9733314 4 0.
9816165 5 1.
000000As there are not many 0’s in the data set, my model was completely lost in predicting any correctly.
I do not know what score this model would’ve produced because of package conflicts as I stated earlier.
I’d like to think it improved it a teeny bit.
I may come back to this using the caret package and see what I can accomplish.
In the EndI learned a ton during this competition.
Here are a few key points:After EDA and BEFORE any feature engineering, do a baseline model to make sure your model works with the data you’re given.
Always encode your categorical data!Check the package versions you’re using on Kaggle, especially when you’re working on a local machine and the uploading your work.
They are not always up to date and can do strange things!.This opened my eyes to the importance of reproducibility in my work and hope to learn more about how to safeguard my code so that it works forever and always.
Computer vision is HARD.
This was probably a poor choice for my first competition since I have never done it before and have no idea how to use pretrained NN’s that would’ve done wonders for me for image processing and analysis.
I did learn how to turn images into an MNIST-like pixel matrix so that’s something!Imputing data: There were options to use models to fill in the blanks on the cells I didn’t have numbers for as well as using quick and dirty methods like entering the mean/mode for empty cells.
I opted for using the mean but know this is definitely not the best way to impute for the real world and would like to use the first method going forward when it makes sense to do so.
Know when to use dimensionality reduction / PCA / regularization / standardization etc.
I had to google when these tactics would be appropriate but seeing as this is only my third stab at ML I’ll forgive myself.
AutoML is still a work in progress.
Using packages like caret among others gave some R users better results in the competition because those users had more control over their models in a way that maximized results.
I was really impressed by Erik Bruin’s baseline model.
He was actually able to create a function to use Quadratic Weighted Kappa (calculates the distance between predicted and adoption speeds in this case) as his model’s predictive success metric.
This is the same equation used to score entries in the competition.
He ended up getting a private score of 0.
30182 doing very little feature engineering with this technique.
I didn’t even know you could do this so I’m excited to dig into caret and getting better at tuning models.
Thanks for reading!.