Predicting geographic origin of fish samples using Random Forest models

I used a discrete Wavelet transformation (similar to the commonly used Fourier transformation) to describe the shape of the outline of each bone (Figure 1) in terms of cosine waves.

The transformation gave me 64 coefficients to describe the morphological nuances of each fish’s ear bone, these are our features.

Figure 1: Ear bone and the derived outline using a Wavelet transformationResponse: Geographic originFeatures: Wavelet coefficients describing bone shapeData ProcessingI had unequal sampling among demographic groups; without getting into the biological details, I needed to isolate the impact of geographic origin on bone shape without demographic details (i.

e.

length, age, etc.

) contributing to the relationship.

I used repeated ANCOVAs to eliminate features where demographic variables significantly covaried with geographic region, and applied a Bonferonni adjustment to minimize the buildup of type one error from repeated analyses.

# function that will derive the p value for covariatesancova.

pvalues <- function(resp, pred, cov){ return(unlist(summary(aov(resp~pred*cov)))['Pr(>F)3'])}# apply to each feature (given predictor of region (pops) and covariate of length)p.

values<-0 # stores valuesfor (i in 1:length(colnames(Wavecoefs))){ p.

values[i]<-ancova.

pvalues(unlist(Wavecoefs[i]), pops, length_cm)}which(p.

values<0.

05) # which features should we omitSome features were skewed, so I applied a box cox transformation to those with skew greater than 0.

75.

Very few features contained NAs; I replaced NAs with the mean value for each feature.

I split the data set into training and test sets using 80% and 20% of the data respectively using the handy method “train_test_split”.

from sklearn.

model_selection import train_test_splittrain_X, test_X, train_y, test_y = train_test_split(df, resp, random_state = 0, test_size=.

2)My response variable, geographic region, was sampled unequally (Figure 2).

I wanted to show that the features I engineered can predict the origin of fish from multiple different regions and I wanted to minimize the impact of the variation in sample size on model prediction.

To do this, I randomly undersampled the most common class (Gulf, in blue).

Figure 2: Sample size for each geographic region (1-Gulf of Mexico, 2-West Atlantic, 0-East Atlantic)The package imbalanced-learn has some useful methods for this purpose; I used “RandomUnderSampler” to create a more balanced training dataset to on which to fit my model (Figure 3).

from imblearn.

under_sampling import RandomUnderSamplerrus = RandomUnderSampler(return_indices=True)X_rus, y_rus, id_rus = rus.

fit_sample(train_X, train_y)Figure 3: Sample size for each geographic region after undersamplingModelingI used a random forest classifier to predict the region the sample came from given the features describing bone shape.

First, I determined the optimal hyperparameter values:max_features: the maximum number of features to consider at each splitmax_depth: the maximum number of splits in any treemin_samples_split: the minimum number of samples required to split a nodemin_samples_leaf: the minimum number of samples required at each leaf nodebootstrap: whether the data set is bootstrapped or whether the whole dataset is used for each treecriterion: the function used to assess the quality of each splitmax_features = [‘auto’, ‘sqrt’, ‘log2’]max_depth = [int(x) for x in np.

linspace(10, 110, num = 11)]max_depth.

append(None)min_samples_split = [2, 5, 10]min_samples_leaf = [1, 2, 4]bootstrap = [True, False]criterion= ['gini', 'entropy']grid_param = {'max_features': max_features, 'max_depth': max_depth, 'min_samples_split': min_samples_split, 'min_samples_leaf': min_samples_leaf, 'bootstrap': bootstrap, 'criterion':criterion }The sci-kit learn module has a handy method “GridSearchCV” to find optimal hyperparameter values through cross-validation.

I used k-fold cross-validation with 5 folds.

from sklearn.

model_selection import GridSearchCVgd_sr = GridSearchCV(estimator=RFC, param_grid=grid_param, scoring=’accuracy’, cv=5,n_jobs=-1)gd_sr.

fit(X_rus, y_rus) print(gd_sr.

best_params_)After identifying the best hyperparameters, I fit the model with the maximum number of trees my machine could compute in a relatively short amount of time.

Best_RFC=RandomForestClassifier(n_estimators=8000,max_features=’auto’, max_depth=20,min_samples_split=5, min_samples_leaf=1, bootstrap=True, criterion=’gini’)# fit best model to training datasetBest_RFC.

fit(X_rus, y_rus)Finally, I predicted the origin of fish samples from the test set and calculated the accuracy of the model:# predict test Y valuesypred=Best_RFC.

predict(test_X)from sklearn import metricsprint(“Accuracy:”,metrics.

accuracy_score(test_y, ypred))The model predicted the geographic origin of each sample in the test set with 89% accuracy.

My prediction accuracy was higher than studies that classified the origin of similar fish species.

This exercise was limited by a small sample size; the species I studied are not caught often.

A classification matrix gives us insight into how the model predictions related to the observed classes.

The random forest model predicted fish samples from the Gulf of Mexico with much greater accuracy than those from the East Atlantic and West Atlantic.

This suggests bone shape is more unique in the Gulf of Mexico than other regions.

This exercise shows the value of machine learning concepts in fisheries science, and their ability to predict the origin of fish samples using the technique I suggest.

Given the small sample size, I think the features I engineered offer a strong predictive capacity.

I appreciate any feedback and constructive criticism.

The code associated with this analysis can be found on github.

com/njermain.

. More details

Leave a Reply