Georgia (and Machine Learning) on my mind

That’s Okefenokee Swamp.

Also note the distribution bars at the top and right.

Let’s also take a look at the distribution of condition ratings and sufficiency ratings, our variables we are trying to predict.

Condition Rating is a discrete score from 0 to 9 that is given to each of three locations on a bridge, the Deck, Superstructure, and the Substructure.

Think Deck as what a car drives on, Superstructure as mainly the beams that hold the deck up, and Substructure as the piers/columns/abutments that everything else sits on.

Sufficiency rating on the other hand is a continuous score from 0 to 100 that is based heavily on condition rating, but also on other geometrical factors.

We can see that all three have very similar distributions, and the superstructure seems to be in the best shape relatively.

Condition RatingMachine LearningIn order to predict the bridge ratings, we have to prepare the data to be used in the machine learning models.

This is called Preprocessing.

Most of the time you do not simply dump the data into a black box model and hope for the best.

In our case, the features that were selected are the followingFeatures: Latitude, Longitude, Elevation, Age, Structure Length, Design Load, Roadway Width, Annual Daily Traffic (ADT), Percent Trucks, Degrees Skew, Horizontal ClearanceThese features are selected mainly through domain knowledge and are not generated artificially from the dataset.

They all represent fairly accessible information about a bridge that an engineer could get access to.

All of the features are on different scales, and many have outliers, such as we saw in the Bridge Length plot above.

These outliers, and the relative scale of the feature, could have oversized effects on the model produced from the data.

Since this is not desirable, we use the QuantileTransformer from sklearn.

preprocessing to prepare the data.

Machine Learning PipelineFor most models in this analysis, a pipeline was constructed with a transformer and GridSearch over hyperparameters.

The models were fitted to the data using nested cross validation and evaluated on a hold out set.

Here is sample code for the Ridge Regression model.

Model PerformanceThe model results after all steps were performed is shown below.

Regression modelsClassification ModelsOverall we can see that the classification models outperform the regression models.

This is mainly due to the Design Load feature.

Design Load as used in our model is not truly a continuous variable, but does include new information for the model, therefore a classifier is better able to use the information than regression.

The Logistic Regression binary classifier has an advantage over all the other models, which is discussed later.

Model EvaluationOne of the best ways to evaluate a classifier model is to develop a confusion matrix.

For all confusion matrices predicting any arbitrary number of n classes, an n x n matrix is developed in which the diagonals represent true predictions and any value off diagonal is an error in prediction.

In the case of binary classification (is something, is not something), then the values off the diagonal of the matrix represent false positives and false negatives.

The confusion matrix for the Random Forest classifier is a good example of the output, shown below.

Random Forest, non-random resultsBest Performing ModelWe notice that the Logistic Regression model — 2 class outperforms all models, but why is it head and shoulders above the rest?.The reason is that we mapped the 10 different condition rating values to 2, Poor/Good.

If any of the the three locations have a condition rating of 4 or less, then the bridge is rated as Poor.

In general, as you reduce the number of classes, the model will perform better and we have shown this to hold true in our case.

ConclusionWith this analysis, we have shown that we can meaningfully predict the outcomes of bridge sufficiency ratings and bridge condition ratings in the state of Georgia using easily obtainable data from the FHWA.

The most successful model from this analysis is the logistic regression binary classifier, which can predict whether a bridge will pass or fail inspection with an accuracy of 94%.

These models can be deployed to aid the DOT in their asset management business and will provide significant value to the department and the taxpayers.

This article only scratches the surface of the full analysis.

Check out the full notebook and supporting code on Github hereCheers!.. More details

Leave a Reply