Quality Control with Machine Learning

We have removed the noise caused by Other_Faults and now the classes are well separated.

SECOND MODELAt this point we try to make a model which not takes into account the Other_Faults class.

We fit a Gradient Boosting as above.

X_train2, y_train2 = X_train[y_train != 'Other_Faults'], y_train[y_train != 'Other_Faults']X_test2, y_test2 = X_test[y_test != 'Other_Faults'], y_test[y_test != 'Other_Faults']gbc2 = GradientBoostingClassifier(n_estimators=500)gbc2.

fit(X_train2, y_train2)Now the ACCURACY is 0.

909%, improving by 10 percentage points.

Certainly this is a good result and confirms the goodness of our reasonings but this second model reproduce an unrealistic scenario.

In this way we are imposing that Other_Faults class doesn’t exist and that all the faults are easy to distinguish and to label.

With our first model we have proved that this in not our case.

So we need a way to translate the uncertainty, that appears when people try to classify an ambiguos steel plate, in machine learning language.

IMPOSE A THRESHOLDI have encoded this uncertainty imposing a threshold on each class on our final predictions.

To build this threshold I have made the predictions with our second model on the Other_Faults samples and I’ve stored it, mantaining the separation for each predicted class (as shown below).

def predict(feature, threshold_map=None): conf = np.

max(gbc2.

predict_proba(feature)) label = gbc2.

predict(feature)[0] if threshold_map and label in threshold_map: if conf >= threshold_map[label]: return {"label": label, "confidence": conf} elif conf < threshold_map[label]: return {"label": "Other_Faults", "confidence": conf} return {"label": label, "confidence": conf}pred_lab, pred_conf = [],[]for row in df[label == 'Other_Faults'].

values: pred_lab.

append(predict([row])['label']) pred_conf.

append(predict([row])['confidence'])other_pred = pd.

DataFrame({'label':pred_lab,'pred':pred_conf})diz_score = other_pred.

groupby('label')['pred'].

apply(list).

to_dict()plt.

boxplot(diz_score.

values(), labels=diz_score.

keys())plt.

show()probability scores and associated predicted labels in Other_FaultsNext I’ve calculated a mobile threshold on each predicted class: I’ve adopted the 0.

30 percentile in every classes (red squares) calculated on the score distributions.

threshold_p = {}for lab in diz_score.

keys(): threshold_p[lab] = np.

percentile(diz_score[lab],30)plt.

boxplot(list(diz_score.

values()), labels=list(diz_score.

keys()))plt.

plot(range(1,len(threshold_p.

keys())+1), list(threshold_p.

values()), 'rs')plt.

show()threshold for every probability scores distributionsPratically we utilize this threshold to say if a steel plate belongs with certainty to a given class of failures.

If our prediction is below the threshold, we have not so much confidence to classify it and so we label it as an Other_Faults.

Adopting this technique we are able to achive an ACCURACY of 0.

861% (in test data without Other_Faults).

If we will increase the threshold, we will lose points in accuracy, but we will get an higher precision, and so on.

Red: Accuracy, Blue: PrecisionRegarding the Other_Faults class we are assuming that it exists in form of an ‘indecision class’, which contains all the samples classified by the model with low confidence.

At the same time we are assuming that all the samples of the original Other_Faults class belong to the class pointed by the model, if the confidece is higher than threshold (we trust this).

At the end if we plot again our original data adopting our resizing of Other_Faults class, we can see a noise reduction (pink dots concentration).

final_pred = []for row in df.

values: final_pred.

append(predict([row], threshold_map=threshold_p)["label"])encoder_final = LabelEncoder().

fit_transform(final_pred)tsne = TSNE(n_components=2, random_state=42, n_iter=300, perplexity=5)np.

set_printoptions(suppress=True)T = tsne.

fit_transform(scaler.

transform(df))fig, ax = plt.

subplots(figsize=(16,9))colors = {0:'red', 1:'blue', 2:'green', 3:'pink', 4:'black', 5:'orange', 6:'cyan'}ax.

scatter(T.

T[0], T.

T[1], c=[colors[i] for i encoder_final]) plt.

show()TSNE on ALL the data with thresholdSUMMARYIn this post I propose a workflow for faults classification in Quality Control.

I’ve recived as input some samples of steel plates and I’ve started to analize them in order to correct classify faults.

After the first step I’ve noticed some ambiguos behaviours in the data structure and so I’ve started to investigate… Based on my human insight, I’ve suggest a new vision of the problem, I’ve tried to solve it and I propose my personal solution.

Keep in touch: Linkedin.. More details

Leave a Reply