Predicting Invasive Ductal Carcinoma using Convolutional Neural Network (CNN) in Keras

Predicting Invasive Ductal Carcinoma using Convolutional Neural Network (CNN) in KerasClassifying histopathology slides as malignant or benign using CNNBikram BaruahBlockedUnblockFollowFollowingJan 3In this blog, we will learn how to use CNN in a real world histopathology dataset.

Real-world data requires a lot more preprocessing than standard datasets such as MNIST, and we will go through the process of making the data ready for classification and then use CNN to classify the images.

I will also discuss the CNN architecture used and some of the hyperparameters tuned when building the model.

The different sections of this blog are:IntroductionUnderstanding the datasetLoading the datasetData preprocessingTackling data imbalance by random undersamplingModel architectureCompiling the modelData AugmentationTraining the modelMaking predictionsEvaluating the models performanceI assume that the reader knows what convolutional neural networks are and the their basic working mechanism.

In this blog, we will only discuss the code which is important in the context of deep learning, and also avoid repeated explanation of code used more than once.

The Github repository for this post can be found here.

I’d suggest you to follow along the Jupyter Notebook while going through this tutorial.

IntroductionOver the past couple of years, there has been a rise in using deep learning for medical image analysis with increasing success.

Deep learning in the field of healthcare is used to identify patterns, classify and segment medical images.

As with most image related tasks, convolutional neural networks are used to do this.

The classification problem tackled here is to classify histopathology slides of Invasive Ductal Carcinoma (IDC) as either malignant or benign.

Histopathology slide of a malignant tumour.

Image credits: Department of Pathology at Johns Hopkins UniversityIDC is a type of breast cancer, where the cancer has spread to the surrounding breast tissue.

Cancer tumours can be classified into two types: malignant and benign.

A benign tumor is one which does not invade its surrounding tissues whereas a malignant tumor is one which may spread to its surrounding tissues or other parts of the body.

Understanding the datasetThe dataset we will use can be downloaded here.

Scroll down to the Dataset Description section of the page and download the 1.

6 GB zip file.

The dataset consists of 162 whole mount slide images of breast cancer specimens scanned at 40x.

From that,277,524 patches of size 50 x 50 were extracted, out of which 198,738 are IDC negative (benign) and 78,786 are IDC positive (malignant).

Each patch’s file name is of the format:u_xX_yY_classC.

png → example 10253_idx5_x1351_y1101_class0.

pngWhere u is the patient ID (10253_idx5), X is the x-coordinate of where this patch was cropped from, Y is the y-coordinate of where this patch was cropped from, and C indicates the class where 0 is non-IDC (benign) and 1 is IDC (malignant)Loading the dataset:imagePatches = glob('C:/Users/asus/Documents/Breast cancer classification/**/*.

png', recursive=True)patternZero = '*class0.

png'patternOne = '*class1.

png'#saves the image file location of all images with file name 'class0' classZero = fnmatch.

filter(imagePatches, patternZero) #saves the image file location of all images with file name 'class1'classOne = fnmatch.

filter(imagePatches, patternOne)The dataset consists of 279 folders, with sub-folders 0 and 1 inside each of the 279 folders.

We first create two variables classZero and classOne, which saves the image locations of all class 0 and class 1 images respectivelydef process_images(lowerIndex,upperIndex): """ Returns two arrays: x is an array of resized images y is an array of labels """ height = 50 width = 50 channels = 3 x = [] #list to store image data y = [] #list to store corresponding class for img in imagePatches[lowerIndex:upperIndex]: full_size_image = cv2.

imread(img) image = (cv2.

resize(full_size_image, (width,height), interpolation=cv2.

INTER_CUBIC)) x.

append(image) if img in classZero: y.

append(0) elif img in classOne: y.

append(1) else: return return x,yWe then create a function process_images which takes as input the starting and end index of the images.

This function first reads the image using OpenCV’s cv2.

imread() and also resizes the image.

Resizing is done because few of the images in the dataset are not 50x50x3.

The function returns two arrays: X, which is an array of the resized image data and Y, which is an array of the corresponding labels.

X, Y = process_images(0,100000)For this tutorial, we will only analyze images from index 0 to 60,000.

The image data (pixel values) are now stored in a list X and their corresponding classes in a list Y.

Data preprocessing:X = np.

array(X)X = X.

astype(np.

float32)X /= 255.

The list X is first converted to a numpy array and then casted to the type float32 to save space.

The images are first normalized by dividing it by 255.

This ensures that all the values are between 0 and 1.

This helps us to train the model faster and also prevents us from falling into the vanishing and exploding gradients problem (insert link to article of vanishing/ exploding gradients).

from sklearn.

model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.

15)The dataset is split into training and test set, with 15% of the entire dataset reserved for testing.

For a dataset of 60,000 , this means 51000 images are reserved for training and 9000 for testing.

Tackling data imbalance by random undersamplingy_train.

count(1) #counting the number of 1y_train.

count(0) #counting the number of 0Counting the number of 1’s and 0’s in the array Y, we find that there are 44478 images of class 0 and 15522 images of class 1.

This problem is known as data imbalance and can cause our model to be more biased towards one particular class, usually the one which has more samples.

Particularly in fields such as healthcare, classifying the minority class (malignant in this case) as majority class (benign in this case) can be very risky.

We will deal with data imbalance by randomly undersampling the majority class, i.

e removing samples of the majority class to make the number of samples of the majority and minority class equal.

y_train = to_categorical(y_train)y_test = to_categorical(y_test)Before that, we need to one-hot-encode the output variable y_train and y_test.

X_trainShape = X_train.

shape[1]*X_train.

shape[2]*X_train.

shape[3]X_testShape = X_test.

shape[1]*X_test.

shape[2]*X_test.

shape[3]X_trainFlat = X_train.

reshape(X_train.

shape[0], X_trainShape)X_testFlat = X_test.

reshape(X_test.

shape[0], X_testShape)We also need to reshape X_train and X_test to use the Random Under Sampler.

from imblearn.

under_sampling import RandomUnderSamplerrandom_under_sampler = RandomUnderSampler(ratio='majority')X_trainRos, Y_trainRos = random_under_sampler.

fit_sample(X_trainFlat, y_train)X_testRos, Y_testRos = random_under_sampler.

fit_sample(X_testFlat, y_test)The parameter ‘ratio=majority’ states the random under sampler to under sample the majority class.

Checking the number of samples of each class again after performing random undersampling, we find that we have equal number of samples of both the classes.

The image data is then converted back to its original shape of 50 x 50 x 3.

Model ArchitectureWe use a similar architecture to the one discussed in this paper.

batch_size = 256num_classes = 2epochs = 80model = Sequential()model.

add(Conv2D(32, kernel_size=(3,3), activation='relu', input_shape=(50,50,3)))model.

add(MaxPooling2D(pool_size=(2, 2)))model.

add(Conv2D(64, (3,3), activation='relu'))model.

add(MaxPooling2D(pool_size=(2,2)))model.

add(Conv2D(128, (3, 3), activation='relu'))model.

add(Conv2D(256, (3, 3), activation='relu'))model.

add(Flatten()) #this converts our 3D feature maps to 1D feature vectors for the dense layer belowmodel.

add(Dropout(0.

5))model.

add(Dense(128, activation='relu'))model.

add(Dropout(0.

5))model.

add(Dense(128, activation='relu'))model.

add(Dense(num_classes, activation='sigmoid'))The model is a sequential which allows us to create the model layer-by-layer.

The architecture consists of convolutional layers, max pooling layers, dropout layers and fully connected layers.

The first layer is a convolutional layer with 32 filters each of size 3 x 3.

We are also required to specify the input shape in the first layer, which is 50 x 50 x 3 in our case.

We will be using the Rectified linear unit (ReLU) activation function for all the layers except the final output layer.

ReLU is the most common choice for activation function in the hidden layers and has shown to work pretty well.

The second layer is a pooling layer.

The pooling layers are used to reduce dimension.

Max Pooling with a 2×2 window only considers the maximum value in the 2×2 window.

The third layer is again a convolutional layer of 64 filters each of size 3 x 3 followed by another max pooling layer of 2×2 window.

Usually, the number of filters in the convolutional layer grows after each layer.

The first layers with lower number of filter learns simple features of the images whereas the deeper layers learn more complex features.

The next two layers are again convolutional layers with the same filter size but increasing number of filters; 128 and 256.

We need to flatten the 3D feature map output from the convolutional layer to 1D feature vectors before adding in the fully connected layers.

This is where the flatten layer comes in.

The next layer is a dropout layer with a dropout rate of 0.

5 .

A dropout layer with dropout rate of 0.

5 means 50% of the neurons will be turned off randomly.

This helps prevent overfitting by making all the neurons learn something about the data and not rely on just a few neurons.

Randomly dropping neurons during training means other neurons will have to do the work of the turned-off neurons, thus generalizing better and prevent overfitting.

The value 0.

5 is taken from the original paper by Hinton (2012), which has proved to be very effective.

These dropout layers are added after each of the fully connected layers before the output.

Dropout also reduces the training time for each epoch.

The following dense layer (fully connected layer) has 128 neurons.

This is followed by another dropout layer with a dropout rate of 0.

5The next layer is another dense layer with 128 neurons.

The final output layer is another dense layer which has number of neurons equal to the number of classes.

The activation function in this layer is sigmoid because the problem in hand is a binary classification problem.

For multi-class classification problem, the activation function should be set to softmax.

Compiling the modelmodel.

compile(loss=keras.

losses.

binary_crossentropy, optimizer=keras.

optimizers.

Adam(lr=0.

00001), metrics=['accuracy'])The model is compiled with binary cross entropy loss function and the Adam optimizer is used.

The ‘accuracy’ metric is used to evaluate the model.

Adam is an optimization algorithm which updates the network weights in an iterative manner.

Although the initial learning rate for Adam can be set (we have set it to 0.

00001 in our case), this is the initial learning rate and the learning rate for each parameter is adapted as training begins.

This is how Adam (short for adaptive moment estimation) is different from stochastic gradient descent, which maintains a single learning rate for all weight updates.

A detailed explanation of the Adam optimization algorithm can be found hereThe learning rate determines how fast we are adjusting the weights of our network towards the local minima.

Too high of a learning rate can result in such high weight changes that it might result in overshooting the local minima.

This causes the training or validation error to fluctuate drastically between consecutive epochs.

Too low of a learning rate can result in taking longer time to train our network.

Thus, the learning rate is one of the most important hyperparameters that needs to be tuned when building the model.

Data augmentationdatagen = ImageDataGenerator( featurewise_center=True, featurewise_std_normalization=True, rotation_range=180, horizontal_flip=True,vertical_flip = True)Generally, the more data we have, the better deep learning tends to work.

Keras ImageDataGenerator generates real time images during training using data augmentation.

Transformations are performed on the mini-batches on-the-fly.

Data augmentation helps generalize the model by reducing the network’s capacity to overfit the training data.

Rotation, vertical and horizontal flipping are some of the common data augmentation techniques used.

Keras ImageDataGenerator provides a variety of data augmentation techniques.

However, we will only use few of them.

A histopathology slide labelled as malignant is still malignant if it is rotated 20 degrees and flipped vertically.

Training the modelTraining the model using a GPU speeds up the training process.

You will need a NVIDIA GPU to do so.

I followed this tutorial to enable GPU training.

We set the number of epochs to a high number , in our case 80, and use a regularization method called Early Stoppingearly_stopping_monitor = EarlyStopping(monitor='val_loss', patience=3, mode='min')Early stopping is a method used to avoid overfitting by stopping the training process when the parameter set to observe does not improve for a certain number of epochs.

In our case, we tell EarlyStopping to monitor val_loss and if it does not improve for 3 epochs continuously, stop the training process.

”The batch size is usually set to a power of 2 because it is more computationally efficient.

We have set it to 256.

We use another Keras callback called ModelCheckpointmodel_checkpoint = ModelCheckpoint('best_model.

h5', monitor='val_loss', mode='min', verbose=1, save_best_only=True)Model checkpoint is used to save the model.

The monitor parameter allow us to set a metric which we want to keep an eye on.

In our case, we only save the model when the validation loss is the minimum.

We save the best model to be used later to make predictions and thus evaluate the model’s performance.

Combining both these callbacks, the best model (which has the minimum validation loss) is saved and following that, if validation loss does not improve (decrease) over the next 3 epochs (set by EarlyStopping), the model training stops.

training = model.

fit_generator(datagen.

flow(X_trainRosReshaped,Y_trainRosHot,batch_size=batch_size),steps_per_epoch=len(X_trainRosReshaped) / batch_size, epochs=epochs,validation_data=(X_testRosReshaped, Y_testRosHot), verbose=1, callbacks=[early_stopping_monitor, model_checkpoint])Because we are using ImageDataGenerator on the fly, we use model.

fit_generator to train the model.

We set it to a variable ‘training’ as we will later plot the training loss and validation loss.

This helps us get an idea of the variance, i.

e the difference between training error and validation set error.

For validation, we will use X_testRosReshaped and Y_testRosHot which we obtained after under-sampling the x_test and y_test set.

Training stops after 37 epochs due to Early Stopping.

Hence, the best model saved is the one during epoch 34, with a validation accuracy of 79.

10%val_loss stopped improving after Epoch 34Plotting the training set and validation set loss, we find that the variance is very low.

This plot ensures that our model is not overfitting.

Training and testing set lossMaking predictionsfrom keras.

models import load_modelfrom sklearn import metricsmodel = load_model('best_model.

h5')y_pred_one_hot = model.

predict(X_testRosReshaped)y_pred_labels = np.

argmax(y_pred_one_hot, axis = 1)y_true_labels = np.

argmax(Y_testRosHot,axis=1)We load the best model saved by ModelCheckpoint and use the predict function to predict the classes of the images in the array X_testRosReshaped.

The predictions are now stored in the list y_pred_labels.

Evaluating the model’s performanceconfusion_matrix = metrics.

confusion_matrix(y_true=y_true_labels, y_pred=y_pred_labels)We use a confusion matrix to evaluate the models performance.

The confusion matrix in a binary classification matrix has four quadrants; false positives, false negatives, true positives and true negatives.

For our case, the four quadrants of the confusion matrix can be simplified as follows:Confusion matrix with description of the 4 quadrants for our caseThe confusion matrix we get is:Resultant confusion matrixIn cases such as this, having a lower false negative is better than having a lower false positive.

This is because identifying a malignant tumour as benign is more dangerous than identifying a benign tumour as malignant, since the former will result in the patient receiving a different treatment due to misdiagnosis, and the latter is likely to go through further tests anyway.

We can see that our model performs well with an accuracy of 79.

10% on the test set.

The confusion matrix only is also favourable for us and we have a model with low variance.

Applying deep learning is an iterative process.

You can try and improve this model further by tuning the hyperparameters such as learning rate of the optimization algorithm, changing the batch size, changing the filters of the convolutional layers, adding in more layers or using more data.

Thanks for reading.

If you have any questions, comment them down below.

I will be writing regularly on topics such as deep learning, so follow me here on Medium.

I am also available on LinkedIn! 🙂 Happy coding.

.. More details

Leave a Reply