Deep learning to identify Malaria cells using CNN on Kaggle

Deep learning to identify Malaria cells using CNN on KaggleKaran BhanotBlockedUnblockFollowFollowingApr 5Photo by Kendal James on UnsplashDeep learning has vast ranging applications and its application in the healthcare industry always fascinates me.

As a keen learner and a Kaggle noob, I decided to work on the Malaria Cells dataset to get some hands-on experience and learn how to work with Convolutional Neural Networks, Keras and images on the Kaggle platform.

One of the many things I like about Kaggle is the immense knowledge it holds in the form of Kernels and Discussions.

Taking cues and references from various kernels and experts really helped me get better at producing highly accurate results.

Do look at other kernels and understand their approach to gain more insights for your own development and knowledge building.

Quick sourcesDataset: https://www.


com/iarunava/cell-images-for-detecting-malariaKaggle notebook: https://www.


com/bhanotkaran22/keras-cnn-data-augmentationImport libraries and datasetI began by importing numpy, pandas, and matplotlib.

I decided to use Keras with Tensorflow backend to implement the CNN model.

So, I imported a number of layers from keras.

layers including Convolution2D, MaxPooling2D, Flatten, Dense, BatchNormalization, andDropout.

I used the Sequential model.

To work with images in the dataset, I imported os, cv2 and Image packages.

Import datasetIn Kaggle, all data files are located inside the input folder which is one level up from where the notebook is located.

The images are inside the cell_images folder.

Thus, I set up the data directory as DATA_DIR to point to that location.

To store the features, I used the variable dataset and for labels I used label.

For this project, I set each image size to be 64×64.


/input/cell_images/cell_images/'SIZE = 64dataset = []label = []Next step was to import the data.

The parasitized (infected) cell images are inside the Parasitized folder and uninfected images are inside the Uninfected folder.

For both folders, I iterated through all files with extension png.

For parasitized cell images, I read the image using cv2.

imread(), convert it from an array using Image.

fromarray() and resize it to 64×64.

Finally, I saved it to the dataset variable and appended 0 for each of these images to label.

I repeated the same process for uninfected cell images but set the label as 1 this time.

Visualizing dataI used matplotlib to randomly plot 5 parasitized and 5 uninfected cells.

Parasitized cellsParasitized cellsUninfected CellsUninfected cellsApplying CNNThe Convolutional Neural Network is one of the most effective neural networks to work with images and make classifications.

I used Keras to create the model.

Convolution2DThis creates a convolution kernel.

I set a few properties as defined below:filter: The first parameter defines the output shape of the layer.

In this case, for both layers I kept the value as 32.

kernel_size: It defines the size of the window we want to use that will traverse along the image.

I set it as 3×3.

input_shape: It is used to define the input size of each image.

In this project, I am using images of size 64×64 and the images are coloured i.


they are composed of RED, BLUE and GREEN.

The channels are thus 3.

So, the parameter input_shape will be (64, 64, 3).

We need to define input_shape only for the first layer.

activation: The activation function is defined in this parameter.

I used relu which is Rectified Linear Unit as the activation function.

MaxPool2DIt is used to downscale the outputs and I used the following parameters:pool_size: It defines the matrix size which defines the number of pixel values that will be converted to 1 value.

I used the value as 2×2 so an image of size 62×62 will be converted to 31×31.

data_format: It describes that, in the input, the channels are defined at the beginning or at the end.

As in this case, the third value is for the channel in (64, 64, 3), I set data_format as channels_last.

BatchNormalizationIt normalizes the output from the previous activation function and I modified just one parameter:axis: It defines the axis to be normalized.

As I used channels_last, I set the value as -1.

DropoutIt selects some of the values at random to be set as 0 so as to prevent overfitting in the model and I used only the rate parameter:rate: Fractions of input to be dropped.

I kept the rate as 0.


FlattenIt flattens the complete n-dimensional matrix to a single array.

So, if its size was 64x64x3, it will be converted to an array of size 12,288.

It acts as the input for the dense layer ahead.

DenseIt defines a densely connected neural network layer and I defined the following parameters:activation: It defines the activation function which I set as relu except for the last (output) layer.

For the last dense layer, I set the activation as sigmoid.

units: It defines the number of neurons in the given layer.

I created three layers with neuron count as 512, 256 and 2 respectively.

Structure of the CNN Model in this projectI created a Sequential model for CNN.

I created a Convolution Layer followed by a MaxPooling layer.

It is followed by BatchNormalization to normalize the output from the previous layers and apply the Dropout regularization.

Another set of these layers is then appended.

I then Flatten the outputs.

The flattened outputs are then passed to an Artificial Neural Network which includes three dense layers with 512, 256 and 2 nodes.

The last layer is the output layer with the activation function sigmoid.

You can read more about activation functions here.

The last step is to compile the model.

The optimizer is adam and this being a categorical problem, I used the loss as categorical_crossentropy and evaluation metric as accuracy.

Training and accuracyI split the dataset into 80% training data and 20% testing data.

Using fit method, I train the model with X_train and y_train.

I used total epochs as 50, which is basically 50 iterations of the complete dataset with a batch size of 64.

I also added validation of 0.

1, so the model trained on 90% training data and validated on 10% training data.

The model achieved an accuracy of 95.


Data augmentation and accuracy improvementData augmentation helps increase the dataset and train the model on more and varied data.

More the data available for the model to learn from, the better the model behaves.

Keras provides a subpackage ImageDataGenerator that can create this data.

Data augmentationFor training data, I rescaled the images by dividing by 255, zoomed images with a range of 0.

3, flipped them horizontally and rotated them by 30.

For testing data, I just rescale the images.

The train_generator and test_generator are created with batch size of 64.

Calculating new accuracyI then trained the classifier using fit_generator and calculated the new accuracy.

The model achieved an accuracy of 96.

41% with data augmentation.

As we can see, with data augmentation, I was able to increase the model accuracy while still having the same data to begin with.

At first glance, it might look like the accuracy hasn’t increased much but in the medical domain a single percent increase can be really useful and can identify more patients correctly.

ConclusionIn this article, I discussed the use of Convolutional Neural Networks and data augmentation for Malaria cell images and achieved a test accuracy of 96.


Thanks for reading.

Please share your thoughts, ideas and suggestions.


. More details

Leave a Reply