Cell Detection with Conv. Network | Data Science: Biotec

Convolusional Neural Network (CNN) for Cell Classification

Intro

In 2018, Paul Mooney upload a series of augmented images of blood cells (JPEG) with the cell labels (CSV). There are approximately 3000 images for each type of cell: Eosinophil, Lymphocyte, Monocyte and Neutrophil. The main purpose of this project is to generate a Convolusional Neural Netwrok (CNN) that classifies each image. The most popular post in kaggle which offers a solution (also uploaded by P. Mooney) is cited in the reference section. This project is focused in develop automated methods to deal with the efficiency and swiftness of blood-based disease diagnosis.

Objectives

* Build and efficient architecture that succesfully classifies 4 types of cells with the given data. (more than 85% efficiency)

Data Preprocessing

First of all, as all the data science projects, the preprocess of the data is fundamental in order to obtain a correct output of the model. In this case the images should measure the same. All these functions were provided by Paul Mooney. He also indicates which commands can be used to visualized the pixel intensity of the RGB spectrum, shown in the following figure.

CNN architecture

This section is the most important of the entire project, the architecture is what will define a successful classification. The last model published (Paul Mooney) is a sequential model of CNN with several neurons, nonetheless I consider that an Inception model would be more proper for this type of problem, beacause we need to make the model note the difference between specific traits of the cells, the model architecture is the following:

Blocks

Each block is build by several convolusion, The specifications are the following:

(K_S= kernel size)

Tower 1: Conv2D 1x1 strides and 32 K_S

Tower 2: Conv2D 1x1 strides and 16 K_S

Conv2D 3x3 strides and 32 K_S

Tower 3: Conv2D 1x1 strides and 8 K_S

Conv2D 3x3 strides and 16 K_S

Conv2D 5x5 strides and 32 K_S

Tower 4: MaxPooling 5x5 strides

Conv2D 1x1 strides and 32 K_S

Final architecture

The final architecture is show below, 5 modify inception blocks with 2 MaxPooling in the first 2 blocks and finally the process of the final outpu: a Convolution of 64 K_S, 1x1 strides and ReLu activation. A dropout of 0.25 to deal with overfitting, flatten the output and finally the number of classes

Model specifications

Optimizer: ADADELTA (This optimizer is a little faster than ADAM and converge quicker, recommended for categorical crossentropy)

Epoch: 30

Loss: Categorical Crossentropy

Parameters: 2,043,000

Early Stop: this add-on will cause the learning to stop if the parameter is not imporve after each epoch. Patience: 6 epoch

Parameter: Validation accuracy

Model Checkpoint: This add-on allow the code to save the best weights so the model can be run independently.

Results: Four Classes

Learning curve

During the model training the model reach the early stop condition when the validation accuracy could not be imporve after reaching a 93%, so the model stopped at the 18 epoch. On the right is shown to graphs, the first one indicating the accuracy of the model and the other with the loss of the model. It is worth to mention that the loss model shows an ascendind tendency, so reducing the patience should be consider in order to avoid the overfitting of the input.

Describing a little more the accuracy curve, the max accuracy reach in the training data was 97% and finally when the best model was used to predict the test data labels, the acurracy was +91%, which is much better than the purpose by Paul Mooney back in 2018, imporving the accuracy by 6.7% and reducing the number of epochs for training the model from 30 to 18.

On the following figure is show the general performance of the model and for each class, as it will be mention afterwards the main problems stands in the identification of the eosinophil.

Confusion matrix

The results from the modify inception model are pretty great compare to what is publish right now in Kaggle. The main problem in this model is the false neutrophil classification, as it is show below, the confusion between eosinophil vs neutrophil is about 18% of the total evaluation, this problem is also important for the sequential methods. This is due to the fact that both (Neutrophil and Eosinophil) are polinuclear, so the problem right now is to find the way to make the model discriminated between this two classes.

Results: Two Classes

Learning curve

During the model training the model reach the early stop condition when the validation accuracy could not be imporve after reaching a 98%, so the model stopped at the 10 epoch. On the right is shown to graphs, the first one indicating the accuracy of the model and the other with the loss of the model. It is worth to mention that the loss model shows an ascendind tendency, so reducing the patience should be consider in order to avoid the overfitting of the input.

Describing a little more the accuracy curve, the max accuracy reach in the training data was 98% and finally when the best model was used to predict the test data labels, the acurracy was +94%, which is a little less than the purpose by Paul Mooney back in 2018, reducing the accuracy by 1%, but also reducing the number of epochs for training the model from 30 to 10.

On the following figure is show the general performance of the model and for each class, It is worth to mention that for this experiment the recall coefficient is partially low, meaning that the mononuclear classification has some problem for certain position or point of view of the image taken, in other words the model is classifying more polinuclear than it should do.

Confusion matrix

The main problem in this model is the false mononuclear classification, it is observed by the confusion matrix and the recall coefficient. The explanation to this is that the image taken might be considering another specific pattern that casuses de miss-classification.

Conlcusions and final thoughts

For dealing the miss-classification between eosinophil vs neutrophil a pre-filter might be the solution. First, just classifying the polynuclear cells identify and then apply another type of architecture in order to identify how many nuclear body the cell has, after this the model should present a less recall coefficient for eosinophil cells.

For this idea, my personal purpose is to use to inception models. The first model, just as shown in this section is to identify the polynuclear and mononuclear. On the other hand, the second model might be as well an inception model type, but in this case apply other methods such as an Averange Pooling instead of a Max Pooling, this might help to recognize better the number of nuclear bodies.

Convolutional networks offer a vast possibilities to perform image identification and classification, even though the architecture of the neural network is vital for the result, using Keras library provides an straightforward design and easy implementation for any type of project. Furthermore, these specific approach (biological field) is vital in order to develop more robust analysis that primarily impact by reducing the time spent by doctors and technicians to analyze each sample and last but not least contributing to the welfare of the patient.

Liked the post?

Any comments or ideas are more than welcome. If you want to obtain codes like this, be sure to contact me!

References

Blood Groups and Red Cell Antigens: Blood and the cells it contains. Dean, L. Available at: https://www.ncbi.nlm.nih.gov/books/NBK2263/

KERAS documentation: https://www.pyimagesearch.com/2018/12/31/keras-conv2d-and-convolutional-layers/

Identify Blood Cell Subtypes From Images(Kaggle), Paul Moooney. Available at: https://www.kaggle.com/paultimothymooney/identify-blood-cell-subtypes-from-images

BCCD_Dataset. Shenngan. Shanghai Jiao Tong University. Available at: https://github.com/Shenggan/BCCD_Dataset