COVID-19 radiography detection

10 min readMay 7, 2021

This project attempts to detect through the use of data and image analysis tools in the field of Machine Learning, such as convolutional neural networks or deep neural networks, the detection of COVID-19 SARS COV2 in chest X-ray images.

Through the use of neural networks in the field of Machine Learning, it is expected to have an accuracy greater than 90% in the detection of Pneumonia caused by COVID-19 SARS COV2.

The intention of this project is directed to the medical community to provide additional diagnostics in the detection of pneumonia caused by COVID-19 SARS COV2, we are going to collect data from Kaggle datasets in order to have radiography for people with COVID-19 positive cases along with Normal, Non-COVID lung infection and Viral Pneumonia images.

Ethics
The most important ethical implication is that this tool is supposed to be used as a support for COVID-19 diagnostic. It is not intended to be the primary source to define whether a patient has it or not. For this reason, the analysis has to be bigger with a COVID-19 test, symptoms review and medical record. Every hospital or doctor has to make sure the whole analysis is completed in order to diagnose a patient with COVID-19.

Data Preprocessing.

The data will be sourced from Kaggle, which is a Google standards community where there are hundreds of data scientists and machine learning professionals. As for the data used for the project, it will not be necessary to use extra data from other places or studies since the data to be used are complete and very well ordered.

The data is in PNG format, a format that belongs to an algorithm based on not losing bitmaps, the transformation of the format type may not be necessary but can be changed if necessary to improve performance. in the process of Analyzing the images to a format that allows better performance like JPG or simply resizing the image to a smaller size.

Features and data exploration

Currently the data present two very important characteristics, the first is the set of images that represent people with healthy lungs and do not present any type of symptoms such as pneumonia, the other group of images does present pneumonia caused by COVID-19 SARS COV2 where it is evidence of infection.

For data exploration, a deep analysis is planned with automated tools such as convolutional neural networks.

Pre-existing hypotheses of the data and verification of the same

Initially we can deduce that the radiographs present a change in the textures such as blurrier spots in the lungs that allow us to identify that this patient is not healthy but they are only hypotheses, to clarify these doubts, in the first stage of the analysis, a small group of patients. images of healthy patients versus patients with pneumonia caused by COVID-19 SARS COV2.

As we can see in image number one and number two, the lungs of these people are more opaque compared to patient number 3, who is a healthy patient free of pneumonia.

Handling missing data or outliers

This study has the support of many health institutions worldwide, therefore there is a large volume of data that was provided in the different stages of the pandemic, these data are provided by teams of researchers from the University of Qatar, Doha , Qatar and the University. . from Dhaka, Bangladesh, together with their partners from Pakistan and Malaysia in collaboration with doctors. The dataset consists of more than 22 thousand images between healthy people and people with pneumonia caused by COVID-19 SARS COV2.

For the handling of atypical data, not only a data set that is the cause of pneumonia due to COVID-19 SARS COV2 will be used, but also a data set of pneumonia caused by other situations, in this way it is expected to determine which is directly related to COVID-19 SARS COV2. It is important to note that in both cases pneumonia presents as a lung infection, therefore it is important to analyze other symptoms by a doctor who must give the final opinion whether or not it was a case of COVID19.

Data in training / validation / test sets

The data will be split into training/validation/testing in the way that we are going to make sure that for training we will take a set big enough to generate statistically significant results but the data will be different to validation and testing.

For this we will take 50% of data for training, 25% for validation and 25% for testing. We will use a function to split training and test first because there is no function to split training, testing and validation at the same time. After that we will split the three nesting values.

The data set is unbiased

We will ensure that the dataset will be unbiased because the dataset includes images from 46 different publications / sites from various countries including Italy, Spain, Germany and USA. Among these databases, COVID-19 database was developed by the authors from collected and publicly available databases, while normal and viral pneumonia databases were created from publicly available Kaggle databases. Also, we are going to guarantee that the amount or quantity of images used for training, validation and testing is very similar and balanced for patients with pneumonia caused by COVID-19 SARS COV2) and the patients that do not have it.

The features in the training of the model

The features we will include in the training model will be: Differentiate when the images have certain characteristics that indicate what type of morbidity are part of the radiographics such as pneumonia for COVID-19 SARS COV2 or simply healthy. This will be done with certain relevant characteristics as the dimensions and contrasts of the images and erasing the irrelevant data.

Type of data

We will handle categorical data, because the data we collect represent characteristics, whether a patient has pneumonia caused by COVID-19 SARS COV2 or not. In this case, this categorical data takes numerical values such as 1 indicating a patient with pneumonia caused by COVID-19 SARS COV2) or 0 if the patient does not have it.

Transform data into the model

We will transform the data as we previously mentioned, resizing the images (150x150 pixels) because we are going to use convolutional neural networks so the computer can do the processing correctly. Also, we will use rgb2gray in order to have every x-ray image in black and white (from 3 layers to one layer).

How will this data be stored

We are going to store the data in two arrays, one for the images (X) and one for the labels (Y). The X array will contain numpy arrays transforming each image once processed as we mentioned before. At the end, we will have arrays for training and testing in order to use it in the model.

Code implementation Python.

Imports

from __future__ import absolute_import, division, print_function, unicode_literalsimport pandas as pd
import tensorflow as tf
from tensorflow.keras import datasets, layers, modelsimport matplotlib.pyplot as plt
import numpy as np
import os
from tqdm import tqdm
import cv2
from glob import globimport sklearn
import skimagefrom skimage.transform import resize
import random
import datetime
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from skimage.color import rgb2gray

Data transformation

train_dir = "chest_xray/train/"
test_dir =  "chest_xray/test/"LOAD_FROM_IMAGES = Truedef get_data(folder):
    X = []
    y = []
    for folderName in os.listdir(folder):
        if not folderName.startswith('.'):
            if folderName in ['NORMAL']:
                label = 0
            elif folderName in ['PNEUMONIA']:
                label = 1
            else:
                label = 2
            for image_filename in tqdm(os.listdir(folder + folderName)):
                img_file = cv2.imread(folder + folderName + '/' + image_filename)
                if img_file is not None:
                    img_file = skimage.transform.resize(img_file, (150, 150, 3),mode='constant',anti_aliasing=True)
                    img_file = rgb2gray(img_file)
                    img_arr = np.asarray(img_file)
                    X.append(img_arr)
                    y.append(label)
    X = np.asarray(X)
    y = np.asarray(y)
    return X,yif LOAD_FROM_IMAGES:
    X_train, y_train = get_data(train_dir)
    X_test, y_test= get_data(test_dir)
    
    np.save('xtrain.npy', X_train)
    np.save('ytrain.npy', y_train)
    np.save('xtest.npy', X_test)
    np.save('ytest.npy', y_test)
else:
    X_train = np.load('xtrain.npy')
    y_train = np.load('ytrain.npy')
    X_test = np.load('xtest.npy')
    y_test = np.load('ytest.npy')

Model

X_trainReshaped = X_train.reshape(len(X_train),150,150,1)
X_testReshaped = X_test.reshape(len(X_test),150,150,1)model = models.Sequential()
model.add(layers.Conv2D(64, (3, 3), activation='relu', input_shape=(150, 150, 1)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(2, activation='softmax'))
model.summary()

visualization images

multipleImages = glob('chest_xray/train/NORMAL/**')
i_ = 0
plt.rcParams['figure.figsize'] = (20.0, 20.0)
plt.subplots_adjust(wspace=0, hspace=0)
for l in multipleImages[:25]:
    im = cv2.imread(l)
    im = cv2.resize(im, (128, 128)) 
    plt.subplot(5, 5, i_+1) 
    plt.imshow(cv2.cvtColor(im, cv2.COLOR_BGR2RGB)); plt.axis('off')
    i_ += 1

chest_xray/train/NORMAL/

multipleImages = glob('chest_xray/train/PNEUMONIA/**')
i_ = 0
plt.rcParams['figure.figsize'] = (20.0, 20.0)
plt.subplots_adjust(wspace=0, hspace=0)
for l in multipleImages[:25]:
    im = cv2.imread(l)
    im = cv2.resize(im, (128, 128)) 
    plt.subplot(5, 5, i_+1) 
    plt.imshow(cv2.cvtColor(im, cv2.COLOR_BGR2RGB)); plt.axis('off')
    i_ += 1

chest_xray/train/PNEUMONIA/

Verification of the amount of data

import seaborn as sns
"""
Label 0: without PNEUMONIA
Label 1: Whit PNEUMONIA
"""plt.figure(figsize=(8,4))
map_characters = {0: 'without PNEUMONIA', 1: ' Whit PNEUMONIA'}
dict_characters=map_charactersdf = pd.DataFrame()
df["labels"]=y_train
lab = df['labels']
dist = lab.value_counts()
sns.countplot(lab)print(dict_characters)

Model compile

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])log_dir="logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir, histogram_freq=1)model.fit(X_trainReshaped, 
          y_train, 
          epochs=20,
          validation_data = (X_testReshaped,y_test),
          callbacks=[tensorboard_callback])

Result

Epoch 1/35
7/7 [==============================] - 5s 719ms/step - loss: 0.9381 - accuracy: 0.4779 - val_loss: 0.6935 - val_accuracy: 0.5000
Epoch 2/35
7/7 [==============================] - 4s 569ms/step - loss: 0.6936 - accuracy: 0.5441 - val_loss: 0.6924 - val_accuracy: 0.6000
Epoch 3/35
7/7 [==============================] - 4s 592ms/step - loss: 0.6892 - accuracy: 0.6187 - val_loss: 0.6887 - val_accuracy: 0.6500
Epoch 4/35
7/7 [==============================] - 4s 604ms/step - loss: 0.6781 - accuracy: 0.7168 - val_loss: 0.6988 - val_accuracy: 0.5000
Epoch 5/35
7/7 [==============================] - 4s 544ms/step - loss: 0.6482 - accuracy: 0.5975 - val_loss: 0.6871 - val_accuracy: 0.5500
Epoch 6/35
7/7 [==============================] - 4s 565ms/step - loss: 0.5812 - accuracy: 0.7397 - val_loss: 0.7176 - val_accuracy: 0.5500
Epoch 7/35
7/7 [==============================] - 4s 569ms/step - loss: 0.6259 - accuracy: 0.6630 - val_loss: 0.6835 - val_accuracy: 0.5833
Epoch 8/35
7/7 [==============================] - 4s 581ms/step - loss: 0.5448 - accuracy: 0.7342 - val_loss: 0.6443 - val_accuracy: 0.6500
Epoch 9/35
7/7 [==============================] - 4s 566ms/step - loss: 0.5119 - accuracy: 0.7898 - val_loss: 0.6214 - val_accuracy: 0.6167
Epoch 10/35
7/7 [==============================] - 4s 594ms/step - loss: 0.5190 - accuracy: 0.7445 - val_loss: 0.5973 - val_accuracy: 0.5833
Epoch 11/35
7/7 [==============================] - 4s 542ms/step - loss: 0.5396 - accuracy: 0.7155 - val_loss: 0.6151 - val_accuracy: 0.7000
Epoch 12/35
7/7 [==============================] - 4s 554ms/step - loss: 0.4559 - accuracy: 0.7962 - val_loss: 0.5664 - val_accuracy: 0.7000
Epoch 13/35
7/7 [==============================] - 4s 555ms/step - loss: 0.4133 - accuracy: 0.8038 - val_loss: 0.6769 - val_accuracy: 0.7000
Epoch 14/35
7/7 [==============================] - 4s 575ms/step - loss: 0.4183 - accuracy: 0.8215 - val_loss: 0.5820 - val_accuracy: 0.7333
Epoch 15/35
7/7 [==============================] - 4s 593ms/step - loss: 0.3948 - accuracy: 0.8177 - val_loss: 0.5052 - val_accuracy: 0.7167
Epoch 16/35
7/7 [==============================] - 4s 576ms/step - loss: 0.3229 - accuracy: 0.8661 - val_loss: 0.4191 - val_accuracy: 0.7333
Epoch 17/35
7/7 [==============================] - 4s 588ms/step - loss: 0.3235 - accuracy: 0.8501 - val_loss: 0.4015 - val_accuracy: 0.7667
Epoch 18/35
7/7 [==============================] - 4s 619ms/step - loss: 0.3115 - accuracy: 0.8913 - val_loss: 0.5566 - val_accuracy: 0.6833
Epoch 19/35
7/7 [==============================] - 4s 650ms/step - loss: 0.3257 - accuracy: 0.8512 - val_loss: 0.3918 - val_accuracy: 0.8167
Epoch 20/35
7/7 [==============================] - 4s 584ms/step - loss: 0.3460 - accuracy: 0.8119 - val_loss: 0.3435 - val_accuracy: 0.8667
Epoch 21/35
7/7 [==============================] - 4s 606ms/step - loss: 0.2737 - accuracy: 0.8790 - val_loss: 0.3651 - val_accuracy: 0.8333
Epoch 22/35
7/7 [==============================] - 4s 562ms/step - loss: 0.2587 - accuracy: 0.8741 - val_loss: 0.2720 - val_accuracy: 0.9000
Epoch 23/35
7/7 [==============================] - 4s 607ms/step - loss: 0.2355 - accuracy: 0.8798 - val_loss: 0.4026 - val_accuracy: 0.8333
Epoch 24/35
7/7 [==============================] - 4s 606ms/step - loss: 0.2178 - accuracy: 0.9103 - val_loss: 0.2065 - val_accuracy: 0.9333
Epoch 25/35
7/7 [==============================] - 4s 564ms/step - loss: 0.1550 - accuracy: 0.9356 - val_loss: 0.3326 - val_accuracy: 0.8333
Epoch 26/35
7/7 [==============================] - 4s 583ms/step - loss: 0.1722 - accuracy: 0.9416 - val_loss: 0.3908 - val_accuracy: 0.8167
Epoch 27/35
7/7 [==============================] - 4s 605ms/step - loss: 0.1935 - accuracy: 0.9252 - val_loss: 0.2397 - val_accuracy: 0.9167
Epoch 28/35
7/7 [==============================] - 4s 616ms/step - loss: 0.1489 - accuracy: 0.9692 - val_loss: 0.1458 - val_accuracy: 0.9500
Epoch 29/35
7/7 [==============================] - 4s 612ms/step - loss: 0.1395 - accuracy: 0.9405 - val_loss: 0.7241 - val_accuracy: 0.7000
Epoch 30/35
7/7 [==============================] - 4s 604ms/step - loss: 0.4520 - accuracy: 0.8178 - val_loss: 0.2306 - val_accuracy: 0.9167
Epoch 31/35
7/7 [==============================] - 4s 568ms/step - loss: 0.2291 - accuracy: 0.9099 - val_loss: 0.2700 - val_accuracy: 0.9000
Epoch 32/35
7/7 [==============================] - 4s 557ms/step - loss: 0.1771 - accuracy: 0.9559 - val_loss: 0.1595 - val_accuracy: 0.9667
Epoch 33/35
7/7 [==============================] - 4s 566ms/step - loss: 0.1002 - accuracy: 0.9788 - val_loss: 0.1153 - val_accuracy: 0.9500
Epoch 34/35
7/7 [==============================] - 4s 578ms/step - loss: 0.1221 - accuracy: 0.9451 - val_loss: 0.1058 - val_accuracy: 0.9667
Epoch 35/35
7/7 [==============================] - 4s 567ms/step - loss: 0.1011 - accuracy: 0.9804 - val_loss: 0.1458 - val_accuracy: 0.9500

Conclusions.

We can conclude that the current model has an accuracy of more than 90%, this means that the proposed model satisfactorily meets the general objective of the project, exceeding 90% accuracy in detecting COVID19 due to pneumonia, during each of the periods. , it was determined that to successfully conclude the project it must be carried out with the mainstreaming of the validation method, this is achieved by dividing the data into two parts: the training set and the validation set, the latter allowed us to validate the performance of the trainees model with training data. Finally, it is clarified that if there is a large difference between the precision by the training model and the validation of the precision, it is declared as an adjusted model, instead if the validation precision (val_acc) is equal to or a little less than precision training (acc) is declared as a good model.