Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions your-project/CNN_Models_Weights/Links_to_Datasets.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
## Dataset
* These models and weights are also quite large so I've provided a link to a Google Drive where folder where you can download them.
[GoogleDrive](https://drive.google.com/drive/folders/1e8LTtOGAbw-qvxBggcZUs1M5p9QNZ_U-?usp=sharing)
7 changes: 7 additions & 0 deletions your-project/Dataset/Links_to_Datasets.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
## Dataset
* The filesize of the Kaggle dataset is roughly 1.2GB, thus I can't upload it to GitHub. You can access it here.
* [Kaggle](https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia) - large Kaggle dataset.
* You can find the forked repo for the COVID CT scans here.
* [GitHub](https://github.com/peiriant/covid-chestxray-dataset) - COVID ChestXRAY data, public open dataset of X-Ray and CT images of patients of suspected and positive COVID-19 / pneumonias.
* [GitHub](https://github.com/peiriant/COVID19) - images extracted from various radiology sources.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

1,628 changes: 1,628 additions & 0 deletions your-project/Notebooks/01 Image Processing_CNN Model_Pneumonia.ipynb

Large diffs are not rendered by default.

1,634 changes: 1,634 additions & 0 deletions your-project/Notebooks/02 Image Processing_CNN Pneumonia_and_COVID19.ipynb

Large diffs are not rendered by default.

110 changes: 74 additions & 36 deletions your-project/README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
<img src="https://bit.ly/2VnXWr2" alt="Ironhack Logo" width="100"/>

# Title of My Project
*[Your Name]*
# Detecting Pneumonia with Machine Learning and Neural Networks
*[Gareth Hughes]*

*[Your Cohort, Campus & Date]*
*[Data Analytics, Barcelona, May 2020]*

## Content
- [Project Description](#project-description)
Expand All @@ -19,54 +19,92 @@
- [Links](#links)

## Project Description
Write a short description of your project: 3-5 sentences about what your project is about, why you chose this topic (if relevant), and what you are trying to show.
In this project, my main objective is to be able to successfully differentiate between pneumonia, COVID-19
and normal lungs using a convolutional neural network using image classification. This would be
achieved by analysing X-Ray scans from an anteroposterior, ie: from front to back, perspective of the lungs.
Ultimately, the aim of the project would be to develop a means of rapidly analysing pneumonia, and pneumonia induced by COVID, thereby saving time
and offering radiologists and doctors a second opinion.
The datasets were obtained from Kaggle and the other from Github, they are referenced below. The first
dataset contains images of both pneumonia infected and normal X-Ray scans as a means of training, testing and validation.
The other datasets contains a mixture of X-Ray scans of lungs infected by pneumonia caused by COVID-19. All the scans
in the dataset are verified by radiologists.


## Hypotheses / Questions
* What data/business/research/personal question you would like to answer?
* What is the context for the question and the possible scientific or business application?
* What are the hypotheses you would like to test in order to answer your question?
Frame your hypothesis with statistical/data languages (i.e. define Null and Alternative Hypothesis). You can use formulas if you want but that is not required.
* Can we identify the presence of pneumonia using image recognition?
* How accurately can we identify pneumonia and COVID?
* Are we able to differentiate between pneumonia and COVID induced pneumonia?

## Dataset
* Where did you get your data? If you downloaded a dataset (either public or private), describe where you downloaded it and include the command to load the dataset.
* Did you build your own datset? If so, did you use an API or a web scraper? PRovide the relevant scripts in your repo.
* For all types of datasets, provide a description of the size, complexity, and data types included in your dataset, as well as a schema of the tables if necessary.
* If the question cannot be answered with the available data, why not? What data would you need to answer it better?
## Datasets
* [Github](https://github.com/UCSD-AI4H/COVID-CT) - This repo contains COVID X-Ray images which I implimented into the dataset.
* [Github](https://github.com/peiriant/COVID19) - This repo also contains COVID X-Ray images.
* [Kaggle](https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia) - The kaggle contains a library of normal and pneumonia X-Rays.

## Cleaning
Describe your full process of data wrangling and cleaning. Document why you chose to fill missing values, extract outliers, or create the variables you did as well as your reasoning behind the process.

## Analysis
* Overview the general steps you went through to analyze your data in order to test your hypothesis.
* Document each step of your data exploration and analysis.
* Include charts to demonstrate the effect of your work.
* If you used Machine Learning in your final project, describe your feature selection process.
* The images obtained from the GitHub dataset needed to be categorised based on the type of illness
that the patient suffered from, in order to ensure only COVID-19 images were being analysed.
* Additionally, only the AP view of the X-Ray scan was to be analysed, thus additional filtering was required.
* All the images were resized to a standard size, decolourised and normalized in order to be fed into the
CNN.
* Various different parameters for the importing, normalizing and mean reduction of the image were investigated.

## Model Training and Evaluation
*Include this section only if you chose to include ML in your project.*
* Describe how you trained your model, the results you obtained, and how you evaluated those results.
The model for the CNN is described in more detail below.
- 1) Classification of Pneumonia and Normal lungs = 90% accurate
- 2) Classification of Pneumonia / COVID-19 and Normal lungs = 85% accurate

## Conclusion
* Summarize your results. What do they mean?
* What can you say about your hypotheses?
* Interpret your findings in terms of the questions you try to answer.
* After tailoring the Convolutional Neural Network, by altering it's parameters and exploring several means of
filtering the data, a model with an accuracy of 85% for the analysis of normal vs. pneumonia AP X-Ray scans
was obtained.
* Following this, due to the imbalance of the dataset I decided to reduce the number of pneumonia images to be
more in line with the normal images. This resulted in an increase in the accuracy, up to 88%.
* The model was then applied to the analysis of the COVID-19 infected pneumonia AP CT scans and an accuracy of
84% was obtained.
* Similarly, I decided to balance the dataset further by reducing the number of pneumonia images. This led to an
increase of accuracy up to 87.5% percent.


## Future Work
Address any questions you were unable to answer, or any next steps or future extensions to your project.
* Increase the size of the dataset for normal, pneumonia and COVID-19 induced pneumonia datasets.
* Apply the CNN to deduce the difference between bacteria, viral and fungal pneumonia.
* Apply the CNN to differentiate other lung based diseases.
* Fine tune the hyperparameters of the CNN further.
* Center the images in the dataset, highlight focus on the lungs.

## Workflow
Outline the workflow you used in your project. What were the steps?
How did you test the accuracy of your analysis and/or machine learning algorithm?
I began my work by looking for the datasets available to me. I discovered that many of the COVID-19 datasets
do not provide images of normal lung X-Ray scans as a reference. They also have X-ray images taken from several
different viewpoints.
As such, I decided to investigate both the Kaggle Pneumonia images and COVID-19 images that I found on several GitHub repoistories.
In order to utilise the Kaggle photoset, I extracted all the photos, converted them to
numpy arrays of RGB numbers using CV2 and then assigned them to the correct class based on which folder they were
located in (eg: Normal, Pneumonia, Test, Train etc). This data was then normalized in order to reduce the
amount of information needed to be commit to memory when running the CNN, and also normalizing the gray scale images
in the process. The data was then fed into the CNN and the accuracy and loss of the model was evaluated. A confusion matrix was also
generated in order to determine the distribution of the CNN's predictions with regards to the prognosis (ie: normal / pneumonia lungs).

The process was repeated with the addition of the COVID-19 image sets. One dataset provided only X-ray images, thus this was easy to utilise.
The other image set imageset contained numerous images from different positions, thus, using the metadata, only the correct X-Ray AP (anterior posterior) images were utilised.
I explored using various methods to split and train the data in order to improve the accuracy of the model, whilst trying to avoid overfitting and underfitting.
In particular I utilised ImageDataGenerator, a function which automatically skews, rotates and alters the zoom of images fed into the training and validation set. This means that the dataset is seeing new data continuously thus
it helps avoid overfitting. I also investigated reducing the size of the overall dataset which led to an improvement in the accuracy of the CNN.

Finally, I utilised the models to predict the class of the images contained with the testing dataset. These were then plotted as confusion matrixes in order to be able to deduce
how well the CNN worked "seeing" new images. Ultimately, both the classification of Pneumonia/Normal and Pneumonia/Normal/COVID-19 x-rays resulted in accuracies above
85% which suggests that our CNN is robust, accurate and is avoiding overfitting / underfitting.

## Organization
How did you organize your work? Did you use any tools like a trello or kanban board?

What does your repository look like? Explain your folder and file structure.
- I utilised a Trello board to map out my overall plan and for check-points along the way.
- The work is organised into two Notebooks, which read chronologically describe the work process.
- The first notebook (01 Image Processing_CNN Model_Pneumonia.ipynb) details the development of a image classifier for the presence of pneumonia in X-Ray scans.
- The second notebook (02 Image Processing_CNN Pneumonia_and_COVID19.ipynb) details the development of a multi-class image classifier for COVID-19, Pneumonia and normal lungs.
- The CNN_Models_Weights and Dataset folders contain text files with links to the models I utilised as they're both too
big to be uploaded to Github.
- The CNN_Models_Weights folder contains a link to a GoogleDrive which has the model and the weights for each notebook.

## Links
Include links to your repository, slides and trello/kanban board. Feel free to include any other links associated with your project.


[Repository](https://github.com/)
[Slides](https://slides.com/)
[Trello](https://trello.com/en)
[Repository](https://github.com/peiriant/Project-Week-8-Final-Project/tree/master/your-project)
[Slides](https://docs.google.com/presentation/d/1EOBTjrrSqtab0Yp7QVxBku-6PXEwTSGyHm1JzJkU20Q/edit?usp=sharing)
[Trello](https://trello.com/b/CDl7EYhV/project-5)