3253_TermProject: Quick-draw dataset

Doodle Prediction using CNN, Tensorflow, Keras.

Video presentation

https://www.youtube.com/watch?v=SHCC5mYxUUI&feature=youtu.be

Notebooks

preprocess.ipynb

preprocess_analysis.ipynb

data_model.ipynb

Data

Data used for this project is the crowdsourced Quick-Draw data set which Google amassed using their online game called Quick Draw. The data set contains over 50 million drawings or doodles submitted by users. The drawings are classified into 345 different classes that range from 'aircraft carrier' to 'zig-zag'.

For data analysis and wrangling, only 20 classes of drawings have been considered, and first 10000 image samples were used for each category. This makes the analysis phase a bit quicker.

For the actual CNN data modelling, we decided to be ambitious and use all 345 classes to build a Convolutional Neural Network. However, only the first 10000 image samples were used for each category. As we will see later, 10000 images is not enough samples to make a prediction with great accuracy.

Downloading the Quick Draw dataset:

Entire dataset is available here: https://github.com/googlecreativelab/quickdraw-dataset

https://console.cloud.google.com/storage/browser/quickdraw_dataset/full/numpy_bitmap

For this project, we are working with preprocessed dataset provided by Google Creative Lab.

All the drawings have been rendered into a 28x28 grayscale bitmap in numpy .npy format.

Step1: Install gsutil: https://cloud.google.com/storage/docs/gsutil_install?hl=fr

Step2: gsutil -m cp gs://quickdraw_dataset/full/simplified/*.npy

Data wrangling

Since the goal of this project is to build a CNN model using all 345 image classes, the data wrangling steps to achieve this were as follows:

Created a dictionary for url and class labels to dowbload the entiry dataset in .npy files.
Then the dataset was downloaded into a dictionary.
Normalized the image pixels between 0 and 1, and added class labels.
Aggregate the first 10,000 arrays for every doodle into one array of doodles.
Split the data into features and target labels.
Save the train and test data as serialised data using pickle files.

The steps listed above are described in the jupyter notebook titled preprocess.ipynb

Challenges

I kept getting a 'Memory Error' trying to create a a dictionary object using the downloaded data. See below:

Running the code in colab crashed the application and produced a RAM error as well. We will have to take another approach to set up data for 345 samples to create our CNN model.

Following piece of code generates png image files from the .npy files and splits the image files into train and test data:

These images will be used to generate our CNN model to classify 345 images.

Analysis

Ammount of test data:

Plotting an intensity histogram for some of the classes:

Data modelling

We attempted to build a CNN model with 345 classes. Architecture of our CNN model is below:

Model summary:

Fitting the model

Model was fit as follows:

I did not have a GPU and as a result the model fitting took about 1 day and 50min to complete.

Performance:

As we can see, the CNN model did not perform that great. We ended with a validation accuracy of ~70%. Which isn't that great. Looks like including all 345 classes might have been too ambitious for 10,000 sample images.

Future Consideration for model improvement

One way to possibly improve the existing CNN model is by using Transfer Learning. More data can be fed to the existing CNN model to retrain it using saved weights. This will make training more efficient and produce better accuracy.

Another approach to making a better performing model is by using Recurrent Neural Networks to build the model instead of CNN.

We now believe that it may have been a little over-ambitious trying to train a model to classify 345 images with only 10000 samples each and no GPU available to quickly train models.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
data		data
images		images
3253_TermProject_Presentation.pdf		3253_TermProject_Presentation.pdf
README.md		README.md
app.py		app.py
conv_doodle_model.h5		conv_doodle_model.h5
conv_doodle_model_json.json		conv_doodle_model_json.json
data_model.ipynb		data_model.ipynb
preprocess.ipynb		preprocess.ipynb
preprocess_analysis.ipynb		preprocess_analysis.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

3253_TermProject: Quick-draw dataset

Video presentation

Notebooks

Data

Downloading the Quick Draw dataset:

Data wrangling

Challenges

Analysis

Data modelling

Fitting the model

Future Consideration for model improvement

About

Uh oh!

Releases

Packages

Languages

npsquared/3253_TermProject

Folders and files

Latest commit

History

Repository files navigation

3253_TermProject: Quick-draw dataset

Video presentation

Notebooks

Data

Downloading the Quick Draw dataset:

Data wrangling

Challenges

Analysis

Data modelling

Fitting the model

Future Consideration for model improvement

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages