Data preprocessing

Next page

Part I - Image data preprocessing

In this part, you will use the popular package “skimage” to preprocess and augment an image before sending it to a neural network coded in Keras.

import numpy as np
import pandas as pd
from keras.models import Sequential
from keras import optimizers
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers import Dense, Conv2D, Embedding, Activation, MaxPooling2D, Dropout
from keras.layers import Flatten, LSTM, ZeroPadding2D, BatchNormalization, MaxPooling2D

%matplotlib inline
import matplotlib.pyplot as plt

Question 1: Use skimage to load your “iguana.jpg” and display it in your notebook.

from skimage.measure import compare_ssim as ssim
from skimage import io
from skimage.transform import resize

# Loading the image
###START CODE HERE ###

###END CODE HERE ###

Question 2: Use skimage to zoom on the face of the iguana. Display the image.

#Zoom image
###START CODE HERE ###

###END CODE HERE ###

Question 3: Use skimage to rescale the image to 20% of the initial size of the image. Display the image. Rescaling means lowering the resolution of the image. Remember that in class we talked about finding the computation/accuracy trade-off by showing different resolutions of the same image to humans and figuring out what is the minimum resolution leading to the maximum human accuracy.

#Rescale image to 25% of the initial size
###START CODE HERE ###

###END CODE HERE ###

Question 4: Use skimage to add random noise to the image. Display the image.

#Add random noise
###START CODE HERE ###

###END CODE HERE ###

Question 5: Use skimage to rotate the image by 45 degrees.

#Rotate
###START CODE HERE ###

###END CODE HERE ###

Question 6: Use skimage to flip the image horizontaly and verticaly. Display the image.

#Horizontal flip
###START CODE HERE ###

###END CODE HERE ###
#Vertical flip
###START CODE HERE ###

###END CODE HERE ###

Question 7: (Optional) Use skimage to (i) blur the image, (ii) enhance its contrast, (iii) convert to grayscale, (iv) invert colors…

#Blur image
###START CODE HERE ###

###END CODE HERE ###

#Convert to grayscale
###START CODE HERE ###

###END CODE HERE ###

#Enhance contrast
###START CODE HERE ###

###END CODE HERE ###

#Color inversion
###START CODE HERE ###

###END CODE HERE ###

Skimage is a popular package for customized data preprocessing and augmentation. However, deep learning frameworks such as Keras often incorporate functions to help you preprocess data in a few lines of code.

Question 8: Read and run the Keras code for image preprocessing. It will save augmented images in a folder called “preview” on the notebook’s directory.

Image preprocessing in Keras

#Image preprocessing in Keras

from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img

datagen = ImageDataGenerator(
        rotation_range=45,
        width_shift_range=0.3,
        height_shift_range=0.3,
        shear_range=0.3,
        zoom_range=0.3,
        horizontal_flip=True,
        fill_mode='nearest')

img = load_img('iguana.jpg')  # this is a PIL image
x = img_to_array(img)  # convert image to numpy array 
x = x.reshape((1,) + x.shape)  # reshape image to (1, ..,..,..) to fit keras' standard shape

#Use flow() to apply data augmentation randomly according to the datagenerator
#and saves the results to the `preview/` directory
num_image_generated = 0
for batch in datagen.flow(x, batch_size=1, save_to_dir='preview', save_prefix='cat', save_format='jpeg'):
    num_image_generated += 1
    if num_image_generated > 20:
        break # stop the loop after num_image_generated iterations

Question 9: (Optional) Train the CNN coded for you in the notebook (See Appendix below) on any of the pictures you created. Evaluate the model.

Part II - Text data preprocessing

Question 1: Go on any static website online. Click right and select “View Page Source”. Copy a complicated part of the html code. Paste it in the notebook in the variable “html_page”.

###START CODE HERE ###
html_txt = """ """
###END CODE HERE ###

print(html_txt)

Question 2: Use BeautifulSoup to parse the html_txt. Print the html_txt.

from bs4 import BeautifulSoup

#Parse the html input
###START CODE HERE ###

###END CODE HERE ###

print(html_txt)

Question 3: Use re to remove meta-characters such as squared brackets and anything between them. Print the html_txt.

import re, string, unicodedata
#Remove meta characters and things between them.
###START CODE HERE ###

###END CODE HERE ###

print(html_txt)

Question 4: Using the Natural Language ToolKit (nltk), separate the text into a list of words.

import nltk
from nltk import word_tokenize, sent_tokenize

#Separate text into words
###START CODE HERE ###

###END CODE HERE ###

Question 5: (Optional) Remove non ASCII characters. Convert to Lower case. Remove punctuation, stopwords, …

###START CODE HERE ###

###END CODE HERE ###

A machine will not be able to read this list strings, you need to build a vocabulary and tokenize your words.

Question 6: Build the vocabulary from the list of words.

#Build Vocabulary
###START CODE HERE ###

###END CODE HERE ###

Question 7: Build word to integer mapping in Python. It should be sorted.

#Build word to integer mapping in Python. It should be sorted.
###START CODE HERE ###

###END CODE HERE ###

Question 8: Tokenize your text.

#Convert list of words into list of tokens using this mapping
###START CODE HERE ###

###END CODE HERE ###

Question 9: Read and run the Keras code for text preprocessing. It uses the Tokenizer Function.

#Preprocess text with Keras for Sentiment classification
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

examples = ['You are amazing!','It is so bad','Congratulations','You suck bro','Awesome dude!']
Y = [1, 0, 1, 0, 1]

#Define Tokenizer
t = Tokenizer()
#Fit Tokenizer on text (Build vocab etc..)
t.fit_on_texts(examples)
#Convert texts to sequences of integers
X = t.texts_to_sequences(examples)
#Pad sequences of integers
X = pad_sequences(X, padding = 'post')

#Get the vocabulary size, useful for the embedding layer.
vocab_size = len(t.word_index) + 1
print(vocab_size)
print(X)

Question 10: (Optional) Train the RNN coded for you in the notebook on the sentiment classification class (with 5 examples). Evaluate the mode.

Appendix: Models and training codes

#CNN
model_CNN = Sequential()
model_CNN.add(Conv2D(32, (7, 7), strides = (1, 1), name = 'conv0', input_shape = image.shape))
model_CNN.add(BatchNormalization(axis = 3, name = 'bn0'))
model_CNN.add(Activation('relu'))
model_CNN.add(MaxPooling2D((2, 2), name='max_pool'))
model_CNN.add(Flatten())
model_CNN.add(Dense(1, activation='sigmoid', name='fc'))
#RNN
model_RNN = Sequential()
model_RNN.add(Embedding(vocab_size, 128))
model_RNN.add(LSTM(128))
model_RNN.add(Dense(1, activation='sigmoid'))
#training code for CNN
sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model_CNN.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['accuracy'])
model_CNN.fit(np.expand_dims(image, axis=0), np.array([1]), epochs=2)
#training code for RNN
sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model_RNN.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['accuracy'])
model_RNN.fit(np.array(X), np.array(Y), epochs=1000)
#testing code for CNN
model_CNN.predict(np.expand_dims(image_blured, axis=0))