Optical Character Recognition In Python

This is a study of a topic that's been addressed many times

$\S 1:$ Introduction: Optical Character Recognition

OCR is a topic in machine learning that has been widely studied. Using a part of the (also well-known) Char74 dataset, I develop multiple classifier models for street-view characters obtained from Google maps.

This project is rather code-heavy, but if you're familiar with the way scikit-learn works it shouldn't require too much explanation. If you're not, feel free to reach out to me with questions/comments directly at derekjanni@gmail.com

For a basic roadmap of what's in this notebook:

  1. Introduction
  2. Data Import & Cleaning
  3. Evaluating Models by Score
  4. Testing the model yourself!
  5. First Attempt at HOG
  6. Exporting Data as JSON for visualization
In [29]:
# sklearn models
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import linear_model
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import scale

# sklearn metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_curve 
from sklearn.metrics import auc
from sklearn.metrics import classification_report
from sklearn.learning_curve import learning_curve
from sklearn.cross_validation import cross_val_score
from sklearn.cross_validation import train_test_split

# graphs
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# images
from scipy.ndimage import convolve
from skimage.feature import hog
from skimage import draw, data, io, segmentation, color, exposure
from skimage.measure import regionprops
from skimage.filters import threshold_otsu
from skimage.transform import resize 
from skimage.transform import warp 
from PIL import Image

# basics
import pickle
import pandas as pd
import numpy as np
from pprint import pprint

$\S 2:$ Importing and Cleaning the Data

The data comes in as a folder of raw images, and a .csv with accompanying labels. The code below provides the following functionality:

  • Import raw image by filename (needs to be .Bmp)
  • Convert to Grayscale
  • Use Otsu's Thresholding method to reduce noise
  • Define "Nudging" to widen dataset and account for variance in signal location in image (used below on training data)
  • Convert images to 1D np array of values

In the future, the following functionality should be added:

  • Add Skewed images to training data
  • Increase size of dataset
In [15]:
import math
import cv2

def img_round(x, base=75):
    """
    Now useless function (replaced by binarization) for flattening image data
    """
    return (base * math.floor(float(x)/base))
vround = np.vectorize(img_round) 

def get_img(i, size):
    """
    Returns a binary image from my file directory with index i
    """
    img = Image.open('/users/derekjanni/pyocr/train/'+ str(i+1) + '.Bmp')
    img = img.convert("L")
    img = img.resize((size,size))
    image = np.asarray(img)
    image.setflags(write=True)
    thresh = threshold_otsu(image)
    binary = image > thresh
    return binary

def nudge_dataset(X, Y, size):
    """
    This produces a dataset 5 times bigger than the original one,
    by moving the (size x size) images around by 1px to left, right, down, up
    """
    direction_vectors = [
        [[0, 1, 0],
         [0, 0, 0],
         [0, 0, 0]],

        [[0, 0, 0],
         [1, 0, 0],
         [0, 0, 0]],

        [[0, 0, 0],
         [0, 0, 1],
         [0, 0, 0]],

        [[0, 0, 0],
         [0, 0, 0],
         [0, 1, 0]]]

    shift = lambda x, w: convolve(x.reshape((size, size)), mode='constant',
                                  weights=w).ravel()
    X = np.concatenate([X] +
                       [np.apply_along_axis(shift, 1, X, vector)
                        for vector in direction_vectors])
    Y = np.concatenate([Y for _ in range(5)], axis=0)
    return X, Y

def show_img(img):
    width = 5.0
    height = img.shape[0]*width/img.shape[1]
    f = plt.figure(figsize=(width, height))
    plt.imshow(img)
In [16]:
df = pd.read_csv('trainLabels.csv', header=0)
raw_y = np.asarray(df['Class'])
raw_x = np.asarray([get_img(i, 50) for i in df.index]).astype(float)
x = np.asarray([i.ravel() for i in raw_x])
y = raw_y
print x.shape, y.shape
(6283, 2500) (6283,)

$\S 3:$ Setting Up Some Models

I've run just about every classifier algorithm known to man on this dataset, with varying results. I've limited the number of models that I run for sake of time: we'll be running a lot of images through this program, which means it takes a lot of time. Below are three of the best and most intuitive methods. A few notes:

  • Bernoulli Naive Bayes is an intuitive solution as it expects binary data (which is what Otsu's method returns) to predict a binary outcome (either this string is or is not the letter "A", for instance). This method doesn't abuse nudging as well as its counterparts, as we'll see below.

  • If you disable the line below that performs the "nudging" you'll see that performance of KNN and Random Forest decreases rather dramatically! That's pretty interesting, but it should be intuitive why nudging in directions is effective!

A necessary improvement is to account for skew in input images, or to naturally deskew them upon entering them.

In [17]:
# declare models for explicit-ness
models = {'(5) K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=5),
          'Bernoulli Naive Bayes': BernoulliNB(),
          'Random Forest Classifier': RandomForestClassifier()
         }

from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(x, y)

#nudge dataset improves performance: test and see!
X_train, Y_train = nudge_dataset(X_train, Y_train, 50)

def precision_report(model):
    model.fit(X_train, Y_train)
    Y_pred = model.predict(X_test)
    return classification_report(Y_test, Y_pred)

for i in models:
    print i + ':\n' + str(precision_report(models[i]))
Bernoulli Naive Bayes:
             precision    recall  f1-score   support

          0       0.00      0.00      0.00        16
          1       0.03      0.24      0.06        17
          2       0.04      0.17      0.06        12
          3       0.08      0.50      0.14        10
          4       0.03      0.38      0.05         8
          5       0.00      0.00      0.00        12
          6       0.00      0.00      0.00        18
          7       0.18      0.75      0.29         8
          8       0.01      0.20      0.02         5
          9       0.00      0.00      0.00         9
          A       0.00      0.00      0.00       129
          B       0.00      0.00      0.00        14
          C       0.00      0.00      0.00        40
          D       0.00      0.00      0.00        40
          E       0.00      0.00      0.00       104
          F       0.09      0.06      0.07        17
          G       0.00      0.00      0.00        30
          H       0.67      0.11      0.20        35
          I       0.00      0.00      0.00        67
          J       0.04      0.06      0.05        17
          K       0.00      0.00      0.00        12
          L       0.00      0.00      0.00        51
          M       0.00      0.00      0.00        37
          N       0.00      0.00      0.00        80
          O       0.00      0.00      0.00        74
          P       0.00      0.00      0.00        27
          Q       0.00      0.00      0.00         7
          R       0.40      0.16      0.23        86
          S       0.50      0.12      0.19        78
          T       0.00      0.00      0.00        63
          U       0.08      0.17      0.10        24
          V       0.00      0.00      0.00        15
          W       0.02      0.09      0.03        11
          X       0.00      0.00      0.00         9
          Y       0.00      0.00      0.00        12
          Z       0.00      0.00      0.00        10
          a       0.17      0.03      0.04        39
          b       0.05      0.40      0.09         5
          c       0.00      0.00      0.00        10
          d       0.00      0.00      0.00        20
          e       0.20      0.03      0.05        36
          f       0.00      0.00      0.00         6
          g       0.06      0.20      0.09         5
          h       0.03      0.62      0.06         8
          i       0.10      0.04      0.05        27
          j       0.00      0.00      0.00         5
          k       0.01      0.50      0.01         4
          l       0.00      0.00      0.00        18
          m       0.10      0.09      0.10        11
          n       0.00      0.00      0.00        18
          o       0.12      0.05      0.07        22
          p       0.00      0.00      0.00         5
          q       0.00      0.00      0.00         8
          r       0.11      0.04      0.06        26
          s       0.08      0.03      0.05        31
          t       0.00      0.00      0.00        25
          u       0.00      0.00      0.00         8
          v       0.00      0.00      0.00         9
          w       0.00      0.00      0.00         3
          x       0.02      0.50      0.04         6
          y       0.00      0.00      0.00         6
          z       0.00      0.00      0.00         6

avg / total       0.08      0.05      0.04      1571

(5) K-Nearest Neighbors:
             precision    recall  f1-score   support

          0       0.14      0.38      0.20        16
          1       0.37      0.41      0.39        17
          2       0.50      0.42      0.45        12
          3       0.80      0.40      0.53        10
          4       0.78      0.88      0.82         8
          5       0.28      0.42      0.33        12
          6       0.89      0.44      0.59        18
          7       0.54      0.88      0.67         8
          8       0.75      0.60      0.67         5
          9       0.80      0.44      0.57         9
          A       0.90      0.84      0.87       129
          B       0.32      0.43      0.36        14
          C       0.45      0.55      0.49        40
          D       0.61      0.42      0.50        40
          E       0.78      0.72      0.75       104
          F       0.56      0.59      0.57        17
          G       0.59      0.33      0.43        30
          H       0.64      0.60      0.62        35
          I       0.36      0.61      0.46        67
          J       0.42      0.29      0.34        17
          K       0.50      0.67      0.57        12
          L       0.74      0.84      0.79        51
          M       0.79      0.70      0.74        37
          N       0.78      0.78      0.78        80
          O       0.41      0.53      0.46        74
          P       0.56      0.81      0.67        27
          Q       0.00      0.00      0.00         7
          R       0.68      0.62      0.65        86
          S       0.56      0.51      0.54        78
          T       0.75      0.83      0.79        63
          U       0.48      0.50      0.49        24
          V       0.41      0.47      0.44        15
          W       0.78      0.64      0.70        11
          X       0.38      0.33      0.35         9
          Y       0.46      0.50      0.48        12
          Z       0.50      0.10      0.17        10
          a       0.48      0.41      0.44        39
          b       0.67      0.40      0.50         5
          c       0.21      0.30      0.25        10
          d       0.89      0.40      0.55        20
          e       0.58      0.53      0.55        36
          f       0.40      0.33      0.36         6
          g       0.00      0.00      0.00         5
          h       0.57      0.50      0.53         8
          i       0.46      0.44      0.45        27
          j       1.00      0.20      0.33         5
          k       0.50      0.50      0.50         4
          l       0.12      0.17      0.14        18
          m       0.80      0.36      0.50        11
          n       0.33      0.50      0.40        18
          o       0.09      0.09      0.09        22
          p       0.50      0.20      0.29         5
          q       0.67      0.25      0.36         8
          r       0.41      0.65      0.51        26
          s       0.47      0.29      0.36        31
          t       0.50      0.20      0.29        25
          u       0.20      0.12      0.15         8
          v       0.20      0.11      0.14         9
          w       0.33      0.33      0.33         3
          x       0.00      0.00      0.00         6
          y       0.09      0.17      0.12         6
          z       0.00      0.00      0.00         6

avg / total       0.58      0.56      0.56      1571

Random Forest Classifier:
             precision    recall  f1-score   support

          0       0.18      0.44      0.26        16
          1       0.24      0.24      0.24        17
          2       0.28      0.42      0.33        12
          3       0.50      0.70      0.58        10
          4       0.40      0.75      0.52         8
          5       0.20      0.25      0.22        12
          6       0.44      0.22      0.30        18
          7       0.75      0.75      0.75         8
          8       0.00      0.00      0.00         5
          9       0.50      0.67      0.57         9
          A       0.63      0.83      0.72       129
          B       0.27      0.50      0.35        14
          C       0.50      0.60      0.55        40
          D       0.44      0.57      0.50        40
          E       0.69      0.79      0.74       104
          F       0.59      0.59      0.59        17
          G       0.52      0.53      0.52        30
          H       0.55      0.51      0.53        35
          I       0.42      0.66      0.51        67
          J       0.00      0.00      0.00        17
          K       0.46      0.50      0.48        12
          L       0.75      0.86      0.80        51
          M       0.69      0.59      0.64        37
          N       0.72      0.70      0.71        80
          O       0.56      0.62      0.59        74
          P       0.75      0.78      0.76        27
          Q       0.50      0.14      0.22         7
          R       0.71      0.62      0.66        86
          S       0.58      0.54      0.56        78
          T       0.58      0.78      0.67        63
          U       0.64      0.38      0.47        24
          V       0.37      0.47      0.41        15
          W       0.56      0.45      0.50        11
          X       0.50      0.11      0.18         9
          Y       0.50      0.33      0.40        12
          Z       0.00      0.00      0.00        10
          a       0.52      0.44      0.47        39
          b       0.75      0.60      0.67         5
          c       0.14      0.10      0.12        10
          d       0.67      0.10      0.17        20
          e       0.58      0.53      0.55        36
          f       1.00      0.33      0.50         6
          g       0.00      0.00      0.00         5
          h       0.38      0.38      0.38         8
          i       0.50      0.52      0.51        27
          j       0.00      0.00      0.00         5
          k       0.50      0.25      0.33         4
          l       0.00      0.00      0.00        18
          m       0.62      0.45      0.53        11
          n       0.47      0.44      0.46        18
          o       0.18      0.14      0.15        22
          p       0.00      0.00      0.00         5
          q       0.00      0.00      0.00         8
          r       0.56      0.58      0.57        26
          s       0.31      0.13      0.18        31
          t       0.45      0.20      0.28        25
          u       0.25      0.12      0.17         8
          v       0.00      0.00      0.00         9
          w       1.00      0.33      0.50         3
          x       0.00      0.00      0.00         6
          y       0.17      0.17      0.17         6
          z       0.00      0.00      0.00         6

avg / total       0.52      0.54      0.52      1571

//anaconda/lib/python2.7/site-packages/sklearn/metrics/classification.py:958: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

So we see that KNN and Random Forest are our best predictors, on the whole. Lets see their accuracy scores as opposed to precision/recall, to get a feel for their actual performance

In [19]:
def accuracy_report(model):
    model.fit(X_train, Y_train)
    Y_pred = model.predict(X_test)
    return accuracy_score(Y_test, Y_pred)

for i in models:
    print i + ':\n' + str(accuracy_report(models[i]))
Bernoulli Naive Bayes:
0.0477402928071
(5) K-Nearest Neighbors:
0.558243157225
Random Forest Classifier:
0.532781667728

To make my own life easier, I pickled the models here to pull them up later

In [20]:
# models
knn = KNeighborsClassifier(n_neighbors=5)
rfc = RandomForestClassifier()
bnb = BernoulliNB()

# fit
knn.fit(X_train, Y_train)
rfc.fit(X_train, Y_train)
bnb.fit(X_train, Y_train)

# pickle
import pickle
with open('knn.pkl', 'w') as picklefile:
    pickle.dump(knn, picklefile)
with open('rfc.pkl', 'w') as picklefile:
    pickle.dump(rfc, picklefile)
with open('bnb.pkl', 'w') as picklefile:
    pickle.dump(bnb, picklefile)

KNN is an interesting (and intuitive) solution to this problem, given the nudging technique. Lets see how the algorithm performs for varying K.

In [21]:
from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.33, random_state=42)

#nudge dataset improves performance: test and see!
X_train, Y_train = nudge_dataset(X_train, Y_train, 50)

def accuracy_knn(k):
    neighbors = KNeighborsClassifier(n_neighbors=k)
    neighbors.fit(X_train, Y_train)
    return accuracy_score(Y_test, neighbors.predict(X_test))

k = [i for i in range(1, 20)]
acc_knn = [accuracy_knn(i) for i in range(1, 20)]
plt.figure(figsize=(10,7)).suptitle("Accuracy Score vs. K in  KNN Classification", fontsize='15')
plt.xlabel('k', fontsize='15')
plt.ylabel('Accuracy Score', fontsize='15')
plt.plot(k, acc_knn)
Out[21]:
[<matplotlib.lines.Line2D at 0x1029e7b10>]

$\S 4:$ Testing the models with real images

Sometimes seeing really is believing, and it's fun to see how models perform on data for yourself! Use the code below to generate a random image and evaluate the computer's effectiveness at this task, as well as notice where it goes wrong.

In [22]:
with open('knn.pkl') as picklefile:
    knn = pickle.load(picklefile)
with open('rfc.pkl') as picklefile:
    rfc = pickle.load(picklefile)
In [23]:
def get_test_img(i):
    """
    Returns image from my file directory with corresponding index i
    """
    img = Image.open('/users/derekjanni/pyocr/test/'+ str(i) + '.Bmp')
    img = img.convert("L")
    img = img.resize((50,50))
    image = np.asarray(img)
    image.setflags(write=True)
    thresh = threshold_otsu(image)
    binary = image > thresh
    return binary
In [24]:
from random import randint

random_image = randint(6284, 12503)
print("My Guess for this file is:" + str(knn.predict(get_test_img(random_image).ravel())).strip('[]\''))
show_img(get_test_img(random_image))
My Guess for this file is:I

The code below writes a submission for the relevant Kaggle competition. As of 8/5/2015 my model is among the top 10!

In [25]:
with open('submission2.csv', 'w') as outfile:
    outfile.write('ID,Class\n')
    for i in range(6284, 12504):
        pre =str(rfc.predict(get_test_img(i).ravel())).strip('[]\'')
        outfile.write(str(i) + ','+ pre +'\n')

$\S 5:$ Histogram of Oriented Gradients as a feature matrix

It might look like I've done really well here, however, part of this is cooked up: Since the images are nudged, there is a lot of overlap and repetitive labeling going on. What would be more interesting is if a HOG approach could better define letters! Lets see how KNN, RFC and LinearSVC perform on a Histogram of Oriented Gradients!

In [30]:
def get_hog(img):
    fd, hog_image = hog(img, orientations=10, pixels_per_cell=(5, 5), cells_per_block=(2, 2), visualise=True)
    return exposure.rescale_intensity(hog_image, in_range=(0, 0.9))
    
def get_img_hog(i, size):
    """
    Returns a binary image from my file directory with index i
    """
    img = Image.open('/users/derekjanni/pyocr/train/'+ str(i+1) + '.Bmp')
    img = img.convert("L")
    img = img.resize((size,size))
    image = np.asarray(img)
    image.setflags(write=True)
    thresh = threshold_otsu(image)
    binary = image > thresh
    return get_hog(binary)

df = pd.read_csv('trainLabels.csv', header=0)
raw_y = np.asarray(df['Class'])
raw_x = np.asarray([get_img_hog(i, 50) for i in df.index]).astype(float)
x = np.asarray([i.ravel() for i in raw_x])
y = raw_y

# declare models for explicit-ness
models = {'(5) K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=5),
          'Bernoulli Naive Bayes': BernoulliNB(),
          'Random Forest Classifier': RandomForestClassifier()
         }

from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(x, y)

#nudge dataset improves performance: test and see!
X_train, Y_train = nudge_dataset(X_train, Y_train, 50)

def precision_report(model):
    model.fit(X_train, Y_train)
    Y_pred = model.predict(X_test)
    return classification_report(Y_test, Y_pred)

for i in models:
    print i + ':\n' + str(precision_report(models[i]))
Bernoulli Naive Bayes:
             precision    recall  f1-score   support

          0       0.00      0.00      0.00        20
          1       0.03      0.06      0.04        16
          2       0.00      0.00      0.00        14
          3       0.00      0.00      0.00         9
          4       0.00      0.00      0.00         8
          5       0.00      0.00      0.00        12
          6       0.00      0.00      0.00        12
          7       0.50      0.27      0.35        11
          8       0.00      0.00      0.00         7
          9       0.00      0.00      0.00        11
          A       0.60      0.38      0.47       117
          B       0.07      0.27      0.11        11
          C       0.34      0.21      0.26        47
          D       0.44      0.22      0.30        49
          E       0.21      0.35      0.26        81
          F       0.11      0.57      0.19        21
          G       0.25      0.05      0.08        20
          H       0.35      0.20      0.25        41
          I       0.35      0.30      0.32        63
          J       0.60      0.23      0.33        13
          K       0.00      0.00      0.00        15
          L       0.55      0.68      0.61        41
          M       0.06      0.17      0.09        30
          N       0.37      0.32      0.34        62
          O       0.35      0.30      0.32        76
          P       0.28      0.16      0.20        32
          Q       0.00      0.00      0.00         4
          R       0.35      0.11      0.16        74
          S       0.37      0.17      0.23        82
          T       0.31      0.46      0.37        56
          U       0.60      0.15      0.24        20
          V       0.15      0.15      0.15        13
          W       0.14      0.53      0.22        17
          X       0.00      0.00      0.00        15
          Y       0.00      0.00      0.00        17
          Z       0.00      0.00      0.00         8
          a       0.11      0.09      0.10        33
          b       0.00      0.00      0.00         2
          c       0.04      0.07      0.05        15
          d       0.10      0.45      0.17        11
          e       0.19      0.33      0.24        54
          f       0.00      0.00      0.00         6
          g       0.00      0.00      0.00        10
          h       1.00      0.18      0.31        11
          i       0.15      0.16      0.16        31
          j       0.00      0.00      0.00         4
          k       0.00      0.00      0.00         7
          l       0.13      0.29      0.18        17
          m       0.04      0.23      0.07        13
          n       0.16      0.15      0.15        34
          o       0.00      0.00      0.00        29
          p       0.00      0.00      0.00         6
          q       0.00      0.00      0.00         5
          r       0.36      0.22      0.27        37
          s       0.04      0.08      0.06        25
          t       0.21      0.15      0.17        34
          u       0.00      0.00      0.00         9
          v       0.00      0.00      0.00         7
          w       0.00      0.00      0.00         5
          x       0.00      0.00      0.00         6
          y       0.00      0.00      0.00        11
          z       0.00      0.00      0.00         4

avg / total       0.27      0.22      0.22      1571

(5) K-Nearest Neighbors:
             precision    recall  f1-score   support

          0       0.12      0.30      0.17        20
          1       0.12      0.25      0.17        16
          2       0.28      0.50      0.36        14
          3       0.12      0.11      0.12         9
          4       0.50      0.25      0.33         8
          5       0.14      0.25      0.18        12
          6       0.50      0.33      0.40        12
          7       0.38      0.45      0.42        11
          8       0.00      0.00      0.00         7
          9       0.00      0.00      0.00        11
          A       0.67      0.90      0.77       117
          B       0.08      0.18      0.11        11
          C       0.39      0.60      0.47        47
          D       0.44      0.49      0.46        49
          E       0.68      0.78      0.73        81
          F       0.53      0.43      0.47        21
          G       0.36      0.20      0.26        20
          H       0.56      0.54      0.55        41
          I       0.33      0.68      0.44        63
          J       1.00      0.38      0.56        13
          K       0.31      0.27      0.29        15
          L       0.73      0.88      0.80        41
          M       0.65      0.57      0.61        30
          N       0.74      0.79      0.77        62
          O       0.38      0.55      0.45        76
          P       0.62      0.56      0.59        32
          Q       0.00      0.00      0.00         4
          R       0.70      0.62      0.66        74
          S       0.60      0.65      0.62        82
          T       0.66      0.88      0.75        56
          U       0.58      0.35      0.44        20
          V       0.42      0.77      0.54        13
          W       0.55      0.35      0.43        17
          X       0.60      0.40      0.48        15
          Y       0.20      0.12      0.15        17
          Z       0.67      0.25      0.36         8
          a       0.60      0.45      0.52        33
          b       0.00      0.00      0.00         2
          c       0.00      0.00      0.00        15
          d       0.50      0.36      0.42        11
          e       0.95      0.39      0.55        54
          f       1.00      0.50      0.67         6
          g       1.00      0.10      0.18        10
          h       0.62      0.45      0.53        11
          i       1.00      0.23      0.37        31
          j       0.00      0.00      0.00         4
          k       0.00      0.00      0.00         7
          l       0.12      0.06      0.08        17
          m       0.50      0.46      0.48        13
          n       0.55      0.32      0.41        34
          o       0.08      0.03      0.05        29
          p       1.00      0.50      0.67         6
          q       0.00      0.00      0.00         5
          r       0.63      0.51      0.57        37
          s       0.19      0.12      0.15        25
          t       0.91      0.29      0.44        34
          u       1.00      0.11      0.20         9
          v       0.00      0.00      0.00         7
          w       0.50      0.20      0.29         5
          x       0.00      0.00      0.00         6
          y       0.50      0.09      0.15        11
          z       0.00      0.00      0.00         4

avg / total       0.54      0.51      0.49      1571

Random Forest Classifier:
             precision    recall  f1-score   support

          0       0.00      0.00      0.00        20
          1       0.06      0.12      0.08        16
          2       0.00      0.00      0.00        14
          3       0.00      0.00      0.00         9
          4       0.00      0.00      0.00         8
          5       0.00      0.00      0.00        12
          6       0.17      0.17      0.17        12
          7       0.22      0.18      0.20        11
          8       0.00      0.00      0.00         7
          9       0.00      0.00      0.00        11
          A       0.35      0.74      0.48       117
          B       0.00      0.00      0.00        11
          C       0.29      0.32      0.31        47
          D       0.32      0.27      0.29        49
          E       0.37      0.65      0.47        81
          F       0.33      0.19      0.24        21
          G       0.24      0.20      0.22        20
          H       0.40      0.34      0.37        41
          I       0.32      0.46      0.37        63
          J       0.25      0.08      0.12        13
          K       0.00      0.00      0.00        15
          L       0.63      0.63      0.63        41
          M       0.35      0.20      0.26        30
          N       0.37      0.48      0.42        62
          O       0.28      0.32      0.29        76
          P       0.57      0.25      0.35        32
          Q       0.00      0.00      0.00         4
          R       0.33      0.47      0.39        74
          S       0.33      0.33      0.33        82
          T       0.48      0.71      0.58        56
          U       0.27      0.15      0.19        20
          V       0.12      0.08      0.10        13
          W       0.67      0.12      0.20        17
          X       0.00      0.00      0.00        15
          Y       0.00      0.00      0.00        17
          Z       0.00      0.00      0.00         8
          a       0.18      0.18      0.18        33
          b       0.00      0.00      0.00         2
          c       0.00      0.00      0.00        15
          d       0.00      0.00      0.00        11
          e       0.41      0.22      0.29        54
          f       0.00      0.00      0.00         6
          g       1.00      0.10      0.18        10
          h       0.00      0.00      0.00        11
          i       0.27      0.13      0.17        31
          j       0.00      0.00      0.00         4
          k       0.00      0.00      0.00         7
          l       0.00      0.00      0.00        17
          m       0.20      0.08      0.11        13
          n       0.45      0.15      0.22        34
          o       0.18      0.10      0.13        29
          p       0.00      0.00      0.00         6
          q       1.00      0.20      0.33         5
          r       0.44      0.30      0.35        37
          s       0.00      0.00      0.00        25
          t       0.71      0.15      0.24        34
          u       0.00      0.00      0.00         9
          v       0.00      0.00      0.00         7
          w       0.00      0.00      0.00         5
          x       0.00      0.00      0.00         6
          y       1.00      0.09      0.17        11
          z       0.00      0.00      0.00         4

avg / total       0.30      0.30      0.28      1571

Again, the accuracy scores are of interest

In [31]:
for i in models:
    print i + ':\n' + str(accuracy_report(models[i]))
Bernoulli Naive Bayes:
0.222151495863
(5) K-Nearest Neighbors:
0.50732017823
Random Forest Classifier:
0.301718650541

$\S 6:$ Sending the data to JSON for visualization

In my opinion, the confusion matrix for this problem is one of the most interesting visuals I can generate! It's a great summary of where the KNN model fails and might help show why it fails as well.

In [ ]:
with open('knn.pkl') as picklefile:
    knn = pickle.load(picklefile)

#test/train split    
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.33, random_state=42)

#nudge dataset improves performance: test and see!
X_train, Y_train = nudge_dataset(X_train, Y_train, 50)

knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
In [ ]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y_test, Y_pred)
In [ ]:
#normalize confusion matrix
from scipy.stats import zscore
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] 
print cm.shape
In [ ]:
columns = ["0",
        "1",
        "2",
        "3",
        "4", 
        "5", 
        "6", 
        "7", 
        "8", 
        "9", 
        "A",
        "B", 
        "C", 
        "D", 
        "E", 
        "F", 
        "G", 
        "H", 
        "I", 
        "J", 
        "K", 
        "L",
        "M",
        "N", 
        "O", 
        "P", 
        "Q", 
        "R", 
        "S", 
        "T", 
        "U", 
        "V", 
        "W", 
        "X",
        "Y",
        "Z", 
        "a", 
        "b", 
        "c", 
        "d", 
        "e", 
        "f", 
        "g", 
        "h", 
        "i", 
        "j",
        "k",
        "l",
        "m",
        "n",
        "o",
        "p",
        "q", 
        "r",
        "s",
        "t",
        "u",
        "v",
        "w",
        "x",
        "y",
        "z"
        ];
In [ ]:
rows = ["0",
        "1",
        "2",
        "3",
        "4", 
        "5", 
        "6", 
        "7", 
        "8", 
        "9", 
        "A",
        "B", 
        "C", 
        "D", 
        "E", 
        "F", 
        "G", 
        "H", 
        "I", 
        "J", 
        "K", 
        "L",
        "M",
        "N", 
        "O", 
        "P", 
        "Q", 
        "R", 
        "S", 
        "T", 
        "U", 
        "V", 
        "W", 
        "X",
        "Y",
        "Z", 
        "a", 
        "b", 
        "c", 
        "d", 
        "e", 
        "f", 
        "g", 
        "h", 
        "i", 
        "j",
        "k",
        "l",
        "m",
        "n",
        "o",
        "p",
        "q", 
        "r",
        "s",
        "t",
        "u",
        "v",
        "w",
        "x",
        "y",
        "z"
        ];
In [ ]:
data = list(list(i) for i in cm)
In [ ]:
knn_data = {
    "columns": [list(["R", i]) for i  in columns],
    "index": [list(i) for i in rows],
    "data": data,
}

knn_numbers = {
    "columns": [list(["R", i]) for i  in columns[:10]],
    "index": [list(i) for i in rows[:10]],
    "data": [i[:10] for i in data[:10]],
}

knn_caps = {
    "columns": [list(["R", i]) for i  in columns[10:36]],
    "index": [list(i) for i in rows[10:36]],
    "data": [i[10:36] for i in data[10:36]],
}

knn_lower = {
    "columns": [list(["R", i]) for i  in columns[36:]],
    "index": [list(i) for i in rows[36:]],
    "data": [i[36:] for i in data[36:]],
}
In [ ]:
import json

with open('knn_data.json', 'w') as outfile:
    json.dump(knn_data, outfile)

with open('knn_numbers.json', 'w') as outfile:
    json.dump(knn_numbers, outfile)
      
with open('knn_caps.json', 'w') as outfile:
    json.dump(knn_caps, outfile)

with open('knn_lower.json', 'w') as outfile:
    json.dump(knn_lower, outfile)
In [ ]:
 
In [ ]: