This is a study of a topic that's been addressed many times
$\S 1:$ Introduction: Optical Character Recognition
OCR is a topic in machine learning that has been widely studied. Using a part of the (also well-known) Char74 dataset, I develop multiple classifier models for street-view characters obtained from Google maps.
This project is rather code-heavy, but if you're familiar with the way scikit-learn
works it shouldn't require too much explanation. If you're not, feel free to reach out to me with questions/comments directly at derekjanni@gmail.com
For a basic roadmap of what's in this notebook:
# sklearn models
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import linear_model
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import scale
# sklearn metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
from sklearn.metrics import classification_report
from sklearn.learning_curve import learning_curve
from sklearn.cross_validation import cross_val_score
from sklearn.cross_validation import train_test_split
# graphs
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# images
from scipy.ndimage import convolve
from skimage.feature import hog
from skimage import draw, data, io, segmentation, color, exposure
from skimage.measure import regionprops
from skimage.filters import threshold_otsu
from skimage.transform import resize
from skimage.transform import warp
from PIL import Image
# basics
import pickle
import pandas as pd
import numpy as np
from pprint import pprint
$\S 2:$ Importing and Cleaning the Data
The data comes in as a folder of raw images, and a .csv with accompanying labels. The code below provides the following functionality:
In the future, the following functionality should be added:
import math
import cv2
def img_round(x, base=75):
"""
Now useless function (replaced by binarization) for flattening image data
"""
return (base * math.floor(float(x)/base))
vround = np.vectorize(img_round)
def get_img(i, size):
"""
Returns a binary image from my file directory with index i
"""
img = Image.open('/users/derekjanni/pyocr/train/'+ str(i+1) + '.Bmp')
img = img.convert("L")
img = img.resize((size,size))
image = np.asarray(img)
image.setflags(write=True)
thresh = threshold_otsu(image)
binary = image > thresh
return binary
def nudge_dataset(X, Y, size):
"""
This produces a dataset 5 times bigger than the original one,
by moving the (size x size) images around by 1px to left, right, down, up
"""
direction_vectors = [
[[0, 1, 0],
[0, 0, 0],
[0, 0, 0]],
[[0, 0, 0],
[1, 0, 0],
[0, 0, 0]],
[[0, 0, 0],
[0, 0, 1],
[0, 0, 0]],
[[0, 0, 0],
[0, 0, 0],
[0, 1, 0]]]
shift = lambda x, w: convolve(x.reshape((size, size)), mode='constant',
weights=w).ravel()
X = np.concatenate([X] +
[np.apply_along_axis(shift, 1, X, vector)
for vector in direction_vectors])
Y = np.concatenate([Y for _ in range(5)], axis=0)
return X, Y
def show_img(img):
width = 5.0
height = img.shape[0]*width/img.shape[1]
f = plt.figure(figsize=(width, height))
plt.imshow(img)
df = pd.read_csv('trainLabels.csv', header=0)
raw_y = np.asarray(df['Class'])
raw_x = np.asarray([get_img(i, 50) for i in df.index]).astype(float)
x = np.asarray([i.ravel() for i in raw_x])
y = raw_y
print x.shape, y.shape
$\S 3:$ Setting Up Some Models
I've run just about every classifier algorithm known to man on this dataset, with varying results. I've limited the number of models that I run for sake of time: we'll be running a lot of images through this program, which means it takes a lot of time. Below are three of the best and most intuitive methods. A few notes:
Bernoulli Naive Bayes is an intuitive solution as it expects binary data (which is what Otsu's method returns) to predict a binary outcome (either this string is or is not the letter "A", for instance). This method doesn't abuse nudging as well as its counterparts, as we'll see below.
If you disable the line below that performs the "nudging" you'll see that performance of KNN and Random Forest decreases rather dramatically! That's pretty interesting, but it should be intuitive why nudging in directions is effective!
A necessary improvement is to account for skew in input images, or to naturally deskew them upon entering them.
# declare models for explicit-ness
models = {'(5) K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=5),
'Bernoulli Naive Bayes': BernoulliNB(),
'Random Forest Classifier': RandomForestClassifier()
}
from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(x, y)
#nudge dataset improves performance: test and see!
X_train, Y_train = nudge_dataset(X_train, Y_train, 50)
def precision_report(model):
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)
return classification_report(Y_test, Y_pred)
for i in models:
print i + ':\n' + str(precision_report(models[i]))
So we see that KNN and Random Forest are our best predictors, on the whole. Lets see their accuracy scores as opposed to precision/recall, to get a feel for their actual performance
def accuracy_report(model):
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)
return accuracy_score(Y_test, Y_pred)
for i in models:
print i + ':\n' + str(accuracy_report(models[i]))
To make my own life easier, I pickled the models here to pull them up later
# models
knn = KNeighborsClassifier(n_neighbors=5)
rfc = RandomForestClassifier()
bnb = BernoulliNB()
# fit
knn.fit(X_train, Y_train)
rfc.fit(X_train, Y_train)
bnb.fit(X_train, Y_train)
# pickle
import pickle
with open('knn.pkl', 'w') as picklefile:
pickle.dump(knn, picklefile)
with open('rfc.pkl', 'w') as picklefile:
pickle.dump(rfc, picklefile)
with open('bnb.pkl', 'w') as picklefile:
pickle.dump(bnb, picklefile)
KNN is an interesting (and intuitive) solution to this problem, given the nudging technique. Lets see how the algorithm performs for varying K.
from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.33, random_state=42)
#nudge dataset improves performance: test and see!
X_train, Y_train = nudge_dataset(X_train, Y_train, 50)
def accuracy_knn(k):
neighbors = KNeighborsClassifier(n_neighbors=k)
neighbors.fit(X_train, Y_train)
return accuracy_score(Y_test, neighbors.predict(X_test))
k = [i for i in range(1, 20)]
acc_knn = [accuracy_knn(i) for i in range(1, 20)]
plt.figure(figsize=(10,7)).suptitle("Accuracy Score vs. K in KNN Classification", fontsize='15')
plt.xlabel('k', fontsize='15')
plt.ylabel('Accuracy Score', fontsize='15')
plt.plot(k, acc_knn)
$\S 4:$ Testing the models with real images
Sometimes seeing really is believing, and it's fun to see how models perform on data for yourself! Use the code below to generate a random image and evaluate the computer's effectiveness at this task, as well as notice where it goes wrong.
with open('knn.pkl') as picklefile:
knn = pickle.load(picklefile)
with open('rfc.pkl') as picklefile:
rfc = pickle.load(picklefile)
def get_test_img(i):
"""
Returns image from my file directory with corresponding index i
"""
img = Image.open('/users/derekjanni/pyocr/test/'+ str(i) + '.Bmp')
img = img.convert("L")
img = img.resize((50,50))
image = np.asarray(img)
image.setflags(write=True)
thresh = threshold_otsu(image)
binary = image > thresh
return binary
from random import randint
random_image = randint(6284, 12503)
print("My Guess for this file is:" + str(knn.predict(get_test_img(random_image).ravel())).strip('[]\''))
show_img(get_test_img(random_image))
The code below writes a submission for the relevant Kaggle competition. As of 8/5/2015 my model is among the top 10!
with open('submission2.csv', 'w') as outfile:
outfile.write('ID,Class\n')
for i in range(6284, 12504):
pre =str(rfc.predict(get_test_img(i).ravel())).strip('[]\'')
outfile.write(str(i) + ','+ pre +'\n')
$\S 5:$ Histogram of Oriented Gradients as a feature matrix
It might look like I've done really well here, however, part of this is cooked up: Since the images are nudged, there is a lot of overlap and repetitive labeling going on. What would be more interesting is if a HOG approach could better define letters! Lets see how KNN, RFC and LinearSVC perform on a Histogram of Oriented Gradients!
def get_hog(img):
fd, hog_image = hog(img, orientations=10, pixels_per_cell=(5, 5), cells_per_block=(2, 2), visualise=True)
return exposure.rescale_intensity(hog_image, in_range=(0, 0.9))
def get_img_hog(i, size):
"""
Returns a binary image from my file directory with index i
"""
img = Image.open('/users/derekjanni/pyocr/train/'+ str(i+1) + '.Bmp')
img = img.convert("L")
img = img.resize((size,size))
image = np.asarray(img)
image.setflags(write=True)
thresh = threshold_otsu(image)
binary = image > thresh
return get_hog(binary)
df = pd.read_csv('trainLabels.csv', header=0)
raw_y = np.asarray(df['Class'])
raw_x = np.asarray([get_img_hog(i, 50) for i in df.index]).astype(float)
x = np.asarray([i.ravel() for i in raw_x])
y = raw_y
# declare models for explicit-ness
models = {'(5) K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=5),
'Bernoulli Naive Bayes': BernoulliNB(),
'Random Forest Classifier': RandomForestClassifier()
}
from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(x, y)
#nudge dataset improves performance: test and see!
X_train, Y_train = nudge_dataset(X_train, Y_train, 50)
def precision_report(model):
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)
return classification_report(Y_test, Y_pred)
for i in models:
print i + ':\n' + str(precision_report(models[i]))
Again, the accuracy scores are of interest
for i in models:
print i + ':\n' + str(accuracy_report(models[i]))
$\S 6:$ Sending the data to JSON for visualization
In my opinion, the confusion matrix for this problem is one of the most interesting visuals I can generate! It's a great summary of where the KNN model fails and might help show why it fails as well.
with open('knn.pkl') as picklefile:
knn = pickle.load(picklefile)
#test/train split
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.33, random_state=42)
#nudge dataset improves performance: test and see!
X_train, Y_train = nudge_dataset(X_train, Y_train, 50)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y_test, Y_pred)
#normalize confusion matrix
from scipy.stats import zscore
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print cm.shape
columns = ["0",
"1",
"2",
"3",
"4",
"5",
"6",
"7",
"8",
"9",
"A",
"B",
"C",
"D",
"E",
"F",
"G",
"H",
"I",
"J",
"K",
"L",
"M",
"N",
"O",
"P",
"Q",
"R",
"S",
"T",
"U",
"V",
"W",
"X",
"Y",
"Z",
"a",
"b",
"c",
"d",
"e",
"f",
"g",
"h",
"i",
"j",
"k",
"l",
"m",
"n",
"o",
"p",
"q",
"r",
"s",
"t",
"u",
"v",
"w",
"x",
"y",
"z"
];
rows = ["0",
"1",
"2",
"3",
"4",
"5",
"6",
"7",
"8",
"9",
"A",
"B",
"C",
"D",
"E",
"F",
"G",
"H",
"I",
"J",
"K",
"L",
"M",
"N",
"O",
"P",
"Q",
"R",
"S",
"T",
"U",
"V",
"W",
"X",
"Y",
"Z",
"a",
"b",
"c",
"d",
"e",
"f",
"g",
"h",
"i",
"j",
"k",
"l",
"m",
"n",
"o",
"p",
"q",
"r",
"s",
"t",
"u",
"v",
"w",
"x",
"y",
"z"
];
data = list(list(i) for i in cm)
knn_data = {
"columns": [list(["R", i]) for i in columns],
"index": [list(i) for i in rows],
"data": data,
}
knn_numbers = {
"columns": [list(["R", i]) for i in columns[:10]],
"index": [list(i) for i in rows[:10]],
"data": [i[:10] for i in data[:10]],
}
knn_caps = {
"columns": [list(["R", i]) for i in columns[10:36]],
"index": [list(i) for i in rows[10:36]],
"data": [i[10:36] for i in data[10:36]],
}
knn_lower = {
"columns": [list(["R", i]) for i in columns[36:]],
"index": [list(i) for i in rows[36:]],
"data": [i[36:] for i in data[36:]],
}
import json
with open('knn_data.json', 'w') as outfile:
json.dump(knn_data, outfile)
with open('knn_numbers.json', 'w') as outfile:
json.dump(knn_numbers, outfile)
with open('knn_caps.json', 'w') as outfile:
json.dump(knn_caps, outfile)
with open('knn_lower.json', 'w') as outfile:
json.dump(knn_lower, outfile)