Learning Algorithms For Classification: Handwritten Digit Recognition

Great strides have been achieved in pattern recognition in the recent years, Especially striking results have attained in handwritten digit recognition. This rapid progress has resulted from a combination of a number of developments including proliferation of powerful, inexpensive computers, the invention of new algorithms that take advantage of these computers.

the availability of large databases of characters that can be use for training and testing of the Algorithms. In this article we will compare some algorithms made by the computer engineers of AT&T Laboratories for the aforesaid purpose of recognition. In addition to accuracy, measures to improve implementation for Handwritten Digit Recognition.

Database for Handwritten Digit Recognition

Let’s start by describing the database they used for training and testing purposes. When they first began research in character recognition, they assembled a database of 1200 digits consisting 10 examples of each digit from 12 different writers. But soon after realising that this was, “too easy” after an accuracy of 99%, the database gets abandon.

Now we will describe the further used types of Databases by the AT&T Laboratories for Handwritten Digit Recognition.

A Zip Code Database

In order to obtain a database more typical of real-world applications, they acquired a zip-code database with 7064 training and 2097 test digits that clip from images of handwritten zip codes. The digits by machine segmented from the zipcode string using an automatic algorithm.

As always, the segmented characters sometimes include extraneous ink and sometimes omitted critical fragments which often resulting in characters that were unidentifiable or appeared mislabel. The butchered data or the distorted data were removed from the training set improving the accuracy.

but kept in the test set to maintain objectivity. It worked, and infact it’s in use as a standard data in the AT&T for a time. But due to other drawbacks found with time like relatively small dataset, etc. It was replaced.

The NIST Test

Responding to the community’s need for better benchmarking, the US National Institute of Standards and Technology provided a database of handwritten characters on two CD ROMs. NIST organised a competition using the above data set. The competitors were distress by the result, as they achieve error rates of less than 1% on validation sets drawn.

from the training data, their performance on the test set was much worse. Later, NIST disclose that training set consists of characters written by US census workers. while the test set collects from characters by uncooperative high school students.

*Fig: The First Image Is A Sample Of The Training Set And The Second Image Is The Sample Of The Test Set.*

Modified NIST (MNIST)

The NIST data gets partition to provide large training and test sets that share the same distribution. Let’s discuss now how the new dataset is create. The original NIST test contains 58,527 digit images written by 500 different writers. In contrast to the training set, the data in the test set is scramble. The writer identities available in use to unscramble the test set as well. Then the test set splits in two equally with 30,000 examples each.

All the images size is normalise to fit in a 20×20 pixel box (while preserving the aspect ratio). The small images were also deslanted and straightened up.

The Classifiers for Handwritten Digit Recognition

Baseline Linear Classifier

One of the simplest classifiers,the values of every input pixel contributes to a weighted sum for every output unit. The output unit with the highest sum (including the contribution of a bias constant) denotes the class of the input character. In this kind of classifier there are 10N weights + 10 biases. Where N denotes the no. of pixel entered via input. For this experiment, 20×20 pixel deslant images of the MNIST character is in use.

The network has 4010 free parameters. Because this is a linear problem, The weight values can be determined separately. The deficiencies of the linear classifier well document form a basis of comparison. On the MNIST data the linear classifier achieved 8.4% on the test set.

Fig : The Input Pixel Is Inserted And It Is Linearly Checked With All The Probable Outputs, For Eg:- In This Figure, A Handwritten 2’S Image Is Inserted And The Linear Algorithm Will Try To Recognise The Pixel.

Baseline Nearest Neighbor Classifier

Another simple classification technique is a K-nearest neighbor classifier with a Euclidean distance measure between input images. The classifier has the advantage that no training time is require. However, the memory required and recognition time is large: the complete 60,000 20×20 pixel training images must be available at the run time. Much more compact representations could be plan with modest increase in identification time and error rate.

As in the previous case, several deslanted 20 x 20 images in use. The MNIST test error for k=3.0 is 2.4%. Naturally, a realistic Euclidean distance nearest-neighbor system would perform on feature vectors rather than directly on the pixels, but since all of the other systems presented in this paper perform directly on the pixels, this result is helpful for baseline comparison.

Large Fully Connected Multi-Layer Neural Network

Another classifier that’s test was fully connect multi-layer neural network with two layers of weights. The network was implement on a supercomputer and train with various numbers of hidden units. Deslanted 20×20 images were use as input. The best results nearly about 1.62% on the MNIST test set obtain with a 400-300-10 network consisting approximately 123,300 weights. Classical feed-forward multilayer networks seem to have a built-in “self- regularisation” mechanism.

Due to the fact that the origin of weighted space is a saddle point that is attractive in almost every direction. Small weights cause the sigmoids to perform in the quasi-linear region, making the network essentially similar to a low-capacity, single layer network. As the learning goes ahead, the weight grows, which progressively increases the effective capacity of the network.

Tangent Distance Classifier (TDC)

The TDC is a k-nearest neighbor classifier which is memory depend in which test patterns are compare to label, prototype patterns in the training set. The key to performance is to determine what “close” means for the images of the characters. In the naive approach, nearest-neighbor classifiers use the Euclidean distance: we simply take the squares of the difference in the values of corresponding pixels between the test image and the prototype pattern. The flaw in the approach is apparent: a slight misalignment between otherwise identical images can lead to a large distance. The standard way of dealing with problems is to use hand-crafted feature extractor to enhance dissimilarity between patterns of different classes and decrease variability within each class.

Optimal Margin Classifier (OMC)

The Optimal Margin Classifier (OMC) is a method for constructing complex decision rules for two groups pattern classification problems. For digit recognition several such classifiers have been construct upon, each one of which tries to identify a particular digit.

One of the many ways is by transforming the input patterns into higher-dimensional vectors. however, then to use a simple linear classifier in the transformed space.

One way of computing classical transformation is by computing the product of k or less input variables. A linear classifier in that space corresponds to a polynomial decision surface of degree k in the input space. Unfortunately, this is not practical when the dimension of the input is large. even for small k, it is based on the concept that only certain linear decision surfaces in the transformed space.

namely, the ones that are at the maximum distance from the convex hulls of the two classes. The resulting architecture can viewed as a 2-layer “neural network” in the weights of the first layer units – the support vector.

*Fig: The OMC Can Transform Patterns From An Input Space In Which They Are Not Linearly Separable To A New Space In Which They Are Linearly Separable.*

Conclusion

Performance depends on many factors including high accuracy, low run time, low memory usage, and reasonable training time. As computer technology improves, large-capacity recognisers become feasible. Large recognisers in turn require large training sets. however, OMC has excellent accuracy, which is most remarkable.

because unlike the other high performance classifiers, it does not include knowledge about the geometry of the problem. In fact, the classifier do just as the same if the image pixels were encrypted by a fixed random permutation. It is still much slower and memory hungry than the convolutional nets. However, the technique is relatively new, therefore there is some room for improvement.

Written By: Anurag Mukherjee

Reviewed By: Shivani Yadav

If you are Interested In Machine Learning You Can Check Machine Learning Internship Program
Also Check Other Technical And Non Technical Internship Programs