Spelling Correction using TensorFlow 2.x

Arthur Flor
19 min readNov 1, 2020

Study and application of Spelling Correction in offline Handwritten Text Recognition Systems

Welcome back, my dear reader. In this post, I bring a continuation of my master degree project, the Spelling Correction. You can find the first part here.

Firstly, I would like to thank everyone who contacted me and sent feedback on the others posts/projects, thank you! This makes me motivated and now, I present a new project.

This post will present my journey in this area of Natural Processing Language, then, no codes here, but only my experience, some concepts that I learned, some results and conclusions.

Just to highlight, the focus of spelling correction here is to use it in conjunction with Handwritten Text Recognition, so the goals and metrics may differ from the conventional ones (Grammatical Error Correction or Machine Translation, for example).

If you’re a code lover, here the project with tutorial Jupyter Notebook and python files:

Introduction

Beginning with my mission here. My goal was study and present some alternatives to post processing text in the HTR, because currently in this area of research, the Kaldi toolkit with SRILM is used in HTR output through the statistical approach, thus improving the text recognized by the optical model. ~This was all information that I had~

Text Correction or Spelling Correction. What do you think? The first search that I did, I found some statistical approaches, such as Norvig and SymSpell techniques and several “trivial” doubts.

What is token? n-gram? filtering? lemmatization? Language Modeling? Where is Kaldi and SRILM? And more later, <EOS>? <SOS>? encoders? decoders? attention? multi-head attention?

In this way, I found this excellent post Deep Spelling and this GitHub repository. So, I saw that Neural Networks have been growing in this area too. ~Cheers~

In fact, I had more doubts that I expected, but on the other hand, I knew I would use character-level tokens and HTR metrics (such as Character Error Rate and Word Error Rate), instead word-level tokens and GER metrics (such as Precision and BLEU). Finally, my goal would be simpler too, unlike the Machine Translation and Grammar Error Correction approaches.

At the end of a long research and study, I summarized the existing techniques that I would use in the project (there is more, of course), they are:

  • Statistical: Similarity (n-gram), Norvig, Symspell, Kaldi + SRILM.
  • Neural Network: Seq2Seq (Bahdanau and Luong attention), Transformer.

To an expert, these methods are “common” and nothing new. I agree. Newer approaches already deal with Transformer-XL and Evolved Transformer, but I preferred to go basic and simple in this task. Then, I’ll try to quickly explain each of them, along with the terms and methods used.

Statistical Approach

It is interesting, but it has many statistical approaches to spelling correction. I will only mention a few that I saw and used. However, first is necessary explain what is Tokens and N-grams, since they are common terms in this area.

Just to mention, the explanations here were from what I understood from readings and articles, after all the work, code, and various errors that I went through. So, I just detail for you, my point of view, since some works just skip over these subjects, or just explain in general, assuming we already know.

Tokens and N-grams

Well, Tokens are pieces of our context work, this is, texts. Simple, right? No! What pieces of the text means? And now, we have two approaches: can be words pieces (word level), or characters pieces (char level).

So, before coding, we have to know what context we want to work. For instance, statistical methods can use word level, making a word dictionary of their occurrences, in this case, the words become tokens, and we can combine with n-gram approach to make new tokens, if we want. However, only pure char level can be hard to correct, and we can’t make a simple char in a token (you can see why?). We need first add some information and in statistical scenarios, we must combine with n-gram approach to make our char level tokens.

But now, what is n-gram approach? Imagine how we can partition (segment) our text into several sections of the same size without leaving any combination out. Yeah, this is n-gram, where the N is the size that we want to partition. I really like the example figure of the N-gram definition from Wikipedia, is simple, objective and intuitive:

You can see the underscores “_” right? Remember that partition must be the same size. So, we can use underscore or any other “special character” to fill.

Similarity

This technique is the more simple from all. It just makes n-grams from our dataset and use similarity approach to search the possible “correct text”. Similarity approach will just check how many n-grams is shared between the texts and the most shared is the potential correction.

I used N-gram python module. Simple and have fuzzy search to performance:

PyPI: https://pypi.org/project/ngram

Github: https://github.com/gpoulter/python-ngram

Norvig

Norvig is already better known in research, but from what I understand, it is already obsolete. However, it is still interesting and uses a different approach. It uses word level and Levenshtein Distance (Edit Distance) to compute the scores for the potential correction. In addition, it compares all permutations (insertions, exclusions, substitutions and transpositions) with the words in the dataset to arrive at that score. However, with these permutations, have a computational cost and time to process, so it’s the slowest.

I used Pyspellchecker python module:

PyPI: https://pypi.org/project/pyspellchecker/

Github: https://github.com/barrust/pyspellchecker

SymSpell

This is my surprise, SymSpell was the first technique that I found, with this awesome posts: 1000x faster Spelling Correction and SymSpell vs. BK-tree: 100x faster fuzzy string search & spell checking. Yeah, by the titles you may have noticed that it is fast.

The interesting part is that it uses the symmetric delete algorithm to reduce the processing complexity (different from that presented by Norvig) and makes use of the Damerau-Levenshtein distance for the scores. I really don’t understanding how can be so fast. ~haha~

Just for you to see a speed comparison, this image from author Wolf Garbe:

Anyway, I used the Symspellpy python module (you can find in several other languages as well):

PyPI: https://pypi.org/project/symspellpy/

Github: https://github.com/mammothb/symspellpy

Original GitHub: https://github.com/wolfgarbe/SymSpell

Kaldi + SRILM

OK, so… this is the approach that I least understood, the most complicated that I found, however, the most used in HTR (at least that’s what they told me), but faster (of course, the only one in C++) and with better results.

Let’s do it by steps.

Kaldi is a speech recognition toolkit (yeah, speech recognition) and SRILM is a toolkit for building and applying statistical Language Models (LMs).

So, we need to download and compile the source code of the two tools and integrate them (Kaldi uses SRILM). After that, we will need to execute some Kaldi commands to decode the output of the optical model (now the list of prediction matrices, that is, the model output without going through the CTC decoder). In general, Kaldi will decode the predictions, create the Language Model using SRILM, create a statistical structure using Hidden Markov Models, create lexical structures, correct the predictions and finally, decode into text using the dataset’s character dictionary. Uff !!!

Just to you know, Kaldi is a very good choice, but I think that is workaround to HTR context (sorry about that). Why? You ask me. I don’t know indeed, just I feel that break the process…

Well, I make this project Handwritten Text Recognition with TensorFlow 2 and we can run all the project through browser. The output is already the final text. So, when I had to use the Kaldi’s approach, I had to create another alternative output, in which I literally cut the end of the decoding process and with that, we will no longer have the final text.

This output is a series of standardized files to make easy in the Kaldi input (you can get this files through kaldi_assets parameter :]).

And finally, I used a script in the Spelling Correction project, to get this standardized files and the Kaldi + SRILM toolkits (already compiled by yourself) to make an automatic execution/correction. ~Did you think I would do it manually?~

I know this is a huge and complicated process, but I just tried to automate this integration, no matter how much it went out of the project pattern.

Neural Network Approach

Recently (2014 to today), I saw an increase in the use of Neural Network for Machine Translation and Grammatical Error Correction. The results really are incredible, but with gigantic models, datasets of billions of sentences and with days or weeks of training. Yes, I know, kind of demotivating for those just starting out. So I tried to go with a simple approach, with the smallest model possible and with hours or days of training (not ideal yet, but ok).

Just to you know, the process of tokens is the same, just forget n-grams for now, because here we have to feed the model through numbers instead of texts. So, I will complement this subject with One Hot Encoding and Embeddings.

Encode/Decode Tokens

To work with Neural Network, we must work with numbers. Then, if we work with char level tokens, we need use One Hot Encoding. Instead, word level tokens can be better using Embeddings approach.

One Hot Encoding, briefly, is transform the text into binary matrix, mapping the tokens. For isntance:

Where ABC = [[1, 0, 0], [0, 1, 0], [0, 0, 1]]

Embeddings, briefly, is transform the text into numbers and then, use Embedding Layer into the model to compute more information from texts. For isntance:

Where ABC = [1, 2, 3]

I really, really, simplified, but this is the concept indeed. In addition, both scenarios you have special tokens, predefined and mandatory in this context. Special tokens can be any unique token in the entire dataset, so:

Start sentence: to indicate the begin of sentence. We can use <SOS> or <B> to word level or some special character to char level.

End sentence: to indicate the end of sentence. We can use <EOS> or <E> to word level or some special character to char level.

Space: to indicate a space in the sentence. We can use <SPACE> or <S> to word level only.

Padding: to indicate a sentence filling (remember that in the neural network, the input is matrices of encoded texts). We can use <PAD>, or PAD, or <P> to word level or some special character to char level. Another interesting information is that there is a standardization of always using this token in the first index, that is, index 0.

Unknown: to indicate an unexpected element, never seen inside the dataset. We can use <UNK>, or UNK, or <U> to word level or some special character to char level.

To make it clearer, here are some examples for both levels, encoding “I code” and “You code” sentences.

First, for one hot encoding (char level), we will use a character dictionary that macthes all character possibilities: [¶, t, d, I, Y, , u, e ,c ,o ,« ,¤ , »]. ~can you see the special tokens?~

So, what we really encode is:

«I code»¶¶ → 10 tokens

«You code» → 10 tokens

Then:

One hot encoding (char level) example

Now, for embeddings (word level), we will use a word dictionary that macthes all words possibilities: [<P>, I, You, <S>, code, <B>, <U>, <E>] ~again, can you see the special tokens?~

So, what we really encode is:

<B> I <S> code <E> → 6 tokens

<B> You <S> code <E> → 6 tokens

Then:

Embedding encoding (word level) example

Simpler, right? This is because the Embeddings layers will also transform into matrices and add more information with decimal values instead of integer values (0, 1). But we don’t need to worry about that.

About the decoding process. Well, with the dictionary in hand, just do the reverse process.

Sequence to Sequence

The most popular approach in this area, Sequence to Sequence, or Seq2Seq for the most intimate, this method consists of using Recurrent Neural Network to work with text. “Work with text” because you can classify, extract information, understand feelings, correct, generate dialogues and among other activities, all from text. Powerful, I know.

A little bit of history, the highlights in this area initially emerged with the LSTM layers, in which it was studied how to improve the learning. For instance, there is an article that presents the process of inverting sentences, as it found a better use in training.

Anyway, the concept here is the same. Seq2Seq has one Encoder and Decoder model. The encoder will obtain the states of the input that will pass to the decoder, which will process the states of the input and the states of the ground truth in order to give the prediction output. Complicated.

If you understood above, you noticed that there is more than one input. Therefore, as input for our model, we have: (i) encoder input; (ii) decoder input (ground truth); and for calculate loss function: (iii) ground truth. The detail is that each of these inputs has its own encoding (text → number). Then:

Encoder input: <S> your_text <E> <P> <P>…

Decoder input: <S> your_text <P> <P>…

Ground truth: your_text <E> <P> <P>…

There is more than one approach to encoding the Encoder inputs. However, this one that I presented covers the current models (that use Bidirectional Recurrent Neural Networks #spoiler).

In this post and image below, you can see a different approach to encoding the inputs, with no Start/End token in encoder input sentence.

Seq2Seq workflow example (with no Start/End token in encoder input sentence)

Bahdanau Attention

Everything changed with the arrival of the Attention layer proposed by Bahdanau (originally in 2014). Now the models have gone from 60% to 95% accuracy. Just by applying the attention process between the encoder and decoder states.

And what means Attention process? in general and quickly, Attention is an “alignment” between two vectors/matrices, based on scores. Bahdanau proposed calculating these scores, based on concatenations. This makes processing, memory usage and training time an aggravating factor, even though it provides beautiful results and is now the most used by the most well-known NLP tools.

Bahdanau Attention Mechanism

Luong Attention

As a way of improving, Luong (2015) proposed a new approach to calculate Attention scores, now using dot product. In this way, the model was extremely light and fast in training, even delivering a result just below the Bahdanau (at least in my experiments).

Luong Attention Mechanism

Transformer

Finally, Transformer model, or Multi-Head Attention by Vaswani et al. (2017) and with the best title paper Attention is All you Need.

I really created high expectations for this model and followed step by step in Google’s excellent tutorial (Transformer Model for Language Understanding).

The concept is really interesting. Also use dot product process, but there are no Recurrent layers, just attention. Yeah, the process here is based only on recursively attention, or self attention.

Encoder and Decoder Transformer Model

The bad thing here is the decoding process. Because if recursion is used to decode. My context was to use sentences with 128 tokens (high value for the usual). Then imagine Self Attention decoding in token by token loop, sentence by sentence. If you spend a lot of time so a huge trade-off to me, because it is extremely fast to train and with a very nice result, but slow to decode. ~I need to hit a code in C ++!~

Anyway, today there are studies aimed at solving this problem, improving, of course, the results (Transformer-XL and Evolved Transformer for example).

I know I explained it very quickly and superficially, but I don’t want to go into the code details or each implementation and operation. For this, all three approaches follow the same coding and decoding logic, same flow as Encoder Model and Decoder Model. In addition, you can follow the char or word level, depending on the objective.

Datasets

This step is complicated for the HTR area. Let me explain, both Machine Translation and Grammatical Error Correction have huge datasets to work with text. I can list BEA2019, CoNLL13, CoNLL14 and One Billion by Google (the project has support for them :]). In these datasets there is no partitioning methodology, so we can divide it by 10% for validation, 10% for testing and the rest for training. The training will works fine and after a long time, our model will predict the correction of the sentence. Everything OK so far.

However, I want to work with HTR datasets, to correct them using a small model. Here I list Bentham, IAM, Rimes, Saint Gall, Washington. In addition, the purpose of these datasets is only the recognition of text in image for the digital environment. Here we got some problems.

The first problem here is the scarcity of data. BEA2019 has about 35 thousand sentences, while in the largest HTR dataset I had access to, it had 11 thousand.

The second problem, the most serious, is the lack of distribution of tokens within the partitioning of the HTR datasets. And why the most serious? Well, we can’t train a model to learn “I do”, and put the sentence “I nned” to predict. The result will be “I do” (wrong), because the model doesn’t know the existence of the “need” token and then correct it correctly.

It took me a long time to realize this and the only solution (not the ideal) I found was to generate new sentences from the dataset using word level n-grams (yes!!). The process is somewhat similar, but instead of just using a value for N, we are going to use some values and some restrictions. I called it multigrams to make it easier to understand. So to exemplify:

Multigram Process example

With the partitioned dataset, I redistributed the multigrams on the training and validation partition (to maintain the 10% ratio) and kept the original test partition. This increased the volume of data considerably.

Remember, if you use a proper dataset for text, you don’t need this step. Therefore, for the HTR context, the ideal scenario would be a text image dataset for recognition, in which your tokens are in equal proportions in the three partitions (training, validation and testing).

You can also find more details and some images of these datasets in this previous post. However, in this project only it texts are used.

Bentham

Bentham database is a collection of manuscripts written by English philosopher Jeremy Bentham (1748–1832).

IAM

The Institut für Informatik und Angewandte Mathematik (IAM) database contains forms with English manuscripts.

Rimes

The Reconnaissance et Indexation de données Manuscrites et de fac similÉS (Rimes) database is a collection of text written in French.

Saint Gall

Saint Gall database brings manuscripts in Latin from the 9th century.

Washington

Finally, Washington database (the smaller) was created from the George Washington Papers (English) at 18th century.

Experiment Setup

In this stage, the usual process was used for the statistical approach: creation of the corpus from the dataset; creation of the dictionary/Language Model with the occurrences; find the best N value for that dataset + technique; finally, get the possible correction.

For the Neural Network approach, the training step is used to adjust the weights; to then predict the possible correction.

Specifically for this context of Neural Networks, Spelling Correction and HTR, I used the char level (One hot enconding) and kept the original partitioning of the datasets (with the addition of multigrams in training and validation).

In addition, I didn’t bring the results using Bahdanau Attention and the Transformer model, as they didn’t meet my expectations (training time or decoding, for example), but I kept the code implementation in the project repository as well.

Thus, I used a combination of the Bahdanau’s Model (who brings encoder step with Bidirectional layer), using Gated Recurrent layers (instead conventional LSTM), using Luong Attention with Layer Normalizaton after (this last I learned from the Transformer model). ~yes, a great mix~

Finally, I kept the units value at 512 and dropout at 0.2 (the normal for small models), using batches with 128 sentences. ReduceLRonPlateau and Early Stopping were also applied after 15 and 20 epochs without improvement in the value of the loss of validation, respectively.

In summary, the training was created to finish as quickly as possible. If someone needs to keep training for longer, or increase the model’s capacity, feel free to increase the parameters.

Noises

In the end, how will the model learn to correct sentences if we only have the ground truth?

Yeah, this step is really important. We will generate random errors in the data. For the validation partition only once (of course), for training data, we generate each interaction.

It is a simple function that generates a noisy sentence before encoding. I found this great post and repository from Tal Weiss, where he uses a conventional process of randomly add/remove/transpose/replace some token in sentence.

In addition, I had to go a little further, because this project has a connection with my HTR project, so I added two more possibilities in remove mode, which were: accentuation and repeated letters in sequence. So, I kept a percentage of 10% to 15% in the generated error rate, how you can see below an example:

Sample of the noise generation process

Metrics

Unlike Machine Translation and GER metrics (such as precision, or BLEU), which aim to evaluate the correction a little more deeply and semantically. Here I made another change, according to the area of Optical Character Recognition (OCR). So, I used the metrics Character Error Rate (CER), Word Error Rate (WER) and Sequence Error Rate (SER).

In addition to the OCR metrics, I also analyzed the time in seconds when executing the correction/prediction on each item.

I know they aren’t the best metrics for this type of activity (at least I haven’t seen any articles with this evaluation approach), but my goal was to integrate with the HTR system, so I kept the metrics throughout the workflow.

Google Colab: Tutorial Mode

Finally, all training and predictions were conducted on the Google Colab platform. The platform offers Linux operating system, with 12GB ram and Nvidia Tesla P100 GPU 16GB memory. ~thank you again, Google ❤❤~

Results

First to explain, all inputs are the outputs of the first project (HTR). That said, we have 5 datasets and 3 optical model architectures (Bluche, Flor, Puigcerver). So there are 15 types of input for the 5 approaches used in this experiment (4 statistical and 1 Se2Seq). ~a lot of results~

In general, the results were favorable to Seq2Seq, especially if the input has a low error rate. However, the runtime stayed with Kaldi + SRILM (of course!). But it is interesting to note that SymSpell comes close in speed and result (remember, c++ vs python !!). In addition, Norvig approach had the slowest execution time and Similarity the worst result.

Bentham

Bentham dataset had very close results. Bluche inputs have almost 13% error, in which the model’s training was between 10% — 15%, so Seq2Seq lost to Kaldi in this architecture (CER value only). However, Seq2Seq reached an incredible 2.91% CER, 6.75% WER and 33.12% SER.

Measures in the Bentham dataset

Some examples of corrections:

Sample results in the Bentham dataset

IAM

The IAM dataset was the most complicated by the tokens in it sentences. Again, Bluche inputs have almost 13% error, then Seq2Seq lost to Kaldi in this architecture (CER value only) and Seq2Seq reached 2.85% CER, 6.41% WER and 32.65% SER.

Measures in the IAM dataset

Some examples of corrections:

Sample results in the IAM dataset

Rimes

Rimes dataset starts to become interesting, as the sentences have a certain context and better sharing of tokens. Seq2Seq reached 1.71% CER, 4.88% WER and 23.71% SER. ~fantastic~

Measures in the Rimes dataset

Some examples of corrections:

Sample results in the Rimes dataset

Saint Gall

Saint Gall dataset was also easy, because it presents Latin language, because it has the Latin language, it has no punctuation or accentuation signs. The amount of tokens was the smallest. Seq2Seq reach 2.11% CER, 5.03% WER, 24.53% SER.

Measures in the Saint Gall dataset

Some examples of corrections:

Sample results in the Saint Gall dataset

Washington

Finally, Washington has the smallest database, but not as complicated as IAM or Bentham. Seq2Seq had 1.29% CER, 2.97% WER, 16.88% SER. Most errors were concentrated in a few sentences.

Measures in the Washington dataset

Some examples of corrections:

Sample results in the Washington dataset

Even though I don’t get into much of the discussions here, it is important to highlight the variations in the N values for each statistical approach, in each architecture and on each base, while in a Seq2Seq we don’t have to worry about that.

Another interesting point is the relationship between the percentage of error of the inputs with that used in training. First, it can cause incompatibility in the correction (seen with the Bluche architecture on the Bentham and IAM bases) and second, with the increase in the amount of noise, the model may go into overfitting, unless we also increase the number of neurons.

Conclusion

Dear reader, if you have read this far, congratulations! If you jumped this far, well, let’s go ahead anyway. After all the study, implementation and experimentation within the context of Handwritten Text Recognition, I came to some points in my head:

  • SymSpell is a great option for quick, practical implementations and good results. ~nothing to complain about here~
  • Kaldi + SRILM is a good alternative and a powerful tool, however, I think it quite laborious and complicated (specially for newbies like me). From that point of view, it would be better to invest a little more time and dedication in the performance of Seq2Seq/Transformer model for this task.
  • I realized that encoder-decoder models is still at the beginning in the HTR area, even with good results, it isn’t all the potential that it can offer. Maybe we can say that HTR research field haven’t left behind yet the statistical approach, making hard to work with incompatible/restrict datasets, for example. ~I know it’s not easy, but nobody said it would be~

Therefore, this implementation is a simple example of the power of seq2seq in a simple spelling correction task. If you want to improve performance (datasets with more tokens, or multi-language), you may be able to try more training epochs, or higher number of units, or even try the transformers (they are amazing).

I wrote this post almost a year ago and we are now able to publish the article “Towards natural language processing as a spell check for offline handwriting recognition systems”, as a result of my master’s project. The concepts used are the same, but for more complete content and updated results, I also recommend reading the article. Finally, feel free to improve and reference the project ~ Cheers!

--

--