Click here to Skip to main content
15,867,568 members
Articles / Artificial Intelligence / Keras

How to Train and Test an AI Language Translation System

Rate me:
Please Sign up or sign in to vote.
5.00/5 (1 vote)
19 Apr 2021CPOL3 min read 6.5K   102   2  
In this article, we’ll train and test the model we created in the previous entry in the series.
Here we'll create a Keras tokenizer that will build an internal vocabulary out of the words found in the parallel corpus, use a Jupyter notebook to train and test our model, and try running our model with self-attention enabled.

Introduction

Google Translate works so well, it often seems like magic. But it’s not magic — it’s deep learning!

In this series of articles, we’ll show you how to use deep learning to create an automatic translation system. This series can be viewed as a step-by-step tutorial that helps you understand and build a neuronal machine translation.

This series assumes that you are familiar with the concepts of machine learning: model training, supervised learning, neural networks, as well as artificial neurons, layers, and backpropagation.

In the previous article, we built a deep learning-based model for automatic translation from English to Russian. In this article, we’ll train and test this model.

Training and Testing Using the LSTM Cells

We’ll start to train and test the core of our model: the LSTM cells without self-attention and word embedding. The standard Keras embedding component will provide the encoding from a set of words to vectors.

Model training includes several specific tasks:

  • Tokenizing the input data (preprocessing)
  • Deciding the training/self-test data ratio
  • Training of the model

We'll start by preparing the input (source) data and output (target) data to have numerical, fixed-size input and output models. Until this is in place, we cannot feed sentences or words to our Keras neural network (NN) model.

We'll start by creating a Keras tokenizer that will build an internal vocabulary out of the words found in the parallel corpus. Let's wrap it in a function:

Python
### tokenizer ###
def tokenization(lines):
        #print(lines)
        tokenizer = Tokenizer()

        tokenizer.fit_on_texts(lines)
        return tokenizer

First, we must use the fit_on_texts function.

This function accepts a list of sentences as its argument, and builds a mapping from most commonly encountered words to indices. It doesn’t encode sentences but prepares a tokenizer to do so.

Then, we have to provide a way to encode our input sentences. Let's create another function to do that:

Python
### encode ###
def encode_sequences(tokenizer, length, lines):
         # integer encode sequences
         seq = tokenizer.texts_to_sequences(lines)
         # pad sequences with 0 values
         seq = pad_sequences(seq, maxlen=length, padding='post')
         return seq

Once we have initialized the tokenizer, we’ll call the texts_to_sequences function for encoding. The following code retrieves a word from a numerical vector:

Python
temp = []
      for j in range(len(i)):
           t = get_word(i[j], ru_tokenizer)
           if j > 0:
               if (t == get_word(i[j-1], ru_tokenizer)) or (t == None):
                    temp.append('')
               else:
                    temp.append(t)
           else:
                  if(t == None):
                         temp.append('')
                  else:
                         temp.append(t)
      return ' '.join(temp)

Let’s use a Jupyter notebook to train and test our model. If you're running on a machine that doesn't have a GPU, you may want to run the notebook on Colab as it provides free GPU-enabled notebook instances..

Processing the first entries of our dataset will give us an exact result for the entries that fall in the training data, and an approximated transaction for other data. This lets us check that the translator is working correctly.

The table below shows the input data in English, then the ideal translation to Russian, and finally the model translation:

Image 1

The Russian translator is surprisingly good, probably because we’ve trained the model with more than 400,000 inputs.

Of course, it’s still not as good as professional automatic translation system, which demonstrates how hard the challenge is. Some flaws became immediately apparent. For example, the sentence "you've been very good to me" is translated as, "ты был для мне очень мне" ("you were for me very me").

We can also create a reverse translation (Russian to English) by simply swapping the output and input data. Or, to experiment with other languages, we can load any other training set from the Tatoeba project.

Now With Self-Attention

Next let’s try running our model with self-attention enabled. We see mixed results. In some cases the translation is close to perfect (yellow) but in some other cases, the translation does not improve or is even inferior in quality to the translation without self-attention (grey).

Intuitively, this makes sense. Attention mechanisms can help a model focus on the importance of words within a sentence - and the longer a sentence is, the easier it is to determine which words are and are not important. A translation model that uses self-attention will often provide better results, but it won't always do so, especially on shorter inputs.

Image 2

Next Steps

In the next article we’ll analyze the results our model produced and discuss the potential of a DL-based approach for universal translators. Stay tuned!

This article is part of the series 'Using Deep Learning for Automatic Translation View All

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
France France
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
-- There are no messages in this forum --