Tools for Building AI Language Translation Systems

Martin_Rupp

5.00/5 (1 vote)

Apr 15, 2021

CPOL

4 min read

11311

161

Tools and software required to build a DL-based automatic translation system

Download source files - 3.5 KB

Introduction

Google Translate works so well, it often seems like magic. But it’s not magic — it’s deep learning!

In this series of articles, we’ll show you how to use deep learning to create an automatic translation system. This series can be viewed as a step-by-step tutorial that helps you understand and build a neuronal machine translation.

This series assumes that you are familiar with the concepts of machine learning: model training, supervised learning, neural networks, as well as artificial neurons, layers, and backpropagation.

In the previous article, we introduced the main theoretical concepts required for building an ML-based translator. In this article, we’ll examine the tools we'll need to use to build an AI language translator.

Tools and Versions

Multiple frameworks provide APIs for deep learning (DL). The TensorFlow + Keras combination is by far the most popular, but competing frameworks, such as PyTorch, Caffe, and Theano, are also widely used.

These frameworks often practice the black box approach to neural networks (NNs) as they perform most of their "magic" without requiring you to code the NN logic. There are other ways to build NNs — for instance, with deep learning compilers.

The following table lists the versions of the Python modules we’ll use. All these modules can be explicitly installed using the ==[version] flag at the end of a pip command. For instance: "pip install tensorflow==2.0".

The code we're writing should work on any operating system but note that we're using Python 3, so make sure you have it installed. If your system has both Python 2 and Python 3 installed, you'll need to run pip3 instead of pip in the install commands below:

module	version
TensorFlow	2.3.1
Keras	2.1.0
numpy	1.18.1
pandas	1.1.3
word2vec	0.11.1

TensorFlow

TensorFlow is a very popular Python framework used for NN building.

Now you can install TensorFlow by running:

pip install tensorflow

The download and installation of the TensorFlow package may take some time as it is pretty large — more than 400 MB.

To verify that TensorFlow has been installed successfully, run:

pip show tensorflow

The output should be similar to this:

Name: tensorflow
Version: 2.3.1
Summary: TensorFlow is an open source machine learning framework for everyone.
Home-page: https://www.tensorflow.org/
Author: Google Inc.
Author-email: packages@tensorflow.org
License: Apache 2.0
Location: /usr/local/lib64/python3.6/site-packages
Requires: opt-einsum, tensorboard, termcolor, six, h5py, gast, tensorflow-estimator, 
google-pasta, astunparse, wrapt, grpcio, absl-py, numpy, keras-preprocessing, protobuf, wheel
Required-by:

It’s important to run python3 and not simply python (same for pip3 and not pip). Usually, a default install of Python would be present in Centos and it must not be used in our project.

Keras

Keras is a deep learning API that runs on top of TensorFlow. Keras can also run on top of other frameworks, such as Theano, but here we chose to associate it with TensorFlow.

To install Keras, enter the following command:

pip install keras

To check if Keras has been successfully installed, type:

pip list | grep Keras

Pandas

Pandas is a Python API for data manipulation and analysis. We will need it, among other things, to prepare the training data for our DL model.

The Pandas library can be installed using pip3:

pip install pandas

Word2Vec

We need Word2Vec for word embedding, to support the creation of an embedding layer in our NN. Other tools, such as GloVe or BERT, can also do this job. BERT would be more efficient but much more complex to integrate because of its context dependencies, so we'll use Word2Vec to keep things simple.

To install Word2Vec, run:

pip install word2vec

The tools we've installed will allow us to build out the machine translation (MT) software. To fit the pieces of our puzzle, we will:

Process an English dictionary and have Word2Vec create a custom word embedding system for the English language.
Build a sequence-to-sequence recurrent neural network (RNN) with long short-term memory (LSTM) cells using Keras, which has built-in support for everything that we need.
Add the embedding layer created by Word2Vec at the start of the LSTM sequence-to-sequence RNN.
Process the English/Russian parallel corpus: clean it, format it, and tokenize it
Train our DL model with the processed English/Russian parallel corpus.
Test our NMT for translation accuracy with several English sentences.

Workflow of the Neural Machine Translation system we are building.

Keep in mind that this isn't the only way to build an AI language translation system. For instance, we could use the gated recurrent unit (GRU) cells instead of LSTM cells. We've chosen the architecture above because it's easy to understand, easy to train, and works well. But once you've learned the basics, there's plenty more to explore.

Next Steps

In the next article, we’ll code an automatic translation system using TensorFlow and Keras. Stay tuned!