Before we get started, have you tried our new Python Code Assistant? It's like having an expert coder at your fingertips. Check it out!
Machine translation is the process of using Machine Learning to automatically translate text from one language to another without any human intervention during the translation.
Neural machine translation emerged in recent years, outperforming all previous approaches. More specifically, neural networks based on attention called transformers did an outstanding job on this task.
This tutorial will teach you how to perform machine translation without any training. In other words, we'll be using pre-trained models from Huggingface transformer models.
The Helsinki-NLP models we will use are primarily trained on the OPUS dataset, a collection of translated texts from the web; it is free online data.
You can either make a new empty Python notebook or file to get started. You can also follow with the notebook in Colab by clicking the Open In Colab button above or down the article. First, let's install the required libraries:
$ pip install transformers==4.12.4 sentencepiece
Importing transformers:
from transformers import *
Related: How to Make a Language Detector in Python.
Let's first get started with the library's pipeline API; we'll be using the models trained by Helsinki-NLP. You can check their page to see the available models they have:
# source & destination languages
src = "en"
dst = "de"
task_name = f"translation_{src}_to_{dst}"
model_name = f"Helsinki-NLP/opus-mt-{src}-{dst}"
translator = pipeline(task_name, model=model_name, tokenizer=model_name)
src
and dst
are the source and destination languages, respectively. Feel free to change for your needs. We dynamically change the name of task_name
and model_name
based on the source and destination languages, we then initialize the pipeline by specifying the model
and tokenizer
arguments as well. Let's test it out:
translator("You're a genius.")[0]["translation_text"]
Output:
Du bist ein Genie.
The pipeline API is pretty straightforward; we get the output by simply passing the text to the translator pipeline object.
Let's pause for a moment here to discuss the Bilingual Evaluation Understudy (BLEU) score. The BLEU score is an algorithm for evaluating the quality of text that has been machine-translated from one language to another. It measures how many words overlap in a specific sequence between the machine-generated translation and a human translation, capturing the precision of the translation. The BLEU score ranges from 0 to 1, with 1 being the perfect match to the reference translation. However, it's important to note that while the BLEU score is a widely accepted metric, it is not flawless and doesn't account for semantic accuracy or linguistic nuances. Check this tutorial of ours to calculate the BLEU score in Python.
Alright, let's test a longer text brought from Wikipedia:
article = """
Albert Einstein ( 14 March 1879 – 18 April 1955) was a German-born theoretical physicist, widely acknowledged to be one of the greatest physicists of all time.
Einstein is best known for developing the theory of relativity, but he also made important contributions to the development of the theory of quantum mechanics.
Relativity and quantum mechanics are together the two pillars of modern physics.
His mass–energy equivalence formula E = mc2, which arises from relativity theory, has been dubbed "the world's most famous equation".
His work is also known for its influence on the philosophy of science.
He received the 1921 Nobel Prize in Physics "for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect", a pivotal step in the development of quantum theory.
His intellectual achievements and originality resulted in "Einstein" becoming synonymous with "genius"
"""
translator(article)[0]["translation_text"]
Output:
Albert Einstein (* 14. März 1879 – 18. April 1955) war ein deutscher theoretischer Physiker, der allgemein als einer der größten Physiker aller Zeiten anerkannt wurde.
Einstein ist am besten für die Entwicklung der Relativitätstheorie bekannt, aber er leistete auch wichtige Beiträge zur Entwicklung der Quantenmechaniktheorie.
Relativität und Quantenmechanik sind zusammen die beiden Säulen der modernen Physik.
Seine Massenenergieäquivalenzformel E = mc2, die aus der Relativitätstheorie hervorgeht, wurde als „die berühmteste Gleichung der Welt" bezeichnet.
Seine Arbeit ist auch für ihren Einfluss auf die Philosophie der Wissenschaft bekannt.
Er erhielt 1921 den Nobelpreis für Physik „für seine Verdienste um die theoretische Physik und vor allem für seine Entdeckung des Gesetzes über den photoelektrischen Effekt", einen entscheidenden Schritt in der Entwicklung der Quantentheorie.
Seine intellektuellen Leistungen und Originalität führten dazu, dass „Einstein" zum Synonym für „Genius" wurde.
I have tested this output on Google Translate to get it back in English, and it seems to be an excellent translation!
Related: How to Paraphrase Text using Transformers in Python.
Since pipeline doesn't provide us with a lot of flexibility during translation generation, let's use the model and tokenizer for manual use:
def get_translation_model_and_tokenizer(src_lang, dst_lang):
"""
Given the source and destination languages, returns the appropriate model
See the language codes here: https://developers.google.com/admin-sdk/directory/v1/languages
For the 3-character language codes, you can google for the code!
"""
# construct our model name
model_name = f"Helsinki-NLP/opus-mt-{src}-{dst}"
# initialize the tokenizer & model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# return them for use
return model, tokenizer
The above function returns the appropriate model given the src_lang
and dst_lang
for source and destination languages, respectively. For a list of language codes, consider checking this page. For instance, let's try English to Chinese:
# source & destination languages
src = "en"
dst = "zh"
model, tokenizer = get_translation_model_and_tokenizer(src, dst)
To translate our previous paragraph, we first need to tokenize the text:
# encode the text into tensor of integers using the appropriate tokenizer
inputs = tokenizer.encode(article, return_tensors="pt", max_length=512, truncation=True)
print(inputs)
Output:
tensor([[32614, 53456, 22, 992, 776, 822, 4048, 8, 3484, 822,
820, 50940, 17, 43, 13, 8214, 16, 32941, 34899, 60593,
2, 5514, 7131, 9, 34, 141, 4, 3, 7680, 60593,
24, 4, 61, 220, 6, 53456, 32, 1109, 3305, 15,
320, 3, 19082, 4, 1294, 24030, 28453, 2, 187, 172,
81, 157, 435, 1061, 9, 3, 92, 4, 3, 19082,
4, 52682, 54813, 6, 45978, 28453, 7, 52682, 54813, 46,
1105, 3, 263, 12538, 4, 6683, 46089, 6, 1608, 3196,
3484, 45425, 50560, 14655, 509, 8, 6873, 4374, 149, 9132,
62, 22703, 51, 1294, 24030, 28453, 19082, 2, 66, 74,
16044, 18553, 258, 40, 1862, 431, 23, 24, 447, 23761,
47364, 10594, 1608, 119, 32, 81, 3305, 15, 45, 6748,
19, 3, 34857, 4, 4102, 6, 250, 948, 3, 912,
774, 38354, 33321, 11, 58505, 40, 4161, 175, 307, 9,
34899, 46089, 2, 7, 978, 15, 175, 34026, 4, 3,
191, 4, 3, 17952, 57867, 1766, 19622, 13, 29632, 2827,
11, 3, 92, 4, 52682, 19082, 6, 1608, 6875, 5710,
7, 5099, 2665, 3897, 11, 40, 338, 767, 40272, 480,
6588, 57380, 29, 40, 9994, 20506, 480, 0]])
The tokenizer.encode()
method encodes the text into tokens and converts them to IDs, we set return_tensors
to "pt"
so it'll return a PyTorch tensor. We also set max_length
to 512
and truncation
to True
.
Let's now use greedy search to generate the translation for this:
# generate the translation output using greedy search
greedy_outputs = model.generate(inputs)
# decode the output and ignore special tokens
print(tokenizer.decode(greedy_outputs[0], skip_special_tokens=True))
We simply use the model.generate()
method to get the outputs, and since the outputs are also tokenized, we need to decode them back to human-readable format. We also set skip_special_tokens
to True
so we don't see tokens such as <pad>
, etc. Here is the output:
阿尔伯特·爱因斯坦(1879年3月14日至1955年4月18日)是德国出生的理论物理学家,被广泛承认是有史以来最伟大的物理学家之一。爱因斯坦以发展相对论闻名,但他也为量子力学理论的发展做出了重要贡献。相对论和量子力学是现代物理学的两大支柱。他的质量 — — 能源等值公式E = mc2来自相对论,被称作“世界最著名的方程 ” 。 他的工作也因其对科学哲学的影响而著称。 他获得了1921年诺贝尔物理奖,“因为他对理论物理学的服务,特别是他发现了光电效应法 ”, 这是量子理论发展的关键一步。 他的智力成就和创举导致“Einstein”成为“genius”的同义词。
You can also use beam search instead of greedy search, which may generate better translations:
# generate the translation output using beam search
beam_outputs = model.generate(inputs, num_beams=3)
# decode the output and ignore special tokens
print(tokenizer.decode(beam_outputs[0], skip_special_tokens=True))
We set num_beams
to 3
. I suggest reading this blog post or our tutorials on text summarization and conversational AI chatbots for more information about beams. The output:
阿尔伯特·爱因斯坦(1879年3月14日至1955年4月18日)是德国出生的理论物理学家,被广泛承认是有史以来最伟大的物理学家之一。爱因斯坦以发展相对论闻名,但他也为量子力学理论的发展做出了重要贡献。相对论和量子力学是现代物理学的两大支柱。来自相对论的其质量 — — 能源等值公式E=mc2被称作“世界上最著名的方程式 ” 。他的工作也因其对科学哲学的影响而著称。他获得了1921年诺贝尔物理奖,“因为他对理论物理学的服务,特别是他发现了光电效应法 ”, 这是量子理论发展的关键一步。他的智力成就和原创性导致了“Einstein”与“genius”的同义。
It was a slightly different translation, and both seemed to be good translations when I translated them back to English using Google Translate.
We can also generate more than one translation in one go:
# let's change target language
src = "en"
dst = "ar"
# get en-ar model & tokenizer
model, tokenizer = get_translation_model_and_tokenizer(src, dst)
# yet another example
text = "It can be severe, and has caused millions of deaths around the world as well as lasting health problems in some who have survived the illness."
# tokenize the text
inputs = tokenizer.encode(text, return_tensors="pt", max_length=512, truncation=True)
# this time we use 5 beams and return 5 sequences and we can compare!
beam_outputs = model.generate(
inputs,
num_beams=5,
num_return_sequences=5,
early_stopping=True,
)
for i, beam_output in enumerate(beam_outputs):
print(tokenizer.decode(beam_output, skip_special_tokens=True))
print("="*50)
We set num_return_sequences
to 5
to generate five different most probable translations, make sure that num_beams >= num_return_sequences
, output:
ويمكن أن تكون حادة، وقد تسببت في ملايين الوفيات في جميع أنحاء العالم، فضلا عن مشاكل صحية دائمة في بعض الذين نجوا من المرض.
==================================================
ويمكن أن تكون خطيرة، وقد تسببت في ملايين الوفيات في جميع أنحاء العالم، فضلا عن مشاكل صحية دائمة في بعض الذين نجوا من المرض.
==================================================
ويمكن أن تكون حادة، وقد تسببت في ملايين الوفيات في جميع أنحاء العالم، فضلا عن مشاكل صحية دائمة لدى بعض الذين نجوا من المرض.
==================================================
ويمكن أن تكون حادة، وقد تسببت في ملايين الوفيات في جميع أنحاء العالم، فضلا عن مشاكل صحية دائمة في بعض من نجوا من المرض.
==================================================
ويمكن أن تكون حادة، وقد تسببت في وفاة ملايين الأشخاص في جميع أنحاء العالم، فضلا عن مشاكل صحية دائمة في بعض الذين نجوا من المرض.
==================================================
That's it for this tutorial! I suggest you use your two languages and your text to see which best suits you in terms of parameters in the model.generate()
method.
As stated above, there are a lot of parameters in the model.generate()
method, most of them are explained in the hugging face blog post or our tutorials on text summarization and conversational AI chatbot.
Also, there are 1300+ pre-trained models on the Helsinki-NLP page, so your native language is definitely present there!
Check the complete code here.
Learn also: Conversational AI Chatbot with Transformers in Python
Happy learning ♥
Liked what you read? You'll love what you can learn from our AI-powered Code Explainer. Check it out!
View Full Code Auto-Generate My Code
Got a coding query or need some guidance before you comment? Check out this Python Code Assistant for expert advice and handy tips. It's like having a coding tutor right in your fingertips!