How to Make a Language Detector in Python

Explore different libraries for detecting natural languages such as langdetect, langid, googletrans, and language-detector in Python.
  · 10 min read · Updated sep 2022 · Application Programming Interfaces · Natural Language Processing

Kickstart your coding journey with our Python Code Assistant. An AI-powered assistant that's always ready to help. Don't miss out!

Python is a universal language that has many applications in different fields. In this article, we will use Python to build a language detector.

A language detector is a tool that can automatically identify the language of a given text. This can be useful in several situations. For example, suppose you want to categorize or filter articles on your blog based on their languages or clean data in your data science projects. In that case, you can do all these easily with the help of a language detector tool.

Python provides a lot of packages for language detection, but this article will cover four different Python packages that are used for detecting languages, the langdetect, langid, googletrans, and language_detector.

Here is the table of contents:

Throughout this article, we will be using these sentences to detect the languages:

I love programming, Python is my favorite language.
أحب البرمجة ، بايثون هي لغتي المفضلة.
我喜欢编程,Python 是我最喜欢的语言。
Me encanta programar, Python es mi lenguaje favorito.
Eu amo programar, Python é minha linguagem favorita.

These are the same sentence translated into different languages. To access all the language codes we will use in this article, visit this page.

Installing the Required Packages

Before everything else, the first task is installing all the required packages, as they are not in Python's standard utility packages. We will, first of all, create a virtual environment and then install all the required packages in it:

$ python -m venv project

And activate it using the command on Windows:

$ .\project\Scripts\activate

or Linux/macOS:

$ source project/bin/activate

Now that the virtual environment is up and running, let us install all the packages that we are going to use:

$ pip install langdetect langid googletrans==3.1.0a0 language-detector

Building a Language Detector Command Line Tool

In this section, we will build the language detector command line tool using one package at a time. So inside the virtual environment, create two files and name them language_detector_cli_1.py and sentences.txt respectively:

Note that you can call the files whatever you want per your preference, but make sure the names are meaningful. In the sentences.txt file, we will have the sentences we want to be detected, so open and paste these lines:

I love programming, Python is my favorite language.
أحب البرمجة ، بايثون هي لغتي المفضلة.
我喜欢编程,Python 是我最喜欢的语言。
Me encanta programar, Python es mi lenguaje favorito.
Eu amo programar, Python é minha linguagem favorita.

You can add as many sentences as you want to this file.

langdetect

Our first implementation of the language detector command line tool will use the langdetect package. As mentioned in the documentation, it supports 55 languages and is part of Google's language detection library.

Open the .py file we have just created and paste this following code:

# import the detect function from langdetect
from langdetect import detect
# openning the txt file in read mode
sentences_file = open('sentences.txt', 'r')
# creating a list of sentences using the readlines() function
sentences = sentences_file.readlines()

Here we are importing the detect() function from langdetect package, we will use it for detecting words or sentences. Then we open the sentences.txt file in read mode, after successfully opening it, we get all the sentences from it.

Let us now create the function for detecting languages; we will call it detect_language(), and paste this code:

# a function for detection language
def detect_langauage(sentence, n):
    """try and except block for catching exception errors"""
    # the try will run when everything is ok
    try:
        # checking if the sentence[n] exists
        if sentences[n]:
            # creating a new variable, the strip() function removes newlines
            new_sentence = sentences[n].strip('\n')
            print(f'The language for the sentence "{new_sentence}" is {detect(new_sentence)}')
    # this will catch all the errors that occur  
    except:
        print(f'Sentence does not exist')

Let us break the code inside the detect_language() function a bit so that we are on the same page. The function takes two arguments, sentence and n, the sentence is of type str and the n is of type int, and inside this function, we have a try/except block for handling any errors.

Inside the try statement, we have an if statement checking whether the sentence exists. If this sentence exists, we are removing the newline characters from it, and we are detecting the language. Inside the except we are just catching any errors that may likely occur. Finally, just below the function, paste this code:

# printing the the number of sentences in the sentences.txt   
print(f'You have {len(sentences)} sentences')
# this will prompt the user to enter an integer
number_of_sentence = int(input('Which sentence do you want to detect?(Provide an integer please):'))
# calling the detect_langauage function
detect_langauage(sentences_file, number_of_sentence)

We have a print() function, an input() function for getting data from the user, and a function call. Let us now test the program; we will detect the language of the first sentence from the sentences_file.txt file whose number is 0.

Now let's run it:

$ python language_detector_cli_1.py

The output will be as follows after providing 0 as input:

You have 5 sentences
Which sentence do you want to detect?(Provide an integer please):0
The language for the sentence "I love programming, Python is my favorite language." is en

If you check the language codes, en is for English.

Note: As mentioned in the documentation, the langdetect package uses a non-deterministic algorithm, which means that you might get different results every time you try to detect a short or ambiguous text.

langid

The second package that we can use for language detection is the langid package. Open another new Python file, name it language_detector_cli_2.py and make it look like this:

import langid

# opening the txt file in read mode
sentences_file = open('sentences.txt', 'r')
# creating a list of sentences using the readlines() function
sentences = sentences_file.readlines()
# looping through all the sentences in thesentences.txt file
for sentence in sentences:
    # detecting the languages for the sentences
    lang = langid.classify(sentence)
    # formatting the sentence by removing the newline characters
    formatted_sentence = sentence.strip('\n')
    print(f'The sentence "{formatted_sentence}" is in {lang[0]}')

In the above code snippet, we are importing the langid then we open the sentences.txt file, and reading the five sentences.

After that, we loop through all these sentences and, at the same time, detect the language of each sentence using langid.classify() function, this function takes a sentence as an argument, and finally, we print the formatted result. 

Let's run it:

$ python langauage_detector_cli_2.py

The output we get is this:

The sentence "I love programming, Python is my favorite language." is in en
The sentence "أحب البرمجة ، بايثون هي لغتي المفضلة." is in ar
The sentence "我喜欢编程,Python 是我最喜欢的语言。" is in zh
The sentence "Me encanta programar, Python es mi lenguaje favorito." is in es
The sentence "Eu amo programar, Python é minha linguagem favorita." is in gl

All the predictions are correct, except for the last one, where it should be pt.

googletrans

Third on the list is the googletrans package, this package can be used for translations and language detection. We have used it in the text translation tutorial if you want to check it out.

According to the documentation, it is free and unlimited. Now let us use it for detecting languages; open up a new Python file and name it language_detector_cli_3.py and add the following:

# importing the Translator function from googletrans
from googletrans import Translator
# translator object
translator = Translator()

We are importing the Translator object from googletrans and then initializing it. Now since we will be getting sentences from the sentences.txt file, we need to open it and get the sentences:

# openning the txt file in read mode
sentences_file = open('sentences.txt', 'r')
# creating a list of sentences using the readlines() function
sentences = sentences_file.readlines()

And for detecting the languages, let us use this function:

# a function for detection language
def detect_langauage(sentence, n):
    """try and except block for catching exception errors"""
    # the try will run when everything is ok
    try:
        # checking if the sentence[n] exists
        if sentences[n]:
            # creating a new variable, the strip() function removes newlines
            new_sentence = sentences[n].strip('\n')
            # detecting the sentence language using the translator.detect()
            # .lang extract the language code
            detected_sentence_lang = translator.detect(new_sentence).lang 
            print(f'The language for the sentence "{new_sentence}" is {detected_sentence_lang}')
    # this will catch all the errors that occur  
    except:
        print(f'Make sure the sentence exists or you have internet connection')

The above function is similar to the other function we used for langdetect package, but the difference is this line of code inside the if statement.

We are using the translator.detect() to detect the language and extract the language code using the lang attribute.

And finally, paste these lines of code after the function:

print(f'You have {len(sentences)} sentences')
# this will prompt the user to enter an integer
number_of_sentence = int(input('Which sentence do you want to detect?(Provide an integer please):'))
# calling the detect_langauage function
detect_langauage(sentences_file, number_of_sentence)

To run the program, use this command:

$ python langauage_detector_cli_3.py

You will be prompted to choose the sentence to detect, and this is the output you get after providing valid input:

You have 5 sentences
Which sentence do you want to detect?(Provide an integer please):1
The language for the sentence "أحب البرمجة ، بايثون هي لغتي المفضلة." is ar

language_detector

Our final package for language detection is the language_detector, without further ado, open a new Python file, name it language_detector_cli_4.py and import the package:

from language_detector import detect_language

Now we will create a function for handling language detection and name it detectLanguage().

Something to note here, we have imported detect_language from language_detector; this must not conflict with the function's name; that's why we have named the function detectLanguage(). The function will take text as an argument:

def detectLanguage(text):
    # detecting the language using the detect_language function
    language = detect_language(text)
    print(f'"{text}" is written in {language}')

In the above function, the text passed to the detectLanguage() function is also passed to the detect_language().

Just after detectLanguage() function, paste this code:

# an infinite while while loop
while True:
    # this will prompt the user to enter options
    option = input('Enter 1 to detect language or 0 to exit:')
    if option == '1':
        # this will prompt the user to enter the text
        data = input('Enter your sentence or word here:')
        # calling the detectLanguage function
        detectLanguage(data)
    # if option is 0 break the loop   
    elif option == '0':
        print('Quitting........\nByee!!!')
        break
    # if option isnt 1 or 0 then its invalid 
    else:
        print('Wrong input, try again!!!')

In the above code snippet, we have an infinite while loop; the user is prompted to enter two options, 1 and 0. If the option is 1, the user will be prompted to enter the text to be detected, if the option is 0, the loop will be broken, and if the options are neither 1 nor 0, the user will be notified about the wrong input. 

To test this program, run:

$ python langauage_detector_cli_4.py

The output:

Enter 1 to detect language or 0 to exit:1
Enter your sentence or word here:J'adore programmer, Python est mon langage préféré
"J'adore programmer, Python est mon langage préféré" is written in French
Enter 1 to detect language or 0 to exit:0
Quitting........
Byee!!!

Conclusion 

This article has shown you how to make a language detector using Python. We have not exhausted the whole list of packages that can be used for detecting languages, but we hope that you now know how to detect languages using Python.

Some packages we have used were not precise enough, but the good thing is that Python comes with other, more precise packages for the job. I invite you to experiment with the libraries and see which one fits you best.

You can get all the scripts here.

Learn also: How to Translate Text in Python.

Happy coding ♥

Liked what you read? You'll love what you can learn from our AI-powered Code Explainer. Check it out!

View Full Code Auto-Generate My Code
Sharing is caring!



Read Also



Comment panel

    Got a coding query or need some guidance before you comment? Check out this Python Code Assistant for expert advice and handy tips. It's like having a coding tutor right in your fingertips!