Google Cloud Translation API with Python

Lets walk through the process of integrating Google's Cloud Translation API with python applications.

Google provides both out-of-the-box Neural Machine Translation (NMT) and custom translation through training an AutoML model. In this blog, we're only focusing on the default NMT model, which excels in general use cases but might encounter challenges with technical domain-specific terms. However, we can enhance its performance by incorporating glossaries—custom dictionaries utilized by the Cloud Translation API to ensure consistent translation of the customer's domain-specific terminology.

Setup

First go to the GCP Console (console.cloud.google.com). and create a project.

Now, in the search bar search for Cloud Translation API and enable this API.

Authentication

Before we can use this API, we need to set up authentication.

We are going to use the Cloud Translation - Advanced API (v3) version which does not support API keys for authentication. We are going to authenticate using our User credentials and ADC (Application Default Credentials)

To provide your user credentials to ADC, we need to use the Google Cloud CLI:

  1. Install and initialize gcloud cli

    The following commands are for Debian 9+ or Ubuntu 18.04+. Refer to this guide for other distros.

    1. First update the packages and make sure it has apt-transport-https and curl installed.

       sudo apt-get update
       sudo apt-get install apt-transport-https ca-certificates gnupg curl sudo
      
    2. Import the Google Cloud public key.

       echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
      
    3. Update and install the gcloud CLI:

       sudo apt-get update && sudo apt-get install google-cloud-cli
      
    4. Run gcloud init to get started:

       gcloud init
      

      You will be asked to login and select a project.

  2. Create local authentication credentials for your Google Account:

     gcloud auth application-default login
    

Install the python library

Now we need to install the python client library. It is recommended to use venv to isolate dependencies.

pip install --upgrade google-cloud-translate

Now we are all set.

Translate Text

To translate texts set the project id, the text to translate, source language and target language.

The supported language codes can be found here.

from google.cloud import translate

def translate_text(
        text: str,
        project_id: str,
        source_language: str,
        target_language: str
    ) -> translate.TranslationServiceClient:

    client = translate.TranslationServiceClient()

    location = "us-central1"

    parent = f"projects/{project_id}/locations/{location}"

    # https://cloud.google.com/translate/docs/supported-formats
    response = client.translate_text(
        request={
            "parent": parent,
            "contents": [text],
            "mime_type": "text/plain",  # mime types: text/plain, text/html
            "source_language_code": source_language,
            "target_language_code": target_language,
        }
    )

    return response


project_id = "your-project-id"
text_to_translate = "Hey, How are you?"
source_language = "en-US"
target_language = "fr"

response = translate_text(text_to_translate, project_id, source_language, target_language)
print(response)

Output:

translations {
  translated_text: "Hey comment allez-vous?"
}

Batch Translate

Batch translation allows you to translate large amounts of text (with a limit of 100 files per batch), and to up to 10 different target languages in a command offline. The total content size should be <= 100M Unicode codepoints and must use UTF-8 encoding.

For batch translation we first need a storage to store the input and output files. Lets create two storage buckets for input and output. Search on the search box for buckets or from the navigation menu go to cloud storage > buckets and click on +Create

  • Give your bucket a name. Names must be globally unique.

  • Select single region to minimize cost.

  • Choose the standard storage class.

  • Leave other options as they are and create the bucket.

Similarly create another bucket for storing output files.

Now create a .txt file with the text that you want to translate and upload it with in the input storage bucket.

And in the output bucket create a folder to store the translated files. For batch translation the target folder must be empty.

Now in the following script adjust the parameters as you need.

from google.cloud import translate

def batch_translate_text(
    input_uris: str,
    output_uri: str,
    project_id: str,
    source_language_code: str,
    target_language_codes: list,
    timeout: int = 180,
) -> translate.TranslateTextResponse:
    """Translates a batch of texts on GCS and stores the result in a GCS location.

    Args:
        input_uri: The input URI of the texts to be translated.
        output_uri: The output URI of the translated texts.
        project_id: The ID of the project that owns the destination bucket.
        timeout: The timeout for this batch translation operation.

    Returns:
        The translated texts.
    """

    client = translate.TranslationServiceClient()

    location = "global"

    input_configs_element = [
        {
            "gcs_source": {"input_uri": uri},
            "mime_type": "text/plain",
        }
        for uri in input_uris
    ]


    gcs_destination = {"output_uri_prefix": output_uri}
    output_config = {"gcs_destination": gcs_destination}
    parent = f"projects/{project_id}/locations/{location}"

    operation = client.batch_translate_text(
        request={
            "parent": parent,
            "source_language_code": source_language_code,
            "target_language_codes": target_language_codes,  # Up to 10 language codes here.
            "input_configs": input_configs_element,
            "output_config": output_config,
        }
    )

    print("Operation ID: {}".format(operation.operation.name)) # needed if we want to cancel the operation
    print("Waiting for operation to complete...")
    response = operation.result(timeout)

    print(f"Total Characters: {response.total_characters}")
    print(f"Translated Characters: {response.translated_characters}")

    return response

project_id = "your-project-id"
input_uris = ["gs://your-input-bucket/path-to-file"] #list of your input files
output_uri = "gs://your-output-bucket/output-folder/"
source_language_code = "en"
target_language_codes = ["bn", "fr"] #list of your target languages (up to 10)

batch_translate_text(input_uris, output_uri, project_id, source_language_code, target_language_codes)

you can get the URI like this.

Here I used one input file and two target languages bn and fr. After the batch process is finished the translated output files are stored in my output bucket.

input

I am a computer science student.
I am not good at cloud tech.
I am learning google's Cloud Translation API.
I needed to create a storage bucket to store the file I want to translate.
I am also learning linear algebra.

output-bn

আমি একজন কম্পিউটার বিজ্ঞানের ছাত্র।
আমি মেঘ প্রযুক্তিতে ভাল নই।
আমি গুগলের ক্লাউড অনুবাদ API শিখছি।
আমি যে ফাইলটি অনুবাদ করতে চাই তা সঞ্চয় করার জন্য আমাকে একটি স্টোরেজ বালতি তৈরি করতে হবে।
আমি লিনিয়ার বীজগণিতও শিখছি।

output-fr

Je suis étudiant en informatique.
Je ne suis pas doué en technologie cloud.
J'apprends l'API Cloud Translation de Google.
J'avais besoin de créer un bucket de stockage pour stocker le fichier que je souhaite traduire.
J'apprends également l'algèbre linéaire.

Glossary

In the input and outputs above there is a problem. Some words are translated to Bengali using their literal meaning ( I do not speak French, so I am gonna skip it for now)

  • Computer science is also called 'computer science' in Bengali. 'Science' is not needed to translated to Bengali. Same for Linear Algebra.

  • 'Cloud' is also translated to literal cloud

  • 'Cloud Translation API' should not be translated word by word.

  • Bucket also should no be translated to বালতি

This is where Glossary comes in.

A glossary is a custom dictionary the Cloud Translation API uses to consistently translate the customer's domain-specific terminology. This typically involves specifying how to translate a named entity.

You might use a glossary for the following use cases:

  • Product names: For example, "Google Home" must translate to "Google Home".

  • Ambiguous words: For example, the word "bat" can mean a piece of sports equipment or an animal. If you know that you are translating words about sports, you might want to use a glossary to feed the Cloud Translation API the sports translation of "bat", not the translation for the animal.

  • Borrowed words: For example, "bouillabaisse" in French translates to "bouillabaisse" in English. English borrowed the word "bouillabaisse" from French in the 19th century. An English speaker lacking French cultural context might not know that bouillabaisse is a fish stew dish. Glossaries can override a translation so that "bouillabaisse" in French translates to "fish stew" in English.

There are two types of Glossaries

We are going to use equivalent term sets here.

For using Glossary we need to:

  • Create a glossary file

  • Create a glossary resource with Cloud Translation API

Lets first create a .csv file for glossary and upload it to the input bucket.

glossary1.csv

en,bn,fr
Translation API, ট্রান্সলেসন API, Translation API
computer science, সাইন্স,
linear algebra, লিনিয়ার আলজেব্রা,
cloud, ক্লাউড, cloud 
bucket, বাকেট, bucket

Now we need to make this glossary available to the Cloud Translation API by creating a glossary resource.

from google.cloud import translate_v3 as translate

def create_glossary(
    project_id: str,
    input_uri: str,
    glossary_id: str,
    language_codes_arg: list,
    timeout: int = 180,
) -> translate.Glossary:
    """
    Create a equivalent term sets glossary. Glossary can be words or
    short phrases (usually fewer than five words).
    https://cloud.google.com/translate/docs/advanced/glossary#format-glossary
    """
    client = translate.TranslationServiceClient()

    location = "us-central1"  # The location of the glossary

    name = client.glossary_path(project_id, location, glossary_id)
    language_codes_set = translate.types.Glossary.LanguageCodesSet(
        language_codes=language_codes_arg
    )

    gcs_source = translate.types.GcsSource(input_uri=input_uri)

    input_config = translate.types.GlossaryInputConfig(gcs_source=gcs_source)

    glossary = translate.types.Glossary(
        name=name, language_codes_set=language_codes_set, input_config=input_config
    )

    parent = f"projects/{project_id}/locations/{location}"
    # glossary is a custom dictionary Translation API uses
    # to translate the domain-specific terminology.
    operation = client.create_glossary(parent=parent, glossary=glossary)

    result = operation.result(timeout)
    print(f"Created: {result.name}")
    print(f"Input Uri: {result.input_config.gcs_source.input_uri}")

    return result


project_id = "your-project-id"
input_uri = "gs://glossary-file-uri"
glossary_id = "glossary-id" # give your glossary a id. must be unique within the project
language_codes = ["en", "bn", "fr"]

response = create_glossary(project_id, input_uri, glossary_id, language_codes)
print(response)

This will create a glossary that we can use for translation.

For more operations like listing, adding, deleting, and updating glossaries and glossary-entries refer to this documentation.

Now for translating text with glossary


from google.cloud import translate

def translate_text_with_glossary(
    text: str,
    project_id: str,
    glossary_id: str,
    source_language_code: str,
    target_language_code: str,
) -> translate.TranslateTextResponse:
    """Translates a given text using a glossary.

    Args:
        text: The text to translate.
        project_id: The ID of the GCP project that owns the glossary.
        glossary_id: The ID of the glossary to use.

    Returns:
        The translated text."""
    client = translate.TranslationServiceClient()
    location = "us-central1"
    parent = f"projects/{project_id}/locations/{location}"

    glossary = client.glossary_path(
        project_id, "us-central1", glossary_id  # The location of the glossary
    )

    glossary_config = translate.TranslateTextGlossaryConfig(glossary=glossary)

    # Supported language codes: https://cloud.google.com/translate/docs/languages
    response = client.translate_text(
        request={
            "contents": [text],
            "target_language_code": target_language_code,
            "source_language_code": source_language_code,
            "parent": parent,
            "glossary_config": glossary_config,
        }
    )

    return response

project_id = "your-project-id"
glossary_id = "glossary-id" # the id used while creating the glossary
text_to_translate = "I am a computer science student. I am not good at cloud tech.I am learning linear algebra."
source_language_code = "en"
target_language_code = "bn"

response = translate_text_with_glossary(text_to_translate, project_id, glossary_id, source_language_code, target_language_code)
print(response)

output

translations {
  translated_text: "আমি একজন কম্পিউটার বিজ্ঞানের ছাত্র। আমি ক্লাউড প্রযুক্তিতে ভালো নই। আমি রৈখিক বীজগণিত শিখছি।"
}
glossary_translations {
  translated_text: "আমি একজন কম্পিউটার সায়েন্স ছাত্র। আমি ক্লাউড প্রযুক্তিতে ভালো নই। আমি লিনিয়ার আলজেব্রা শিখছি।"
  glossary_config {
    glossary: "my-glossary-resource"
  }
}

We can see it also returns the translation generated without using the glossary.

For batch translation using glossary

from google.cloud import translate

def batch_translate_text(
    input_uris: str,
    output_uri: str,
    project_id: str,
    glossary_id: str,
    source_language_code: str,
    target_language_codes: list,
    timeout: int = 180,
) -> translate.TranslateTextResponse:
    """Translates a batch of texts on GCS and stores the result in a GCS location.

    Args:
        input_uri: The input URI of the texts to be translated.
        output_uri: The output URI of the translated texts.
        project_id: The ID of the project that owns the destination bucket.
        glossary_id: The ID of the glossary to use.
        source_language_code: The language code of the input text.
        target_language_codes: The language codes of the output texts. max 10
        timeout: The timeout for this batch translation operation.

    Returns:
        The translated texts.
    """

    client = translate.TranslationServiceClient()

    location = "us-central1"

    input_configs_element = [
        {
            "gcs_source": {"input_uri": uri},
            "mime_type": "text/plain",
        }
        for uri in input_uris
    ]


    gcs_destination = {"output_uri_prefix": output_uri}
    output_config = {"gcs_destination": gcs_destination}
    parent = f"projects/{project_id}/locations/{location}"

    # glossary is a custom dictionary Translation API uses
    # to translate the domain-specific terminology.
    glossary_path = client.glossary_path(
        project_id, "us-central1", glossary_id  # The location of the glossary
    )

    glossary_config = translate.TranslateTextGlossaryConfig(glossary=glossary_path)

    glossaries = {"bn": glossary_config, "fr": glossary_config}

    operation = client.batch_translate_text(
        request={
            "parent": parent,
            "source_language_code": source_language_code,
            "target_language_codes": target_language_codes,
            "input_configs": input_configs_element,
            "output_config": output_config,
            "glossaries": glossaries,
        }
    )

    print("Operation ID: {}".format(operation.operation.name)) # needed if we want to cancel the operation
    print("Waiting for operation to complete...")
    response = operation.result(timeout)

    print(f"Total Characters: {response.total_characters}")
    print(f"Translated Characters: {response.translated_characters}")

    return response

project_id = "your-project-id"
input_uris = ["gs://your-input-bucket/path-to-file"] #list of your input files
output_uri = "gs://your-output-bucket/output-folder/"
source_language_code = "en"
target_language_codes = ["bn", "fr"] #list of your target languages (up to 10)
glossary_id = "glossary-id"

response = batch_translate_text(intput_uris, output_uri, project_id, glossary_id, source_language_code, target_language_codes)
Without glossary
আমি একজন কম্পিউটার বিজ্ঞানের ছাত্র।
আমি মেঘ প্রযুক্তিতে ভাল নই।
আমি গুগলের ক্লাউড অনুবাদ API শিখছি।
আমি যে ফাইলটি অনুবাদ করতে চাই তা সঞ্চয় করার জন্য আমাকে একটি স্টোরেজ বালতি তৈরি করতে হবে।
আমি লিনিয়ার বীজগণিতও শিখছি।

with glossary
আমি একজন কম্পিউটার সায়েন্স ছাত্র।
আমি ক্লাউড প্রযুক্তিতে ভাল নই।
আমি google এর Cloud ট্রান্সলেসন API শিখছি।
আমি যে ফাইলটি অনুবাদ করতে চাই তা সঞ্চয় করার জন্য আমাকে একটি স্টোরেজ বাকেট তৈরি করতে হবে।
আমি লিনিয়ার আলজেব্রা শিখছি।

Notice that in the 3rd line of the output with glossary Cloud is still in English. But in our glossary we declared cloud, ক্লাউড, cloud. This is because glossary is case-sensitive. If we want to ignore case we can do so in the glossary config.

glossary_config = translate.TranslateTextGlossaryConfig(glossary=glossary_path, ignore_case=True)

Limitations

The total number of terms in an glossary input file can't exceed 10.4 million (10,485,760) UTF-8 bytes for all terms in all the languages combined. Any single glossary term must be less than 1024 UTF-8 bytes. Terms longer than 1024 bytes are ignored.

See also : Cloud Translation Pricing