Google Cloud Translation API with Python
Lets walk through the process of integrating Google's Cloud Translation API with python applications.
Google provides both out-of-the-box Neural Machine Translation (NMT) and custom translation through training an AutoML model. In this blog, we're only focusing on the default NMT model, which excels in general use cases but might encounter challenges with technical domain-specific terms. However, we can enhance its performance by incorporating glossaries—custom dictionaries utilized by the Cloud Translation API to ensure consistent translation of the customer's domain-specific terminology.
Setup
First go to the GCP Console (console.cloud.google.com). and create a project.
Now, in the search bar search for Cloud Translation API and enable this API.
Authentication
Before we can use this API, we need to set up authentication.
We are going to use the Cloud Translation - Advanced API (v3) version which does not support API keys for authentication. We are going to authenticate using our User credentials and ADC (Application Default Credentials)
To provide your user credentials to ADC, we need to use the Google Cloud CLI:
Install and initialize gcloud cli
The following commands are for Debian 9+ or Ubuntu 18.04+. Refer to this guide for other distros.
First update the packages and make sure it has
apt-transport-https
andcurl
installed.sudo apt-get update sudo apt-get install apt-transport-https ca-certificates gnupg curl sudo
Import the Google Cloud public key.
echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
Update and install the gcloud CLI:
sudo apt-get update && sudo apt-get install google-cloud-cli
Run
gcloud init
to get started:gcloud init
You will be asked to login and select a project.
Create local authentication credentials for your Google Account:
gcloud auth application-default login
Install the python library
Now we need to install the python client library. It is recommended to use venv
to isolate dependencies.
pip install --upgrade google-cloud-translate
Now we are all set.
Translate Text
To translate texts set the project id, the text to translate, source language and target language.
The supported language codes can be found here.
from google.cloud import translate
def translate_text(
text: str,
project_id: str,
source_language: str,
target_language: str
) -> translate.TranslationServiceClient:
client = translate.TranslationServiceClient()
location = "us-central1"
parent = f"projects/{project_id}/locations/{location}"
# https://cloud.google.com/translate/docs/supported-formats
response = client.translate_text(
request={
"parent": parent,
"contents": [text],
"mime_type": "text/plain", # mime types: text/plain, text/html
"source_language_code": source_language,
"target_language_code": target_language,
}
)
return response
project_id = "your-project-id"
text_to_translate = "Hey, How are you?"
source_language = "en-US"
target_language = "fr"
response = translate_text(text_to_translate, project_id, source_language, target_language)
print(response)
Output:
translations {
translated_text: "Hey comment allez-vous?"
}
Batch Translate
Batch translation allows you to translate large amounts of text (with a limit of 100 files per batch), and to up to 10 different target languages in a command offline. The total content size should be <= 100M Unicode codepoints and must use UTF-8 encoding.
For batch translation we first need a storage to store the input and output files. Lets create two storage buckets for input and output. Search on the search box for buckets or from the navigation menu go to cloud storage > buckets
and click on +Create
Give your bucket a name. Names must be globally unique.
Select single region to minimize cost.
Choose the standard storage class.
Leave other options as they are and create the bucket.
Similarly create another bucket for storing output files.
Now create a .txt file with the text that you want to translate and upload it with in the input storage bucket.
And in the output bucket create a folder to store the translated files. For batch translation the target folder must be empty.
Now in the following script adjust the parameters as you need.
from google.cloud import translate
def batch_translate_text(
input_uris: str,
output_uri: str,
project_id: str,
source_language_code: str,
target_language_codes: list,
timeout: int = 180,
) -> translate.TranslateTextResponse:
"""Translates a batch of texts on GCS and stores the result in a GCS location.
Args:
input_uri: The input URI of the texts to be translated.
output_uri: The output URI of the translated texts.
project_id: The ID of the project that owns the destination bucket.
timeout: The timeout for this batch translation operation.
Returns:
The translated texts.
"""
client = translate.TranslationServiceClient()
location = "global"
input_configs_element = [
{
"gcs_source": {"input_uri": uri},
"mime_type": "text/plain",
}
for uri in input_uris
]
gcs_destination = {"output_uri_prefix": output_uri}
output_config = {"gcs_destination": gcs_destination}
parent = f"projects/{project_id}/locations/{location}"
operation = client.batch_translate_text(
request={
"parent": parent,
"source_language_code": source_language_code,
"target_language_codes": target_language_codes, # Up to 10 language codes here.
"input_configs": input_configs_element,
"output_config": output_config,
}
)
print("Operation ID: {}".format(operation.operation.name)) # needed if we want to cancel the operation
print("Waiting for operation to complete...")
response = operation.result(timeout)
print(f"Total Characters: {response.total_characters}")
print(f"Translated Characters: {response.translated_characters}")
return response
project_id = "your-project-id"
input_uris = ["gs://your-input-bucket/path-to-file"] #list of your input files
output_uri = "gs://your-output-bucket/output-folder/"
source_language_code = "en"
target_language_codes = ["bn", "fr"] #list of your target languages (up to 10)
batch_translate_text(input_uris, output_uri, project_id, source_language_code, target_language_codes)
you can get the URI like this.
Here I used one input file and two target languages bn
and fr
. After the batch process is finished the translated output files are stored in my output bucket.
input
I am a computer science student.
I am not good at cloud tech.
I am learning google's Cloud Translation API.
I needed to create a storage bucket to store the file I want to translate.
I am also learning linear algebra.
output-bn
আমি একজন কম্পিউটার বিজ্ঞানের ছাত্র।
আমি মেঘ প্রযুক্তিতে ভাল নই।
আমি গুগলের ক্লাউড অনুবাদ API শিখছি।
আমি যে ফাইলটি অনুবাদ করতে চাই তা সঞ্চয় করার জন্য আমাকে একটি স্টোরেজ বালতি তৈরি করতে হবে।
আমি লিনিয়ার বীজগণিতও শিখছি।
output-fr
Je suis étudiant en informatique.
Je ne suis pas doué en technologie cloud.
J'apprends l'API Cloud Translation de Google.
J'avais besoin de créer un bucket de stockage pour stocker le fichier que je souhaite traduire.
J'apprends également l'algèbre linéaire.
Glossary
In the input and outputs above there is a problem. Some words are translated to Bengali using their literal meaning ( I do not speak French, so I am gonna skip it for now)
Computer science is also called 'computer science' in Bengali. 'Science' is not needed to translated to Bengali. Same for Linear Algebra.
'Cloud' is also translated to literal cloud
'Cloud Translation API' should not be translated word by word.
Bucket also should no be translated to বালতি
This is where Glossary comes in.
A glossary is a custom dictionary the Cloud Translation API uses to consistently translate the customer's domain-specific terminology. This typically involves specifying how to translate a named entity.
You might use a glossary for the following use cases:
Product names: For example, "Google Home" must translate to "Google Home".
Ambiguous words: For example, the word "bat" can mean a piece of sports equipment or an animal. If you know that you are translating words about sports, you might want to use a glossary to feed the Cloud Translation API the sports translation of "bat", not the translation for the animal.
Borrowed words: For example, "bouillabaisse" in French translates to "bouillabaisse" in English. English borrowed the word "bouillabaisse" from French in the 19th century. An English speaker lacking French cultural context might not know that bouillabaisse is a fish stew dish. Glossaries can override a translation so that "bouillabaisse" in French translates to "fish stew" in English.
There are two types of Glossaries
We are going to use equivalent term sets here.
For using Glossary we need to:
Create a glossary file
Create a glossary resource with Cloud Translation API
Lets first create a .csv
file for glossary and upload it to the input bucket.
glossary1.csv
en,bn,fr
Translation API, ট্রান্সলেসন API, Translation API
computer science, সাইন্স,
linear algebra, লিনিয়ার আলজেব্রা,
cloud, ক্লাউড, cloud
bucket, বাকেট, bucket
Now we need to make this glossary available to the Cloud Translation API by creating a glossary resource.
from google.cloud import translate_v3 as translate
def create_glossary(
project_id: str,
input_uri: str,
glossary_id: str,
language_codes_arg: list,
timeout: int = 180,
) -> translate.Glossary:
"""
Create a equivalent term sets glossary. Glossary can be words or
short phrases (usually fewer than five words).
https://cloud.google.com/translate/docs/advanced/glossary#format-glossary
"""
client = translate.TranslationServiceClient()
location = "us-central1" # The location of the glossary
name = client.glossary_path(project_id, location, glossary_id)
language_codes_set = translate.types.Glossary.LanguageCodesSet(
language_codes=language_codes_arg
)
gcs_source = translate.types.GcsSource(input_uri=input_uri)
input_config = translate.types.GlossaryInputConfig(gcs_source=gcs_source)
glossary = translate.types.Glossary(
name=name, language_codes_set=language_codes_set, input_config=input_config
)
parent = f"projects/{project_id}/locations/{location}"
# glossary is a custom dictionary Translation API uses
# to translate the domain-specific terminology.
operation = client.create_glossary(parent=parent, glossary=glossary)
result = operation.result(timeout)
print(f"Created: {result.name}")
print(f"Input Uri: {result.input_config.gcs_source.input_uri}")
return result
project_id = "your-project-id"
input_uri = "gs://glossary-file-uri"
glossary_id = "glossary-id" # give your glossary a id. must be unique within the project
language_codes = ["en", "bn", "fr"]
response = create_glossary(project_id, input_uri, glossary_id, language_codes)
print(response)
This will create a glossary that we can use for translation.
For more operations like listing, adding, deleting, and updating glossaries and glossary-entries refer to this documentation.
Now for translating text with glossary
from google.cloud import translate
def translate_text_with_glossary(
text: str,
project_id: str,
glossary_id: str,
source_language_code: str,
target_language_code: str,
) -> translate.TranslateTextResponse:
"""Translates a given text using a glossary.
Args:
text: The text to translate.
project_id: The ID of the GCP project that owns the glossary.
glossary_id: The ID of the glossary to use.
Returns:
The translated text."""
client = translate.TranslationServiceClient()
location = "us-central1"
parent = f"projects/{project_id}/locations/{location}"
glossary = client.glossary_path(
project_id, "us-central1", glossary_id # The location of the glossary
)
glossary_config = translate.TranslateTextGlossaryConfig(glossary=glossary)
# Supported language codes: https://cloud.google.com/translate/docs/languages
response = client.translate_text(
request={
"contents": [text],
"target_language_code": target_language_code,
"source_language_code": source_language_code,
"parent": parent,
"glossary_config": glossary_config,
}
)
return response
project_id = "your-project-id"
glossary_id = "glossary-id" # the id used while creating the glossary
text_to_translate = "I am a computer science student. I am not good at cloud tech.I am learning linear algebra."
source_language_code = "en"
target_language_code = "bn"
response = translate_text_with_glossary(text_to_translate, project_id, glossary_id, source_language_code, target_language_code)
print(response)
output
translations {
translated_text: "আমি একজন কম্পিউটার বিজ্ঞানের ছাত্র। আমি ক্লাউড প্রযুক্তিতে ভালো নই। আমি রৈখিক বীজগণিত শিখছি।"
}
glossary_translations {
translated_text: "আমি একজন কম্পিউটার সায়েন্স ছাত্র। আমি ক্লাউড প্রযুক্তিতে ভালো নই। আমি লিনিয়ার আলজেব্রা শিখছি।"
glossary_config {
glossary: "my-glossary-resource"
}
}
We can see it also returns the translation generated without using the glossary.
For batch translation using glossary
from google.cloud import translate
def batch_translate_text(
input_uris: str,
output_uri: str,
project_id: str,
glossary_id: str,
source_language_code: str,
target_language_codes: list,
timeout: int = 180,
) -> translate.TranslateTextResponse:
"""Translates a batch of texts on GCS and stores the result in a GCS location.
Args:
input_uri: The input URI of the texts to be translated.
output_uri: The output URI of the translated texts.
project_id: The ID of the project that owns the destination bucket.
glossary_id: The ID of the glossary to use.
source_language_code: The language code of the input text.
target_language_codes: The language codes of the output texts. max 10
timeout: The timeout for this batch translation operation.
Returns:
The translated texts.
"""
client = translate.TranslationServiceClient()
location = "us-central1"
input_configs_element = [
{
"gcs_source": {"input_uri": uri},
"mime_type": "text/plain",
}
for uri in input_uris
]
gcs_destination = {"output_uri_prefix": output_uri}
output_config = {"gcs_destination": gcs_destination}
parent = f"projects/{project_id}/locations/{location}"
# glossary is a custom dictionary Translation API uses
# to translate the domain-specific terminology.
glossary_path = client.glossary_path(
project_id, "us-central1", glossary_id # The location of the glossary
)
glossary_config = translate.TranslateTextGlossaryConfig(glossary=glossary_path)
glossaries = {"bn": glossary_config, "fr": glossary_config}
operation = client.batch_translate_text(
request={
"parent": parent,
"source_language_code": source_language_code,
"target_language_codes": target_language_codes,
"input_configs": input_configs_element,
"output_config": output_config,
"glossaries": glossaries,
}
)
print("Operation ID: {}".format(operation.operation.name)) # needed if we want to cancel the operation
print("Waiting for operation to complete...")
response = operation.result(timeout)
print(f"Total Characters: {response.total_characters}")
print(f"Translated Characters: {response.translated_characters}")
return response
project_id = "your-project-id"
input_uris = ["gs://your-input-bucket/path-to-file"] #list of your input files
output_uri = "gs://your-output-bucket/output-folder/"
source_language_code = "en"
target_language_codes = ["bn", "fr"] #list of your target languages (up to 10)
glossary_id = "glossary-id"
response = batch_translate_text(intput_uris, output_uri, project_id, glossary_id, source_language_code, target_language_codes)
Without glossary
আমি একজন কম্পিউটার বিজ্ঞানের ছাত্র।
আমি মেঘ প্রযুক্তিতে ভাল নই।
আমি গুগলের ক্লাউড অনুবাদ API শিখছি।
আমি যে ফাইলটি অনুবাদ করতে চাই তা সঞ্চয় করার জন্য আমাকে একটি স্টোরেজ বালতি তৈরি করতে হবে।
আমি লিনিয়ার বীজগণিতও শিখছি।
with glossary
আমি একজন কম্পিউটার সায়েন্স ছাত্র।
আমি ক্লাউড প্রযুক্তিতে ভাল নই।
আমি google এর Cloud ট্রান্সলেসন API শিখছি।
আমি যে ফাইলটি অনুবাদ করতে চাই তা সঞ্চয় করার জন্য আমাকে একটি স্টোরেজ বাকেট তৈরি করতে হবে।
আমি লিনিয়ার আলজেব্রা শিখছি।
Notice that in the 3rd line of the output with glossary Cloud is still in English. But in our glossary we declared cloud, ক্লাউড, cloud.
This is because glossary is case-sensitive. If we want to ignore case we can do so in the glossary config.
glossary_config = translate.TranslateTextGlossaryConfig(glossary=glossary_path, ignore_case=True)
Limitations
The total number of terms in an glossary input file can't exceed 10.4 million (10,485,760) UTF-8 bytes for all terms in all the languages combined. Any single glossary term must be less than 1024 UTF-8 bytes. Terms longer than 1024 bytes are ignored.
See also : Cloud Translation Pricing