Split big posts for successful translation

ValdikSS · January 31, 2021, 8:19pm

I use Translator plugin with Microsoft service. If the post is big enough, it can’t be translated with the following error:

This post is too long to be translated by the translator.

Could you please implement a feature for per-paragraph translation to workaround this issue?

jharris1993 · January 31, 2021, 8:24pm

That’s a normal part of Google Translate and I believe the word limit is something like 500 words.

If you routinely get something larger than that, I see a few options:

Manually parse the content for blocks of text smaller than 500 words, (or, whatever the limit is).
Make use of another Google API that does document translation, (I’m not sure, but I think they have one, you have to ask over there).
Make use of a different site that does document translation and hope they expose API’s.

Don’t forget to tell us what worked.

ValdikSS · January 31, 2021, 8:43pm

I use Microsoft API, not Google.
It seems that Microsoft has a limit of 10,000 characters per single request: Request limits - Translator - Azure Cognitive Services | Microsoft Docs

I suppose the easiest would be to split the post per paragraphs ("\r\n\r\n" or <p>), assuming that the paragraph won’t be larger than 10000 characters?

jharris1993 · January 31, 2021, 8:54pm

I haven’t used Microsoft Translate so you’re ahead of me there - though I suspect that in theory the methods wold be the same.

I like your idea of parsing for paragraph breaks, though I’m not sure I’d assume every document has a CR/LF line ending. 'nix uses just a LF character. Mac uses just a CR character. Windows uses both. Other documents might use a null byte as the EOL character.

Unicode provides it’s own problems since every character is two bytes long.

Possible solution: Look at the line ending in the first sentence or two, store that as a value, and then convert all the line-endings to just “\n” before parsing the document. After the document is complete, you could automatically reset to the correct line ending.

One method would be to scan forward until you get to 10,000 words and then scan backwards for a paragraph break. Put a head-pointer at the head of the current block, scan forward, and when you find the last paragraph break before 10,000 words, place a tail-pointer. Snip out that block, translate it, move it to the result-document, move the head pointer to the tail pointer position and continue.

==============================

As an aside, translation software can be abysmal because a lot of times the translation is terribly context sensitive, contains slang, or uses jargon. Likewise technical terms or words specific to a particular trade or skill - that shouldn’t be translated in many cases - get mangled horribly. Legal, medical, and engineering/technical documents are classics.

I sent a complex medical document, (the operative report for someone’s brain surgery), through two different translators - Google and Yandex - trying to translate it to Russian. The result of both translations was more like bad lasagna than a readable document!

ValdikSS · January 31, 2021, 9:10pm

@jharris1993, I assume you’re proposing me to implement the feature. Unfortunately, I don’t have experience with Ruby, and that would take much longer than it would take for experienced person.
Sure, I can hack it up, but that won’t be merged.

On my forum, most requested translation is from Russian to English, of technical posts. Microsoft does pretty good job here.

jharris1993 · January 31, 2021, 9:12pm

Cool beanies!

What forum is that, pray tell. If it can do English => Russian, I may send my next long/complex document through it!

ValdikSS · January 31, 2021, 9:36pm

The translation is performed with discourse-translator plugin. I thought I created this topic under plugin category, but it left uncategorised.

Topic		Replies	Views
Translate button unable to translate one post, popup error - The field Text must be a string or array type with a maximum length of '5000' Support translator	9	2207	November 5, 2019
Increase Amazon translate limit to ~10,000 characters Feature translator , completed	4	541	April 21, 2024
Truncated (larger) articles Support translator	0	190	January 14, 2024
Excerpt gets too long under certain circumstances: "ActiveRecord::ValueTooLong" Bug	8	1229	March 14, 2018
Failed to machine-translate Post#9153 to en: quoted string not terminated Support translator	4	45	May 22, 2025

Split big posts for successful translation

Related topics