Split big posts for successful translation

I use Translator plugin with Microsoft service. If the post is big enough, it can’t be translated with the following error:

This post is too long to be translated by the translator.

Could you please implement a feature for per-paragraph translation to workaround this issue?

2 Likes

That’s a normal part of Google Translate and I believe the word limit is something like 500 words.

If you routinely get something larger than that, I see a few options:

  1. Manually parse the content for blocks of text smaller than 500 words, (or, whatever the limit is).
  2. Make use of another Google API that does document translation, (I’m not sure, but I think they have one, you have to ask over there).
  3. Make use of a different site that does document translation and hope they expose API’s.

Don’t forget to tell us what worked.

I use Microsoft API, not Google.
It seems that Microsoft has a limit of 10,000 characters per single request: Request limits - Translator - Azure Cognitive Services | Microsoft Docs

I suppose the easiest would be to split the post per paragraphs ("\r\n\r\n" or <p>), assuming that the paragraph won’t be larger than 10000 characters?

1 Like

I haven’t used Microsoft Translate so you’re ahead of me there - though I suspect that in theory the methods wold be the same.

I like your idea of parsing for paragraph breaks, though I’m not sure I’d assume every document has a CR/LF line ending. 'nix uses just a LF character. Mac uses just a CR character. Windows uses both. Other documents might use a null byte as the EOL character.

Unicode provides it’s own problems since every character is two bytes long.

Possible solution: Look at the line ending in the first sentence or two, store that as a value, and then convert all the line-endings to just “\n” before parsing the document. After the document is complete, you could automatically reset to the correct line ending.

One method would be to scan forward until you get to 10,000 words and then scan backwards for a paragraph break. Put a head-pointer at the head of the current block, scan forward, and when you find the last paragraph break before 10,000 words, place a tail-pointer. Snip out that block, translate it, move it to the result-document, move the head pointer to the tail pointer position and continue.

==============================

As an aside, translation software can be abysmal because a lot of times the translation is terribly context sensitive, contains slang, or uses jargon. Likewise technical terms or words specific to a particular trade or skill - that shouldn’t be translated in many cases - get mangled horribly. Legal, medical, and engineering/technical documents are classics.

I sent a complex medical document, (the operative report for someone’s brain surgery), through two different translators - Google and Yandex - trying to translate it to Russian. The result of both translations was more like bad lasagna than a readable document!

@jharris1993, I assume you’re proposing me to implement the feature. Unfortunately, I don’t have experience with Ruby, and that would take much longer than it would take for experienced person.
Sure, I can hack it up, but that won’t be merged.

On my forum, most requested translation is from Russian to English, of technical posts. Microsoft does pretty good job here.

1 Like

Cool beanies!

What forum is that, pray tell. If it can do English => Russian, I may send my next long/complex document through it!

The translation is performed with discourse-translator plugin. I thought I created this topic under plugin category, but it left uncategorised.

1 Like