I haven’t used Microsoft Translate so you’re ahead of me there - though I suspect that in theory the methods wold be the same.
I like your idea of parsing for paragraph breaks, though I’m not sure I’d assume every document has a CR/LF line ending. 'nix uses just a LF character. Mac uses just a CR character. Windows uses both. Other documents might use a null byte as the EOL character.
Unicode provides it’s own problems since every character is two bytes long.
Possible solution: Look at the line ending in the first sentence or two, store that as a value, and then convert all the line-endings to just “\n” before parsing the document. After the document is complete, you could automatically reset to the correct line ending.
One method would be to scan forward until you get to 10,000 words and then scan backwards for a paragraph break. Put a head-pointer at the head of the current block, scan forward, and when you find the last paragraph break before 10,000 words, place a tail-pointer. Snip out that block, translate it, move it to the result-document, move the head pointer to the tail pointer position and continue.
As an aside, translation software can be abysmal because a lot of times the translation is terribly context sensitive, contains slang, or uses jargon. Likewise technical terms or words specific to a particular trade or skill - that shouldn’t be translated in many cases - get mangled horribly. Legal, medical, and engineering/technical documents are classics.
I sent a complex medical document, (the operative report for someone’s brain surgery), through two different translators - Google and Yandex - trying to translate it to Russian. The result of both translations was more like bad lasagna than a readable document!