Careful there, you can totally hammer search if you allowed this unguarded, onebox a PDF with a book and now every search hits your post. Finding the threshold for breaks would be a nightmare.
Maybe a checkbox beside the search filed for including/excluding pdf content would be helpful.
Threshold for breaks? - Can you describe the issue in more detail?
I mean that if you attach a PDF of the full lord of the rings including every word in our search index would be terrible from multiple perspectives.
So you would need an extra site setting for “max length of text in PDF to index” or something along those lines.
Honestly a much better approach here would be to have the onebox expand the first few paragraphs of the PDF (in this new rich PDF onebox plugin)
Does the OneBox-ing of PDF documents only work for external files? Not for local uploaded files?
I have PDF files uploaded in topic posts, and they are not showing up as OneBox.
I also support getting the metadata if the document is local. That should avoid the need to download it from the external source, so there is no excuse not to process it.
How about something like Docsplit
that works on many different file formats?
I am closing this for now, we now support PDF oneboxes so if there is a need to improve it open another feature req.