Oneboxing of PDFs and other attachments

Continuing the discussion from Custom visualization for specific attachment types:

Putting this out there as a feature request. I’d love to see the ability to onebox PDFs and other attachments along the lines of google docs oneboxing. Or perhaps even simply a file attachment appearance like you’d get using the file upload, ideally also with the file type and size provided.

Right now putting a PDF URL on its own line presents as a raw URL. Very 1990s. As my millenial colleague told me recently, “who needs to know what http is in this day and age?”

10 Likes

Kids these days. I’m so sure.

1 Like

Sure oneboxing of PDF is a reasonable idea, @techapj can you add it to your list? At minimum try to get the title of the document and a text summary. I would not worry about thumbnail as that will be considerably harder, just use a generic (but pretty) PDF icon like we do for Google Docs.

7 Likes

Okay, we now support PDF onebox using PDF metadata.

The oneboxing works best when the metadata of PDF file is complete i.e. it contains “Title”, “Subject” & “Author”.

Demo:

  1. PDF contains complete metadata:
  1. PDF only have “Title” & “Author” as metadata:
  1. PDF with no metadata:
12 Likes

Thanks for this - super exciting to see PDF oneboxing. :rocket:

However I’m having a bit of trouble with it - my PDFs do not appear to be oneboxed, even here on meta. Here’s an example:

2 Likes

I had to revert this change, specifically getting information from “PDF metadata”.

The PDF metadata was being fetched using pdf-reader gem which introduced lots of its own dependencies. I just removed the dependency on pdf-reader gem for onebox.

Now the onebox will simply show pdf filename and filesize. This change significantly reduces time required to onebox because instead of fetching the whole file and loading it in memory we are now just making a HEAD request to get “Content-Length” for filesize and the URL contains filename.

Here is the demo of new PDF onebox:

.

.

.

I looked into this locally. It was because of pdf title not being able to forced into UTF-8 encoding. The new onebox fixes this issue:

https://namati.org/wp-content/uploads/2017/01/4.Evidence_Land-Rights_-Myanmar-2017-Final.pdf

11 Likes

Fabulous. Confirmed working - thanks! :rocket:

2 Likes

What about pdfs that are mail attachements or have been simply uploaded as attachments? They are already on the server. Wouldn’t this ease the analysis of meta-data?

You still need a library to read this metadata and to load the file in memory, that’s more expensive.

2 Likes

I really thought that the meta-data are stored at the very beginning of the
file, so that just a stub would need to be loaded.

That may be correct (I am not sure), but to read the pdf file in Ruby we will have to depend on pdf-reader gem. Hence additional library and more memory.

3 Likes

Are these OneBox information cached or you have to reprocess it every time the URL is shown?

If it is cached, then I can’t see why spending the time to read the file and extract the meta info should be resource-wasting.

1 Like

It’s cached – this information is baked into the HTML-version of the post :slight_smile:

4 Likes

We also have to consider the resources of install time and disk space - adding a whole bundle of other gems isn’t really helpful on that front.

7 Likes

This is a pity. Having the metadata displayed would be extremely useful, especially in the academic context, where a lot of pdfs are shared. I understand that this may not be the right setting to have enabled by default because it potentially uses a lot of resources, but is there a chance of bringing this back as a site setting? Or perhaps at least for locally uploaded pdfs, i.e. where the pdf doesn’t need to be downloaded?

4 Likes

I would prefer this to, at least initially, be a plugin.

I don’t want to worry about another gem dependency, I don’t want to worry about it potentially causing memory bloat on our job processor. Putting it in a plugin a 3rd party maintains would shield us from this and allow you to nut out all the intricacies and edge cases with bad metadata that is floating around there in random PDFs.

7 Likes

FWIW and after rereading my OP above, I think my need is met by the current functionality. Providing extra info about the PDF contents is a “nice to have” not a requirement.

3 Likes

I don’t like this at all.

In my use case pdf attachments are directly embedded into the text. Oneboxes costs space and the real world benefit is extremely low. As I have previously suggested about a year ago, I would prefer an HTML5-based PDF viewer and more capabilities to search inside these pdf documents with the Discourse search. - Maybe, it could be nice to automaticly insert an PDF icon right before the linked file name. This signals more then enough, that an pdf file is being placed on this location

I’d be happy with this too, and suggested it in the OP.

But really my need here is met already and I wouldn’t want to see the discourse team devoting too much more time to making PDFs more presentable in discussions. Bike shedding and all that. But I suppose making this change could be pr-welcome.

1 Like

That sounds like an awesome idea for a plugin! I’d guess a couple or three days of work for a programmer familiar with Discourse (which means, not me!), given that some HTML-5 based viewer already exists.

1 Like