Searchable File Attachments

We have need to not only search the forums, but the attachments as well. Currently enjoying a 2 week preview of the product, liking it immensely, but just found out that search did not yield any results from any of the sample .pdf’s we uploaded. :confused:

It would be incredibly nice if discourse could index and search .pdf formats, as well as other standard office-type or text formats.

Could you please add this to the upcoming feature list? Much appreciated!!

6 Likes

Whoa there, do you mean attachment content? As in what the files contain? I am not sure I view that as within the scope of Discourse.

Search should match filenames, if your filenames are unique enough, because the filenames are part of the post body. But the contents of the files are not considered part of the Discourse posts…

1 Like

It is definitely something I would be supportive of in a plugin if someone feels like building it.

Would be nice to add this level of extensiblity to search.

7 Likes

Correct… being able to search content within the attachments. To me, this is one of those features that is almost becoming a standard nowadays… and our users are beginning to expect this kind of functionality. We run multiple organization’s websites, and have been performing this kind of functionality for many years using the Microsoft Indexing Service. We have other sites that have switched to the Sitefinity CMS product, and this functionality was a must-have there as well. Gmail lets you do it as well across the the attachments you have saved in your account. It’s a tremendously valuable feature for those who provide and upload a lot of content within file attachments.

Anyways, please let me know if you reconsider, or if you do hear of a plugin that would be capable of doing something like this, I’m definitely interested!

It’d probably be a feature we only offer to enterprise hosted instances.

1 Like

This gem seens a good candidate with a good compatibility across document types:

https://github.com/Erol/yomu/blob/master/README.md

However running Java, adding a potentially very big column with search data and creating necessary plugin hooks on the search infrastructure is something, involved

Yeah… we’re not that big. The Standard hosting model is even overkill for our sized group. :frowning:

A bit different spec, but it might be easier and less resource intensive to have a page that queries uploads and lists the files, if that would be sufficient and be a fair compromise.

Or perhaps, add another field in the upload file dialog box that asks for a description, and allows you to dump some content into there that would be searchable?

With that kind of code, I’d be very concerned about bugs in the file format parsing. Office files are prone to making bugs when you try to parse them, including RCE on occasion.

2 Likes

Duplicate of

?

1 Like

I think the idea would be to use something like [Apache Tika – Apache Tika] https://tika.apache.org/ and then make the extracted meta data searchable in Discourse.

Going to close this as a dupe of: Index File Contents for Search

Very supportive of someone experimenting in a plugin, no concrete plans from our side to integrate with a tika server, etc.

1 Like