We have need to not only search the forums, but the attachments as well. Currently enjoying a 2 week preview of the product, liking it immensely, but just found out that search did not yield any results from any of the sample .pdf’s we uploaded.
It would be incredibly nice if discourse could index and search .pdf formats, as well as other standard office-type or text formats.
Could you please add this to the upcoming feature list? Much appreciated!!
Whoa there, do you mean attachment content? As in what the files contain? I am not sure I view that as within the scope of Discourse.
Search should match filenames, if your filenames are unique enough, because the filenames are part of the post body. But the contents of the files are not considered part of the Discourse posts…
Correct… being able to search content within the attachments. To me, this is one of those features that is almost becoming a standard nowadays… and our users are beginning to expect this kind of functionality. We run multiple organization’s websites, and have been performing this kind of functionality for many years using the Microsoft Indexing Service. We have other sites that have switched to the Sitefinity CMS product, and this functionality was a must-have there as well. Gmail lets you do it as well across the the attachments you have saved in your account. It’s a tremendously valuable feature for those who provide and upload a lot of content within file attachments.
Anyways, please let me know if you reconsider, or if you do hear of a plugin that would be capable of doing something like this, I’m definitely interested!
However running Java, adding a potentially very big column with search data and creating necessary plugin hooks on the search infrastructure is something, involved
A bit different spec, but it might be easier and less resource intensive to have a page that queries uploads and lists the files, if that would be sufficient and be a fair compromise.
Or perhaps, add another field in the upload file dialog box that asks for a description, and allows you to dump some content into there that would be searchable?
With that kind of code, I’d be very concerned about bugs in the file format parsing. Office files are prone to making bugs when you try to parse them, including RCE on occasion.
I think the idea would be to use something like [Apache Tika – Apache Tika] https://tika.apache.org/ and then make the extracted meta data searchable in Discourse.