Index File Contents for Search

So, I’ve been looking into this, and putting a wireframe together (with AI) on how to best achie this. There are a couple of ideas that come to mind. Using Apache TIKA would allow us to OCR almost any file type with text, including images. It would be a self-hosted option. And/Or: Use Gemini Flash 1.5 (for example) to not only perform OCR, but also describe images being seen and analyzed, then post that data in a PostgreSQL table/column for search. Of course, this requires a sizable investment of tokens upfront to rebake all posts with attachments/uploads, but it would be most useful. I suppose you get what you pay for?

1 Like