Will RAG Support PDF Files in the Future?

first, your AI stuff rocks!

second, if we post PDF or word or powerpoint files to our forum, will it also read those and chunk them up into vectors for RAG?

2 Likes

Sadly we do not have PDF support yet, it is something we are thinking about. We do support TXT files in our Persona and Tool RAG implementation. So as long as you are able to convert the source material to text files you can consume it in a persona.

3 Likes

yes, that is what we did, we converted attachments to text and associated those with each topic.

1 Like

We have seen this feedback a few times and are considering expanding extension support in the future through our AI bot persona and Tool RAG implementation

2 Likes

as a work around for now, we just convert the powerpoint or word or PDf file to text and attach it to the same topic it belongs with.

1 Like

PDF support would absolutely be a game changer for many communities! Given that it seems to be a universal standard for documents, we often find ourselves having to reformat stuff into .txt for RAG which is indeed time-intensive :face_with_spiral_eyes:

4 Likes

We are finishing some work on Embeddings and as soon as that is complete next up will be adding PDF support

4 Likes

Wow, that’s super nice. Kudos to the team that always takes into mind what the community needs!

What about JSON files? I had a ton of Discord chats exported that we need to query within AI so we don’t lose this info :slight_smile:

I was thinking about fine-tuning models, but I think adding the files to Discourse should be better and simpler for everyone with a similar use case.

JSON is just text so we already support it.

It is an inefficient representation for LLMs given large amount of duplication within the format so it would waste a few tokens, but overall it will work. I would recommend running a script on it and reformatting to improve RAG performance.

It is very hard to do this automatically cause JSON can be very nested and picking a perfect domain specific text representation highly depends on the domain.

3 Likes

Thanks Sam, can I ask about your suggestion to keep balanced performance+price when adding ~150 MB of JSON (on PDF)?

That’s my the first time at RAG on our data and I will start to learn soon on the process.

I appreciate any insight from the community as well.