I need to prepare an export or backup which I can then manually upload into a Perplexity.ai Workspace. I want to include all of the posts on the whole site. The allowed filetypes are:
.json
.md
.txt
.pdf
.csv
.xlsx
.docx
.pptx
I can upload up to 50 files of 25mb or smaller.
I do not need images included.
What would be the best method?
(Also, not necessary, but if possible, I would like the exported file(s) to include the contents of user-uploaded documents such as .pdf files in a format readable by the LLMs.)
Does anyone have any thoughts about the best way to prepare such an export?
I am self-hosting.
Here’s what I found so far…
Ask.Discourse.com said...
Exporting Content While Excluding Images:
- Using the Data Explorer Plugin (if available):
- Discourse’s Data Explorer Plugin allows you to run custom SQL queries to extract different types of data in the format you want (e.g.,
JSON
,CSV
). - Example Query: Run a query to export all posts and topics. This will give you structured data.
- JSON Export through API:
- Each topic or post on Discourse can be programmatically exported with
JSON
by appending.json
to its URL. - For example:
https://<your-discourse-site>/t/topic-title/<topic-id>.json
- If you aim to programmatically trigger these exports for all topics on the site based on the topic_ids, this would get the conversations in individual
.json
files.
- Automated Command Line Option (if Supported):
- For self-hosted sites, advanced users can work directly with the database using
pg_dump
commands or similar database query extraction steps. - However, hosted plans don’t allow running these tools directly.
- User-Uploaded Documents (e.g., PDFs):
- Unfortunately, merging the content of uploaded documents directly into
.json
or any other output format like.txt
isn’t natively supported. However, you can separately download and parse the documents locally using specialized tools to include their text into structured formats.
File Type Recommendation:
Given your upload constraints:
- For text extraction, export
.json
or.csv
files, as these formats can best capture structured posts and topic details.
If additional customization is needed, please let me know!
For more details, check out the Data Explorer Plugin documentation.