Dump all conversations in a file and structured data

Wall-E · September 1, 2021, 2:59pm

This question is similar to this one: Does Discourse support export conversations as an organized bulk of data?

but we are looking for a way to do some NLP on all conversations of our Discourse site. Someone in our team asked if this could be done by acting at some low-level, in the backend, e.g. exporting the database but without the table, with something like pg_dump --schema-only. I didn’t fully understand what my colleague meant but I thought maybe you would.

pfaffman · September 1, 2021, 3:02pm

If you’re self-hosted, then they can do the pg_dump command that they think will help.

You can also dump data in various formats with the Data Explorer Plugin.

Wall-E · September 1, 2021, 8:44pm

This plugin seems to provide most of what we’re looking for! Thanks!

Wall-E · September 14, 2021, 10:40pm

So I installed the plugin and looked at all the queries made at (Superseded) What cool data explorer queries have you come up with? but there isn’t anything that can export the actual conversations. For example, I have asked for the top 100 active topics. I get database entries with topic IDs (see screenshot), but no conversations. Is this because the plugin is only to extract data from the database only and won’t pull the conversation themselves? If that is correct, is there a way to use the information pulled from the database to pull the conversations in a json files, and whose topic IDs are the ones pulled from the database by the plugin?

pfaffman · September 14, 2021, 11:35pm

 SELECT * FROM posts where topic_id=425

That will give you the posts the first topic in your query (given that I can type on this phone).

But if what you want is JSON, you could do something like

  https://meta.discourse.org/t/dump-all-conversations-in-a-file-and-structured-data/202351.json

Wall-E · September 15, 2021, 12:17am

I didn’t understand your 1st option, maybe a typo in your text? Did you mean I only get the 1st post of the topic?

Regarding the 2nd option with the .json extension, is there an alternative url that uses the topic_id or any other entry that can be used to have a more programmatic way to get the conversation as a json without having to know the topic title?

pfaffman · September 15, 2021, 12:38am

Did you try the sql query? Was there an error? Edit: I checked. That query will return all posts in a topic.

You can get any topic with only the topic id.

https://meta.discourse.org/t/-/202351.json

Wall-E · September 15, 2021, 1:03pm

the query was fine, i just misunderstood your explanation of what it actually provides. Thanks for double-checking. These are great solutions.

system · October 15, 2021, 1:04pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Does Discourse support export conversations as an organized bulk of data? Feature	4	1035	February 21, 2021
Exporting all Topic titles and IDs? Dev	1	1502	January 22, 2018
Discourse Public Data Dump Developer Guides	1	970	May 13, 2025
Exporting all Forum Posts for Manual Upload into External LLMs? Support ai	1	80	January 27, 2025
Saving a Conversation Support	10	893	November 4, 2021

Dump all conversations in a file and structured data

Related topics