I’m using the hosted Discourse/CDCK to set up my site. I’m currently working on taking conversations from a Slack export, doing a bit of preprocessing on the text, generating a title with a hugging face transformer, then adding it to Discourse via the API.
I saw on another forum (can’t find it sorry) that using the API for bulk operations is not best practice. I also keep hitting rate limits and other issues that I’ve tried to modify the settings for, but ultimately continue hitting rate limits to no avail. Do any of you have any suggestions on how to bulk import any sort of file I could produce with the Slack messages?
The processed (Discourse-friendly) Slack messages are in a database and I have a script that pulls them out and pushes them to the Discourse API, I could easily change this to make a file that Discourse could ingest and generate topics in bulk.
Typically you’ll use an import script on a server that you control or a development instance. You would need to put your site if read only mode, download the backup from your hosted instance, do the work, make a backup, upload it to your hosted instance, and ask that your backup be restored. If you site is live and a day of read only time isn’t an issue do this.
If read only time is Really Bad and you have only a few thousand posts, write your script to have delay between calls and also retry when denied.
If you have tens of thousands of posts, you’ll need to convince yourself and whoever else that a day of read only time is not Really Bad
scripts/import_scripts in the source code for examples.
I’m only pulling in a few hundred at a time, so I’ll probably just put delays so I don’t have to stand up the local instance. We’re slowly moving over the longer conversations with specific people in it to seed our Discourse site with a bunch of valuable conversations that are no longer searchable via Slack.
Will CDCK ever support some way to invoke these bulk import scripts or possibly expose a bulk import capability via the API?
I doubt it, as it’s a fairly niche request, and potentially very dangerous. Coincidentally, I am about to start work on a plugin that will accept a URL of a google docs spreadsheet with sheets for categories, users, topics, posts, and import them, but I would expect it to be available only to self-hosters and enterprise customers.
Fair enough, I’m adding some exception handling and skipping if the external id already exists. This workflow works fine since we are only trying to import a few hundred at a time and then manually going through and fixing them up. I just wanted to make sure I wasn’t doing anything that could get me into a bad state.
That said, that plugin seems like a great alternative! hopefully you would offer a fee to enable it for a month and then go back to the team plan.
If you’re using the API then you have a bunch of protections provided by rails to see that you’re not doing anything that be really bad.
Just uploaded 500. Took a while but all I did was shorten the wait times to 1s for posts and topics, and added a sleep for a second after each create call and that solved it. It took about 4 hours to upload ~500 messages.
This is definitely the way. This will give us plenty to chew on and moderate for a bit.
BTW this is all currently in a private Discourse, so we’re gonna make it look pretty before we open the gates.
Thanks again for all of your help! Looking forward to that new plugin!
This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.