Pre-migration evaluation for a large Drupal 7 forum

Hi everyone, I own and administer what I believe is one of the largest Drupal-based forums on the internet, getting toward 2M posts. Drupal 7 is dying, and Drupal 8/9 are turning into more of a framework for web programmers than a ready-to-use content management system. The new Drupal versions simply do not offer the third-party modules I need for my forum to continue with its basic functions, and thanks to the joys of PHP and Drupal’s many other quirks, the upgrade would be every bit as hellish as migrating to a completely different platform. So I’m going to have to bite the bullet and migrate to something else. I’m pretty sure it will have to be Discourse due to a unique aspect of my forum community’s style: I’m the only moderator, and it’s not my full-time job. So over the years I’ve used the flexible Rules and Flag frameworks in Drupal to create a piecemeal system of community moderation of spam and offensive posts, with automatic post removal and/or closing user accounts at certain thresholds based on the how new the user is and how many users have flagged it and also taking into account the newness and recent flagging activity of the users that flag it. In other words, it’s almost exactly what Discourse has implemented. I’m really glad to see that Discourse has recognized the value of community moderation and has implemented such a comprehensive and well thought-out system out-of-the-box. Drupal 7 was and still is the only CMS flexible enough to allow this sort of custom functionality without being an experienced developer, which I am not. So it’s looking like I’ll be moving to Discourse. However I do have some concerns.

  1. Community moderation system: Our forum is currently evaluating a playground installation of Discourse. I’m impressed with how comprehensive and well thought-out the whole system is. But the community has noticed some quirks:
    • I really don’t like how it hides automatically removed posts behind “View ignored content”. If a post is bad enough to get removed by the community it’s either highly offensive or pure spam, and I don’t want visitors or users to even have the option to view it. This is especially problematic in the case of topics that are spam or have an offensive title. And wouldn’t search engine crawlers see into the hidden spam content? Is it possible to configure the amount of time without user intervention before an automatically hidden spam post gets completely deleted from public view? And what about topics and posts that were community flagged as inappropriate?
    • I read here that “Note: All values mentioned above are the default settings. They can be changed by admins in site settings” regarding the thresholds that lead to post removal and/or user silencing, but I’m not seeing those granular settings in my Discourse test instance. All I can find is “hide post sensitivity” and “silence new user sensitivity”, but I don’t understand what that sensitivity actually relates to in concrete terms.
    • I would like to remove the “off-topic” reason for flagging a post. Our forum community is very laid back in that sense and has a forum culture where off-topic posts are very common and well accepted. Update: Looks like this might work.
  2. Private messages migration: The current forum has close to a million private message threads using the Drupal 7 privatemsg module, and the Drupal → Discourse migration script doesn’t handle it. This seems like a major omission, because despite being a third-party module (in typical Drupal fashion) it’s basically the de-facto private messaging functionality that Drupal 7 administrators use.
  3. Post format conversion: Unfortunately the current forum uses a mix of pure HTML and Textile formatted posts. I understand that the migration script can handle pure HTML (please correct me if I’m wrong) but not Textile. If possible I would like to convert the Textile posts to HTML or Markdown, whatever is easier. I have been told that Pandoc can be hooked into the migration script, but that it would also massively increase the migration time. I’ve looked for Drupal modules to convert the format of existing posts, but I only found this, which doesn’t support batch processing for the massive amount of posts, and it doesn’t support the Drupal “comments” paradigm, which make up the vast majority of the “posts” that need to be converted. So I’ve thought about just doing some kind of offline find/replace on the database dump file with sed , similar to what is described here. Suggestions or solutions would be welcome. I’m an experienced Linux user and I’ve worked off and on with regexp, but I’m still not good at it. Edit: This is an interesting option to find/replace once the raw data is in Discourse.
  4. Ads: I’m really glad to see that the Ad Plugin for Discourse seems to have matured a lot since I last looked into it. I understand that the in-house ads will allow me to place image banners in specific spots with a target link when clicked, and that if multiple ads are assigned to the same spot they will be selected at random, correct? However, I have no idea how to deal with the mobile paradigm. In my current forum I have one top horizontal banner and three vertical banners in the left sidebar, all of which wouldn’t be feasible for mobile users in Discourse’s responsive interface. Edit: Might have to modify the Ad Plugin for my needs, paid offer here.
  5. Permalinks: Drupal’s URL scheme has two major components: /node/XXXXXXX , and links to specific comments within those nodes /comment/YYYYYYY#comment-YYYYYYY (YYYYYYY is the same in both occurences). Will the Drupal 7 → Discourse migration script automatically maintain those links so that links in posts to other threads or posts still work and maintain SEO? What about a sitemap.xml file for the search engines?
  6. Batch processing: During the migration will it run in batches? What happens if it hits an error, after fixing it will it continue on or will it require starting from the beginning?
  7. Old Apple device users: I of course understand the dangers of using outdated browsers. For Windows and old Android devices there’s almost always a way to install a modern browser that is compatible with Discourse. But I’m concerned by one of my users that claims to have a 2015 Mac that doesn’t receive any updates and has no way to install anything besides the old version of Safari that is showing him deprecation notices with Discourse. I really know very little about Apple devices apart from the fact that they’re much more locked down. Is it really that hard to install other modern browsers on them?
  8. Image / Upload storage: My users and I love the ease of uploading images in Discourse, but I’m a a bit worried about the storage space and costs. The best option down the road would probably be to mount a network storage volume to the VPS if needed. If I initially setup Discourse with the default uploads location would it cause problems to move it to a different volume later on?
  9. Backups:
    • I wish there was a system for differential, or better still, deduplicated backups. I currently use Duplicity with Amazon S3 for my Drupal forum, and the costs are unbelievably low for a very long history of revisions. Does anybody know off the top of their head how soon after an S3 archive creation a rule can make it transition to Glacier?
    • Does the Discourse backup interface allow for deleting archives in Amazon S3? I know it’s a bit extreme, but I would want to disable that functionality, because I set up my S3 buckets with only PUT and GET and LIST permissions to prevent a hacker on the compromised system from deleting my remote backups. Then an S3 lifecycle rule kicks in and server-side deletes the older archives after a certain amount of time.
  10. Stop Forum Spam plugin: I don’t want to use Akismet, but I’ve always had good results with to prevent a lot of spammer account creation. Does anybody know if the plugin for Discourse has configurable thresholds for how many hits a username or IP or email address should have in the database for it to be rejected? Edit: No it doesn’t. Requested here. Also it unfortunately doesn’t intervene to actually prevent account creation if they have enough hits in the SFS database like it does in Drupal.

Sorry for the long post. Thanks in advance to everyone for their insight, and many thanks to the entire Discourse project for this excellent product.

I just ran across this:

apply a set of regexp-based transformations, such as replacing BBCode tags with Markdown

It was last updated in 2016, not sure if it’s still a relevant option.

Is this still relevant? In the Drupal importer script I’m seeing code like:

 create_posts(results, total: total_count, offset: offset) do |row|
        topic_mapping = topic_lookup_from_imported_post_id("nid:#{row['nid']}")
 def create_permalinks
    puts '', 'creating permalinks...'

    Topic.listable_topics.find_each do |topic|
        tcf = topic.custom_fields
        if tcf && tcf['import_id']
          node_id = tcf['import_id'][/nid:(\d+)/, 1]
          slug = "/topic/#{node_id}"
          Permalink.create(url: slug, topic_id:
1 Like

The script typically pulls in 1000 posts at a time.

It keeps track of what has been processed, so subsequent runs can skip data already run. I on scripts I’ve touched I also include an import_after setting that further speeds up subsequent runs by loading only recent data (also useful for testing with just a small subset of the data).

I would need to look more closely to see about whether posts are included in permalinks. They typically are not, but it can be done.

You’ll want all of your uploads on S3, so your backup will include only the database dump. You can’t really do anything to optimize that. You can either let discourse keep a certain number or tell it not to (or just set the number of backups to a big number) and let your rules handle it.

1 Like

Oh, that’s a very good point. Now that I think about it, I’ll get charged either way for upload storage on S3, whether the uploads go directly (and only) to S3, or whether they’re inside multiple tarballs from the Discourse backup.

And what about using a bucket with no delete permissions for the Discourse backups?

But if they are on S3 then you have only a single copy.

I suspect that it will work if discourse does not have permission to delete, though I don’t know.

Right, and with S3’s insane levels of data redundancy that would generally be considered a responsible way of storing uploads? I haven’t fiddled around with S3 options recently, but I believe they also have lifecycle rules to recover deleted files for a time period? I’m thinking in the event that the uploads somehow got deleted due to a mistaken call from Discourse, be it an (unlikely and massive) coding bug or user error. Or a hacking event, circling back to my original concern about delete permissions on the bucket.

Yes, you can turn on versioning so that files are not deleted when they are marked as deleted. If you don’t care how much space you’re paying for, you can do that. When Discourse deletes a file because it’s no longer used, it moves it to a tombstone folder for a while before deleting it. I recommend that you trust Discourse to manage the files. I don’t know if disallowing delete access will break anything.

You can put backups on a separate bucket with different permissions (but same credentials) if you want.

1 Like

Question for @pfaffman or anybody else that puts uploads on S3 – I know it depends on a million factors, but do you have at least anecdotal information on the charges for bandwidth and S3 requests for a medium-large forum with its uploads on S3? Thanks a lot!

1 Like

Little update here: So what I think I’m going to do is keep my uploads local; I should have enough local storage for now and the option to expand it with additional storage volumes if needed. I just don’t want to deal with the complexity and expense of a CDN and the unpredictable charges of object storage and above all the transfer costs for live website image serving. Then I’m going to do automatic S3 backups to Backblaze B2 including uploads and the s3 disable cleanup option. Backblaze pricing is so cheap that it shouldn’t be a problem to keep a few weeks of daily backups even with the redundant uploads. It turns out that Backblaze B2 has two very simple options for buckets that are just what I need: 1) automatic lifecycle rules to delete files after X days, and 2) prevent deletion or modification of files for N days (to prevent the small possibility of the server getting hacked and the hacker using the stored credentials to delete my remote backups). I tested this and it seems to work fine; I tried to delete a backup archive from the Discourse GUI that was prohibited from deletion by Backblaze, and it simply did nothing.

Just to clarify for me and for others: It is possible to automatically backup uploads on local storage to S3 if the backup with uploads option is enabled (default), right?

1 Like

Yes. By default local uploads are included in the backup file.

1 Like