Redirecting old forum URLs to new Discourse URLs

import
(Dan Dascalescu) #21

I’ve just created a topic map from MyBB to Discourse automatically, using the migration script.

MyBB was set to use SEO-friendly URLs without IDs in them. Now for example when I navigate to /thread-foo-bar, nginx redirects to /t/foo-bar/12. Here’s how I did it:

  1. Patch the importer to output lines that end up creating a map file to use for for nginx’s map module. For the MyBB importer, I added this code in create_posts:

    parent = topic_lookup_from_imported_post_id(m['first_post_id'])
    if parent
      puts "\nXXX #{m['topic_id']}: #{parent[:topic_id]},"
    end
    

    After that, I grepped for lines starting with XXX, removed the XXX, and made the file a JSON object, which I pasted into this script. Change the URLs to your forums, run the script, and its output will be a series of nginx map lines. I saved it as /etc/nginx/mybb2discourse.map.

  2. Configure nginx to “run other websites on the same machine as Discourse”, while making the following modifications to the nginx config file (/etc/nginx/conf.d/discourse.conf) in order to point nginx to the map file:

    • insert this at the top of the file:
    map_hash_bucket_size 128;
    map_hash_max_size 50000;  # might have to increase this
    
    map $uri $new {
        include /etc/nginx/mybb2discourse.map;
    }
    
    • then in the server section, add:
    if ($new) {
        rewrite ^ $new permanent;
    }
    
  3. Complete the nginx reload and container rebuild steps from the end of the Configure nginx… post linked above.

Would be great if someone who’s better with Ruby patched the importer to output the topic IDs map (or even better, the nginx map directly).

5 Likes

Best practices for URL forwarding/rewriting from a previous (non-Discourse) system
Migrating from mybb
Running other websites on the same machine as Discourse
(zh99998) #22

poor performance …
it will generate millions of regexp, and nginx have to process each of it in every request.

0 Likes

(Dan Dascalescu) #23

Do you have a better proposal?

Or a performance benchmark? I suspect that up to a pretty high number of regexps, the bottleneck is by far the entire Rails request processing + database lookup + response building stack, not nginx’s entirely memory-contained regexp matching.

0 Likes

(Fajfi) #24

After that, I grepped for lines starting with XXX, removed the XXX, and made the file a JSON object, which I pasted into this script…

I just copied this new piece of code into the script. How it works, how to get this json object/file?

0 Likes

(Stefano Maffulli) #25

this doesn’t seem to capture the threads with 0 replies in my mybb database :frowning: Any idea what the issue could be? Or any suggestions on a cleaner way to get a map of old to new threads?

0 Likes

(Charles) #26

Permalinks and normalizers are the most frustrating, unclear, under documented feature of Discourse that i’ve run into so far. Having a horrible time setting these up. Just wanted to vent my frustration here. I’ve read Problem with permalinks, or regex? as well as other posts on vbulletin specific importers.

Great feature idea, just wish i could figure out how to use it properly.

3 Likes

(Jay Pfaffman) #27

I understand your frustration. It took me a while to figure them out.

I think it’s because the feature is used infrequently (just when you do an import) and by relatively few people (people who write importers). And once you’ve figured it out for the current problem, you just move on.

3 Likes

(Charles) #29

I’m soon to be moving on… to another vbulletin import to Discourse :slight_smile: So I’ll share what I’m doing, and after a couple more of these i’ll compile all my lessons learned somewhere.

I wrote an importer for permalinks that solved my vbulletin4 redirects for old permalinks.

To get it to work - add the following “permalink normalizations” in Admin settings to get these redirects to work.

Example 1, you have urls like this:

/forums/f10/some-thread-title-here-51689/index1.html
/forums/f10/some-thread-title-here-51689/index15.html#323423

Normalization 1. This is for the above 2 examples. Add this normalization into the adminsetting first (order of normalizations is important!)

/(forums)/f[0-9]+/.±([0-9]+)/index[0-9]*.html/\1\2

Example 2, your vbulletin also has permalinks like this:

/forums/f10/some-thread-title-here-51689/

Normalization 2. This is for the above example permalink. Add this one normalization second (order of normalizations is important!)

/(forums)/f[0-9]+/.±([0-9]+)/\1\2

And then run this import script after completing the bulk-import or normal import scripts for vbulletin (btw, i had to use both official import scripts, and modified them because neither solved my needs alone: forum around ~1million posts)

5 Likes

(Jay Pfaffman) #30

Could you add that to the script and submit a PR?

Also, you can set the permalink normalizations in the script (rather than the web interface) something like this:

SiteSetting.permalink_normalizations='/topic/(.*t)\?.*/\1'

If you don’t know what logic is required to know which permalink normalization to use, just pick one and add the other one as a comment. People running the importer will see the code before they can find this thread. :slight_smile:

1 Like

(OG) #31

If you need to delete all Permalinks at once, use Permalink.all.each { |p| p.destroy } from rails console.

3 Likes

(Régis Hanol) #32

Permalink.destroy_all is shorter and more efficient :wink:

10 Likes

split this topic #33

7 posts were split to a new topic: Discourse to WordPress redirect questions

0 Likes

(Danny Goodall) #34

Apologies for opening an old thread but it seems like a good place for my question to sit as some of what I’m asking has been touched on but not fully answered.

I’m trying to ensure that I understand the workflow for permalink normalistation and as others have said there really doesn’t seem to be a great deal of documentation around this.

Can I just confirm my understanding / misunderstanding of the permalink normalisation process or at least the process that normalisation plays in redirects?

  1. URL comes in and isn’t matched to any route
  2. Before 404 is thrown - we check for a permalink rule matching our URL
  3. Before we attempt to match the URL, we apply a permalink_normalization regex on the inbound URL turning it into a new string
  4. We look for an exact match between the new string generated in 3. and the url column in the permalinks table
  5. If we find a match we redirect the visitor to the relevant category / topic / post described in the permalinks row.

IF that is the correct flow, can I ask

  1. What strategies do people use to generate the new string from the regex? Presumably, regardless of the incoming url, we could just generate /topic/d9aa09c3-19bd-4c6e-9d8d-a8f1008000a1, /post/4a512429-0e2d-4437-826c-a7590144617c or /category/elephants (yes, MVCF does use UUID descriptors on the url for topics and posts!)
  2. As you can have multiple permalink_normalization entries, are they applied in order until a match is found or a 404 is raised?
  3. Any other gotchas?

Thanks

3 Likes

(Jay Pfaffman) #35

Yup, I think that’s it.

1 Like

(Danny Goodall) #36

Thanks for the advice/validation (AGAIN) @pfaffman, I did manage to the get the redirects working.

Just wanted to circle back to this to mention a few of the gotchas that I found and perhaps leave some breadcrumbs for future travellers - because I found this hellishly difficult to debug.

Escaping in the permalink normalization string

The format of the permalink normalization string has two components

  1. the Regular Expression string
  2. the Replacement string

They appear, one immediately after the other, in the permalink normalization string like so

         Permalink Normalization
    Regular Expression       Replacement
<-------------------------><------------->
/(this)reallyis(intuitive)/\1reallyisn't\2

Importantly, slashes are treated differently in the different parts of the same string.

A slash (and other regex chars) in the Regular Expression part of the string must be escaped, however, slashes do not need to be escaped in the Replacement part of the same string and will instead be treated literally.

The Format of incoming URL strings

Secondly, and this took me a while to nail down, you match the URL as a relative path description from root but you will not receive the / as the first part of the string.

For example, if the URL that your old forum uses looked like this…

http://oldforum.com/chat/the-topic-title/post/d9aa09c3-19bd-4c6e-9d8d-a8f1008000a1

…then the URL that your the regular expression in your permalink normalization will match against will look like this…

chat/topic-title/post/d9aa09c3-19bd-4c6e-9d8d-a8f1008000a1

i.e. a path description from root but without the leading / slash. (I guess that YMMV here depending on the structure of the URLs that you are redirecting - but I don’t think so).

Examples

Here are some examples from my migration project

CATEGORY_LINK_NORMALIZATION = '/(cat)\/(.*?)([#\?].*)?$/cat/\2'
POST_LINK_NORMALIZATION = '/chat\/(.*?)\/(post)\/(.+?)([#\?].*)?$/post/\3'
TOPIC_LINK_NORMALIZATION = '/(chat)\/(.*?)([#\?].*)?$/topic/\2'

The Process

Old URL Permalink Normalization URL Match Text
http://oldsite.com/cat/history /(cat)\/(.*?)([#\?].*)?$/cat/\2 cat/history
http://oldsite.com/chat/topic-title/post/d9aa09c3-19bd-4c6e-9d8d-a8f1008000a1 /chat\/(.*?)\/(post)\/(.+?)([#\?].*)?$/post/\3 post/d9aa09c3-19bd-4c6e-9d8d-a8f1008000a1
http://oldsite.com/chat/mindgames-in-football /(chat)\/(.*?)([#\?].*)?$/topic/\2 topic/mindgames-in-football

The Old URL is as it sounds - the URL of the item in the old system.

The permalink normalization (recorded in the permalink_normalizations system setting) will grab the incoming URL (without the leading slash /) and apply the regex match. The resulting normalised URL is then used to match against the URL Match Text entered on the /admin/customize/permalinks screen.

3 Likes

(Marco) #37

Dear all,
I’m currently working on optimising an smf2 to Discourse guide while using it for my own smf2 forum migration.
The smf2.rb import script has a function to create permalink in case the pretty url plugin were installed on smf2.
That was not my case. On my forum the links are quite “ugly”:

  1. Link to topic: https://www.someforum.com/index.php?topic=NNN where NNNis the numeric id of a topic
  2. Link to message: https://www.someforum.com/index.php?topic=NNN.msgMMM#msgMMM where NNNis the numeric id of a topic and MMM is the numeric id of a post in that topic (not the incremental counter, but the real post id)

Now, I created a function in smf2.rb which seemingly works OK in both cases (I checked using the Data Explorer extension, the permalinks are created in the DB for both URL types).

Different story when trying to request a URL to discourse: case 1. works with no problems, and I get redirected correctly. Case 2. does not work, and I get landed on the 404 page. I was wondering what could be the cause to all this. I´m thinking that the URL scheme of case 2 contains dot and hash characters.

  • Could these characters are somehow breaking the permalink URL recognition?
  • Could it be that I need to assign a value to both topic_id and post_id?

Here is the code of the function I developed.

  def make_old_smf2_permalinks()
    puts 'creating permalinks for forumastronautico.it topics'
    begin
      Permalink.destroy_all # I want a clean slate

      fait_topics = query(<<-SQL, as: :array)
        SELECT t.id_topic, t.id_first_msg
        FROM smf_topics t;
      SQL
      fait_topics.each do |fait_t|
        begin
          t = topic_lookup_from_imported_post_id(fait_t[:id_first_msg])
          Permalink.create(url: "/index.php?topic=#{fait_t[:id_topic]}", topic_id: t[:topic_id]) unless t.nil?
        rescue Exception => e
          puts e.message
          next
        end
      end

      fait_messages = query(<<-SQL, as: :array)
        SELECT m.id_topic, m.id_msg
        FROM smf_messages m;
      SQL
      fait_messages.each do |fait_m|
        begin
          t = topic_lookup_from_imported_post_id(fait_m[:id_msg])
          m = post_id_from_imported_post_id(fait_m[:id_msg]) unless t.nil?
          Permalink.create(url: "/index.php?topic=#{fait_m[:id_topic]}.msg#{fait_m[:id_msg]}\#msg#{fait_m[:id_msg]}", post_id: m) unless t.nil?
        rescue Exception => e
          puts e.message
          next
        end
      end
    rescue Exception => e
      puts e.message
      puts e.backtrace.inspect
    end
  end

As you can see the permalinks to posts are created OK in the DB.

0 Likes

(Cameron:D) #38

I think part of your problem is the #part of the url is never sent to the server as part of the request so maybe try removing that part of it?

When I did my SMF import I just redirected the url in the topic=123.msg456 format and use an nginx rewrite to clean up every alternate url layout (i.e. topic=123.100 for a specific page in a topic, print view, etc.).

location /index.php {
   set $p 0;
   if ($arg_topic ~ "([0-9]+)\.(msg[0-9]+)") {
       set $p $2;
   }
   if ($arg_topic ~ "([0-9]+)(\.[0-9]+)?") {
       set $t $1;
       rewrite ^ /index.php?topic=$t.$p?;
   }
   proxy_set_header Host $http_host;
   proxy_set_header X-Real-IP $remote_addr;
   proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
   proxy_set_header X-Forwarded-Proto $thescheme;
   proxy_http_version 1.1;
   proxy_pass http://discourse;
   break;
}
1 Like

(Jay Pfaffman) #39

I think you might need redirects to handle the stuff after the hash.

1 Like

(Marco) #40

I’ll give it a try. At the end of the day this # part is used to get to that specific message in the page using an anchor that uses the post number as name. I’m afraid that the search engines have the coomplete URL (including the anchor) saved in their systems…

0 Likes

(Vincent) #41

I don’t think so. nginx will serve the requested page but Discourse won’t jump to the requested post.

4 Likes