Redirecting old forum URLs to new Discourse URLs

import
(Régis Hanol) #32

Permalink.destroy_all is shorter and more efficient :wink:

10 Likes
split this topic #33

7 posts were split to a new topic: Discourse to WordPress redirect questions

(Danny Goodall) #34

Apologies for opening an old thread but it seems like a good place for my question to sit as some of what I’m asking has been touched on but not fully answered.

I’m trying to ensure that I understand the workflow for permalink normalistation and as others have said there really doesn’t seem to be a great deal of documentation around this.

Can I just confirm my understanding / misunderstanding of the permalink normalisation process or at least the process that normalisation plays in redirects?

  1. URL comes in and isn’t matched to any route
  2. Before 404 is thrown - we check for a permalink rule matching our URL
  3. Before we attempt to match the URL, we apply a permalink_normalization regex on the inbound URL turning it into a new string
  4. We look for an exact match between the new string generated in 3. and the url column in the permalinks table
  5. If we find a match we redirect the visitor to the relevant category / topic / post described in the permalinks row.

IF that is the correct flow, can I ask

  1. What strategies do people use to generate the new string from the regex? Presumably, regardless of the incoming url, we could just generate /topic/d9aa09c3-19bd-4c6e-9d8d-a8f1008000a1, /post/4a512429-0e2d-4437-826c-a7590144617c or /category/elephants (yes, MVCF does use UUID descriptors on the url for topics and posts!)
  2. As you can have multiple permalink_normalization entries, are they applied in order until a match is found or a 404 is raised?
  3. Any other gotchas?

Thanks

3 Likes
(Jay Pfaffman) #35

Yup, I think that’s it.

1 Like
(Danny Goodall) #36

Thanks for the advice/validation (AGAIN) @pfaffman, I did manage to the get the redirects working.

Just wanted to circle back to this to mention a few of the gotchas that I found and perhaps leave some breadcrumbs for future travellers - because I found this hellishly difficult to debug.

Escaping in the permalink normalization string

The format of the permalink normalization string has two components

  1. the Regular Expression string
  2. the Replacement string

They appear, one immediately after the other, in the permalink normalization string like so

         Permalink Normalization
    Regular Expression       Replacement
<-------------------------><------------->
/(this)reallyis(intuitive)/\1reallyisn't\2

Importantly, slashes are treated differently in the different parts of the same string.

A slash (and other regex chars) in the Regular Expression part of the string must be escaped, however, slashes do not need to be escaped in the Replacement part of the same string and will instead be treated literally.

The Format of incoming URL strings

Secondly, and this took me a while to nail down, you match the URL as a relative path description from root but you will not receive the / as the first part of the string.

For example, if the URL that your old forum uses looked like this…

http://oldforum.com/chat/the-topic-title/post/d9aa09c3-19bd-4c6e-9d8d-a8f1008000a1

…then the URL that your the regular expression in your permalink normalization will match against will look like this…

chat/topic-title/post/d9aa09c3-19bd-4c6e-9d8d-a8f1008000a1

i.e. a path description from root but without the leading / slash. (I guess that YMMV here depending on the structure of the URLs that you are redirecting - but I don’t think so).

Examples

Here are some examples from my migration project

CATEGORY_LINK_NORMALIZATION = '/(cat)\/(.*?)([#\?].*)?$/cat/\2'
POST_LINK_NORMALIZATION = '/chat\/(.*?)\/(post)\/(.+?)([#\?].*)?$/post/\3'
TOPIC_LINK_NORMALIZATION = '/(chat)\/(.*?)([#\?].*)?$/topic/\2'

The Process

Old URL Permalink Normalization URL Match Text
http://oldsite.com/cat/history /(cat)\/(.*?)([#\?].*)?$/cat/\2 cat/history
http://oldsite.com/chat/topic-title/post/d9aa09c3-19bd-4c6e-9d8d-a8f1008000a1 /chat\/(.*?)\/(post)\/(.+?)([#\?].*)?$/post/\3 post/d9aa09c3-19bd-4c6e-9d8d-a8f1008000a1
http://oldsite.com/chat/mindgames-in-football /(chat)\/(.*?)([#\?].*)?$/topic/\2 topic/mindgames-in-football

The Old URL is as it sounds - the URL of the item in the old system.

The permalink normalization (recorded in the permalink_normalizations system setting) will grab the incoming URL (without the leading slash /) and apply the regex match. The resulting normalised URL is then used to match against the URL Match Text entered on the /admin/customize/permalinks screen.

3 Likes
(Marco) #37

Dear all,
I’m currently working on optimising an smf2 to Discourse guide while using it for my own smf2 forum migration.
The smf2.rb import script has a function to create permalink in case the pretty url plugin were installed on smf2.
That was not my case. On my forum the links are quite “ugly”:

  1. Link to topic: https://www.someforum.com/index.php?topic=NNN where NNNis the numeric id of a topic
  2. Link to message: https://www.someforum.com/index.php?topic=NNN.msgMMM#msgMMM where NNNis the numeric id of a topic and MMM is the numeric id of a post in that topic (not the incremental counter, but the real post id)

Now, I created a function in smf2.rb which seemingly works OK in both cases (I checked using the Data Explorer extension, the permalinks are created in the DB for both URL types).

Different story when trying to request a URL to discourse: case 1. works with no problems, and I get redirected correctly. Case 2. does not work, and I get landed on the 404 page. I was wondering what could be the cause to all this. I´m thinking that the URL scheme of case 2 contains dot and hash characters.

  • Could these characters are somehow breaking the permalink URL recognition?
  • Could it be that I need to assign a value to both topic_id and post_id?

Here is the code of the function I developed.

  def make_old_smf2_permalinks()
    puts 'creating permalinks for forumastronautico.it topics'
    begin
      Permalink.destroy_all # I want a clean slate

      fait_topics = query(<<-SQL, as: :array)
        SELECT t.id_topic, t.id_first_msg
        FROM smf_topics t;
      SQL
      fait_topics.each do |fait_t|
        begin
          t = topic_lookup_from_imported_post_id(fait_t[:id_first_msg])
          Permalink.create(url: "/index.php?topic=#{fait_t[:id_topic]}", topic_id: t[:topic_id]) unless t.nil?
        rescue Exception => e
          puts e.message
          next
        end
      end

      fait_messages = query(<<-SQL, as: :array)
        SELECT m.id_topic, m.id_msg
        FROM smf_messages m;
      SQL
      fait_messages.each do |fait_m|
        begin
          t = topic_lookup_from_imported_post_id(fait_m[:id_msg])
          m = post_id_from_imported_post_id(fait_m[:id_msg]) unless t.nil?
          Permalink.create(url: "/index.php?topic=#{fait_m[:id_topic]}.msg#{fait_m[:id_msg]}\#msg#{fait_m[:id_msg]}", post_id: m) unless t.nil?
        rescue Exception => e
          puts e.message
          next
        end
      end
    rescue Exception => e
      puts e.message
      puts e.backtrace.inspect
    end
  end

As you can see the permalinks to posts are created OK in the DB.

(Cameron:D) #38

I think part of your problem is the #part of the url is never sent to the server as part of the request so maybe try removing that part of it?

When I did my SMF import I just redirected the url in the topic=123.msg456 format and use an nginx rewrite to clean up every alternate url layout (i.e. topic=123.100 for a specific page in a topic, print view, etc.).

location /index.php {
   set $p 0;
   if ($arg_topic ~ "([0-9]+)\.(msg[0-9]+)") {
       set $p $2;
   }
   if ($arg_topic ~ "([0-9]+)(\.[0-9]+)?") {
       set $t $1;
       rewrite ^ /index.php?topic=$t.$p?;
   }
   proxy_set_header Host $http_host;
   proxy_set_header X-Real-IP $remote_addr;
   proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
   proxy_set_header X-Forwarded-Proto $thescheme;
   proxy_http_version 1.1;
   proxy_pass http://discourse;
   break;
}
1 Like
(Jay Pfaffman) #39

I think you might need redirects to handle the stuff after the hash.

1 Like
(Marco) #40

I’ll give it a try. At the end of the day this # part is used to get to that specific message in the page using an anchor that uses the post number as name. I’m afraid that the search engines have the coomplete URL (including the anchor) saved in their systems…

(Vincent) #41

I don’t think so. nginx will serve the requested page but Discourse won’t jump to the requested post.

4 Likes
(Jay Pfaffman) #42

It does get to the server. What you want to do, @marcozambi, is to make permalink be something like
/oldforumpost/ID and then use Permalink_normalizations to re-map /wahtever/link.php?1234#IDto/oldforumpost/ID`.

1 Like
(Cameron:D) #43

No it doesn’t, try making a request to yoursite.com/#anythingyouwant and looking at the logs (or the requests in your dev tools), the request will just be for /, and once the client loads / it will deal with the #anythingyouwant part, usually by scrolling to that part of the page, or handled with JS.

Alternatively you can test this with Discourse itself by making a new permalink for, say, test#404, and when you try to load that exact url it will 404. Add a new permalink for just test and it should load happily.

Now, the permalink normalisation is something new to me and may be worth looking at that instead of my rewrites.

3 Likes
(Marco) #44

Continuing my effort to translate SMF2 old forum URLs into the Discourse ones, I’m now spending some time on the Permalink Normalization so that i can get rid of the non necessary parts of the SMF2 links.

For example, given this SMF2 link

 https://www.myforum.it/index.php?topic=27962.msg305350#msg305350

I need to get rid of the unuseful part

.msg305350#msg305350

as I’m already successfully translating old topic ids into the discourse standard (see here).

In order to do that I’m trying to use this regexp (which works when tested in https://regex101.com/)

 | P1                       | P2  | 
 |--------------------------|-----|
/(\/index\.php\?topic=[0-9]+)(\..+)/\1

Basically I need to keep what is found in P1, while P2 can just be discarded.
If I understood correctly how to use Permalink Normalization, P1 should contain /index.php?topic=27962 while P2 is so generic that will catch anything after P1 (and it’s ok so).
Setting P3 to \1 should then return /index.php?topic=27962 but nevertheless I am led to 404 error page.

For completeness, here’s the screenshot of my current Permalink Normalization setting.

What am I doing wrong?

2 Likes
(Gerhard Schlager) #45

The regex shouldn’t start with a slash. So, the following should work:

/(index\.php\?topic=[0-9]+)(\..+)/\1
4 Likes
(Marco) #46

Perfect! It did the trick Thank you very much! :medal_military:
I will add this to the SMF2 migration guide…

3 Likes
(Marco) #47

Back on this topic to report a really weird behavior of the old topic redirection:

  • if in one of the Discourse messages imported from the old forum I click on an internal link making use of the old SMF2 URL pattern, e.g. https://www.myforum.com/index.php?topic=123, even if I have (and I do have) a correct permalink set for this old topic pointing to the Discourse-type link, I get landed on the 404 page.
  • if I copy/paste the exact same old SMF2 link above https://www.myforum.com/index.php?topic=123 into the browser’s URL bar, or I simply press F5 from the 404 page where I was redirected in the first place, the redirection works like a charm and I get to the new Discourse-type link.

One of our beta testers (we’re still not operational with Discourse) has noticed that when we click on an internal old SMF2-style link, a GET request is generated (maybe for keeping track of the number of clicks on that link?) which is something like https://www.myforum.com/clicks/track?https://www.myforum.com/index.php?topic=123&amp;post_id=56789 286778&amp;topic_id=19650&amp;redirect=false&amp;_=1533275034978 , which gets an empty 200 OK response.

If I modify parameter redirect=false to redirect=true in the GET request above, then I finally get a 302 FOUND response whose header sends to the correct new discourse-like topic URL.

Any idea on how to avoid this?

(Jay Pfaffman) #48

What you really need to do is to replace the internal links with discourse urls in the importer (in the raw post).

1 Like
(Marco) #49

Yes, that would certainly work.
The reason I went with this approach is that avoids me to do a very difficult matching of the old SMF2 topic ids with the new Discourse topic ids, and will take care of the pages already indexed by search engines.

I think that an option in the settings to switch off “counting” the clicks of internal links would be a possible solution…

(Jay Pfaffman) #50

The permalinks will handle the pages indexed by Google, but as you have learned, not the internal links.

The old topic ids should be in the post custom fields table, but it’s usually easier to do in the import rather than afterward, though I’ve done it both ways.

1 Like
#51

Since the permalinks table can get quite big for large Forums migrations, I wonder if it could incorporate some mechanism to store the last time each permalink was used, and a count.

This would allow admins to get some notion of whether they can

  • delete entries from the permalink table (not accessed for a long time, or never)
  • take steps to update old links somewhere out there on the Internet, where possible (those that get used a lot and generate lots of redirects)

Maybe this is already there (I didn’t check, because I don’t know where to check…). Thanks for any comments anyone may have on this idea.

4 Likes