Redirecting old forum URLs to new Discourse URLs

It does get to the server. What you want to do, @marcozambi, is to make permalink be something like
/oldforumpost/ID and then use Permalink_normalizations to re-map /wahtever/link.php?1234#IDto/oldforumpost/ID`.

1 Like

No it doesn’t, try making a request to yoursite.com/#anythingyouwant and looking at the logs (or the requests in your dev tools), the request will just be for /, and once the client loads / it will deal with the #anythingyouwant part, usually by scrolling to that part of the page, or handled with JS.

Alternatively you can test this with Discourse itself by making a new permalink for, say, test#404, and when you try to load that exact url it will 404. Add a new permalink for just test and it should load happily.

Now, the permalink normalisation is something new to me and may be worth looking at that instead of my rewrites.

3 Likes

Continuing my effort to translate SMF2 old forum URLs into the Discourse ones, I’m now spending some time on the Permalink Normalization so that i can get rid of the non necessary parts of the SMF2 links.

For example, given this SMF2 link

 https://www.myforum.it/index.php?topic=27962.msg305350#msg305350

I need to get rid of the unuseful part

.msg305350#msg305350

as I’m already successfully translating old topic ids into the discourse standard (see here).

In order to do that I’m trying to use this regexp (which works when tested in https://regex101.com/)

 | P1                       | P2  | 
 |--------------------------|-----|
/(\/index\.php\?topic=[0-9]+)(\..+)/\1

Basically I need to keep what is found in P1, while P2 can just be discarded.
If I understood correctly how to use Permalink Normalization, P1 should contain /index.php?topic=27962 while P2 is so generic that will catch anything after P1 (and it’s ok so).
Setting P3 to \1 should then return /index.php?topic=27962 but nevertheless I am led to 404 error page.

For completeness, here’s the screenshot of my current Permalink Normalization setting.

What am I doing wrong?

2 Likes

The regex shouldn’t start with a slash. So, the following should work:

/(index\.php\?topic=[0-9]+)(\..+)/\1
4 Likes

Perfect! It did the trick Thank you very much! :medal_military:
I will add this to the SMF2 migration guide…

3 Likes

Back on this topic to report a really weird behavior of the old topic redirection:

  • if in one of the Discourse messages imported from the old forum I click on an internal link making use of the old SMF2 URL pattern, e.g. https://www.myforum.com/index.php?topic=123, even if I have (and I do have) a correct permalink set for this old topic pointing to the Discourse-type link, I get landed on the 404 page.
  • if I copy/paste the exact same old SMF2 link above https://www.myforum.com/index.php?topic=123 into the browser’s URL bar, or I simply press F5 from the 404 page where I was redirected in the first place, the redirection works like a charm and I get to the new Discourse-type link.

One of our beta testers (we’re still not operational with Discourse) has noticed that when we click on an internal old SMF2-style link, a GET request is generated (maybe for keeping track of the number of clicks on that link?) which is something like https://www.myforum.com/clicks/track?https://www.myforum.com/index.php?topic=123&post_id=56789 286778&topic_id=19650&redirect=false&_=1533275034978 , which gets an empty 200 OK response.

If I modify parameter redirect=false to redirect=true in the GET request above, then I finally get a 302 FOUND response whose header sends to the correct new discourse-like topic URL.

Any idea on how to avoid this?

What you really need to do is to replace the internal links with discourse urls in the importer (in the raw post).

1 Like

Yes, that would certainly work.
The reason I went with this approach is that avoids me to do a very difficult matching of the old SMF2 topic ids with the new Discourse topic ids, and will take care of the pages already indexed by search engines.

I think that an option in the settings to switch off “counting” the clicks of internal links would be a possible solution…

The permalinks will handle the pages indexed by Google, but as you have learned, not the internal links.

The old topic ids should be in the post custom fields table, but it’s usually easier to do in the import rather than afterward, though I’ve done it both ways.

1 Like

Since the permalinks table can get quite big for large Forums migrations, I wonder if it could incorporate some mechanism to store the last time each permalink was used, and a count.

This would allow admins to get some notion of whether they can

  • delete entries from the permalink table (not accessed for a long time, or never)
  • take steps to update old links somewhere out there on the Internet, where possible (those that get used a lot and generate lots of redirects)

Maybe this is already there (I didn’t check, because I don’t know where to check…). Thanks for any comments anyone may have on this idea.

4 Likes

I don’t know if this has been discussed before, but would the devs consider the possibility of allowing an active handling of permalinks redirects?

What I mean is: I provide a block of Ruby code for Discourse to run while it’s handling the redirect. I could do my own normalization, lookup the post’s import_id, use conditions, etc.

I feel this would allow me to keep things simple and direct, handling all the different cases I have in my old forum URL’s (many different sorts of URL’s) I want to handle. I’m not sure I can do everything with regexp normalizations, and even if I can, it’s way more obscure and obnoxious than straight Ruby code (ok, with a few regexps).

You are going to need a plugin here, very uneasy allowing admins to run arbitrary Ruby on the server, direct from the admin UI

4 Likes

I wasn’t thinking of doing it from the Admin UI, I was thinking just a file on the server’s file system.

I’m not sure what would be the best way to do it, I am not educated in the ways that Discourse integrates external code (plugins, etc).

Wouldn’t a well-known mechanism be enough? Like “if you place a file here with this or that characteristic”, Discourse will pick it up and use it?

To learn where to put such a file, You’d start with Beginner's Guide to Creating Discourse Plugins - Part 1

1 Like

I’m running into a problem with my permalink normalization. My regexp includes the | pipe character because it specifies optional matches (somewhat like the OR operand).

It turns out that Discourse is using that character to join multiple normalizations regexps into a single setting stored in the database. So if I add 3 regexps, foo, bar and baz, they will be stored as foo|bar|baz. This is really not a good idea because it keeps valid regexps from being used…

I suggest joining with a multiple character string like |||.

Two isn’t enough, because it can be used in a regexp: (foo||bar) matches foo, or empty string, or bar. But I can’t think of a case where you would use |||.

Should I open a GitHub issue for this? And can anyone suggest a workaround?

This is the full regexp made with much love and effort:

/(?:.*)(\/)(?'topicid'\d*.)-(.[^\/#\?]*)(?'parm'\?(\w*)[=](?'start'\d+))?(?:#|\/unread|\/reply|\/edit\/|\/)?(?'postid'\d+)?/normalized.${topicid}.${postid}

1 Like

I think I can work around my own problem already, but I’d still like to hear other people’s comments on this (essentially, to know if I should raise a Github issue with it), I believe Discourse would be a better app with a bit of improvement here. Thanks

Did you try escaping the pipe with a backslash?

1 Like

Yes, I did. I am afraid it doesn’t work.

This is not an issue with the regexp parsing, but with Discourse code using Ruby join to concatenate several regexps into a single database entry.

So when it reads from the database, it uses split (with separator char |), and this breaks my regexp from the first | character, regardless of any attempts to “escape” it.

This breaks in the same way when I do it from the UI, or from my script with SiteSetting.permalink_normalizations

Bummer. My guess is that you’ll need to just make more regexes. Or write a custom plugin that will be difficult to maintain.

Or have an external nginx rule rewrite them. That might be what you want.

Since this is an OR operator, I can get by with multiple regexps, I just repeat the same regexp, each time with a different token.

But thinking of the project, I don’t see any disadvantages if Discourse was to change its concatenation separator to |||, and it would allow for better compatibility with more regexps.

1 Like