"http" gets parsed incorrectly in posts

In a post that contains the string “http”, the text after it might not get rendered if a full URL is posted “near it”.

Example:

The string http with some more words after it.
http://www.discourse.org
More text

The string http with some more words after it.


More text

The same thing also happens to “https”.

12 « J'aime »

Because the cleverness of this isn’t very clear, the above post history reveals the issue (or the raw input).

4 « J'aime »

wow … great catch.

Does this bug come from ruby code or external library? If it’s ruby and there is a bug open in the tracker can I have link? I could have a look into it.

This is from our markdown parser, most likely from this file

https://github.com/discourse/discourse/blob/master/app/assets/javascripts/discourse/dialects/autolink_dialect.js

OK. I can cope with JS. You want me to have a crack at it?

Sure would be more than happy for you to!

OK. Have looked at the code.

Initial thoughts:

I had a think about what the user posts and expects from their input, and what is likely to be valid. Anyone posting a valid link is always going to include http(s): with the colon. Always. Anyone leaving that off the link is going to put a dot after the www. Always. Well they are if they want to work if it is cut and pasted.

So based on that I would first suggest (before diving into the really hard regexp) a simple change to the tokens that you look for (lines 24-25). Very roughly:

Discourse.Dialect.inlineRegexp(_.merge({start: 'http:'}, urlReplacerArgs));
Discourse.Dialect.inlineRegexp(_.merge({start: 'https:'}, urlReplacerArgs));
Discourse.Dialect.inlineRegexp(_.merge({start: 'www.'}, urlReplacerArgs));

I am not sure if the start token can be a regex, or need to be escaped, but that is the suggestion.

(EDIT: Nope. Problem is deeper. )

To fix the regex itself the start token would have to be non-greedy. Possibly also a simple fix.

It might be a good idea to also look at

https://github.com/discourse/discourse/blob/master/app/assets/javascripts/discourse/dialects/dialect.js

to make sure the spaceOrTagBoundary: true, doesn’t have something to do with the problem.

Curious that both
with some more word
and
://www.discourse.com
have 20 characters

1 « J'aime »

Preliminary testing indicates that the character count does indeed factor into it somehow.

Thank you for the file ref, and yes that is a useful clue.

Okay, through some trial and error and some looking around I found a regex that passes this edge case and seems to be simple enough but strong enough as well.

/^((?:https?):\/\/(?:-\.)?(?:[^\s\/?\.#-]+\.?)+(?:\/[^\s]*)?)/

I tested it live on here by replacing the existing regex with the above one and testing how it handled the example text from the OP.

Looking at what I can change with the existing regex to have the same properties.

Found the source regex for the above here listed under @imme_emosol: In search of the perfect URL validation regex

Edit: I’ve identified the g and m flags as causing the issue with the current regex.

Should be:

/^((?:https?:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.])(?:[^\s()<>]+|\([^\s()<>]+\))+(?:\([^\s()<>]+\)|[^`!()\[\]{};:'".,<>?«»“”‘’\s]))/

Edit 2: Just submitted a pull request to fix this issue.

https://github.com/discourse/discourse/pull/3175

4 « J'aime »