"http" gets parsed incorrectly in posts

In a post that contains the string “http”, the text after it might not get rendered if a full URL is posted “near it”.

Example:

The string http with some more words after it.
http://www.discourse.org
More text

The string http with some more words after it.


More text

The same thing also happens to “https”.

12 Likes

Because the cleverness of this isn’t very clear, the above post history reveals the issue (or the raw input).

4 Likes

wow … great catch.

Does this bug come from ruby code or external library? If it’s ruby and there is a bug open in the tracker can I have link? I could have a look into it.

This is from our markdown parser, most likely from this file

https://github.com/discourse/discourse/blob/master/app/assets/javascripts/discourse/dialects/autolink_dialect.js

OK. I can cope with JS. You want me to have a crack at it?

Sure would be more than happy for you to!

OK. Have looked at the code.

Initial thoughts:

I had a think about what the user posts and expects from their input, and what is likely to be valid. Anyone posting a valid link is always going to include http(s): with the colon. Always. Anyone leaving that off the link is going to put a dot after the www. Always. Well they are if they want to work if it is cut and pasted.

So based on that I would first suggest (before diving into the really hard regexp) a simple change to the tokens that you look for (lines 24-25). Very roughly:

Discourse.Dialect.inlineRegexp(_.merge({start: 'http:'}, urlReplacerArgs));
Discourse.Dialect.inlineRegexp(_.merge({start: 'https:'}, urlReplacerArgs));
Discourse.Dialect.inlineRegexp(_.merge({start: 'www.'}, urlReplacerArgs));

I am not sure if the start token can be a regex, or need to be escaped, but that is the suggestion.

(EDIT: Nope. Problem is deeper. )

To fix the regex itself the start token would have to be non-greedy. Possibly also a simple fix.

It might be a good idea to also look at

https://github.com/discourse/discourse/blob/master/app/assets/javascripts/discourse/dialects/dialect.js

to make sure the spaceOrTagBoundary: true, doesn’t have something to do with the problem.

Curious that both
with some more word
and
://www.discourse.com
have 20 characters

1 Like

Preliminary testing indicates that the character count does indeed factor into it somehow.

Thank you for the file ref, and yes that is a useful clue.

Okay, through some trial and error and some looking around I found a regex that passes this edge case and seems to be simple enough but strong enough as well.

/^((?:https?):\/\/(?:-\.)?(?:[^\s\/?\.#-]+\.?)+(?:\/[^\s]*)?)/

I tested it live on here by replacing the existing regex with the above one and testing how it handled the example text from the OP.

Looking at what I can change with the existing regex to have the same properties.

Found the source regex for the above here listed under @imme_emosol: In search of the perfect URL validation regex

Edit: I’ve identified the g and m flags as causing the issue with the current regex.

Should be:

/^((?:https?:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.])(?:[^\s()<>]+|\([^\s()<>]+\))+(?:\([^\s()<>]+\)|[^`!()\[\]{};:'".,<>?«»“”‘’\s]))/

Edit 2: Just submitted a pull request to fix this issue.

https://github.com/discourse/discourse/pull/3175

4 Likes