"http" gets parsed incorrectly in posts

anon42064797 · February 5, 2015, 1:55pm

In a post that contains the string “http”, the text after it might not get rendered if a full URL is posted “near it”.

Example:

The string http with some more words after it.
http://www.discourse.org
More text

The string http with some more words after it.

More text

The same thing also happens to “https”.

cpradio · February 5, 2015, 2:31pm

Because the cleverness of this isn’t very clear, the above post history reveals the issue (or the raw input).

sam · February 5, 2015, 9:34pm

wow … great catch.

rhulse · February 6, 2015, 6:09am

Does this bug come from ruby code or external library? If it’s ruby and there is a bug open in the tracker can I have link? I could have a look into it.

sam · February 6, 2015, 6:12am

This is from our markdown parser, most likely from this file

https://github.com/discourse/discourse/blob/master/app/assets/javascripts/discourse/dialects/autolink_dialect.js

rhulse · February 6, 2015, 6:24am

OK. I can cope with JS. You want me to have a crack at it?

sam · February 6, 2015, 6:25am

Sure would be more than happy for you to!

rhulse · February 6, 2015, 6:42am

OK. Have looked at the code.

Initial thoughts:

I had a think about what the user posts and expects from their input, and what is likely to be valid. Anyone posting a valid link is always going to include http(s): with the colon. Always. Anyone leaving that off the link is going to put a dot after the www. Always. Well they are if they want to work if it is cut and pasted.

So based on that I would first suggest (before diving into the really hard regexp) a simple change to the tokens that you look for (lines 24-25). Very roughly:

Discourse.Dialect.inlineRegexp(_.merge({start: 'http:'}, urlReplacerArgs));
Discourse.Dialect.inlineRegexp(_.merge({start: 'https:'}, urlReplacerArgs));
Discourse.Dialect.inlineRegexp(_.merge({start: 'www.'}, urlReplacerArgs));

I am not sure if the start token can be a regex, or need to be escaped, but that is the suggestion.

(EDIT: Nope. Problem is deeper. )

rhulse · February 6, 2015, 6:43am

To fix the regex itself the start token would have to be non-greedy. Possibly also a simple fix.

Mittineague · February 6, 2015, 6:49am

It might be a good idea to also look at

https://github.com/discourse/discourse/blob/master/app/assets/javascripts/discourse/dialects/dialect.js

to make sure the spaceOrTagBoundary: true, doesn’t have something to do with the problem.

Curious that both
with some more word
and
://www.discourse.com
have 20 characters

riking · February 6, 2015, 6:52am

Preliminary testing indicates that the character count does indeed factor into it somehow.

rhulse · February 6, 2015, 6:58am

Thank you for the file ref, and yes that is a useful clue.

chapel · February 6, 2015, 8:30am

Okay, through some trial and error and some looking around I found a regex that passes this edge case and seems to be simple enough but strong enough as well.

/^((?:https?):\/\/(?:-\.)?(?:[^\s\/?\.#-]+\.?)+(?:\/[^\s]*)?)/

I tested it live on here by replacing the existing regex with the above one and testing how it handled the example text from the OP.

Looking at what I can change with the existing regex to have the same properties.

Found the source regex for the above here listed under @imme_emosol: In search of the perfect URL validation regex

Edit: I’ve identified the g and m flags as causing the issue with the current regex.

Should be:

/^((?:https?:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.])(?:[^\s()<>]+|\([^\s()<>]+\))+(?:\([^\s()<>]+\)|[^`!()\[\]{};:'".,<>?«»“”‘’\s]))/

Edit 2: Just submitted a pull request to fix this issue.

https://github.com/discourse/discourse/pull/3175

Topic		Replies	Views
Links with "mailto" broken by prepended "http://" Bug	2	586	October 8, 2020
What method is used by Discourse to identify urls in post body Dev	6	1110	December 12, 2017
URL auto-linking doesn't hyperlink if open paren precedes the protocol Feature markdown-it-review	6	2123	June 26, 2017
Bug(s) in Discourse handing of URIs in markdown content Bug	6	796	December 15, 2022
`Topic#featured_link` containing more than just a valid URL Bug	5	797	December 4, 2017

"http" gets parsed incorrectly in posts

Related topics