"http" gets parsed incorrectly in posts

anon42064797 · Février 5, 2015, 1:55

In a post that contains the string “http”, the text after it might not get rendered if a full URL is posted “near it”.

Example:

The string http with some more words after it.
http://www.discourse.org
More text

The string http with some more words after it.

More text

The same thing also happens to “https”.

cpradio · Février 5, 2015, 2:31

Because the cleverness of this isn’t very clear, the above post history reveals the issue (or the raw input).

sam · Février 5, 2015, 9:34

wow … great catch.

rhulse · Février 6, 2015, 6:09

Does this bug come from ruby code or external library? If it’s ruby and there is a bug open in the tracker can I have link? I could have a look into it.

sam · Février 6, 2015, 6:12

This is from our markdown parser, most likely from this file

https://github.com/discourse/discourse/blob/master/app/assets/javascripts/discourse/dialects/autolink_dialect.js

rhulse · Février 6, 2015, 6:24

OK. I can cope with JS. You want me to have a crack at it?

sam · Février 6, 2015, 6:25

Sure would be more than happy for you to!

rhulse · Février 6, 2015, 6:42

OK. Have looked at the code.

Initial thoughts:

I had a think about what the user posts and expects from their input, and what is likely to be valid. Anyone posting a valid link is always going to include http(s): with the colon. Always. Anyone leaving that off the link is going to put a dot after the www. Always. Well they are if they want to work if it is cut and pasted.

So based on that I would first suggest (before diving into the really hard regexp) a simple change to the tokens that you look for (lines 24-25). Very roughly:

Discourse.Dialect.inlineRegexp(_.merge({start: 'http:'}, urlReplacerArgs));
Discourse.Dialect.inlineRegexp(_.merge({start: 'https:'}, urlReplacerArgs));
Discourse.Dialect.inlineRegexp(_.merge({start: 'www.'}, urlReplacerArgs));

I am not sure if the start token can be a regex, or need to be escaped, but that is the suggestion.

(EDIT: Nope. Problem is deeper. )

rhulse · Février 6, 2015, 6:43

To fix the regex itself the start token would have to be non-greedy. Possibly also a simple fix.

Mittineague · Février 6, 2015, 6:49

It might be a good idea to also look at

https://github.com/discourse/discourse/blob/master/app/assets/javascripts/discourse/dialects/dialect.js

to make sure the spaceOrTagBoundary: true, doesn’t have something to do with the problem.

Curious that both
with some more word
and
://www.discourse.com
have 20 characters

riking · Février 6, 2015, 6:52

Preliminary testing indicates that the character count does indeed factor into it somehow.

rhulse · Février 6, 2015, 6:58

Thank you for the file ref, and yes that is a useful clue.

chapel · Février 6, 2015, 8:30

Okay, through some trial and error and some looking around I found a regex that passes this edge case and seems to be simple enough but strong enough as well.

/^((?:https?):\/\/(?:-\.)?(?:[^\s\/?\.#-]+\.?)+(?:\/[^\s]*)?)/

I tested it live on here by replacing the existing regex with the above one and testing how it handled the example text from the OP.

Looking at what I can change with the existing regex to have the same properties.

Found the source regex for the above here listed under @imme_emosol: In search of the perfect URL validation regex

Edit: I’ve identified the g and m flags as causing the issue with the current regex.

Should be:

/^((?:https?:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.])(?:[^\s()<>]+|\([^\s()<>]+\))+(?:\([^\s()<>]+\)|[^`!()\[\]{};:'".,<>?«»“”‘’\s]))/

Edit 2: Just submitted a pull request to fix this issue.

github.com/discourse/discourse

FIX: Remove g and m flags from autolink regex

master ← chapel:fix-autolink-regex

merged 08:56AM - 06 Feb 15 UTC

chapel

+1 -1

Fixes issue where a non-link is falsely identified as a link and given the link …text of a later link. More info in thread: https://meta.discourse.org/t/http-gets-parsed-incorrectly-in-posts/24866 The `g` and `m` flags together caused the above case. How this regex works, there isn't any reason to use `g` as the regex should match at the start of the given text being tested, and for `m` that is for matching across lines, and links are not valid with multiple lines.

Sujet		Réponses	Vues
Links with "mailto" broken by prepended "http://" Bug	2	592	Octobre 8, 2020
What method is used by Discourse to identify urls in post body Dev	6	1134	Décembre 12, 2017
URL auto-linking doesn't hyperlink if open paren precedes the protocol Feature markdown-it-review	6	2177	Juin 26, 2017
Bug(s) in Discourse handing of URIs in markdown content Bug	6	811	Décembre 15, 2022
`Topic#featured_link` containing more than just a valid URL Bug	5	816	Décembre 4, 2017

"http" gets parsed incorrectly in posts

Sujets connexes