"http" gets parsed incorrectly in posts


#1

In a post that contains the string “http”, the text after it might not get rendered if a full URL is posted “near it”.

Example:

The string http with some more words after it.
http://www.discourse.org
More text

The string http with some more words after it.


More text

The same thing also happens to “https”.


(cpradio) #2

Because the cleverness of this isn’t very clear, the above post history reveals the issue (or the raw input).


(Sam Saffron) #3

wow … great catch.


(Richard Hulse) #4

Does this bug come from ruby code or external library? If it’s ruby and there is a bug open in the tracker can I have link? I could have a look into it.


(Sam Saffron) #5

This is from our markdown parser, most likely from this file

https://github.com/discourse/discourse/blob/master/app/assets/javascripts/discourse/dialects/autolink_dialect.js


(Richard Hulse) #6

OK. I can cope with JS. You want me to have a crack at it?


(Sam Saffron) #7

Sure would be more than happy for you to!


(Richard Hulse) #8

OK. Have looked at the code.

Initial thoughts:

I had a think about what the user posts and expects from their input, and what is likely to be valid. Anyone posting a valid link is always going to include http(s): with the colon. Always. Anyone leaving that off the link is going to put a dot after the www. Always. Well they are if they want to work if it is cut and pasted.

So based on that I would first suggest (before diving into the really hard regexp) a simple change to the tokens that you look for (lines 24-25). Very roughly:

Discourse.Dialect.inlineRegexp(_.merge({start: 'http:'}, urlReplacerArgs));
Discourse.Dialect.inlineRegexp(_.merge({start: 'https:'}, urlReplacerArgs));
Discourse.Dialect.inlineRegexp(_.merge({start: 'www.'}, urlReplacerArgs));

I am not sure if the start token can be a regex, or need to be escaped, but that is the suggestion.

(EDIT: Nope. Problem is deeper. )


(Richard Hulse) #9

To fix the regex itself the start token would have to be non-greedy. Possibly also a simple fix.


(Mittineague) #10

It might be a good idea to also look at

https://github.com/discourse/discourse/blob/master/app/assets/javascripts/discourse/dialects/dialect.js

to make sure the spaceOrTagBoundary: true, doesn’t have something to do with the problem.

Curious that both
with some more word
and
://www.discourse.com
have 20 characters


(Kane York) #11

Preliminary testing indicates that the character count does indeed factor into it somehow.


(Richard Hulse) #12

Thank you for the file ref, and yes that is a useful clue.


(Jacob Chapel) #14

Okay, through some trial and error and some looking around I found a regex that passes this edge case and seems to be simple enough but strong enough as well.

/^((?:https?):\/\/(?:-\.)?(?:[^\s\/?\.#-]+\.?)+(?:\/[^\s]*)?)/

I tested it live on here by replacing the existing regex with the above one and testing how it handled the example text from the OP.

Looking at what I can change with the existing regex to have the same properties.

Found the source regex for the above here listed under @imme_emosol: In search of the perfect URL validation regex

Edit: I’ve identified the g and m flags as causing the issue with the current regex.

Should be:

/^((?:https?:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.])(?:[^\s()<>]+|\([^\s()<>]+\))+(?:\([^\s()<>]+\)|[^`!()\[\]{};:'".,<>?«»“”‘’\s]))/

Edit 2: Just submitted a pull request to fix this issue.

https://github.com/discourse/discourse/pull/3175


(Régis Hanol) #15