In a post that contains the string “http”, the text after it might not get rendered if a full URL is posted “near it”.
Example:
The string http with some more words after it.
http://www.discourse.org
More text
The string http with some more words after it.
Discourse is modern forum software for your community. Use it as a mailing list, discussion forum, long-form chat room, and more!
More text
The same thing also happens to “https”.
12 Likes
cpradio
(cpradio)
February 5, 2015, 2:31pm
2
Because the cleverness of this isn’t very clear, the above post history reveals the issue (or the raw input).
4 Likes
rhulse
(Richard Hulse)
February 6, 2015, 6:09am
4
Does this bug come from ruby code or external library? If it’s ruby and there is a bug open in the tracker can I have link? I could have a look into it.
sam
(Sam Saffron)
February 6, 2015, 6:12am
5
rhulse
(Richard Hulse)
February 6, 2015, 6:24am
6
OK. I can cope with JS. You want me to have a crack at it?
sam
(Sam Saffron)
February 6, 2015, 6:25am
7
Sure would be more than happy for you to!
rhulse
(Richard Hulse)
February 6, 2015, 6:42am
8
OK. Have looked at the code.
Initial thoughts:
I had a think about what the user posts and expects from their input, and what is likely to be valid. Anyone posting a valid link is always going to include http(s): with the colon. Always. Anyone leaving that off the link is going to put a dot after the www. Always. Well they are if they want to work if it is cut and pasted.
So based on that I would first suggest (before diving into the really hard regexp) a simple change to the tokens that you look for (lines 24-25). Very roughly:
Discourse.Dialect.inlineRegexp(_.merge({start: 'http:'}, urlReplacerArgs));
Discourse.Dialect.inlineRegexp(_.merge({start: 'https:'}, urlReplacerArgs));
Discourse.Dialect.inlineRegexp(_.merge({start: 'www.'}, urlReplacerArgs));
I am not sure if the start token can be a regex, or need to be escaped, but that is the suggestion.
(EDIT: Nope. Problem is deeper. )
rhulse
(Richard Hulse)
February 6, 2015, 6:43am
9
To fix the regex itself the start token would have to be non-greedy. Possibly also a simple fix.
It might be a good idea to also look at
https://github.com/discourse/discourse/blob/master/app/assets/javascripts/discourse/dialects/dialect.js
to make sure the spaceOrTagBoundary: true,
doesn’t have something to do with the problem.
Curious that both
with some more word
and
://www.discourse.com
have 20 characters
1 Like
riking
(Kane York)
February 6, 2015, 6:52am
11
Preliminary testing indicates that the character count does indeed factor into it somehow.
rhulse
(Richard Hulse)
February 6, 2015, 6:58am
12
Thank you for the file ref, and yes that is a useful clue.
chapel
(Jacob Chapel)
February 6, 2015, 8:30am
14
Okay, through some trial and error and some looking around I found a regex that passes this edge case and seems to be simple enough but strong enough as well.
/^((?:https?):\/\/(?:-\.)?(?:[^\s\/?\.#-]+\.?)+(?:\/[^\s]*)?)/
I tested it live on here by replacing the existing regex with the above one and testing how it handled the example text from the OP.
Looking at what I can change with the existing regex to have the same properties.
Found the source regex for the above here listed under @imme_emosol
: In search of the perfect URL validation regex
Edit: I’ve identified the g
and m
flags as causing the issue with the current regex.
Should be:
/^((?:https?:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.])(?:[^\s()<>]+|\([^\s()<>]+\))+(?:\([^\s()<>]+\)|[^`!()\[\]{};:'".,<>?«»“”‘’\s]))/
Edit 2: Just submitted a pull request to fix this issue.
https://github.com/discourse/discourse/pull/3175
4 Likes