Onebox incorrectly encodes urls with ampersands


(Gavin) #1

I’m finding that onebox is incorrectly encoding urls, e.g. ampersands in the following url are encoded so the url doesn’t work as expected:

The following is where the onebox link should go (using a url shortener service to work-around the issue):


Onebox links with ampersands don't encode correctly
(Robin Ward) #3

I looked at this, and the problem occurs because our onebox code currently escapes & characters as & in URLs.

It seems that this is the correct thing to do for security reasons and was done to our codebase intentionally because of this.

Most servers work with & so this hasn’t been a problem in the past. I think it would be better and more secure if topomap.nz.co adjusted their software to work with & encoding in URLs instead.


(Gavin) #4

By all means encode data for use in the model behind the page, but when rendering urls back out to html I would have thought they should be unencoded back to their original values, where they won’t be consumed by javascript and no longer present a security risk?

I also argue against your statement:

Most servers work with & so this hasn’t been a problem in the past.

I’ve not encountered any sites myself that you can substitute & for an & in their urls and continue to work as intended, not even Google caters to this e.g.

Discourse - Google Search = using &
vs.
Discourse - Google Search = using &

Note that the urls using parameters above don’t get touched by Onebox and so are left un-corrupted here. So I feel this is still a bug with how Onebox handles urls.

I guess this bug might be low priority though as more and more sites shy away from using url parameters for SEO purposes, etc.


(Eli the Bearded) #6

It’s not sites that should cater to it, it’s browsers

<p>Link with amp: <a href="https://www.google.com/search?q=Discourse&amp;tbs=qdr:m">Disourse</a></p>
<p>Link without amp: <a href="https://www.google.com/search?q=Discourse&tbs=qdr:m">Disourse</a></p>

In HTML ampersand is properly supposed to be &amp;, that some URLs work without that is because of loose parsing.


(cpradio) #7

Wait… what? Show me a browser that supports &amp; in the address bar. I’ve yet to find one. I just tried several on my desktop, all fail.

It seems like JavaScript doesn’t support utilizing &amp; for escaping &? Here is a codepen recreating the issue with pure HTML and JavaScript versions

Note how the first link is broken, but the second works. Both are escaped, the first is done so twice, producing bad output. The second is escape once producing the correct output. The third one fails with JavaScript being handed &amp; and the fourth is successful. It seems JavaScript writes a literal version of what you give it, it doesn’t seem to apply any HTML parsing of the attribute.

In fact, that referenced link in the stack overflow solution fails (5th example in codepen) in JavaScript. Because it isn’t being parsed like regular HTML is.

This is where your argument falls short. Yes, you are right about HTML, however, HTML is not what is running here. JavaScript is. We’re past the browser’s initial rendering of HTML and based on my codepen above, it seems JavaScript when setting an attribute, does not invoke anything in the browser to convert &amp; to & (like it does in the HTML example – same codepen).

On top of all of that, & is considered a reserve character when part of a URI (http://www.ietf.org/rfc/rfc2396.txt) and should only be escaped when the value it represents would conflict with the original intent of &.

Example:

If I wanted to query google for Tom & Jerry, I would need to escape the & because otherwise, it would assign q to Tom and a second parameter named Jerry, which would have no value.
https://www.google.com/search?q=Tom+%26+Jerry

versus
https://www.google.com/search?q=Tom&Jerry

Note the encoding used for the & in that first example, it isn’t &amp;; in fact, google hates that too
https://www.google.com/search?q=Tom+&+Jerry

Because & is not an appropriate encoding for a URI, as it is still utilizing the reserved character.


(Jeff Atwood) #8

It is not the address bar that requires the ampersand encoding in URLs, it is the HTML markup.


(cpradio) #9

Right. But html is not in charge in this as my codepen indicates.


(Gavin) #10

Correct, for html text content, but not for html tag attributes like href.


(Eli the Bearded) #11

Did I show an address bar? No, I showed code.

Build a page with those two lines of HTML and run it through a validator. The one without &amp; gives invalid entity errors:

Edit: Also, address bars work different has been true a very long time.

<a href="www.netscape.com"> always has gone to the www.netscape.com page on the current site, but in the address bar www.netscape.com goes to the home page of of the browser maker. (Or whatever is there these days.) In the address bar there is a different set of heuristics applied during the loose parsing.


(Gavin) #12

I think you may be using the validator incorrectly…

I created a full test page for you at Test page for Discourse to test your theory - it validates without issue. Here’s the result.


(Eli the Bearded) #13

Okay, you got me. It validates as HTML5. My test page was HTML2.0. It appears HTML5 has a much more lenient view of ampersands.

But note also that the &amp; version is not wrong and is less ambiguous.


(Gavin) #14

The real issue is that urls are treated differently by the edit box and Onebox so there’s an inconsistency as well as a bug.

If ampersands are present in the url and the page doesn’t provide any Open Graph tags for Onebox to use, then the url is treated correctly e.g.

https://www.google.com/search?q=Discourse&tbs=qdr:m

However, if ampersands are present in the url and the page also provides Open Graph tags, then Onebox becomes involved and corrupts the original url by html encoding the href attribute in the final html output e.g.

Hopefully that’s clarified things a little better.


(cpradio) #15

I was about to point that out too. Onebox is doing another step or is due to it being a JavaScript app it is behaving differently.

You can also see the behavior using your first example.

Test Url for Discourse - NZ Topo Map

The latter works.


(Leo McArdle) #16

This is still a bug:

http://www.britishmuseum.org/research/collection_online/collection_object_details.aspx?objectId=1612368&partId=1


(Guido Drehsen) #17

still a problem


(Jeff Atwood) #18

Anything we can do here @sam for 1.9?


(Sam Saffron) #19

probably sounds to me like a bug in the onebox gem @techAPJ can you see what is involved in fixing this, if it takes longer than 30 minutes skip it and report back here.


(Jeff Atwood) #20

Be very careful not to introduce security regressions with non-escaping of things, but maybe focus on the ampersand case and make sure that specifically works @techapj… since this has come up 4-5 times.


(Arpit Jalan) #21

FIxed via:


(Jeff Atwood) #22

ok @GuidoD can you get latest and confirm this now works for you?