Onebox breaks if there's chinese text in URL

discourse/lib/final_destination.rb

  def initialize(url, opts = nil)
    @url = url
    Rails.logger.info("=============================1  #{@url}")
    @uri =
      begin
        URI(escape_url) if @url
      rescue URI::InvalidURIError
      end
    Rails.logger.info("=============================2  #{@uri.to_s}")

Output logs:

I, [2017-08-03T15:45:31.697669 #112637]  INFO -- : =============================1  https://domain/%E6%B5%8B%E8%AF%95
I, [2017-08-03T15:45:31.757180 #112637]  INFO -- : =============================2  https://domain/%25E6%25B5%258B%25E8%25AF%2595

What exactly breaks? Do you have an exception and backtrace?

the input url is https://domain/%E6%B5%8B%E8%AF%95

the url i got in onebox turns to be: https://domain/%25E6%25B5%258B%25E8%25AF%2595

I’ve added logs in discourse/lib/final_destination.rb (as shown in the above post) and find it escape to the wrong url.

they are different which expected the same.

Well, yes, of course they’re going to be different; if you’re escaping a URL, any % characters are always going to be converted to %25, that’s what URI escaping does.

It’s no escaped by me. It’s escaped by discourse/lib/final_destination.rb.

It did not work this way before.

The scenario is:

  1. i wrote a onebox plugin, to handle url like https://domain/%E6%B5%8B%E8%AF%95

  2. user input url : https://domain/%E6%B5%8B%E8%AF%95
    image

  3. and the onbebox got the @url=https://domain/%E6%B5%8B%E8%AF%95 and do the work
    image

in the old days, the @url is the exact the user input url https://domain/%E6%B5%8B%E8%AF%95.

but now it turns to be https://domain/%25E6%25B5%258B%25E8%25AF%2595

The change was made in https://github.com/discourse/discourse/commit/b534778f46ac310d9b59afa6f5390fced267f2f0. There’s no bug reference on the commit. @tgxworld, do you recall the purpose of that commit?

I noticed this with other languages.
Or is this another scenario?

https://en.wikipedia.org/wiki/Free_software

Russian version

https://ru.wikipedia.org/wiki/%D0%A1%D0%B2%D0%BE%D0%B1%D0%BE%D0%B4%D0%BD%D0%BE%D0%B5_%D0%BF%D1%80%D0%BE%D0%B3%D1%80%D0%B0%D0%BC%D0%BC%D0%BD%D0%BE%D0%B5_%D0%BE%D0%B1%D0%B5%D1%81%D0%BF%D0%B5%D1%87%D0%B5%D0%BD%D0%B8%D0%B5

1 Like

There’s nothing language-specific here. Any URL data that can’t be represented in the constrained set of characters permitted in URLs gets percent-escaped. In both cases here, the data is UTF-8 coded character data, that’s possibly undergoing a double round of encoding. Unfortunately, because no backtrace has been provided, it’s impossible to see where the data’s coming from, merely that something is going on.

@tgxworld looks like there is a regression here:

Used to work as a Onebox in 1.8 and now no longer works.

2 Likes

Oops not sure how I missed the reply but I was fixing an error flooding our logs.

[1] pry(main)> URI("https://eviltrout.com?s=180&d=mm&r=g")
URI::InvalidURIError: bad URI(is not URI?): https://eviltrout.com?s=180&d=mm&r=g
from /home/tgxworld/.rbenv/versions/2.4.1/lib/ruby/2.4.0/uri/rfc3986_parser.rb:67:in `split'

Hmm I’m not sure if it actually worked in 1.8 because FinalDestination was introduced before the 1.8 release and the URL wouldn’t have been resolved at al.

[2] pry(main)> URI("https://ru.wikipedia.org/wiki/Свободное_программное_обеспечение2")
URI::InvalidURIError: URI must be ascii only "https://ru.wikipedia.org/wiki/\u0421\u0432\u043E\u0431\u043E\u0434\u043D\u043E\u0435_\u043F\u0440\u043E\u0433\u0440\u0430\u043C\u043C\u043D\u043E\u0435_\u043E\u0431\u0435\u0441\u043F\u0435\u0447\u0435\u043D\u0438\u04352"
from /home/tgxworld/.rbenv/versions/2.4.1/lib/ruby/2.4.0/uri/rfc3986_parser.rb:21:in `split'

Hmm OK I see what is happening here. FinalDestination is given

https://ru.wikipedia.org/wiki/%D0%A1%D0%B2%D0%BE%D0%B1%D0%BE%D0%B4%D0%BD%D0%BE%D0%B5_%D0%BF%D1%80%D0%BE%D0%B3%D1%80%D0%B0%D0%BC%D0%BC%D0%BD%D0%BE%D0%B5_%D0%BE%D0%B1%D0%B5%D1%81%D0%BF%D0%B5%D1%87%D0%B5%D0%BD%D0%B8%D0%B5

instead of

https://ru.wikipedia.org/wiki/Свободное_программное_обеспечение

For the first case we end up escaping the %sign… Hmm will need to figure out the format given to FinalDestination because the url passed to it is sometimes not escaped.

3 Likes

Fixed in

https://github.com/discourse/discourse/commit/367fb1c524cff06a33c7a4144cd13a270a9f3489

7 Likes

Yay encoded URLs are being properly oneboxed again.

7 Likes