Onebox breaks if there's chinese text in URL

(kiki) #1


  def initialize(url, opts = nil)
    @url = url"=============================1  #{@url}")
    @uri =
        URI(escape_url) if @url
      rescue URI::InvalidURIError
      end"=============================2  #{@uri.to_s}")

Output logs:

I, [2017-08-03T15:45:31.697669 #112637]  INFO -- : =============================1  https://domain/%E6%B5%8B%E8%AF%95
I, [2017-08-03T15:45:31.757180 #112637]  INFO -- : =============================2  https://domain/%25E6%25B5%258B%25E8%25AF%2595

(Matt Palmer) #2

What exactly breaks? Do you have an exception and backtrace?

(kiki) #3

the input url is https://domain/%E6%B5%8B%E8%AF%95

the url i got in onebox turns to be: https://domain/%25E6%25B5%258B%25E8%25AF%2595

I’ve added logs in discourse/lib/final_destination.rb (as shown in the above post) and find it escape to the wrong url.

they are different which expected the same.

(Matt Palmer) #4

Well, yes, of course they’re going to be different; if you’re escaping a URL, any % characters are always going to be converted to %25, that’s what URI escaping does.

(kiki) #5

It’s no escaped by me. It’s escaped by discourse/lib/final_destination.rb.

It did not work this way before.

The scenario is:

  1. i wrote a onebox plugin, to handle url like https://domain/%E6%B5%8B%E8%AF%95

  2. user input url : https://domain/%E6%B5%8B%E8%AF%95

  3. and the onbebox got the @url=https://domain/%E6%B5%8B%E8%AF%95 and do the work

in the old days, the @url is the exact the user input url https://domain/%E6%B5%8B%E8%AF%95.

but now it turns to be https://domain/%25E6%25B5%258B%25E8%25AF%2595

(Matt Palmer) #6

The change was made in FIX: Escape URL before attempting to resolve it. · discourse/discourse@b534778 · GitHub. There’s no bug reference on the commit. @tgxworld, do you recall the purpose of that commit?

(Evgeny) #7

I noticed this with other languages.
Or is this another scenario?

Russian version

(Matt Palmer) #8

There’s nothing language-specific here. Any URL data that can’t be represented in the constrained set of characters permitted in URLs gets percent-escaped. In both cases here, the data is UTF-8 coded character data, that’s possibly undergoing a double round of encoding. Unfortunately, because no backtrace has been provided, it’s impossible to see where the data’s coming from, merely that something is going on.

(Sam Saffron) #9

@tgxworld looks like there is a regression here:

Used to work as a Onebox in 1.8 and now no longer works.

(Alan Tan) #11

Oops not sure how I missed the reply but I was fixing an error flooding our logs.

[1] pry(main)> URI("")
URI::InvalidURIError: bad URI(is not URI?):
from /home/tgxworld/.rbenv/versions/2.4.1/lib/ruby/2.4.0/uri/rfc3986_parser.rb:67:in `split'

(Alan Tan) #12

Hmm I’m not sure if it actually worked in 1.8 because FinalDestination was introduced before the 1.8 release and the URL wouldn’t have been resolved at al.

[2] pry(main)> URI("Свободное_программное_обеспечение2")
URI::InvalidURIError: URI must be ascii only "\u0421\u0432\u043E\u0431\u043E\u0434\u043D\u043E\u0435_\u043F\u0440\u043E\u0433\u0440\u0430\u043C\u043C\u043D\u043E\u0435_\u043E\u0431\u0435\u0441\u043F\u0435\u0447\u0435\u043D\u0438\u04352"
from /home/tgxworld/.rbenv/versions/2.4.1/lib/ruby/2.4.0/uri/rfc3986_parser.rb:21:in `split'

(Alan Tan) #13

Hmm OK I see what is happening here. FinalDestination is given

instead ofСвободное_программное_обеспечение

For the first case we end up escaping the %sign… Hmm will need to figure out the format given to FinalDestination because the url passed to it is sometimes not escaped.

(Alan Tan) #15

Fixed in

(Alan Tan) #16

Yay encoded URLs are being properly oneboxed again.

Home Page does not work HTTP ERROR 500
(Alan Tan) #17