Automatic encoding of parsed URL params

mpalmer · August 1, 2017, 9:53am

The underlying problem which caused this bug is that query parameters, at least, are not being tagged with any particular encoding. Based on some reading, it’s my understanding that modern Rails versions do correctly indicate the encoding of parameters provided in request bodies (by looking at the Content-Type request header), and experimentation indicates that explicit query parameters (such as ?foo=bar) do get marked as being UTF-8, but as far as I can tell params which come from parsing the URL through route patterns, and format markers, are being marked as ASCII-8BIT.

I’ve played a bit of whack-a-mole with the problems reported in the bug topic above, but given that we found three separate instances of potentially needing to force an encoding in one place, I’m thinking the issue is fairly likely to be extremely widespread throughout the codebase.

So, the questions which come to mind are:

Does this constitute a bug in Rails itself which should be fixed?
How should this be worked around in Discourse, either until Rails itself gets fixed, or forever?
How has this not been a problem before? It’s not like Discourse suddenly got its first non-seven-bit-ASCII users in the last week or two…

sam · August 1, 2017, 2:49pm

My 2 cents on this is that we should be normalizing everything to utf-8, if rails has no option of doing so we should introduce one.

mpalmer · August 1, 2017, 11:01pm

OK, I’m a bit too snowed at the moment to be taking something this big through to completion; anyone who’s looking for a chance to contribute a PR to Discourse (and possibly Rails), go forth and hack!

Alavi1412 · August 3, 2017, 7:17am

I know rails a little and I have some Idea.

this code in topic controller is not working well. As I tested some puts in console. I have find out that this part of topic controller has some problem. And I found out that there is no extra whitespace after this.
params[:slug]

I have tested some codes to find out if there is extra whitespace after slug or not. so I added the following to topic_controller:

puts "'#{params[:slug]}'"
puts params[:slug].length
puts "'#{@topic_view.topic.slug}'"
puts @topic_view.topic.slug.length

and the out put was this:

'مشکل-پیدا-کردن-ترجمه-ی-خوب'
47
'مشکل-پیدا-کردن-ترجمه-ی-خوب'
26

I added a single quotation around the out put to see whatespace in console there is no whitespace between the slug and the single quotation. the are exactly the same but the length is diffrent.

UPDATE:
look at the code I have changed and the output:

code:

      puts "'#{params[:slug]}'"
      puts params[:slug].length
      puts params[:slug].encoding
      puts "'#{@topic_view.topic.slug}'"
      puts @topic_view.topic.slug.length
      puts @topic_view.topic.slug.encoding

output:

'مشکل-پیدا-کردن-ترجمه-ی-خوب'
47
ASCII-8BIT
'مشکل-پیدا-کردن-ترجمه-ی-خوب'
26
UTF-8

Alavi1412 · August 3, 2017, 7:33am

I have solved it but I don’t know the solution is good or not.

I just changed the slugs_do_not_match method at topic_controller like this:

def slugs_do_not_match
      params[:slug].force_encoding('UTF-8')      #new line added
      params[:slug] && @topic_view.topic.slug != params[:slug]
  end

mpalmer · August 3, 2017, 7:54am

The proper fix for this problem, across the entire codebase, is to monkey patch or add a before_filter that fixes the encoding of all parameters parsed out of route patterns to be UTF-8. Individual checks, like the one you propose, don’t scale, for the reasons I explained in my initial post.

Pad_Pors · August 6, 2017, 6:49am

is there any progress on this?

it’s becoming little bit annoying in our forum as you can’t open any of the topics in a new link.

mpalmer · August 6, 2017, 11:19am

As the topic says, “PR welcome”.

david · August 6, 2017, 11:23am

If topic links are a problem, you could try changing the “slug generation method” in your site settings to “none”.

You would lose the nice URLs for topics, but if your forum isn’t working it’s probably worth it.

Pad_Pors · August 6, 2017, 11:59am

thanks, but that doesn’t reproduce previous existed topic slugs. old topics need to be edited so that their slug changes.

btw, is there any way to do this via database? I mean with some lines of codes?

may you open the case a little more? @Alavi1412 would like to help, but he’s not sure if his knowledge of discourse can solve it.
I’d appreciate if you explain which file should be explored among discourse codebase or any other information rather than monkey patch and before_filter?

p.s: @david can I re-generate all the slugs using this command in the app:

rake posts:rebake

an old post slug will only change if it is edited after the slug-generation-method has been changed in admin setting. can the above command do this?

david · August 6, 2017, 12:13pm

I would guess that running posts:rebake will replace all of the links created in oneboxes and quotes. Probably won’t fix any links that were explicitly posted by a user.

However… since the issue you’re having is “too many redirects” (a redirect loop), I don’t think you even need to fix the links - the old ones should redirect correctly to the new ones. Have you tried it?

Pad_Pors · August 6, 2017, 12:15pm

yes, currently the slug-generation method is none, and only new posts are working fine.

david · August 6, 2017, 12:25pm

Does editing the title on old topics fix the problem?

If so, something like this might work on the rails console:

Topic.all.each do |topic|
    topic.title = topic.title
    topic.save!
end

(Assigning topic.title re-generates the slug)

Please please please take a backup before trying that, as I don’t know if it will have any unintended side effects.

It might also take a long time to run depending how many topics you have…

mpalmer · August 6, 2017, 9:42pm

We prioritise work for paying customers. At the moment, everyone on the team is already doing customer work. If a paying customer reports this particular problem is impacting them, we will reprioritise accordingly.

elijah · August 7, 2017, 5:26am

Digging through a bunch of code, I find that unescape() (by that or a similar name) has been defined in multiple locations throughout the RoR system. There’s even this one in ActiveSupport (active_support/core_ext/uri.rb) which includes an explicit test that things decode right:

require 'uri'
str = "\xE6\x97\xA5\xE6\x9C\xAC\xE8\xAA\x9E" # Ni-ho-nn-go in UTF-8, means Japanese.
parser = URI::Parser.new

unless str == parser.unescape(parser.escape(str))
  URI::Parser.class_eval do
    remove_method :unescape
    def unescape(str, escaped = /%[a-fA-F\d]{2}/)
      # TODO: Are we actually sure that ASCII == UTF-8?
      # YK: My initial experiments say yes, but let's be sure please
      enc = str.encoding
      enc = Encoding::UTF_8 if enc == Encoding::US_ASCII
      str.gsub(escaped) { [$&[1, 2].hex].pack('C') }.force_encoding(enc)
    end
  end
end

Very, very few of the unescapes make any reference to Encoding, relying on the default inside Ruby’s core uri:

  def self.decode_www_form_component(str, enc=Encoding::UTF_8)

But this bit in ActionDispatch (action_dispatch/journey/router/utils.rb) looks curious:

          def unescape_uri(uri)
            encoding = uri.encoding == US_ASCII ? UTF_8 : uri.encoding
            uri.gsub(ESCAPED) { [$&[1, 2].hex].pack('C') }.force_encoding(encoding)
          end

If US-ASCII, force it to be UTF-8, otherwise leave as is. And the as-is we are seeing here is ASCII-8BIT. (That, incidentally, is the encoding used generally for the url escape functions, as it gives the nice byte-wise %encoding result.)

I haven’t reproduced this bug in my own environment, so I can’t test fixing that. And I’m unclear if this is the exact unescaper that get used here.

Pad_Pors · August 7, 2017, 5:37am

thanks for the hints . we already have the bug , we’ll test and report the progress or perhaps some new questions.

elijah · August 7, 2017, 4:59pm

Okay, new since yesterday, I’ve been able duplicate the issue of infinite redirects on non-ASCII slugs.

And I was able to fix it by modifying action_dispatch/journey/router/utils.rb.

Inside the container there are two copies of that file, I changed both:

--- vendor/bundle/ruby/2.3.0/gems/actionpack-4.2.9/lib/action_dispatch/journey/router/utils.orig        2017-08-07 16:54:17.726203797 +0000
+++ vendor/bundle/ruby/2.3.0/gems/actionpack-4.2.9/lib/action_dispatch/journey/router/utils.rb  2017-08-07 16:46:04.166411177 +0000
@@ -28,6 +28,7 @@
         class UriEncoder # :nodoc:
           ENCODE   = "%%%02X".freeze
           US_ASCII = Encoding::US_ASCII
+          ASCII_8  = Encoding::ASCII_8BIT
           UTF_8    = Encoding::UTF_8
           EMPTY    = "".force_encoding(US_ASCII).freeze
           DEC2HEX  = (0..255).to_a.map{ |i| ENCODE % i }.map{ |s| s.force_encoding(US_ASCII) }
@@ -56,7 +57,9 @@
           end

           def unescape_uri(uri)
-            encoding = uri.encoding == US_ASCII ? UTF_8 : uri.encoding
+            encoding = uri.encoding
+            encoding = UTF_8 if ( encoding == US_ASCII || encoding == ASCII_8 )
+
             uri.gsub(ESCAPED) { [$&[1, 2].hex].pack('C') }.force_encoding(encoding)
           end

--- vendor/bundle/ruby/2.3.0/gems/actionpack-4.2.8/lib/action_dispatch/journey/router/utils.orig        2017-08-07 16:56:20.718104297 +0000
+++ vendor/bundle/ruby/2.3.0/gems/actionpack-4.2.8/lib/action_dispatch/journey/router/utils.rb  2017-08-07 16:47:21.338413239 +0000
@@ -26,6 +26,7 @@
         class UriEncoder # :nodoc:
           ENCODE   = "%%%02X".freeze
           US_ASCII = Encoding::US_ASCII
+          ASCII_8  = Encoding::ASCII_8BIT
           UTF_8    = Encoding::UTF_8
           EMPTY    = "".force_encoding(US_ASCII).freeze
           DEC2HEX  = (0..255).to_a.map{ |i| ENCODE % i }.map{ |s| s.force_encoding(US_ASCII) }
@@ -54,7 +55,8 @@
           end

           def unescape_uri(uri)
-            encoding = uri.encoding == US_ASCII ? UTF_8 : uri.encoding
+            encoding = uri.encoding
+            encoding = UTF_8 if ( encoding == US_ASCII || encoding == ASCII_8 )
             uri.gsub(ESCAPED) { [$&[1, 2].hex].pack('C') }.force_encoding(encoding)
           end

@@ -91,4 +93,3 @@
     end
   end
 end
-

Someone want to recommend where I can submit this officially?

Edit: Probably a pull request to Rails itself? https://github.com/rails/rails The code exists in that tree.

blake · August 7, 2017, 7:12pm

Just wanted to post this link here so that we have another data point to test:

https://community.wanikani.com/t/%E6%BC%A2%E5%AD%97%E3%81%B2%E3%82%89%E3%81%8C%E3%81%AA-in-topic-permalinks-breaks-navigation-directly-to-a-topic/19207

This url also is experiencing the same issues.

Alavi1412 · August 8, 2017, 12:43pm

Just a suggestion:

This problem can be because of anything but for a permanent solution, I think a little change in topic model or topic controller can solve it.

mpalmer · August 8, 2017, 11:16pm

Feel free to propose a PR demonstrating your approach. Given that we’ve already found incompatibilities in at least two different controllers, though, I’m not confident a “little change” in one controller (or model) will do the trick.

Topic		Replies	Views
Topic with Japanese in URL doesn't redirect if URL doesn't perfectly match Bug	72	4046	October 21, 2019
When slug generation method is encoded, It redirected many times when we open new tab opens in topic Bug	9	5309	November 16, 2017
Removing the concept of "slugs" for some languages Feature	27	10060	December 28, 2017
Force Lowercase slug URLs when set to "encoded" Support	23	4336	June 8, 2024
Category slug containing Unicode characters results in redirect loop Bug	7	1581	January 15, 2021

Automatic encoding of parsed URL params

Related topics