The underlying problem which caused this bug is that query parameters, at least, are not being tagged with any particular encoding. Based on some reading, it’s my understanding that modern Rails versions do correctly indicate the encoding of parameters provided in request bodies (by looking at the Content-Type request header), and experimentation indicates that explicit query parameters (such as ?foo=bar) do get marked as being UTF-8, but as far as I can tell params which come from parsing the URL through route patterns, and format markers, are being marked as ASCII-8BIT.
I’ve played a bit of whack-a-mole with the problems reported in the bug topic above, but given that we found three separate instances of potentially needing to force an encoding in one place, I’m thinking the issue is fairly likely to be extremely widespread throughout the codebase.
So, the questions which come to mind are:
Does this constitute a bug in Rails itself which should be fixed?
How should this be worked around in Discourse, either until Rails itself gets fixed, or forever?
How has this not been a problem before? It’s not like Discourse suddenly got its first non-seven-bit-ASCII users in the last week or two…
OK, I’m a bit too snowed at the moment to be taking something this big through to completion; anyone who’s looking for a chance to contribute a PR to Discourse (and possibly Rails), go forth and hack!
this code in topic controller is not working well. As I tested some puts in console. I have find out that this part of topic controller has some problem. And I found out that there is no extra whitespace after this. params[:slug]
I have tested some codes to find out if there is extra whitespace after slug or not. so I added the following to topic_controller:
I added a single quotation around the out put to see whatespace in console there is no whitespace between the slug and the single quotation. the are exactly the same but the length is diffrent.
UPDATE:
look at the code I have changed and the output:
The proper fix for this problem, across the entire codebase, is to monkey patch or add a before_filter that fixes the encoding of all parameters parsed out of route patterns to be UTF-8. Individual checks, like the one you propose, don’t scale, for the reasons I explained in my initial post.
thanks, but that doesn’t reproduce previous existed topic slugs. old topics need to be edited so that their slug changes.
btw, is there any way to do this via database? I mean with some lines of codes?
may you open the case a little more? @Alavi1412 would like to help, but he’s not sure if his knowledge of discourse can solve it.
I’d appreciate if you explain which file should be explored among discourse codebase or any other information rather than monkey patch and before_filter?
p.s: @david can I re-generate all the slugs using this command in the app:
rake posts:rebake
an old post slug will only change if it is edited after the slug-generation-method has been changed in admin setting. can the above command do this?
I would guess that running posts:rebake will replace all of the links created in oneboxes and quotes. Probably won’t fix any links that were explicitly posted by a user.
However… since the issue you’re having is “too many redirects” (a redirect loop), I don’t think you even need to fix the links - the old ones should redirect correctly to the new ones. Have you tried it?
We prioritise work for paying customers. At the moment, everyone on the team is already doing customer work. If a paying customer reports this particular problem is impacting them, we will reprioritise accordingly.
Digging through a bunch of code, I find that unescape() (by that or a similar name) has been defined in multiple locations throughout the RoR system. There’s even this one in ActiveSupport (active_support/core_ext/uri.rb) which includes an explicit test that things decode right:
require 'uri'
str = "\xE6\x97\xA5\xE6\x9C\xAC\xE8\xAA\x9E" # Ni-ho-nn-go in UTF-8, means Japanese.
parser = URI::Parser.new
unless str == parser.unescape(parser.escape(str))
URI::Parser.class_eval do
remove_method :unescape
def unescape(str, escaped = /%[a-fA-F\d]{2}/)
# TODO: Are we actually sure that ASCII == UTF-8?
# YK: My initial experiments say yes, but let's be sure please
enc = str.encoding
enc = Encoding::UTF_8 if enc == Encoding::US_ASCII
str.gsub(escaped) { [$&[1, 2].hex].pack('C') }.force_encoding(enc)
end
end
end
Very, very few of the unescapes make any reference to Encoding, relying on the default inside Ruby’s core uri:
If US-ASCII, force it to be UTF-8, otherwise leave as is. And the as-is we are seeing here is ASCII-8BIT. (That, incidentally, is the encoding used generally for the url escape functions, as it gives the nice byte-wise %encoding result.)
I haven’t reproduced this bug in my own environment, so I can’t test fixing that. And I’m unclear if this is the exact unescaper that get used here.
Feel free to propose a PR demonstrating your approach. Given that we’ve already found incompatibilities in at least two different controllers, though, I’m not confident a “little change” in one controller (or model) will do the trick.