Weird encoding issue on categories page

hendersj · January 3, 2025, 8:10pm

I’m trying to track down a weird issue in a non-dockerized installation (I realize that support for this type of installation is limited/non-existent, so just looking for some pointers as to what might be wrong here - our internal packaging team used the ‘developer build’ instructions to work out how to build the necessary packages). I have been able to confirm that the issue is specific to the way we’ve installed - my infra team is unwilling to use a Dockerized installation (they prefer to build everything themselves), so I run an sandbox instances that are dockerized and non-dockerized with a copy of our database in order to verify where an issue is, and this is definitely an artifact of the way we’ve installed our setup.

Upgrading from 3.3.2 to 3.3.3, some of our non-English forum staff noticed that the “about” text for sections that use accented characters are not encoded correctly:

Interestingly, we can see that the heading as well as all other text is encoded properly. In fact, the message itself that’s used for the about message is properly encoded:

I confirmed that this is the same text by editing it and seeing the change on the categories page.

So it’s something specific to rendering that text on the categories page.

Looking at document.characterSet in my browser, it’s properly identified as UTF-8. The database also shows the format as UTF-8.

I’m wondering if anyone can point me to what is different about how this text is rendered in the categories page. My guess is that it’s some ruby package that’s not built properly (missing UTF-8 support maybe) that’s used in rendering that text but not other text on the system, or something that processes the about message text and truncates it (which I noted is the case here; however, we also have a link to an external French forum that is a non-truncated message, but I’m guessing that it’s still evaluated by the same code).

Thanks for any pointers. I’m a bit stumped here.

supermathie · January 3, 2025, 8:22pm

I see sometimes it’s correct:

Pulling a raw categories.json shows it’s only wrong in the excerpt:

        "description": "Esta sección del Foro se dedica a las personas usuarias de openSUSE que forman parte de la comunidad lingüística castellana, de tal forma que dichas personas puedan consultar y participar en el foro en dicha lengua (sea el dialecto español o cualquiera de las variedades latinoamericanas, etc.).",
        "description_text": "Esta sección del Foro se dedica a las personas usuarias de openSUSE que forman parte de la comunidad lingüística castellana, de tal forma que dichas personas puedan consultar y participar en el foro en dicha lengua (sea el dialecto español o cualquiera de las variedades latinoamericanas, etc.).",
        "description_excerpt": "Esta secciÃ³n del Foro se dedica a las personas usuarias de openSUSE que forman parte de la comunidad lingÃ¼Ãstica castellana, de tal forma que dichas personas puedan consultar y participar en el foro en dicha lengua (sea el dialecto espaÃ±ol o cualquiera de las variedades latinoamericanas, etc.).",

Creating the same category on try.discourse.org and checking categories.json gives the correct result:

        "description": "Esta sección del Foro se dedica a las personas usuarias de openSUSE que forman parte de la comunidad lingüística castellana, de tal forma que dichas personas puedan consultar y participar en el foro en dicha lengua (sea el dialecto español o cualquiera de las variedades latinoamericanas, etc.).",
        "description_text": "Esta sección del Foro se dedica a las personas usuarias de openSUSE que forman parte de la comunidad lingüística castellana, de tal forma que dichas personas puedan consultar y participar en el foro en dicha lengua (sea el dialecto español o cualquiera de las variedades latinoamericanas, etc.).",
        "description_excerpt": "Esta sección del Foro se dedica a las personas usuarias de openSUSE que forman parte de la comunidad lingüística castellana, de tal forma que dichas personas puedan consultar y participar en el foro en dicha lengua (sea el dialecto español o cualquiera de las variedades latinoamericanas, etc.).",

I’m not sure what the next step would be in tracking this down on your install, but maybe focusing on the codepath that generates the excerpt will help, as well as knowing this arose by something interpreting the UTF-8 encoding as iso-8859-1.

hendersj · January 3, 2025, 8:29pm

Yeah, that’s my guess - whatever is generating the excerpt is probably the right place. Just not sure where that is in the code itself. But knowing to look for the term “excerpt” is definitely helpful - thanks!

It did look to me like it was coming across as iso-8859-1 at some point, so appreciate that confirmation as well (I wasn’t 100% sure that was the mis-encoding I was seeing, but it seemed right).

What you saw on try.discourse.org is what I saw in my dockerized installation as well (well, the end result of the encoding being correct )

Thanks!

supermathie · January 3, 2025, 8:33pm

You can easily check with:

○ → ipython3

In [1]: 'Esta sección del Foro se dedica a las personas usuarias de openSUSE que forman parte de la comunidad lingüística castellana
   ...: , de tal forma que dichas personas puedan consultar y participar en el foro en dicha lengua (sea el dialecto español o cualq
   ...: uiera de las variedades latinoamericanas, etc.).'.encode('utf-8').decode('iso-8859-1')
Out[1]: 'Esta secciÃ³n del Foro se dedica a las personas usuarias de openSUSE que forman parte de la comunidad lingÃ¼Ã\xadstica castellana, de tal forma que dichas personas puedan consultar y participar en el foro en dicha lengua (sea el dialecto espaÃ±ol o cualquiera de las variedades latinoamericanas, etc.).'

pfaffman · January 3, 2025, 8:54pm

My guess is that something is doing some kind of caching. That’s not much help, but that’s what I’d try to look for.

Unless they just hate Docker, they can build their own images with discourse_docker. Then they can see exactly what is happening and not have to trust anyone else’s images.

hendersj · January 3, 2025, 9:46pm

Michael Brown:

You can easily check with:

○ → ipython3

In [1]: 'Esta sección del Foro se dedica a las personas usuarias de openSUSE que forman parte de la comunidad lingüística castellana
   ...: , de tal forma que dichas personas puedan consultar y participar en el foro en dicha lengua (sea el dialecto español o cualq
   ...: uiera de las variedades latinoamericanas, etc.).'.encode('utf-8').decode('iso-8859-1')
Out[1]: 'Esta secciÃ³n del Foro se dedica a las personas usuarias de openSUSE que form

Cool, thanks for that.

I thought that might be the case, but changing the message resulted in an update, so I don’t think it’s caching - it’s something encoding incorrectly.

I suggested a few options, but in the end, the infra team opted to just go with packages built using the build service. I don’t think it’s a “we hate docker” thing (though podman would probably be more likely what they’d want to use), but more a way of using the configuration management tools they’re using to manage everything in the same manner. Having a one-off that uses Docker/podman would add other complexities to using the CI/CD setup they’re using (or so I understand).

So ultimately, I set up my two sandboxes to be able to determine where issues are rooted so I could report them to the proper place; unfortunately, that means that when it is something in how we’ve built things, I have to chase down what we’re doing different from a standard Docker-based installation so we can fix it.

pfaffman · January 3, 2025, 10:00pm

I understand how they think that the Discourse way is crazy and they really want everything to be managed under one Unified System.

But. The last client I had that insisted on using their favorite tooling ended up paying me close to 20 hours work to get a usable backup to move to discourse.org hosting. The one before that paid a bunch more to tweak their custom setup to keep their site from crashing several times a week, and then a year later they paid me more still to move them to discourse.org hosting.

Good luck!

hendersj · January 3, 2025, 10:23pm

Appreciate the advice. The good news is that the backups from our prod system work in a dockerized installation just fine (I’ve tested that), so if/when they decide that using a Docker-based installation is the right way to go, we’ll be in good shape. We’ve got a lot of data (migrated from vBulletin a few years ago to Discourse), and generally things have worked pretty well, with a few odd hiccups here and there.

Consequently, we’ve learned a lot about how Discourse works, so not a bad thing all around.

It looks like the /categories.json is an API endpoint rather than a static file that’s created and then read, so I think that helps limit the issue to either ruby or javascript. I’ve found where the schema is for this endpoint, but not being particularly familiar with ruby (I’ve accumulated a lot of programming language experience over the years, so reading most languages isn’t a problem for me even if I can’t code in them - I can get the gist pretty easily), but it looks like the javascript is mostly executed in the browser, and the ruby is executed on the server (though I note that nodejs is installed as well, so that generalization may not really hold).

If I can find the function that processes /categories (as it looks like .json on the end just tells the code how to format the output; I see similar behaviour in /top vs /top.rss, for example), then that should narrow down where I need to look in the code, and that’ll tell me what ruby gems (I’m fairly certain it’ll be ruby code) need to be checked that they’re properly built.

hendersj · January 6, 2025, 11:30pm

It seems to be something specific to the excerpting functions - I’ve just noticed this happens on our search results pages as well:

(for example)

The text is:

I don’t see a “Share” button.

Which is something I quoted in a response to a user (panorain) myself. I happened on this by mistake, as trying to look at my own activity, I get a 500 server error, and the output from /logs shows a runtime error “input string cannot be empty” in lib/excerpt_parser.rb.

Seems like a few things are leading back to something in excerpt processing, but only in the development-style installs.

In my docker-based installation, I can actually view my activity without error; weirdly, though, the database in that installation is restored from a recent production server backup - where the issue exists.

hendersj · January 13, 2025, 2:44am

It looks like we upgraded nokogiri to 1.17.2, and I see the Dockerized version is 1.16.7 - I suspect that’s the cause of this issue. Going to see about reverting that update (and anything else that was updated at the same time).

darix · January 26, 2025, 8:38pm

so I downgraded our package to use nokogiri 1.16 again. What I dont get. When ever I bump a gem to reduce duplicated packaging, I check if there had been related changes in main there haven’t been any. Unless I missed something

        "description": "Witaj w polskiej sekcji społeczności openSUSE!",
        "description_text": "Witaj w polskiej sekcji społeczności openSUSE!",
        "description_excerpt": "Witaj w polskiej sekcji spoÅecznoÅci openSUSE!",

as you can see we have the correct text twice, only when it runs through PrettyText.excerpt it is broken. How is that handled in main?

@hendersj I am already preparing a main package so we can test that with a copy of the DB.

I guess it was handled in DEV: Update nokogiri to 1.18.1 (#30554) · discourse/discourse@affe26f · GitHub

but I wonder … in lib/retrieve_title.rb

doc = Nokogiri.HTML5(html, encoding:)

shouldn’t this be:

doc = Nokogiri.HTML5(html, encoding: Encoding::UTF_8)

darix · February 5, 2025, 12:43am

@pfaffman that weird code is also in the 3.4.0 release.

could you check if that should really be called with an empty encoding: setting?

supermathie · February 5, 2025, 1:25am

Any reason you think so? UTF-8 is the default.

[1] pry(main)> Nokogiri::VERSION
=> "1.18.2"

[2] pry(main)> t = '<div>Witaj w polskiej sekcji społeczności openSUSE!</div>'
=> "<div>Witaj w polskiej sekcji społeczności openSUSE!</div>"

[3] pry(main)> Nokogiri.HTML5(t).to_s
=> "<html><head></head><body><div>Witaj w polskiej sekcji społeczności openSUSE!</div></body></html>"

[4] pry(main)> Nokogiri.HTML5(t, encoding: Encoding::UTF_8).to_s
=> "<html><head></head><body><div>Witaj w polskiej sekcji społeczności openSUSE!</div></body></html>"

[5] pry(main)> Nokogiri.HTML5(t).to_s == Nokogiri.HTML5(t, encoding: Encoding::UTF_8).to_s
=> true

The retrieve_title function is used for extracting titles from external URLs (e.g. Youtube) and though I’m not intimately familiar with this codepath would be surprised to find this the source of your problem.

If you’re doing something else (e.g. using this function in a custom plugin) the encoding parameter there comes from the content-type header of the fetched resource:

        if !encoding && content_type = _response["content-type"]&.strip&.downcase
          if content_type =~ /charset="?([a-z0-9_-]+)"?/
            encoding = Regexp.last_match(1)
            encoding = nil if !Encoding.list.map(&:name).map(&:downcase).include?(encoding)
          end
        end

        max_size = max_chunk_size(uri) * 1024
        title = extract_title(current, encoding)

so one would suspect the responding webserver reporting an incorrect content-type

darix · February 5, 2025, 1:29am

because all other calls in that patch have encoding: parameters specified.

only the one in retrieve title does not. which seems inconsistent. and not properly handling UTF-8 encoding was the whole discussion that lead to this thread.

supermathie · February 5, 2025, 1:36am

Ah:

is shorthand for:

doc = Nokogiri.HTML5(html, encoding: encoding)

forcing UTF8 there would break parsing of non-UTF8 responses from webservers

darix · February 5, 2025, 1:41am

Thank you for the clarification. back to packaging 3.4.0.

Topic		Replies	Views
Topic with Japanese in URL doesn't redirect if URL doesn't perfectly match Bug	72	4057	October 21, 2019
Upgrading from 1.6.0 to 3 Installation	13	589	January 18, 2023
40 Fatals and 7 Errors in recent logs Support	13	1562	June 8, 2024
Newly added non-ASCII category page does not load Bug	12	1270	July 30, 2020
3.5.0beta3: Full admin search, better font selection, more robust site search, category personalization, and easier configuration management Announcements release-notes	1	374	April 29, 2025

Weird encoding issue on categories page

Related topics