Weird encoding issue on categories page

I’m trying to track down a weird issue in a non-dockerized installation (I realize that support for this type of installation is limited/non-existent, so just looking for some pointers as to what might be wrong here - our internal packaging team used the ‘developer build’ instructions to work out how to build the necessary packages). I have been able to confirm that the issue is specific to the way we’ve installed - my infra team is unwilling to use a Dockerized installation (they prefer to build everything themselves), so I run an sandbox instances that are dockerized and non-dockerized with a copy of our database in order to verify where an issue is, and this is definitely an artifact of the way we’ve installed our setup.

Upgrading from 3.3.2 to 3.3.3, some of our non-English forum staff noticed that the “about” text for sections that use accented characters are not encoded correctly:

Interestingly, we can see that the heading as well as all other text is encoded properly. In fact, the message itself that’s used for the about message is properly encoded:

I confirmed that this is the same text by editing it and seeing the change on the categories page.

So it’s something specific to rendering that text on the categories page.

Looking at document.characterSet in my browser, it’s properly identified as UTF-8. The database also shows the format as UTF-8.

I’m wondering if anyone can point me to what is different about how this text is rendered in the categories page. My guess is that it’s some ruby package that’s not built properly (missing UTF-8 support maybe) that’s used in rendering that text but not other text on the system, or something that processes the about message text and truncates it (which I noted is the case here; however, we also have a link to an external French forum that is a non-truncated message, but I’m guessing that it’s still evaluated by the same code).

Thanks for any pointers. I’m a bit stumped here.

I see sometimes it’s correct:

Pulling a raw categories.json shows it’s only wrong in the excerpt:

        "description": "Esta sección del Foro se dedica a las personas usuarias de openSUSE que forman parte de la comunidad lingüística castellana, de tal forma que dichas personas puedan consultar y participar en el foro en dicha lengua (sea el dialecto español o cualquiera de las variedades latinoamericanas, etc.).",
        "description_text": "Esta sección del Foro se dedica a las personas usuarias de openSUSE que forman parte de la comunidad lingüística castellana, de tal forma que dichas personas puedan consultar y participar en el foro en dicha lengua (sea el dialecto español o cualquiera de las variedades latinoamericanas, etc.).",
        "description_excerpt": "Esta sección del Foro se dedica a las personas usuarias de openSUSE que forman parte de la comunidad lingüística castellana, de tal forma que dichas personas puedan consultar y participar en el foro en dicha lengua (sea el dialecto español o cualquiera de las variedades latinoamericanas, etc.).",

Creating the same category on try.discourse.org and checking categories.json gives the correct result:

        "description": "Esta sección del Foro se dedica a las personas usuarias de openSUSE que forman parte de la comunidad lingüística castellana, de tal forma que dichas personas puedan consultar y participar en el foro en dicha lengua (sea el dialecto español o cualquiera de las variedades latinoamericanas, etc.).",
        "description_text": "Esta sección del Foro se dedica a las personas usuarias de openSUSE que forman parte de la comunidad lingüística castellana, de tal forma que dichas personas puedan consultar y participar en el foro en dicha lengua (sea el dialecto español o cualquiera de las variedades latinoamericanas, etc.).",
        "description_excerpt": "Esta sección del Foro se dedica a las personas usuarias de openSUSE que forman parte de la comunidad lingüística castellana, de tal forma que dichas personas puedan consultar y participar en el foro en dicha lengua (sea el dialecto español o cualquiera de las variedades latinoamericanas, etc.).",

I’m not sure what the next step would be in tracking this down on your install, but maybe focusing on the codepath that generates the excerpt will help, as well as knowing this arose by something interpreting the UTF-8 encoding as iso-8859-1.

Yeah, that’s my guess - whatever is generating the excerpt is probably the right place. Just not sure where that is in the code itself. But knowing to look for the term “excerpt” is definitely helpful - thanks!

It did look to me like it was coming across as iso-8859-1 at some point, so appreciate that confirmation as well (I wasn’t 100% sure that was the mis-encoding I was seeing, but it seemed right).

What you saw on try.discourse.org is what I saw in my dockerized installation as well (well, the end result of the encoding being correct :slight_smile: )

Thanks!

You can easily check with:

○ → ipython3

In [1]: 'Esta sección del Foro se dedica a las personas usuarias de openSUSE que forman parte de la comunidad lingüística castellana
   ...: , de tal forma que dichas personas puedan consultar y participar en el foro en dicha lengua (sea el dialecto español o cualq
   ...: uiera de las variedades latinoamericanas, etc.).'.encode('utf-8').decode('iso-8859-1')
Out[1]: 'Esta sección del Foro se dedica a las personas usuarias de openSUSE que forman parte de la comunidad lingüÃ\xadstica castellana, de tal forma que dichas personas puedan consultar y participar en el foro en dicha lengua (sea el dialecto español o cualquiera de las variedades latinoamericanas, etc.).'

My guess is that something is doing some kind of caching. That’s not much help, but that’s what I’d try to look for.

Unless they just hate Docker, they can build their own images with discourse_docker. Then they can see exactly what is happening and not have to trust anyone else’s images.

Cool, thanks for that.

I thought that might be the case, but changing the message resulted in an update, so I don’t think it’s caching - it’s something encoding incorrectly.

I suggested a few options, but in the end, the infra team opted to just go with packages built using the build service. I don’t think it’s a “we hate docker” thing (though podman would probably be more likely what they’d want to use), but more a way of using the configuration management tools they’re using to manage everything in the same manner. Having a one-off that uses Docker/podman would add other complexities to using the CI/CD setup they’re using (or so I understand).

So ultimately, I set up my two sandboxes to be able to determine where issues are rooted so I could report them to the proper place; unfortunately, that means that when it is something in how we’ve built things, I have to chase down what we’re doing different from a standard Docker-based installation so we can fix it.

I understand how they think that the Discourse way is crazy and they really want everything to be managed under one Unified System.

But. The last client I had that insisted on using their favorite tooling ended up paying me close to 20 hours work to get a usable backup to move to discourse.org hosting. The one before that paid a bunch more to tweak their custom setup to keep their site from crashing several times a week, and then a year later they paid me more still to move them to discourse.org hosting. :slight_smile:

Good luck!

3 Likes

Appreciate the advice. The good news is that the backups from our prod system work in a dockerized installation just fine (I’ve tested that), so if/when they decide that using a Docker-based installation is the right way to go, we’ll be in good shape. We’ve got a lot of data (migrated from vBulletin a few years ago to Discourse), and generally things have worked pretty well, with a few odd hiccups here and there.

Consequently, we’ve learned a lot about how Discourse works, so not a bad thing all around. :slight_smile:

It looks like the /categories.json is an API endpoint rather than a static file that’s created and then read, so I think that helps limit the issue to either ruby or javascript. I’ve found where the schema is for this endpoint, but not being particularly familiar with ruby (I’ve accumulated a lot of programming language experience over the years, so reading most languages isn’t a problem for me even if I can’t code in them - I can get the gist pretty easily), but it looks like the javascript is mostly executed in the browser, and the ruby is executed on the server (though I note that nodejs is installed as well, so that generalization may not really hold).

If I can find the function that processes /categories (as it looks like .json on the end just tells the code how to format the output; I see similar behaviour in /top vs /top.rss, for example), then that should narrow down where I need to look in the code, and that’ll tell me what ruby gems (I’m fairly certain it’ll be ruby code) need to be checked that they’re properly built.

2 Likes

It seems to be something specific to the excerpting functions - I’ve just noticed this happens on our search results pages as well:

(for example)

The text is:

I don’t see a “Share” button.

Which is something I quoted in a response to a user (panorain) myself. I happened on this by mistake, as trying to look at my own activity, I get a 500 server error, and the output from /logs shows a runtime error “input string cannot be empty” in lib/excerpt_parser.rb.

Seems like a few things are leading back to something in excerpt processing, but only in the development-style installs.

In my docker-based installation, I can actually view my activity without error; weirdly, though, the database in that installation is restored from a recent production server backup - where the issue exists.

2 Likes

It looks like we upgraded nokogiri to 1.17.2, and I see the Dockerized version is 1.16.7 - I suspect that’s the cause of this issue. Going to see about reverting that update (and anything else that was updated at the same time).