Parsing RSS feed missing quotes + apostrophes


(Pat David) #1

Hi all!

We’ve recently upgraded to 2.0.0.beta10~git49.9f422c93f6 and our latest RSS pull for comment embedding nuked apostrophes and quotes from the posts:

The image example is from this page:

The feed it’s pulling from is here: PIXLS.US

Search on “an about page and help” to get to the relevant section.

y u hate typographical marks?! :smiley:


(Régis Hanol) #2

@techAPJ didn’t you make changes to the feed poller recently?


(Darix) #3

So this seems to be 2 separate issues:

  1. when upgrading from 2.0.0.beta9+git0 to the version mentioned above discourse decided it needs to refetch/rerender a lot of older posts from the RSS feed. This is how we noticed the 2nd bug
  2. It seems we lost at least all typography markers. also in our recent posts. The import category can be seen here PIXLS.US - discuss.pixls.us

It smells a bit like “RSS feed despite being sent with the correct headers is not seen/read as utf-8 encoded string so that the reencoding with the replace option strips utf-8 encoded chars.”


(Arpit Jalan) #4

Thanks for bringing this to our notice. I have reverted the UTF-8 encode related changes I pushed few days ago.

Updating to latest version will normalize the behaviour.


(Arpit Jalan) #7

Since the content was updated/changed the topic got updated as per this code. Now that I reverted the code, the topic will be updated again with proper formatting.


(Darix) #8

ok I patched out the “not recently polled” check for a moment. triggered the sidekig job and all our posts are good again.

Why was this reencode added in the beginning? I will debug later why the raw_feed string isnt utf-8 encoded. then the reencode should have been a noop no?


(Arpit Jalan) #9

Because in some cases of bad (not supported) encoding we were seeing job exceptions in error logs.

Yes, if the string is proper UTF-8 encoded already, then the encode logic shouldn’t have come into action.


(Darix) #10

might depend on the locale/lang environment of the sidekiq job. maybe i should add LC_ALL=en_US.UTF-8 and LANG=en_US.UTF-8 in my service file.


(Arpit Jalan) #11

@gerhard sent a PR with proper fix:

Closing this topic for now. Please create a new topic if the problem persists.


(Arpit Jalan) #12