CNN oneboxing failure

wazroth · February 18, 2024, 11:29pm

Over on another Discourse forum, @sam suggested I make a bug report here: There seems to be (at least) intermittent failure of oneboxing of CNN articles. Since this is a high-profile site, it seems pretty significant. Unclear whether it’s a rate-limiting issue, a user-agent block, a Discourse-side issue, an actual problem with the oEmbed data, or what.

Examples:

(Additional examples removed due to New Userness.)

Lilly · February 18, 2024, 11:46pm

Hi @wazroth welcome to Meta. Thanks for the report.

Yea I have been able to repro this on my dev instance and a hosted site. CNN links not oneboxing at all.

sam · February 19, 2024, 12:25am

Thanks @wazroth we have debugging this slotted to some time in the next 4 weeks

tgxworld · May 7, 2024, 6:05am

@ted Do you happen to remember why we reduced max_download_kb for onebox from 10mb to 2mb in SECURITY: Prevent Onebox cache overflow by limiting downloads and URL… · discourse/discourse@95a82d6 · GitHub?

The reported CNN links are not being oneboxed because the response size is 2.7mb which is greater than the current limit of 2mb. We can fix this by raising the default size but I would like to better the understand the risk of doing so.

sam · May 7, 2024, 6:29am

hmmm also … to follow on, isn’t all the info we need in the first 2mb anyway?

Wingtip · May 7, 2024, 5:11pm

See also Amazon Onebox broken, possibly related?

Richie · May 7, 2024, 5:27pm

Is this max_download_kb setting hard coded? It’s nothing something I can change via the admin menu is it?

tgxworld · May 8, 2024, 4:10am

Depends on the onebox engine. Some onebox engines like Amazon scrapes for certain information like the price of the item from within the <body>. For opengraph engines, we should in theory only need certain tags like <meta> in <head>.

The most efficient way will be to parse the response for the necessary information as we stream the response but changing all our engines to do this is likely going to take a considerable amount of effort and is a much more complex solution.

Actually, Nokogiri which we use to parse the HTML response is capable of parsing incomplete HTML text so there is no need to throw away the entire response when it is too large. I think we can just continue to limit the response body to 2mb and if the response size exceeds that, we just try to parse the first 2mb.

tgxworld · May 8, 2024, 5:33am

This is fixed by

github.com/discourse/discourse

FIX: Attempt to onebox even if response body exceeds `max_download_kb`

discourse:main ← discourse:fix_dont_throw_away_response_body_when_max_download_exceeded

opened 05:28AM - 08 May 24 UTC

tgxworld

+46 -5

In 95a82d608d6377faf68a0e2c5d9640b043557852, we lowered the default for `Onebox….options.max_download_kb` from 10mb to 2mb for security hardening purposes. However, this resulted in multiple bug reports where seemingly nomral URLs stopped being oneboxed. It turns out that lowering `Onebox.options.max_download_kb` resulted in `Onebox::Helpers::DownloadTooLarge` being raised more often for more URLs in `Onebox::Helpers.fetch_response` which `Onebox::Helpers.fetch_html_doc` relies on. When `Onebox::Helpers::DownloadTooLarge` is raised in `Onebox::Helpers.fetch_response`, we throw away whatever response body which we have already downloaded at that point. This is not ideal because Nokogiri can parse incomplete HTML documents and there is a really high chance that the incomplete HTML document still contains the information which we need for oneboxing. Therefore, this commit updates `Onebox::Helpers.fetch_html_doc` to not throw away the response body when the size of the response body exceeds `Onebox.options.max_download_size`. Instead, we just take whatever response which we have and get Nokogiri to parse it.

Locally, the “problematic” URLs reported in this topic no longer displays an error when we try to onebox.

Richie · May 8, 2024, 9:53am

Fantastic, thanks @tgxworld

Topic		Replies	Views
Oneboxing large images fails with no visual cue Bug onebox	47	1526	May 4, 2023
Onebox links result in large image downloads Feature	2	467	December 14, 2022
Can't override size of image oneboxes Feature	3	790	June 18, 2017
Onebox issue with a specific site Support onebox	14	1595	March 2, 2019
Onebox settings being ignored on edits Support	4	546	March 24, 2022

CNN oneboxing failure

Related topics