CNN oneboxing failure

Over on another Discourse forum, @sam suggested I make a bug report here: There seems to be (at least) intermittent failure of oneboxing of CNN articles. Since this is a high-profile site, it seems pretty significant. Unclear whether it’s a rate-limiting issue, a user-agent block, a Discourse-side issue, an actual problem with the oEmbed data, or what.

Examples:

(Additional examples removed due to New Userness.)

4 Likes

Hi @wazroth :wave: welcome to Meta. :slight_smile: Thanks for the report.

Yea I have been able to repro this on my dev instance and a hosted site. CNN links not oneboxing at all.

3 Likes

Thanks @wazroth we have debugging this slotted to some time in the next 4 weeks

4 Likes

@ted Do you happen to remember why we reduced max_download_kb for onebox from 10mb to 2mb in SECURITY: Prevent Onebox cache overflow by limiting downloads and URL… · discourse/discourse@95a82d6 · GitHub?

The reported CNN links are not being oneboxed because the response size is 2.7mb which is greater than the current limit of 2mb. We can fix this by raising the default size but I would like to better the understand the risk of doing so.

4 Likes

hmmm also … to follow on, isn’t all the info we need in the first 2mb anyway?

2 Likes

See also Amazon Onebox broken, possibly related?

Is this max_download_kb setting hard coded? It’s nothing something I can change via the admin menu is it?

Depends on the onebox engine. Some onebox engines like Amazon scrapes for certain information like the price of the item from within the <body>. For opengraph engines, we should in theory only need certain tags like <meta> in <head>.

The most efficient way will be to parse the response for the necessary information as we stream the response but changing all our engines to do this is likely going to take a considerable amount of effort and is a much more complex solution.

Actually, Nokogiri which we use to parse the HTML response is capable of parsing incomplete HTML text so there is no need to throw away the entire response when it is too large. I think we can just continue to limit the response body to 2mb and if the response size exceeds that, we just try to parse the first 2mb.

1 Like

This is fixed by

Locally, the “problematic” URLs reported in this topic no longer displays an error when we try to onebox.

5 Likes

Fantastic, thanks @tgxworld :smiley:

2 Likes