Over on another Discourse forum, @sam suggested I make a bug report here: There seems to be (at least) intermittent failure of oneboxing of CNN articles. Since this is a high-profile site, it seems pretty significant. Unclear whether it’s a rate-limiting issue, a user-agent block, a Discourse-side issue, an actual problem with the oEmbed data, or what.
Examples:
(Additional examples removed due to New Userness.)
The reported CNN links are not being oneboxed because the response size is 2.7mb which is greater than the current limit of 2mb. We can fix this by raising the default size but I would like to better the understand the risk of doing so.
Depends on the onebox engine. Some onebox engines like Amazon scrapes for certain information like the price of the item from within the <body>. For opengraph engines, we should in theory only need certain tags like <meta> in <head>.
The most efficient way will be to parse the response for the necessary information as we stream the response but changing all our engines to do this is likely going to take a considerable amount of effort and is a much more complex solution.
Actually, Nokogiri which we use to parse the HTML response is capable of parsing incomplete HTML text so there is no need to throw away the entire response when it is too large. I think we can just continue to limit the response body to 2mb and if the response size exceeds that, we just try to parse the first 2mb.