Onebox issue with a specific site

Continuing the discussion from Rich link previews with Onebox:

I’m having problems sharing with this site:

I’ve got an odd issue with oneboxing.

That one-boxes fine (as you can see).

But there are problems when deep linking, and there are countless links on this site that don’t work, but work on Facebook, why?!

e.g.:

https://seekingalpha.com/article/4223492-blackstone-blue-chip-private-equity-name-continues-reward-investors-plus-7_7-percent-yield

This latter one fails to onebox when pasted as title.

Facebook debugger works fine, you can see the og: tags when you view source,

However, Onebox-ing fails. (Yes I’m aware I need to clear the cache and I do.)

FYI it also fails on the discourse/onebox test server

Any ideas?

OK I’ve got a bit further with this, but still at a loss:

in helper.rb of discourse/onebox

response = (fetch_response(url, nil, nil, headers) rescue nil)

returns nil.

Is it possible the site is somehow rejecting the request?

OK wanted to update this as someone might appreciate the insight at some point or be able to help me clarify what’s going on.

For the sake of control, I decided to take Discourse out of the picture for a second, as I was suspicious about running these Open Graph scrapes from my VPS provider.

So I found an alternative open source Open Graph (and Twitter) tag extractor library on Github

Installing and running this on my local PC versus the VPS produced an interesting result.

First, my local PC

BBC News

$ node ogscraper.js
error: false
results: { data:
   { ogTitle: 'Home - BBC News',
     ogType: 'website',
     ogDescription:
      'Visit BBC News for up-to-the-minute news, breaking news, video, audio and feature stories. BBC News provide
s trusted World and UK news as well as local and regional perspectives. Also entertainment, business, science, tec
hnology and health news.',
     ogSiteName: 'BBC News',
     ogLocale: 'en_GB',
     ogUrl: 'https://www.bbc.co.uk/news',
     twitterCard: 'summary_large_image',
     twitterSite: '@BBCNews',
     twitterTitle: 'Home - BBC News',
     twitterDescription:
      'Visit BBC News for up-to-the-minute news, breaking news, video, audio and feature stories. BBC News provide
s trusted World and UK news as well as local and regional perspectives. Also entertainment, business, science, tec
hnology and health news.',
     twitterCreator: '@BBCNews',
     ogImage:
      { url:
         '//m.files.bbci.co.uk/modules/bbc-morph-news-waf-page-meta/2.3.0/bbc_news_logo.png',
        width: null,
        height: null,
        type: null },
     twitterImage:
      { url:
         '//m.files.bbci.co.uk/modules/bbc-morph-news-waf-page-meta/2.3.0/bbc_news_logo.png',
        width: null,
        height: null,
        alt: 'BBC News' } },
  success: true,
  requestUrl: 'http://news.bbc.co.uk/' }

Success, the script is working!

Next, the problem article!

SeekingAlpha (article which fails to Onebox on my Discourse forum)

$ node ogscraper.js
error: false
results: { data:
   { ogLocale: 'en_US',
     ogSiteName: 'Seeking Alpha',
     ogTitle:
      'Blackstone: This Blue Chip Private Equity Name Continues To Reward Investors With +7.7% Yield',
     ogUrl:
      'https://seekingalpha.com/article/4223492-blackstone-blue-chip-private-equity-name-continues-reward-investor
s-plus-7_7-percent-yield',
     ogDescription:
      'Blackstone posted inflows and asset under management growth across all of its major segments in Q3. BX is s
till sitting on significant dry powder, giving its ma',
     ogType: 'article',
     twitterCard: 'summary_large_image',
     twitterSite: '@SeekingAlpha',
     twitterCreator: '@',
     twitterTitle:
      'Blackstone: This Blue Chip Private Equity Name Continues To Reward Investors With +7.7% Yield',
     twitterDescription:
      'Blackstone posted inflows and asset under management growth across all of its major segments in Q3.BX is st
ill sitting on significant dry powder, giving its management team flexibility as it navigates',
     twitterAppNameiPhone: 'Seeking Alpha Portfolio',
     twitterAppIdiPhone: '552799694',
     twitterAppNameiPad: 'Seeking Alpha Portfolio',
     twitterAppIdiPad: '552799694',
     twitterAppNameGooglePlay: 'Seeking Alpha',
     twitterAppIdGooglePlay: 'com.seekingalpha.webwrapper',
     ogImage:
      { url:
         'https://static3.seekingalpha.com/uploads/2018/11/13/16392-15421307232575014.png',
        width: null,
        height: null,
        type: null },
     twitterImage:
      { url:
         'https://static3.seekingalpha.com/uploads/2018/11/13/16392-15421307232575014.png',
        width: null,
        height: null,
        alt: null } },
  success: true,
  requestUrl:
   'https://seekingalpha.com/article/4223492-blackstone-blue-chip-private-equity-name-continues-reward-investors-p
lus-7_7-percent-yield' }

Awesome, so it works from my local PC!

Next, log into my development VPS that I’ve been using to develop TLP on. This has NodeJS installed. Let’s try it there:

My VPS server

BBC News

$ node ogscraper.js
error: false
results: { data:
   { ogTitle: 'Home - BBC News',
     ogType: 'website',
     ogDescription: 'Visit BBC News for up-to-the-minute news, breaking news, video, audio and feature stories. BBC News provides trusted World and UK news as well as local and regional perspectives. Also entertainment, business, science, technology and health news.',
     ogSiteName: 'BBC News',
     ogLocale: 'en_GB',
     ogUrl: 'https://www.bbc.co.uk/news',
     twitterCard: 'summary_large_image',
     twitterSite: '@BBCNews',
     twitterTitle: 'Home - BBC News',
     twitterDescription: 'Visit BBC News for up-to-the-minute news, breaking news, video, audio and feature stories. BBC News provides trusted World and UK news as well as local and regional perspectives. Also entertainment, business, science, technology and health news.',
     twitterCreator: '@BBCNews',
     ogImage:
      { url: '//m.files.bbci.co.uk/modules/bbc-morph-news-waf-page-meta/2.3.0/bbc_news_logo.png',
        width: null,
        height: null,
        type: null },
     twitterImage:
      { url: '//m.files.bbci.co.uk/modules/bbc-morph-news-waf-page-meta/2.3.0/bbc_news_logo.png',
        width: null,
        height: null,
        alt: 'BBC News' } },
  success: true,
  requestUrl: 'https://news.bbc.co.uk/' }

Great! That works

Next the problem article …

SeekingAlpha (article which fails to Onebox on my Discourse forum)

$ node ogscrape.js
error: true
results: { error: 'Page Not Found',
  success: false,
  requestUrl: 'https://seekingalpha.com/article/4223492-blackstone-blue-chip-private-equity-name-continues-reward-investors-plus-7_7-percent-yield',
  errorDetails: 'Server Has Ran Into A Error',
  response:
   IncomingMessage {
     _readableState:
      ReadableState {
        objectMode: false,
        highWaterMark: 16384,
        buffer: [Object],
        length: 0,
        pipes: null,
        pipesCount: 0,
        flowing: false,
        ended: true,
        endEmitted: true,
        reading: false,
        sync: true,
        needReadable: false,
        emittedReadable: false,
        readableListening: false,
        resumeScheduled: false,
        destroyed: false,
        defaultEncoding: 'utf8',
        awaitDrain: 0,
        readingMore: false,
        decoder: null,
        encoding: null },
     readable: false,
     domain: null,
     _events: { end: [Array], close: [Function] },
     _eventsCount: 2,
     _maxListeners: undefined,
     socket:
      TLSSocket {
        _tlsOptions: [Object],
        _secureEstablished: true,
        _securePending: false,
        _newSessionPending: false,
        _controlReleased: true,
        _SNICallback: null,
        servername: null,
        npnProtocol: undefined,
        alpnProtocol: false,
        authorized: true,
        authorizationError: null,
        encrypted: true,
        _events: [Object],
        _eventsCount: 10,
        connecting: false,
        _hadError: false,
        _handle: null,
        _parent: null,
        _host: 'seekingalpha.com',
        _readableState: [Object],
        readable: false,
        domain: null,
        _maxListeners: undefined,
        _writableState: [Object],
        writable: false,
        allowHalfOpen: false,
        _bytesDispatched: 223,
        _sockname: null,
        _pendingData: null,
        _pendingEncoding: '',
        server: undefined,
        _server: null,
        ssl: null,
        _requestCert: true,
        _rejectUnauthorized: true,
        parser: null,
        _httpMessage: [Object],
        read: [Function],
        _consuming: true,
        _idleTimeout: -1,
        _idleNext: null,
        _idlePrev: null,
        _idleStart: 1615,
        _destroyed: false,
        [Symbol(asyncId)]: 9,
        [Symbol(bytesRead)]: 2329,
        [Symbol(asyncId)]: 22,
        [Symbol(triggerAsyncId)]: 18 },
     connection:
      TLSSocket {
        _tlsOptions: [Object],
        _secureEstablished: true,
        _securePending: false,
        _newSessionPending: false,
        _controlReleased: true,
        _SNICallback: null,
        servername: null,
        npnProtocol: undefined,
        alpnProtocol: false,
        authorized: true,
        authorizationError: null,
        encrypted: true,
        _events: [Object],
        _eventsCount: 10,
        connecting: false,
        _hadError: false,
        _handle: null,
        _parent: null,
        _host: 'seekingalpha.com',
        _readableState: [Object],
        readable: false,
        domain: null,
        _maxListeners: undefined,
        _writableState: [Object],
        writable: false,
        allowHalfOpen: false,
        _bytesDispatched: 223,
        _sockname: null,
        _pendingData: null,
        _pendingEncoding: '',
        server: undefined,
        _server: null,
        ssl: null,
        _requestCert: true,
        _rejectUnauthorized: true,
        parser: null,
        _httpMessage: [Object],
        read: [Function],
        _consuming: true,
        _idleTimeout: -1,
        _idleNext: null,
        _idlePrev: null,
        _idleStart: 1615,
        _destroyed: false,
        [Symbol(asyncId)]: 9,
        [Symbol(bytesRead)]: 2329,
        [Symbol(asyncId)]: 22,
        [Symbol(triggerAsyncId)]: 18 },
     httpVersionMajor: 1,
     httpVersionMinor: 1,
     httpVersion: '1.1',
     complete: true,
     headers:
      { allow: 'GET, POST, HEAD, PUT, PATCH, DELETE, OPTIONS',
        'content-encoding': 'gzip',
        'content-type': 'text/html; charset=UTF-8',
        'accept-ranges': 'bytes, bytes, bytes, bytes',
        'content-length': '1763',
        date: 'Sat, 05 Jan 2019 18:14:46 GMT',
        connection: 'close',
        'set-cookie': [Array],
        'x-served-by': 'cache-sea1031-SEA, cache-ams21050-AMS',
        'x-cache': 'MISS, MISS',
        'x-cache-hits': '0, 0',
        'x-timer': 'S1546712087.617278,VS0,VE182',
        vary: 'User-Agent, Accept-Encoding' },
     rawHeaders:
      [ 'Allow',
        'GET, POST, HEAD, PUT, PATCH, DELETE, OPTIONS',
        'Content-Encoding',
        'gzip',
        'Content-Type',
        'text/html; charset=UTF-8',
        'Accept-Ranges',
        'bytes',
        'Accept-Ranges',
        'bytes',
        'Accept-Ranges',
        'bytes',
        'Content-Length',
        '1763',
        'Accept-Ranges',
        'bytes',
        'Date',
        'Sat, 05 Jan 2019 18:14:46 GMT',
        'Connection',
        'close',
        'set-cookie',
        'machine_cookie=9628607654888; expires=Fri, 05 Jan 2024 18:14:46 GMT; path=/;',
        'X-Served-By',
        'cache-sea1031-SEA, cache-ams21050-AMS',
        'X-Cache',
        'MISS, MISS',
        'X-Cache-Hits',
        '0, 0',
        'X-Timer',
        'S1546712087.617278,VS0,VE182',
        'Vary',
        'User-Agent, Accept-Encoding' ],
     trailers: {},
     rawTrailers: [],
     upgrade: false,
     url: '',
     method: null,
     statusCode: 403,
     statusMessage: 'Forbidden',
     client:
      TLSSocket {
        _tlsOptions: [Object],
        _secureEstablished: true,
        _securePending: false,
        _newSessionPending: false,
        _controlReleased: true,
        _SNICallback: null,
        servername: null,
        npnProtocol: undefined,
        alpnProtocol: false,
        authorized: true,
        authorizationError: null,
        encrypted: true,
        _events: [Object],
        _eventsCount: 10,
        connecting: false,
        _hadError: false,
        _handle: null,
        _parent: null,
        _host: 'seekingalpha.com',
        _readableState: [Object],
        readable: false,
        domain: null,
        _maxListeners: undefined,
        _writableState: [Object],
        writable: false,
        allowHalfOpen: false,
        _bytesDispatched: 223,
        _sockname: null,
        _pendingData: null,
        _pendingEncoding: '',
        server: undefined,
        _server: null,
        ssl: null,
        _requestCert: true,
        _rejectUnauthorized: true,
        parser: null,
        _httpMessage: [Object],
        read: [Function],
        _consuming: true,
        _idleTimeout: -1,
        _idleNext: null,
        _idlePrev: null,
        _idleStart: 1615,
        _destroyed: false,
        [Symbol(asyncId)]: 9,
        [Symbol(bytesRead)]: 2329,
        [Symbol(asyncId)]: 22,
        [Symbol(triggerAsyncId)]: 18 },
     _consuming: true,
     _dumped: false,
     req:
      ClientRequest {
        domain: null,
        _events: [Object],
        _eventsCount: 6,
        _maxListeners: undefined,
        output: [],
        outputEncodings: [],
        outputCallbacks: [],
        outputSize: 0,
        writable: true,
        _last: true,
        upgrading: false,
        chunkedEncoding: false,
        shouldKeepAlive: false,
        useChunkedEncodingByDefault: false,
        sendDate: false,
        _removedConnection: false,
        _removedContLen: false,
        _removedTE: false,
        _contentLength: 0,
        _hasBody: true,
        _trailer: '',
        finished: true,
        _headerSent: true,
        socket: [Object],
        connection: [Object],
        _header: 'GET /article/4223492-blackstone-blue-chip-private-equity-name-continues-reward-investors-plus-7_7-percent-yield HTTP/1.1\r\nuser-agent: request.js\r\nhost: seekingalpha.com\r\naccept-encoding: gzip, deflate\r\nConnection: close\r\n\r\n',
        _onPendingData: [Function: noopPendingOutput],
        agent: [Object],
        socketPath: undefined,
        timeout: undefined,
        method: 'GET',
        path: '/article/4223492-blackstone-blue-chip-private-equity-name-continues-reward-investors-plus-7_7-percent-yield',
        _ended: true,
        res: [Circular],
        aborted: undefined,
        timeoutCb: [Function: emitTimeout],
        upgradeOrConnect: false,
        parser: null,
        maxHeadersCount: null,
        [Symbol(outHeadersKey)]: [Object] },
     request:
      Request {
        domain: null,
        _events: [Object],
        _eventsCount: 5,
        _maxListeners: undefined,
        timeout: 2000,
        headers: [Object],
        gzip: true,
        encoding: null,
        followAllRedirects: true,
        maxRedirects: 20,
        callback: [Function],
        readable: true,
        writable: true,
        _qs: [Object],
        _auth: [Object],
        _oauth: [Object],
        _multipart: [Object],
        _redirect: [Object],
        _tunnel: [Object],
        setHeader: [Function],
        hasHeader: [Function],
        getHeader: [Function],
        removeHeader: [Function],
        method: 'GET',
        localAddress: undefined,
        pool: {},
        dests: [],
        __isRequestRequest: true,
        _callback: [Function],
        uri: [Object],
        proxy: null,
        tunnel: true,
        setHost: true,
        originalCookieHeader: undefined,
        _disableCookies: true,
        _jar: false,
        port: 443,
        host: 'seekingalpha.com',
        path: '/article/4223492-blackstone-blue-chip-private-equity-name-continues-reward-investors-plus-7_7-percent-yield',
        httpModule: [Object],
        agentClass: [Object],
        agent: [Object],
        _started: true,
        href: 'https://seekingalpha.com/article/4223492-blackstone-blue-chip-private-equity-name-continues-reward-investors-plus-7_7-percent-yield',
        req: [Object],
        ntick: true,
        timeoutTimer: null,
        response: [Circular],
        originalHost: 'seekingalpha.com',
        originalHostHeaderName: 'host',
        responseContent: [Object],
        _ended: true,
        _destdata: true,
        _callbackCalled: true },
     toJSON: [Function: responseToJSON],
     caseless: Caseless { dict: [Object] },
     read: [Function],
     body: <Buffer 3c 21 44 4f 43 54 59 50 45 20 68 74 6d 6c 3e 0a 3c 68 74 6d 6c 20 6c 61 6e 67 3d 22 65 6e 22 3e 0a 3c 68 65 61 64 3e 0a 20 20 3c 6d 65 74 61 20 63 68 ... > } }

Well, ok so now it looks like we have a problem! If anyone is familiar with this kind of response and what it might mean feel free to comment, but it looks like making such a request from my VPS host might be the issue …

however:

     statusCode: 403,
     statusMessage: 'Forbidden',

looks pretty clear!

btw, I’m not able to curl the article from either location (it returns a Captcha request).

Whilst this is starting to prove this is not a ‘Discourse’ code issue, this is surely a general challenge for us hosting on ad hoc servers across the internet …

Any advice appreciated, I’d love to be able to Onebox these guys (because that’s what my users expect to be able to do).

1 Like

It’s a simple problem, and we have had reports of this here on Meta, your droplet IP is banned from the website you want to onebox.

The more straightforward solution is changing IP or even changing to a different host company.

4 Likes

Thanks. I have three Discourse forums at present. This fails on all of them. They are all on the same hosting provider.

I read even AWS is often blacklisted. Any recommendations?

Is there any point in approaching this specific site and ask them to whitelist me?

Sharing the name of the hosting provider here would be helpful as we can give pointers to our users when this happens again…

We host this site here in AWS and the onebox is working. You can swap IPs pretty easily if using their floating IPs too.

Well, you can try, but I wouldn’t count on it. Again, this come to which provider you use? If the provider is a cheap one that ignores spam and bad behaving crawlers hosted in their platform, they may have blocked the entire IP block they have.

Rafael it’s Scaleway.

I’ve raised a ticket with them, i’ll see what they come back with. I may start spinning up a server on a competitor or two and run the same node script and see where I get to.

It’s a shame because otherwise I’ve not had any issues with them.

Problem is that it’s too much effort to differentiate between oneboxing and scraping. Many will sacrifice the former to prevent the latter.

Companies like WPEngine make a big deal out of only billing for real visits, they have a financial interest in blocking automated non-search visits. Expect this to only get worse.

Oh and many rich embeds only cache for finite periods, which means that the burden or re-caching content and thumbnails only ramps up.

3 Likes

And of course they make an exception for big powerful players like Facebook or Twitter.

The open web (if it still exists) is so fragile! :(.

This is a very good point. We bill for all visits that incur significant server time – and provide self-service tools for blocking rogue crawlers and bots via user-agent. So I think WPEngine has painted themselves into a corner here.

2 Likes

Maybe, but the obvious solution here is a rich embed that doesn’t involve pulling down the entire page just to query tags in the page.

That would let us block scrapers without hampering the potentially useful cases.

3 Likes

That would be more of a global web solution, though – has anyone proposed anything along these lines out there?

1 Like

Just an update to this, I figured it would be interesting to see if I could find a workaround without moving server.

I’ve successfully monkeypatched Oneboxer and Onebox::Helpers in a local plugin to call a proprietary third party API to bring back the target page source reliably if the direct response is nil (ie my request was rejected). There is a (very) small charge to my API account every time I call it so it needs to be used sparingly for those sites that are tricky.

The advantage is I can remain on a cheap-as-chips server whilst not having to worry about its ‘reputation’ (a bit like using a 3rd party mail service).

The code still works as normal for sites that don’t reject my first request so I’m not charged for the majority of previews.

I anticipate that the costs of using the third party API will be much lower than the increase in costs for moving to a VPS provider with a more trusted reputation.

But in anycase problem solved. Thanks to everyone who weighed in.

This is now a working plugin.

2 Likes

Stephen do you think one solution might be for websites to mask their main content when responding to suspected crawlers whilst leaving their tags in place so 3rd parties can still generate previews? e.g the data in the head tag, this is surely not that much of a concern?

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.