OK wanted to update this as someone might appreciate the insight at some point or be able to help me clarify what’s going on.
For the sake of control, I decided to take Discourse out of the picture for a second, as I was suspicious about running these Open Graph scrapes from my VPS provider.
So I found an alternative open source Open Graph (and Twitter) tag extractor library on Github
Installing and running this on my local PC versus the VPS produced an interesting result.
First, my local PC
BBC News
$ node ogscraper.js
error: false
results: { data:
{ ogTitle: 'Home - BBC News',
ogType: 'website',
ogDescription:
'Visit BBC News for up-to-the-minute news, breaking news, video, audio and feature stories. BBC News provide
s trusted World and UK news as well as local and regional perspectives. Also entertainment, business, science, tec
hnology and health news.',
ogSiteName: 'BBC News',
ogLocale: 'en_GB',
ogUrl: 'https://www.bbc.co.uk/news',
twitterCard: 'summary_large_image',
twitterSite: '@BBCNews',
twitterTitle: 'Home - BBC News',
twitterDescription:
'Visit BBC News for up-to-the-minute news, breaking news, video, audio and feature stories. BBC News provide
s trusted World and UK news as well as local and regional perspectives. Also entertainment, business, science, tec
hnology and health news.',
twitterCreator: '@BBCNews',
ogImage:
{ url:
'//m.files.bbci.co.uk/modules/bbc-morph-news-waf-page-meta/2.3.0/bbc_news_logo.png',
width: null,
height: null,
type: null },
twitterImage:
{ url:
'//m.files.bbci.co.uk/modules/bbc-morph-news-waf-page-meta/2.3.0/bbc_news_logo.png',
width: null,
height: null,
alt: 'BBC News' } },
success: true,
requestUrl: 'http://news.bbc.co.uk/' }
Success, the script is working!
Next, the problem article!
SeekingAlpha (article which fails to Onebox on my Discourse forum)
$ node ogscraper.js
error: false
results: { data:
{ ogLocale: 'en_US',
ogSiteName: 'Seeking Alpha',
ogTitle:
'Blackstone: This Blue Chip Private Equity Name Continues To Reward Investors With +7.7% Yield',
ogUrl:
'https://seekingalpha.com/article/4223492-blackstone-blue-chip-private-equity-name-continues-reward-investor
s-plus-7_7-percent-yield',
ogDescription:
'Blackstone posted inflows and asset under management growth across all of its major segments in Q3. BX is s
till sitting on significant dry powder, giving its ma',
ogType: 'article',
twitterCard: 'summary_large_image',
twitterSite: '@SeekingAlpha',
twitterCreator: '@',
twitterTitle:
'Blackstone: This Blue Chip Private Equity Name Continues To Reward Investors With +7.7% Yield',
twitterDescription:
'Blackstone posted inflows and asset under management growth across all of its major segments in Q3.BX is st
ill sitting on significant dry powder, giving its management team flexibility as it navigates',
twitterAppNameiPhone: 'Seeking Alpha Portfolio',
twitterAppIdiPhone: '552799694',
twitterAppNameiPad: 'Seeking Alpha Portfolio',
twitterAppIdiPad: '552799694',
twitterAppNameGooglePlay: 'Seeking Alpha',
twitterAppIdGooglePlay: 'com.seekingalpha.webwrapper',
ogImage:
{ url:
'https://static3.seekingalpha.com/uploads/2018/11/13/16392-15421307232575014.png',
width: null,
height: null,
type: null },
twitterImage:
{ url:
'https://static3.seekingalpha.com/uploads/2018/11/13/16392-15421307232575014.png',
width: null,
height: null,
alt: null } },
success: true,
requestUrl:
'https://seekingalpha.com/article/4223492-blackstone-blue-chip-private-equity-name-continues-reward-investors-p
lus-7_7-percent-yield' }
Awesome, so it works from my local PC!
Next, log into my development VPS that I’ve been using to develop TLP on. This has NodeJS installed. Let’s try it there:
My VPS server
BBC News
$ node ogscraper.js
error: false
results: { data:
{ ogTitle: 'Home - BBC News',
ogType: 'website',
ogDescription: 'Visit BBC News for up-to-the-minute news, breaking news, video, audio and feature stories. BBC News provides trusted World and UK news as well as local and regional perspectives. Also entertainment, business, science, technology and health news.',
ogSiteName: 'BBC News',
ogLocale: 'en_GB',
ogUrl: 'https://www.bbc.co.uk/news',
twitterCard: 'summary_large_image',
twitterSite: '@BBCNews',
twitterTitle: 'Home - BBC News',
twitterDescription: 'Visit BBC News for up-to-the-minute news, breaking news, video, audio and feature stories. BBC News provides trusted World and UK news as well as local and regional perspectives. Also entertainment, business, science, technology and health news.',
twitterCreator: '@BBCNews',
ogImage:
{ url: '//m.files.bbci.co.uk/modules/bbc-morph-news-waf-page-meta/2.3.0/bbc_news_logo.png',
width: null,
height: null,
type: null },
twitterImage:
{ url: '//m.files.bbci.co.uk/modules/bbc-morph-news-waf-page-meta/2.3.0/bbc_news_logo.png',
width: null,
height: null,
alt: 'BBC News' } },
success: true,
requestUrl: 'https://news.bbc.co.uk/' }
Great! That works
Next the problem article …
SeekingAlpha (article which fails to Onebox on my Discourse forum)
$ node ogscrape.js
error: true
results: { error: 'Page Not Found',
success: false,
requestUrl: 'https://seekingalpha.com/article/4223492-blackstone-blue-chip-private-equity-name-continues-reward-investors-plus-7_7-percent-yield',
errorDetails: 'Server Has Ran Into A Error',
response:
IncomingMessage {
_readableState:
ReadableState {
objectMode: false,
highWaterMark: 16384,
buffer: [Object],
length: 0,
pipes: null,
pipesCount: 0,
flowing: false,
ended: true,
endEmitted: true,
reading: false,
sync: true,
needReadable: false,
emittedReadable: false,
readableListening: false,
resumeScheduled: false,
destroyed: false,
defaultEncoding: 'utf8',
awaitDrain: 0,
readingMore: false,
decoder: null,
encoding: null },
readable: false,
domain: null,
_events: { end: [Array], close: [Function] },
_eventsCount: 2,
_maxListeners: undefined,
socket:
TLSSocket {
_tlsOptions: [Object],
_secureEstablished: true,
_securePending: false,
_newSessionPending: false,
_controlReleased: true,
_SNICallback: null,
servername: null,
npnProtocol: undefined,
alpnProtocol: false,
authorized: true,
authorizationError: null,
encrypted: true,
_events: [Object],
_eventsCount: 10,
connecting: false,
_hadError: false,
_handle: null,
_parent: null,
_host: 'seekingalpha.com',
_readableState: [Object],
readable: false,
domain: null,
_maxListeners: undefined,
_writableState: [Object],
writable: false,
allowHalfOpen: false,
_bytesDispatched: 223,
_sockname: null,
_pendingData: null,
_pendingEncoding: '',
server: undefined,
_server: null,
ssl: null,
_requestCert: true,
_rejectUnauthorized: true,
parser: null,
_httpMessage: [Object],
read: [Function],
_consuming: true,
_idleTimeout: -1,
_idleNext: null,
_idlePrev: null,
_idleStart: 1615,
_destroyed: false,
[Symbol(asyncId)]: 9,
[Symbol(bytesRead)]: 2329,
[Symbol(asyncId)]: 22,
[Symbol(triggerAsyncId)]: 18 },
connection:
TLSSocket {
_tlsOptions: [Object],
_secureEstablished: true,
_securePending: false,
_newSessionPending: false,
_controlReleased: true,
_SNICallback: null,
servername: null,
npnProtocol: undefined,
alpnProtocol: false,
authorized: true,
authorizationError: null,
encrypted: true,
_events: [Object],
_eventsCount: 10,
connecting: false,
_hadError: false,
_handle: null,
_parent: null,
_host: 'seekingalpha.com',
_readableState: [Object],
readable: false,
domain: null,
_maxListeners: undefined,
_writableState: [Object],
writable: false,
allowHalfOpen: false,
_bytesDispatched: 223,
_sockname: null,
_pendingData: null,
_pendingEncoding: '',
server: undefined,
_server: null,
ssl: null,
_requestCert: true,
_rejectUnauthorized: true,
parser: null,
_httpMessage: [Object],
read: [Function],
_consuming: true,
_idleTimeout: -1,
_idleNext: null,
_idlePrev: null,
_idleStart: 1615,
_destroyed: false,
[Symbol(asyncId)]: 9,
[Symbol(bytesRead)]: 2329,
[Symbol(asyncId)]: 22,
[Symbol(triggerAsyncId)]: 18 },
httpVersionMajor: 1,
httpVersionMinor: 1,
httpVersion: '1.1',
complete: true,
headers:
{ allow: 'GET, POST, HEAD, PUT, PATCH, DELETE, OPTIONS',
'content-encoding': 'gzip',
'content-type': 'text/html; charset=UTF-8',
'accept-ranges': 'bytes, bytes, bytes, bytes',
'content-length': '1763',
date: 'Sat, 05 Jan 2019 18:14:46 GMT',
connection: 'close',
'set-cookie': [Array],
'x-served-by': 'cache-sea1031-SEA, cache-ams21050-AMS',
'x-cache': 'MISS, MISS',
'x-cache-hits': '0, 0',
'x-timer': 'S1546712087.617278,VS0,VE182',
vary: 'User-Agent, Accept-Encoding' },
rawHeaders:
[ 'Allow',
'GET, POST, HEAD, PUT, PATCH, DELETE, OPTIONS',
'Content-Encoding',
'gzip',
'Content-Type',
'text/html; charset=UTF-8',
'Accept-Ranges',
'bytes',
'Accept-Ranges',
'bytes',
'Accept-Ranges',
'bytes',
'Content-Length',
'1763',
'Accept-Ranges',
'bytes',
'Date',
'Sat, 05 Jan 2019 18:14:46 GMT',
'Connection',
'close',
'set-cookie',
'machine_cookie=9628607654888; expires=Fri, 05 Jan 2024 18:14:46 GMT; path=/;',
'X-Served-By',
'cache-sea1031-SEA, cache-ams21050-AMS',
'X-Cache',
'MISS, MISS',
'X-Cache-Hits',
'0, 0',
'X-Timer',
'S1546712087.617278,VS0,VE182',
'Vary',
'User-Agent, Accept-Encoding' ],
trailers: {},
rawTrailers: [],
upgrade: false,
url: '',
method: null,
statusCode: 403,
statusMessage: 'Forbidden',
client:
TLSSocket {
_tlsOptions: [Object],
_secureEstablished: true,
_securePending: false,
_newSessionPending: false,
_controlReleased: true,
_SNICallback: null,
servername: null,
npnProtocol: undefined,
alpnProtocol: false,
authorized: true,
authorizationError: null,
encrypted: true,
_events: [Object],
_eventsCount: 10,
connecting: false,
_hadError: false,
_handle: null,
_parent: null,
_host: 'seekingalpha.com',
_readableState: [Object],
readable: false,
domain: null,
_maxListeners: undefined,
_writableState: [Object],
writable: false,
allowHalfOpen: false,
_bytesDispatched: 223,
_sockname: null,
_pendingData: null,
_pendingEncoding: '',
server: undefined,
_server: null,
ssl: null,
_requestCert: true,
_rejectUnauthorized: true,
parser: null,
_httpMessage: [Object],
read: [Function],
_consuming: true,
_idleTimeout: -1,
_idleNext: null,
_idlePrev: null,
_idleStart: 1615,
_destroyed: false,
[Symbol(asyncId)]: 9,
[Symbol(bytesRead)]: 2329,
[Symbol(asyncId)]: 22,
[Symbol(triggerAsyncId)]: 18 },
_consuming: true,
_dumped: false,
req:
ClientRequest {
domain: null,
_events: [Object],
_eventsCount: 6,
_maxListeners: undefined,
output: [],
outputEncodings: [],
outputCallbacks: [],
outputSize: 0,
writable: true,
_last: true,
upgrading: false,
chunkedEncoding: false,
shouldKeepAlive: false,
useChunkedEncodingByDefault: false,
sendDate: false,
_removedConnection: false,
_removedContLen: false,
_removedTE: false,
_contentLength: 0,
_hasBody: true,
_trailer: '',
finished: true,
_headerSent: true,
socket: [Object],
connection: [Object],
_header: 'GET /article/4223492-blackstone-blue-chip-private-equity-name-continues-reward-investors-plus-7_7-percent-yield HTTP/1.1\r\nuser-agent: request.js\r\nhost: seekingalpha.com\r\naccept-encoding: gzip, deflate\r\nConnection: close\r\n\r\n',
_onPendingData: [Function: noopPendingOutput],
agent: [Object],
socketPath: undefined,
timeout: undefined,
method: 'GET',
path: '/article/4223492-blackstone-blue-chip-private-equity-name-continues-reward-investors-plus-7_7-percent-yield',
_ended: true,
res: [Circular],
aborted: undefined,
timeoutCb: [Function: emitTimeout],
upgradeOrConnect: false,
parser: null,
maxHeadersCount: null,
[Symbol(outHeadersKey)]: [Object] },
request:
Request {
domain: null,
_events: [Object],
_eventsCount: 5,
_maxListeners: undefined,
timeout: 2000,
headers: [Object],
gzip: true,
encoding: null,
followAllRedirects: true,
maxRedirects: 20,
callback: [Function],
readable: true,
writable: true,
_qs: [Object],
_auth: [Object],
_oauth: [Object],
_multipart: [Object],
_redirect: [Object],
_tunnel: [Object],
setHeader: [Function],
hasHeader: [Function],
getHeader: [Function],
removeHeader: [Function],
method: 'GET',
localAddress: undefined,
pool: {},
dests: [],
__isRequestRequest: true,
_callback: [Function],
uri: [Object],
proxy: null,
tunnel: true,
setHost: true,
originalCookieHeader: undefined,
_disableCookies: true,
_jar: false,
port: 443,
host: 'seekingalpha.com',
path: '/article/4223492-blackstone-blue-chip-private-equity-name-continues-reward-investors-plus-7_7-percent-yield',
httpModule: [Object],
agentClass: [Object],
agent: [Object],
_started: true,
href: 'https://seekingalpha.com/article/4223492-blackstone-blue-chip-private-equity-name-continues-reward-investors-plus-7_7-percent-yield',
req: [Object],
ntick: true,
timeoutTimer: null,
response: [Circular],
originalHost: 'seekingalpha.com',
originalHostHeaderName: 'host',
responseContent: [Object],
_ended: true,
_destdata: true,
_callbackCalled: true },
toJSON: [Function: responseToJSON],
caseless: Caseless { dict: [Object] },
read: [Function],
body: <Buffer 3c 21 44 4f 43 54 59 50 45 20 68 74 6d 6c 3e 0a 3c 68 74 6d 6c 20 6c 61 6e 67 3d 22 65 6e 22 3e 0a 3c 68 65 61 64 3e 0a 20 20 3c 6d 65 74 61 20 63 68 ... > } }
Well, ok so now it looks like we have a problem! If anyone is familiar with this kind of response and what it might mean feel free to comment, but it looks like making such a request from my VPS host might be the issue …
however:
statusCode: 403,
statusMessage: 'Forbidden',
looks pretty clear!
btw, I’m not able to curl the article from either location (it returns a Captcha request).
Whilst this is starting to prove this is not a ‘Discourse’ code issue, this is surely a general challenge for us hosting on ad hoc servers across the internet …
Any advice appreciated, I’d love to be able to Onebox these guys (because that’s what my users expect to be able to do).