"Onebox Assistant", crawl for those previews reliably!

What it does

Turns this kind of result:

(where your server has failed to bring back the page source so cannot extract the required tags to build the onebox)

Into this!:

It simply provides an alternative path for onebox to get its page source with which to look for meta-data when the target server refuses your connection.

It changes nothing about how onebox then processes the page source to find the meta-data and render the box.

It’s meant to allow you to enter the details and credentials of a third party API to bring back the page instead of doing a normal http call directly to the target page.

Why

I found my servers were being forbidden access to a number of commercial sites so oneboxes would fail to be rendered. It essentiallly helps leverage the trustworthiness of the 3rd party API, a bit like a mail service.

Why it’s cost effective

It is intended to use the API sparingly and will bring back the page source in the normal way under most circumstances. It only uses the API when it’s refused a response.. On deeper investigation and with experience I’ve noticed that using the API every time may be a requirement now as the redirects stage can fail to bring back the correct page for the very same reasons as a total denial to respond. The plugin can now use the API on every occasion.

What this means is you can use a relatively cheap VPS but still get reliable one-boxing functionality, even if your IP or user agent is somehow ‘blacklisted’.

You don’t need it if

You are oneboxing all your target content ok with the vanilla install and all users are happy

Pre-requisites

You need an account with a suitable 3rd part API.

Settings

See example below

image

This setting allows you to ignore the prefetch (to check if the direct crawl returns a result) and use the API from the get-go.


default OFF

I recommend setting this to TRUE.

This is more expensive of course but often yields better results as there are some cases where the pre-fetch gets redirected to the wrong page because you are not trusted.

Known Limitations

  • It’s only been tested with one provider at the moment and not tested on others. That provider is https://embed.rocks (with whom I have no affiliation). I’m happy to consider supporting more services if the work is sponsored.

  • The monkey patching is done at method level. This overrides more code than it needs to which leads to a greater risk of the plugin breaking after a core update. However I don’t think there’s a way to minimise this further?

Repo here

All feedback welcome.

19 Likes

Can i know what 3rd party API you are using?

2 Likes

The plugin currently supports https://embed.rocks. I’ve updated the first post with that information too.

There’s a switch to return page source instead of a ‘card’ (shown in the screenshot).

1 Like

embed.rocks look like down right now.

I’m trying to use proxycrawl.com but i can not. Can you take a look?

1 Like

I’ve escalated the website issue. Have mailed the owner.

Thanks for the alternate link. Very useful suggestion. I don’t have time to look at this at the moment unfortunately as have a couple of live projects. I suggest using embed.rocks for now.

If you’d like me to take a look at supporting proxycrawl more urgently you can hire me or submit a PR :).

1 Like

Hi @merefield, I’ve been stuck for about an hour over here. Have tried every combination of what should work for settings. No matter what I do I can’t get this URL to work in Discourse… even though embed.rocks doesn’t have a problem getting the data under “Try It” on their site.

https://oilandgaslawdigest.com/caselaw/scotx-applies-discovery-rule-to-breach-of-pref-right-despite-disclosure-in-deed-records/

Please help.

1 Like

Hi James,

Yes this stuff is super frustrating at times.

It’s neither Discourse’s fault on this occasion, nor your server (though it would be nice if Discourse provided more info about why it’s failed).

If you check with Facebook’s opengraph debugger, you see this:

https://developers.facebook.com/tools/debug/sharing/?q=https%3A%2F%2Foilandgaslawdigest.com%2Fcaselaw%2Fscotx-applies-discovery-rule-to-breach-of-pref-right-despite-disclosure-in-deed-records%2F

So it looks like the target website doesn’t have a very well formed metadata section. If you know them, you could raise it with them - they should be checking with the Facebook tool in any case.

It always amazes me how many sites still don’t get their metadata right. I face this weekly on one of my sites that focusses on financial markets.

The other systems, like embed.rocks and iframely are possibly using alternative tags and tricks to put their previews together.

Remember my plugin is not using embed.rocks previews, its merely using that service to scrape, so onebox is processing the original page source.

As I said, Discourse/Onebox could arguably help more by being more transparent about why it fails when it is unable to render these things instead of just rendering the original URL. Letting the poster know which tag(s) were missing or if there was a bad response from the scrape attempt that prevented the onebox from rendering correctly would be a real improvement.

It might be good to enhance the plugin or build another plugin to support one of the third party preview builders to provide an alternative to oneboxing. That’s currently beyond the scope and I’m busy with other projects at the moment. I might consider the work at some stage if it were suitably sponsored. Nonetheless, retaining onebox functionality for previews makes the plugin more resilient and less likely to fail due to changes in core.

3 Likes

Recent changes to core appear to have broken this plugin. This was expected at some stage due to the facetious overriding.

This is now resolved with this update:

1 Like

Fun fact: even Signal (the highly secure mobile messaging app) gets blocked from previews (probably rate limiting):

4 Likes

I signed up at embed.rocks

I’m on current version of Discourse, and have installed the Onebox Assistant with the following settings:

My link http://denvergeeks.com does not work on my Discourse forum, but it does work on the “Try It” page at embed.rocks:

Can you offer any help to get this working?

2 Likes

http://denvergeeks.com/tiki-read_article.php?articleId=1

That’s not one-boxing so it may be some of the head tags are missing (but obviously meta is not using such a service so not a perfect comparison) One-boxing is quite fussy.

Have you checked on the Facebook preview debugger? Any warnings? See above

2 Likes

Thank you so much Robert for the quick response! – YES that page actually IS One-boxing on embed.rocks – see this screenie:

Here is what reveals on the Facebook Debugger:

Any other thoughts?

Thank you again!!

1 Like

That URL has nothing to do with Discourse. But the warnings are all valid, you should fix that page so it has the required tags,

4 Likes

Thank you Jeff – YOU totally ROCK for managing and supporting this forum!!

My hope was that the “Onebox Assistant” with the help of the embed.rocks service/API (the “alternative path” that developer Robert says is integrated with this plugin) would work since it is working on the URL tester on embed.rocks…

Maybe I misunderstand the purpose of the plugin?

1 Like

The plugin helps prevent you being blocked or rate limited when scraping for previews (which is a major problem for independent platforms based on VPSs). The plugin is extremely effective for sites that preview a lot of commercial websites.

What it won’t do is magically generate the missing metadata required to render a Onebox.

That’s something that needs to be addressed on this target site.

No, that is not a ‘Onebox’. That is their own preview card. Onebox is Discourse’ s own preview component, has specific data requirements of its own and uses the industry standard opengraph or ‘og’ tags to render a preview. It expects these tags to be included on the target web pages. The inbuilt preview on embed.rocks may be able to use other data and may use techniques to process the raw html. This plugin does not leverage their card system, only their scraping system. Rendering of a preview is still left to Discourse’s Onebox system because it is so tightly embedded in the software, thereby making the overrides lighter and less risky so the plugin remains simpler and more robust.

If you have access to the site or know the site owner discuss the missing data. Fix the warnings you see on the Facebook feedback and it should make things work. Judging by the rather dated styling it is possible it is an old CMS and exposing opengraph tags will not be possible in which case you are just going to have to live with it. You could encourage the owner to upgrade their software as og tags are pretty much a universal standard now.

7 Likes

Wow. Thank you so very much Robert for that cogent explanation. So Incredibly helpful!!

3 Likes

Hi Merefield,

I think the Amazon blocked my links of one box. and trying to use your plugin “Onebox Assistant”.
I signed up on embed.rocks and got the api key. but still cannot working for the Amazon links on One Box.

My plugin screenshot configuration.

My website uses the Amazon links not working.

I tested it in two links. One is working, and another one is not working. See below screenshot.

What I am missing from this plugin?

Updated after 20 minutes: Now I freshed the post, and these two links of Amazon is not working.

Digital Server got 503 error:
wget https://amzn.to/2QFd1kh
–2020-01-19 08:07:34-- https://amzn.to/2QFd1kh
Resolving amzn.to (amzn.to)… 67.199.248.12, 67.199.248.13
Connecting to amzn.to (amzn.to)|67.199.248.12|:443… connected.
HTTP request sent, awaiting response… 301 Moved Permanently
Location: https://www.amazon.com/gp/product/B07V1QS7MY/ref=as_li_tl?ie=UTF8&tag=papasasa-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B07V1QS7MY&linkId=d2da317a25e9a6843d9bb5d9b7c5ff60 [following]
–2020-01-19 08:07:34-- https://www.amazon.com/gp/product/B07V1QS7MY/ref=as_li_tl?ie=UTF8&tag=papasasa-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=B07V1QS7MY&linkId=d2da317a25e9a6843d9bb5d9b7c5ff60
Resolving www.amazon.com (www.amazon.com)… 23.203.100.50
Connecting to www.amazon.com (www.amazon.com)|23.203.100.50|:443… connected.
HTTP request sent, awaiting response… 503 Service Unavailable
2020-01-19 08:07:34 ERROR 503: Service Unavailable.

1 Like

It’s not 100% foolproof, it’s just that their hit rate tends to be much better. The proxy service may also be being blocked. Have you checked their online test tool Embed.rocks? If fails there let their support know.

Also you may be breaking Amazons Affiliate agreement by Oneboxing as it may copy their trademark locally. I had my affiliate account Closed because of this. I would advise you, if you are an affiliate, to blacklist their URL from Oneboxing and use their site strip links instead. I have had no problem since.

1 Like

Thank you, Robert. Good suggestions.

I’ll do it without One boxing.

2 Likes

btw You may not need embed.rocks if you use the site strip urls btw. However it may be useful for other sites.

One downside of the site strip URLs is you will need to manually update an amazon link every time to make them look nice. This is not a problem if users rarely post these.

See my guide here Incorporating Amazon OneLink: making affiliated links on global forums much easier

1 Like