Onebox user agent

It appears that when onebox fetches a url to grab opengraph data, it sends a user agent string of “Ruby”.

Can this be updated to provide a more useful user agent that identifies the crawler properly?

What problem are you trying to solve here? Is there a site that is not working because of Ruby or is there another reason why you’d like to change it.

If you want a pointer to where in the code this happens, it’s in our onebox gem. You’d have to add a custom header there if one does not exist.

1 Like

Hey @eviltrout,

Besides being good practice to identify the crawler properly, one specific example would be prerendering:

https://prerender.io/

For sites that use Angular, it’s important to be able to identify crawlers so that the page can be prerendered and served to the crawler. Without it, the crawler may end up extracting things like:

<meta property="og:description" content="{{ogDescription}}">.

That example is pretty odd – in what case would it ever make sense for the prerenderer to not include the og:description value in the initial HTML payload? It seems to me like any time the document is rendered on the server side it should fill in that value, regardless of user agent.

Do you have an example site that uses this pattern that Onebox doesn’t work with?

I’m not against adding a User Agent (although I’m not sure what it would be), but I am curious about a definitive example of where we are currently broken.

Pages are not prerendered by default. The default is to render the page in the browser (which your crawler cannot do). Therefore it’s neccessary to identify the user agents of crawlers that require the page to be prerendered.

Check out this official gist by prerender.io to see the nginx configuration to identify crawlers and serve a prendered page: Official prerender.io nginx.conf for nginx · GitHub

The standard format for cawling bots can be seen here: User agent - Wikipedia

Googlebot/2.1 (+http://www.google.com/bot.html)

It’s recommended to include a url in the user agent string that provides information about what the bot does and how to block it if needed. Does Onebox obey robots.txt rules?

Anyways, I see no harm in at least updating the user agent to something like:

Onebox/1.6.7 (GitHub - discourse/onebox: (DEPRECATED) A gem for turning URLs into website previews)

Sorry we don’t want this. Some sites block opengraph retrieval for any unknown user agents. So we want a user agent that is a common / popular device, which won’t be blocked.

Your user agent is currently “Ruby”. How is that common / popular?

If so then that’s a bug. Amazon for example bars it. Can you confirm on this @techapj?

1 Like

Another option could be to add “Prerender” to the user agent. I’m guessing you won’t want to do this either, but I’m really just looking for a solution so I can support your crawler.

It seems odd that you guys wouldn’t want to give developers with single page applications a way to support your crawler. I can understand why you would want to make sure your crawler doesn’t get blanket blocked, but there’s gotta be a compromising solution here…

Currently we are not specifying “User-Agent” header while making an HTTP request to retrieve response, so Net::HTTP library sends “Ruby” as default user agent.

2 Likes

I don’t see much harm in adding an extra http header we send with our requests, I also don’t see much harm in allowing site owners to override user agent via a hidden site setting

But I don’t think the team will work on prototyping or testing either of these

2 Likes

I think the onebox library should offer this as a customizable string at minimum. Does not need to be exposed in Discourse per se…

2 Likes

Just a small info: I used to work with a big Akka Scala app, and the http library from akka freaks out with User Agents that contain http://, they say it’s a spec violation.

So maybe a simple default like Discourse Onebox 1.6.7 is enough.

4 Likes

That works for me!

I didn’t dig into the RFC, but many of Google’s user agents contain urls. They are prefixed with +, if that makes any difference.

https://support.google.com/webmasters/answer/1061943?hl=en

3 Likes

Wait, that bug report ist about @ – and even that is allowed within ( parenthesis ). It looks like in a (``)-delimited comment, <any TEXT excluding "(" and ")"> is allowed :thumbsup:

4 Likes

AND ( and ) are allowed in the comment under special circumstances:

  1. When backslash escaped
  2. When nesting comments

So, for example, my PSP’s built-in browser issues:

User-Agent: Mozilla/4.0 (PSP (PlayStation Portable); 2.00)

Which is apparently RFC compliant even though there are parens within parens.

2 Likes

@eviltrout will be adding this feature tomorrow for a few different reasons.

1 Like

This commit adds a custom User-Agent to discourse oneboxes:

https://github.com/discourse/discourse/commit/8bc93c0b018045bf23769b8ecc4ec5b493368667

The User agent looks like this:

Discourse Forum Onebox v1.8.0.beta13
12 Likes