Onebox user agent

(Kevin) #1

It appears that when onebox fetches a url to grab opengraph data, it sends a user agent string of “Ruby”.

Can this be updated to provide a more useful user agent that identifies the crawler properly?

(Robin Ward) #2

What problem are you trying to solve here? Is there a site that is not working because of Ruby or is there another reason why you’d like to change it.

If you want a pointer to where in the code this happens, it’s in our onebox gem. You’d have to add a custom header there if one does not exist.

(Kevin) #3

Hey @eviltrout,

Besides being good practice to identify the crawler properly, one specific example would be prerendering:

For sites that use Angular, it’s important to be able to identify crawlers so that the page can be prerendered and served to the crawler. Without it, the crawler may end up extracting things like:

<meta property="og:description" content="{{ogDescription}}">.

(Robin Ward) #4

That example is pretty odd – in what case would it ever make sense for the prerenderer to not include the og:description value in the initial HTML payload? It seems to me like any time the document is rendered on the server side it should fill in that value, regardless of user agent.

Do you have an example site that uses this pattern that Onebox doesn’t work with?

I’m not against adding a User Agent (although I’m not sure what it would be), but I am curious about a definitive example of where we are currently broken.

(Kevin) #5

Pages are not prerendered by default. The default is to render the page in the browser (which your crawler cannot do). Therefore it’s neccessary to identify the user agents of crawlers that require the page to be prerendered.

Check out this official gist by to see the nginx configuration to identify crawlers and serve a prendered page: Official nginx.conf for nginx · GitHub

The standard format for cawling bots can be seen here: User agent - Wikipedia

Googlebot/2.1 (+

It’s recommended to include a url in the user agent string that provides information about what the bot does and how to block it if needed. Does Onebox obey robots.txt rules?

Anyways, I see no harm in at least updating the user agent to something like:

Onebox/1.6.7 (GitHub - discourse/onebox: A gem for turning URLs into website previews)

(Jeff Atwood) #6

Sorry we don’t want this. Some sites block opengraph retrieval for any unknown user agents. So we want a user agent that is a common / popular device, which won’t be blocked.

(Kevin) #7

Your user agent is currently “Ruby”. How is that common / popular?

(Jeff Atwood) #8

If so then that’s a bug. Amazon for example bars it. Can you confirm on this @techapj?

(Kevin) #9

Another option could be to add “Prerender” to the user agent. I’m guessing you won’t want to do this either, but I’m really just looking for a solution so I can support your crawler.

It seems odd that you guys wouldn’t want to give developers with single page applications a way to support your crawler. I can understand why you would want to make sure your crawler doesn’t get blanket blocked, but there’s gotta be a compromising solution here…

(Arpit Jalan) #10

Currently we are not specifying “User-Agent” header while making an HTTP request to retrieve response, so Net::HTTP library sends “Ruby” as default user agent.

(Sam Saffron) #11

I don’t see much harm in adding an extra http header we send with our requests, I also don’t see much harm in allowing site owners to override user agent via a hidden site setting

But I don’t think the team will work on prototyping or testing either of these

(Jeff Atwood) #12

I think the onebox library should offer this as a customizable string at minimum. Does not need to be exposed in Discourse per se…

(Rafael dos Santos Silva) #13

Just a small info: I used to work with a big Akka Scala app, and the http library from akka freaks out with User Agents that contain http://, they say it’s a spec violation.

So maybe a simple default like Discourse Onebox 1.6.7 is enough.

(Kevin) #14

That works for me!

I didn’t dig into the RFC, but many of Google’s user agents contain urls. They are prefixed with +, if that makes any difference.

(Felix Freiberger) #15

Wait, that bug report ist about @ – and even that is allowed within ( parenthesis ). It looks like in a (``)-delimited comment, <any TEXT excluding "(" and ")"> is allowed :thumbsup:

(Eli the Bearded) #16

AND ( and ) are allowed in the comment under special circumstances:

  1. When backslash escaped
  2. When nesting comments

So, for example, my PSP’s built-in browser issues:

User-Agent: Mozilla/4.0 (PSP (PlayStation Portable); 2.00)

Which is apparently RFC compliant even though there are parens within parens.

(Jeff Atwood) #17

@eviltrout will be adding this feature tomorrow for a few different reasons.

(Robin Ward) #18

This commit adds a custom User-Agent to discourse oneboxes:

The User agent looks like this:

Discourse Forum Onebox v1.8.0.beta13

(Jeff Atwood) #21