Onebox user agent

kevinranks · April 24, 2017, 2:35pm

It appears that when onebox fetches a url to grab opengraph data, it sends a user agent string of “Ruby”.

Can this be updated to provide a more useful user agent that identifies the crawler properly?

eviltrout · April 24, 2017, 3:26pm

What problem are you trying to solve here? Is there a site that is not working because of Ruby or is there another reason why you’d like to change it.

If you want a pointer to where in the code this happens, it’s in our onebox gem. You’d have to add a custom header there if one does not exist.

kevinranks · April 24, 2017, 3:42pm

Hey @eviltrout,

Besides being good practice to identify the crawler properly, one specific example would be prerendering:

https://prerender.io/

For sites that use Angular, it’s important to be able to identify crawlers so that the page can be prerendered and served to the crawler. Without it, the crawler may end up extracting things like:

<meta property="og:description" content="{{ogDescription}}">.

eviltrout · April 24, 2017, 4:03pm

That example is pretty odd – in what case would it ever make sense for the prerenderer to not include the og:description value in the initial HTML payload? It seems to me like any time the document is rendered on the server side it should fill in that value, regardless of user agent.

Do you have an example site that uses this pattern that Onebox doesn’t work with?

I’m not against adding a User Agent (although I’m not sure what it would be), but I am curious about a definitive example of where we are currently broken.

kevinranks · April 24, 2017, 4:24pm

Pages are not prerendered by default. The default is to render the page in the browser (which your crawler cannot do). Therefore it’s neccessary to identify the user agents of crawlers that require the page to be prerendered.

Check out this official gist by prerender.io to see the nginx configuration to identify crawlers and serve a prendered page: Official prerender.io nginx.conf for nginx · GitHub

The standard format for cawling bots can be seen here: User agent - Wikipedia

Googlebot/2.1 (+http://www.google.com/bot.html)

It’s recommended to include a url in the user agent string that provides information about what the bot does and how to block it if needed. Does Onebox obey robots.txt rules?

Anyways, I see no harm in at least updating the user agent to something like:

Onebox/1.6.7 (GitHub - discourse/onebox: (DEPRECATED) A gem for turning URLs into website previews)

codinghorror · April 24, 2017, 4:37pm

Sorry we don’t want this. Some sites block opengraph retrieval for any unknown user agents. So we want a user agent that is a common / popular device, which won’t be blocked.

kevinranks · April 24, 2017, 4:41pm

Your user agent is currently “Ruby”. How is that common / popular?

codinghorror · April 24, 2017, 4:42pm

If so then that’s a bug. Amazon for example bars it. Can you confirm on this @techapj?

kevinranks · April 24, 2017, 4:52pm

Another option could be to add “Prerender” to the user agent. I’m guessing you won’t want to do this either, but I’m really just looking for a solution so I can support your crawler.

It seems odd that you guys wouldn’t want to give developers with single page applications a way to support your crawler. I can understand why you would want to make sure your crawler doesn’t get blanket blocked, but there’s gotta be a compromising solution here…

techAPJ · April 24, 2017, 5:05pm

Currently we are not specifying “User-Agent” header while making an HTTP request to retrieve response, so Net::HTTP library sends “Ruby” as default user agent.

sam · April 24, 2017, 5:17pm

I don’t see much harm in adding an extra http header we send with our requests, I also don’t see much harm in allowing site owners to override user agent via a hidden site setting

But I don’t think the team will work on prototyping or testing either of these

codinghorror · April 24, 2017, 5:24pm

I think the onebox library should offer this as a customizable string at minimum. Does not need to be exposed in Discourse per se…

Falco · April 24, 2017, 5:40pm

Just a small info: I used to work with a big Akka Scala app, and the http library from akka freaks out with User Agents that contain http://, they say it’s a spec violation.

So maybe a simple default like Discourse Onebox 1.6.7 is enough.

kevinranks · April 24, 2017, 5:49pm

That works for me!

I didn’t dig into the RFC, but many of Google’s user agents contain urls. They are prefixed with +, if that makes any difference.

https://support.google.com/webmasters/answer/1061943?hl=en

fefrei · April 24, 2017, 7:34pm

Wait, that bug report ist about @ – and even that is allowed within ( parenthesis ). It looks like in a (``)-delimited comment, <any TEXT excluding "(" and ")"> is allowed

elijah · April 24, 2017, 8:22pm

AND ( and ) are allowed in the comment under special circumstances:

When backslash escaped
When nesting comments

So, for example, my PSP’s built-in browser issues:

User-Agent: Mozilla/4.0 (PSP (PlayStation Portable); 2.00)

Which is apparently RFC compliant even though there are parens within parens.

codinghorror · May 23, 2017, 10:14pm

@eviltrout will be adding this feature tomorrow for a few different reasons.

eviltrout · May 24, 2017, 4:21pm

This commit adds a custom User-Agent to discourse oneboxes:

https://github.com/discourse/discourse/commit/8bc93c0b018045bf23769b8ecc4ec5b493368667

The User agent looks like this:

Discourse Forum Onebox v1.8.0.beta13

Topic		Replies	Views
Javascript Embedding Discourse Comments - User Agent Support	5	673	October 13, 2019
What is the Discourse User Agent for Onebox? Support	5	1101	November 11, 2020
Onebox requests being incorrectly redirected due to user-agent Support	20	1724	November 28, 2022
Inline-onebox doesn't use SiteSetting.onebox_user_agent Feature onebox , pr-welcome	2	69	August 31, 2025
Onebox previews are not working for my site Support onebox	5	228	March 24, 2024

Onebox user agent

Related topics