Link preview HTTP GET breaks spec

I’ve been struggling with an issue for a while that doesn’t seem to be reported yet. I apologize for the odd number of moving parts, but I’ll attempt to describe it succinctly.

The TL;DR; is that when I paste a link into a message, the Ruby Gem that ultimately makes the HTTP GET request to that URL to look for embed data sends an HTTP request that is considered by some HTTP proxies to be invalid per the spec. This prevents previews from working in some cases:

image

The slightly longer version is this. We use a nice little service called Gitbook.io for our docs. Gitbook is a hosted solution and they use Cloudflare workers for internal redirects on their site. Part of their Cloudflare workers involve using the Node Fetch API to proxy HTTP requests. The Node fetch devs are VERY pedantic about following the spec and they will reject any GET request that has an HTTP body or even a Content-Length header, even if that header is set to 0.
And that is exactly what is happening. The Ruby gem that makes the HTTP request sends a

Content-Length: 0

request header and this pisses off the node fetch proxy to no end and ends up in the request being rejected from the remote server. There has been much debate on different forums about whether a request body on a GET or even just a content length header is valid per the HTTP spec. I don’t have an issue with it, but that hasn’t stopped the Node fetch devs from closing every issue that’s ever been opened begging them to allow such a semantic.

I’m unfortunately stuck in the middle here.

  • The Node Fetch project refuses to consider these HTTP requests as valid.
  • Cloudflare support refuses to help me because I don’t have control over the Node-based workers in question
  • Gitbook’s support refuses to help me because they agree with the Fetch developers (and I’m not sure they really care)
  • And the HTTPrb library refuse to remove the header because they think it’s perfectly valid.

So that leaves me posting here asking if there is any way to control or change the HTTP GET requests made for link previews to include an acceptable set of HTTP headers such that proxies using incredibly pedantic libraries such as Node Fetch will not reject these requests?

If you want to try, here’s an example URL that’s hosted on Gitbook’s servers and uses their Node Fetch-powered Cloudflare worker.

6 Likes

@jamie.wilson / @techAPJ any idea why we are sending Content-Length of 0 with our requests? Can you confirm this behavior? I guess this makes sense for HEAD, but for GET?

2 Likes

Hi @sam The HTTP requests appear to be made by a Ruby library called httprb which has this behavior. If you look at the link in the “HTTPrb library refuse to remove the header because they think it’s perfectly valid.” bullet you can see the developer of that library making a case for why he’s bending the HTTP spec, but not breaking it.

As I’ve been poking at this all over the internet trying to get someone to come to agreement, I was able to get someone to send this pull to httprb which may solve the issue.

I’m not a Ruby dev so I wouldn’t even know how to test this. I assume eventually this Gem will release a version with the fix and then, eventually Discource will update to start using that. I would be great if someone had a way to test whether it works. The repro case is very simple – just paste in the link above to my Gitbook URL and see if the preview is rejected.

1 Like

I’m seeing the following (which matches up with the image in your first post):

The text ‘Getting Started Guide’ shows us that the request has been successful - it’s pulling that string from the og:title meta tag:

<meta property="og:title" content="Getting Started Guide" data-react-helmet="true">

The error/warning that the description is missing is also correct. The content on the page is as follows:

<meta property="og:description" content="" data-react-helmet="true">

The image URL comes from the og:image tag, which is as follows:

<meta data-react-helmet="true" property="og:image" content="https://app.gitbook.com/share/space/thumbnail/-LA-UVvV3_TgzQyCXMWK.png">

If I cut and paste https://app.gitbook.com/share/space/thumbnail/-LA-UVvV3_TgzQyCXMWK.png in to my browser (recent Safari on MacOS) I receive an error saying:

Error: could not handle the request

Making the same request via curl gives the same response:

curl -v https://app.gitbook.com/share/space/thumbnail/-LA-UVvV3_TgzQyCXMWK.png
*   Trying 104.18.8.111...
* TCP_NODELAY set
* Connected to app.gitbook.com (104.18.8.111) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/cert.pem
  CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-ECDSA-CHACHA20-POLY1305
* ALPN, server accepted to use h2
* Server certificate:
*  subject: C=US; ST=California; L=San Francisco; O=Cloudflare, Inc.; CN=sni.cloudflaressl.com
*  start date: Jun 16 00:00:00 2021 GMT
*  expire date: Jun 15 23:59:59 2022 GMT
*  subjectAltName: host "app.gitbook.com" matched cert's "*.gitbook.com"
*  issuer: C=US; O=Cloudflare, Inc.; CN=Cloudflare Inc ECC CA-3
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x142809200)
> GET /share/space/thumbnail/-LA-UVvV3_TgzQyCXMWK.png HTTP/2
> Host: app.gitbook.com
> User-Agent: curl/7.64.1
> Accept: */*
>
* Connection state changed (MAX_CONCURRENT_STREAMS == 256)!
< HTTP/2 500
< date: Mon, 23 Aug 2021 17:40:04 GMT
< content-type: text/plain; charset=utf-8
< content-length: 36
< cf-ray: 68361fb8ea9b4009-YYZ
< age: 432
< vary: Accept-Encoding
< via: magic cache
< cf-cache-status: HIT
< expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
< x-cache: HIT
< x-cloud-trace-context: 9d4cbd24a15138451c88b2ced35a32f1;o=1
< x-content-type-options: nosniff
< x-magic-hash: f46ac4bf6b6dc125a68e9ad566b48481631bb27eec2165532a7c0f538e93c4f6
< x-release: gitbook-28427-6.25.14
< server: cloudflare
<
Error: could not handle the request
* Connection #0 to host app.gitbook.com left intact
* Closing connection 0

Can you see an image if you cut and paste the og:image URL in to your browser?

In summary: the Onebox preview seems to be performing as expected, based on the response from the original URL.

3 Likes

@jamie.wilson Thanks for your time looking into this. Can you clarify however, if your test above were with the newest version of the httprb gem that contains the aforementioned pull request or if it is the older version of the library?

The original error I used to see in the Onebox preview was that the target URL returned a 500 status code, At some point before I posted this, the Onebox preview began instead showing the note about the missing opengraph metadata. As I’ve been troubleshooting this for months prior to posting with the Gitbook support, it’s possible something changed in the mean time.

If the Gibook URL is actually loading and just missing some metadata or having missing images, then that’s different from the request being rejected. However, I do know for sure that any requests I send myself containing a Content-Length: 0 HTTP request header, are rejected by the CloudFlare workers on the remote server. Perhaps, the HTTP client used to make the requests in Discourse has changed? I don’t know anything about the Discourse source code and I’m not even 100% that the httprb library is the actual source of the HTTP requests.

I don’t believe we use the httprb gem at all. The Oneboxing process (i.e., the thing that is generating the link previews) uses Net::HTTP from ruby’s stdlib, and also the Excon gem as part of the flow.

Digging a little deeper I can see that we do sometimes generate requests with a Content-Length: 0 header. However, in the case of the URL provided at least, this isn’t interfering with the generation of the Onebox.

There may have been a minor version bump, but nothing major like rearchitecting how we make requests or which libraries we use to do so.

There have been some changes to make Oneboxing more robust in general which may explain why URLs which previously 500-ing are now Oneboxing successfully.

If you have URLs that you can share that are currently returning errors during Oneboxing (or otherwise not working as expected within some other part of Discourse), please feel free to send them my way!

3 Likes

Ahh, so this is very good info. I’ve been having to make wild guesses as what libraries were even involved at this point, largely due to getting hardly any help at all from the Gitbook team who maintains the CloudFlare proxies.

Got it. I don’t think I shared above, but the one bit of information was WAS able to get out of Gitbook was that the error in their CloudFlare error logs which was rejecting the preview requests from Discourse was this:

Request with a GET or HEAD method cannot have a body.

What’s not clear is if the GET request from Discourse actually contained a body or just the Content-Length: 0 header. Either way, that does violate the Fetch spec according to some people (including those at Cloudflare)

Yes, at some point it does seem the the Onebox error changed from the generic 500 to now having some data. There’s no telling what libs may have been bumped (and I have updated Discourse in this time). I wish I had a way to capture exactly what headers are being sent from Discourse, but even if I hit a URL like http://httpbin.org/get I don’t have a way to “see” what is returned since the results are consumed entirely by Onebox and not logged anywhere that I know of.

If the empty content-length header is now gone, then I can at least work with Gitbook to fix their embed stuff (which won’t happen since they are currently rewriting their entire app from scratch and refusing to address any existing bugs :confused: but that’s not Discourse’s problem at least )

Let me first state a lot of what has been written above is waaaaay over my head, but I’m grasping at straws here. It’s perfectly okay to let me know I’m an idiot if I’m chiming in on the wrong topic.

We’re seeing this quite a bit, because we frequently post links to our Help Center (knowledge base) in our Community.

Some examples of links that fail to Onebox:

https://help.republicwireless.com/hc/en-us/articles/115014150828--How-to-Add-an-E911-Address

From the preview panel as I’m typing:
image

1 Like

Having dug a little deeper, it’s the Excon gem adding Content-Length: 0, but not on GET requests.

But that code has been there for 8 or 9 years, so probably wasn’t the problem.

The Gemfile.lock file will show you the gems used by core Discourse.

2 Likes

This site is behind Cloudflare captcha and it blocks Discourse from grabbing any information on it :slightly_frowning_face:

2 Likes

When viewed in a browser this page does contain the necessary meta tags to construct a Onebox. However, trying to fetch the URL it seems as though we’re getting an error!

oneboxer preview url: https://help.republicwireless.com/hc/en-us/articles/115014150828--How-to-Add-an-E911-Address
headers: {"User-Agent"=>"Discourse Forum Onebox v2.8.0.beta4"}
helpers response code: 403

Which means we requested that URL with a User Agent of “Discourse Forum Onebox v2.8.0.beta4”, but the remote web server returned a 403 status code.

Similarly, using the command line tool wget:

wget https://help.republicwireless.com/hc/en-us/articles/115014150828--How-to-Add-an-E911-Address
--2021-08-23 17:38:30--  https://help.republicwireless.com/hc/en-us/articles/115014150828--How-to-Add-an-E911-Address
Resolving help.republicwireless.com (help.republicwireless.com)... 104.16.53.111, 104.16.51.111
Connecting to help.republicwireless.com (help.republicwireless.com)|104.16.53.111|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden

Which is saying the same thing… We’re sending a valid request, but the remote web server is refusing to return a result. Is it possible for the people responsible for help.republicwireless.com to unblock these valid requests?

These two sites don’t have the OpenGraph title/description tags, however they do have other titles/descriptions that Onebox should fallback to. This is something we should look in to fixing.

2 Likes

:confused: It has been working for years, though.

Here’s an example where a link from the same site presents a valid Onebox: https://forums.republicwireless.com/t/4-digit-pin-which-i-have-forgot/37655/2

Which links to https://help.republicwireless.com/hc/en-us/articles/115012101188-Can-t-Get-Past-the-Screen-Lock-on-the-Phone

image

1 Like

Cloudflare constantly changes their robot detection algorithm, so if you want Discourse to not be blocked, you may want to contact their support and ask why the request is getting blocked.

5 Likes