Amazon onebox fails on my site, but works here


(Dave McClure) #1

I’ve tried adding both amazon.com and www.amazon.com as whitelisted domains.

I tried to pasted this link and many others from Amazon into a new topic on my site:

http://www.amazon.com/The-Advantage-Organizational-Everything-Business/dp/1491510803

When I do, I get the following in Chrome’s dev tools console:

https://www.myforumdomain.com/onebox?url=http%3A%2F%2Fwww.amazon.com%2FThe-Advantage-Organizational-Everything-Business%2Fdp%2F1491510803d&refresh=false
Failed to load resource: the server responded with a status of 404 ()

When I paste it here, it works… see?

I see this other topic referencing a bug fixed before, but that was a while ago:


Support slideshare.net one box?
(Jay Pfaffman) #2

What version of Discourse are you running? Do you have any plugins installed that might be conflicting?


(Dave McClure) #3

I’m pretty well up to date and only have standard plugins enabled.

The ‘static pages’ plugin is installed, but disabled… let me try getting rid of that completely:


(Dave McClure) #4

Same issue after rebuilding the container without that plugin:


(Jeff Atwood) #5

Is amazon blocked on the network that server is on?


(Dave McClure) #6

Hmm… It’s just a vanilla Digital Ocean install, but I should do some basic troubleshooting to check that.


(Dean Taylor) #7

I’ve just tested this on a live Discourse instance, one that is running a little behind latest v1.5.0.beta12 +135 - I will update and test again when I get chance… but…

Although I’ve white-listed the domain www.amazon.co.uk and restarted (not rebuild) Discourse I see the same results as @mcwumbly mentions in Chrome Dev Tools.

I’ve verified that I can connect to the URL:

root@forum:~# curl http://www.amazon.co.uk/Belkin-BSV103-SurgeCube-Protector-Charging/dp/B00P2GW7MG -o deleteme.html --verbose
* Hostname was NOT found in DNS cache
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 176.32.108.186...
* Connected to www.amazon.co.uk (176.32.108.186) port 80 (#0)
> GET /Belkin-BSV103-SurgeCube-Protector-Charging/dp/B00P2GW7MG HTTP/1.1
> User-Agent: curl/7.35.0
> Host: www.amazon.co.uk
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Sun, 27 Mar 2016 18:07:47 GMT
* Server Server is not blacklisted
< Server: Server
< pragma: no-cache
< x-amz-id-1: 1QKR29Z3J28QM1BPTS6V
< p3p: policyref="https://www.amazon.co.uk/w3c/p3p.xml",CP="CAO DSP LAW CUR ADM IVAo IVDo CONo OTPo OUR DELi PUBi OTRi BUS PHY ONL UNI PUR FIN COM NAV INT DEM CNT STA HEA PRE LOC GOV OTC "
< cache-control: no-cache, no-transform
< x-frame-options: SAMEORIGIN
< expires: -1
< x-ua-compatible: IE=edge
< Vary: Accept-Encoding,User-Agent
< Content-Type: text/html; charset=UTF-8
< Transfer-Encoding: chunked
<
{ [data not shown]
100  439k    0  439k    0     0   496k      0 --:--:-- --:--:-- --:--:--  496k
* Connection #0 to host www.amazon.co.uk left intact
root@forum:~#

And again it embeds fine here on meta:

And only the standard plugins for me:

Doing a bit of debugging I tried executing the following:

[4] pry(main)> url = "http://www.amazon.com/gp/product/B005T3GRNW/ref=s9_simh_gw_p147_d0_i2"
review}" == preview.t=> "http://www.amazon.com/gp/product/B005T3GRNW/ref=s9_simh_gw_p147_d0_i2"
[5] pry(main)> preview = Onebox.preview(url)
=> #<Onebox::Preview:0x007f2f1a97f270
 @cache=
  #<Moneta::Expires:0x007f2f1674ca38
   @adapter=
    #<Moneta::Transformer::JsonPrefixKeyJsonValue:0x007f2f1674cab0
     @adapter=#<Moneta::Adapters::Memory:0x007f2f16795710 @backend={}>,
     @features=[:increment, :create],
     @prefix="">,
   @default_expires=nil>,
 @engine_class=Onebox::Engine::AmazonOnebox,
 @options=
  #<OpenStruct cache=#<Moneta::Expires:0x007f2f1674ca38 @adapter=#<Moneta::Transformer::JsonPrefixKeyJsonValue:0x007f2f1674cab0 @adapter=#<Moneta::Adapters::Memory:0x007f2f16795710 @backend={}>, @prefix="", @features=[:increment, :create]>, @default_expires=nil>, connect_timeout=5, timeout=10, load_paths=["/var/www/discourse/vendor/bundle/ruby/2.0.0/gems/onebox-1.5.35/templates"], twitter_client=TwitterApi>,
 @url="http://www.amazon.com/gp/product/B005T3GRNW/ref=s9_simh_gw_p147_d0_i2">
[6] pry(main)> "#{preview}" == preview.to_s
NoMethodError: undefined method `value' for nil:NilClass
from /var/www/discourse/vendor/bundle/ruby/2.0.0/gems/onebox-1.5.35/lib/onebox/engine/amazon_onebox.rb:38:in `image'
[7] pry(main)> 

After executing this I can see now raw contains a value when viewing the object preview (too large to paste here)

This seems to indicate that:

  1. the expected HTML response has changed
  2. the parsing of the HTML doesn’t correctly catch an error which occurs during the “parsing” of the response.
  3. an error is not being surfaced to the Discourse logs

(Jeff Atwood) #8

Hmm possibly @tgxworld can have a peek


(Dave McClure) #9

Just tried a couple things:

  1. From the DO box itself I can curl this URL fine:
    http://www.amazon.com/The-Advantage-Organizational-Everything-Business/dp/1491510803

  2. If I curl from within the docker container, though, I get a 503:

* Connection #0 to host www.google.com left intact
root@host:/# curl http://www.amazon.com/The-Advantage-Organizational-Everything-Business/dp/1491510803 -v -o deleteme
* Hostname was NOT found in DNS cache
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 54.239.17.6...
* Connected to www.amazon.com (54.239.17.6) port 80 (#0)
> GET /The-Advantage-Organizational-Everything-Business/dp/1491510803 HTTP/1.1
> User-Agent: curl/7.35.0
> Host: www.amazon.com
> Accept: */*
>
< HTTP/1.1 503 Service Unavailable
< Date: Sun, 27 Mar 2016 19:41:45 GMT
* Server Server is not blacklisted
< Server: Server
< Vary: Cookie,Accept-Encoding,User-Agent
< Last-Modified: Thu, 05 Nov 2015 22:42:39 GMT
< ETag: "562-523d2d81655c0"
< Accept-Ranges: bytes
< Content-Length: 1378
< Cneonction: close
< Content-Type: text/html

response body contains this at the top:

<!--
        To discuss automated access to Amazon data please contact api-services-support@amazon.com.
        For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_5_sv, or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_5_ac for advertising use cases.
-->

(Jeff Atwood) #10

So Amazon is restricting access from Digital Ocean.


(Dave McClure) #11

Hmm… seems a little more nuanced than that since the call succeeds when made directly from the DO box, but not when made from within the container.


(Dean Taylor) #12

Tracing the code a more accurate request for testing is:

curl http://www.amazon.com/gp/aw/d/1491510803 -v -o deleteme -H "User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS
 5_0_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9A405 Safari/7534.48.3"

Note the:

  • manipulated URL
  • from: http://www.amazon.com/The-Advantage-Organizational-Everything-Business/dp/1491510803
  • to: http://www.amazon.com/gp/aw/d/1491510803
  • Specific mobile User Agent.

I’ve now tested on latest:

  • whilst the test curl request above returns 200 OK
  • I still see a 404 (Not Found) error in Chrome Dev Tools for /onebox?url=...
  • I have not seen a 503 (Service Unavailable) error
  • This Discourse instance is on DigitialOcean (LON1)

(Dave McClure) #13

Are you curling from inside the container (after doing ./auncher enter app)?


(Dean Taylor) #14

It’s the same for me both inside and outside of the container (successful 200 OK, no HTML comment message at top of page code).


(Jeff Atwood) #15

It looks like @tgxworld may have fixed something here?


(Alan Tan) #16

This is a super weird bug…

(byebug) Nokogiri::HTML(open(url, {read_timeout: timeout}.merge(http_params)).read).css("#main-image").first.attributes.keys
["alt", "src", "data-fling-asin", "data-fling-refmarker", "data-midres-replacement", "onload", "class", "id", "data-a-image-name", "data-a-hires"]
(byebug) Nokogiri::HTML(open(url, {read_timeout: timeout}.merge(http_params)).read).css("#main-image").first.attributes.keys
["alt", "src", "data-fling-asin", "data-fling-refmarker", "data-hires-replacement", "onload", "data-a-image-name", "class", "id", "data-a-dynamic-image", "style"]
(byebug) Nokogiri::HTML(open(url, {read_timeout: timeout}.merge(http_params)).read).css("#main-image").first.attributes.keys
["alt", "src", "data-fling-asin", "data-fling-refmarker", "data-hires-replacement", "onload", "data-a-image-name", "class", "id", "data-a-dynamic-image", "style"]
(byebug) Nokogiri::HTML(open(url, {read_timeout: timeout}.merge(http_params)).read).css("#main-image").first.attributes.keys
["alt", "src", "data-fling-asin", "data-fling-refmarker", "data-hires-replacement", "onload", "data-a-image-name", "class", "id", "data-a-dynamic-image", "style"]
(byebug) Nokogiri::HTML(open(url, {read_timeout: timeout}.merge(http_params)).read).css("#main-image").first.attributes.keys
["alt", "src", "data-fling-asin", "data-fling-refmarker", "data-hires-replacement", "onload", "data-a-image-name", "class", "id", "data-a-dynamic-image", "style"]
(byebug) Nokogiri::HTML(open(url, {read_timeout: timeout}.merge(http_params)).read).css("#main-image").first.attributes.keys
["alt", "src", "data-fling-asin", "data-fling-refmarker", "data-midres-replacement", "onload", "class", "id", "data-a-image-name", "data-a-hires"]
(byebug) 

For some reason, the attributes in the image keeps changing. Note how it alternates between [data-a-hires, data-hires-replacement]and [data-a-dynamic-image, data-midres-replacement]

Why? I’m not sure and I’ve spent too much time trying to figure out why. I did manage to reproduce the different image tag attributes in my browser by loading the same product URL normally and in incognito mode.

I’ll wait a day or two before merging my PR below to see if anyone might know why…


(Dave McClure) #17

:relieved: I was pretty sure I was just going crazy…


(Alan Tan) #18

O anyway there was another bug fix here… We’re getting a 404 because our code blows up if it can’t retrieve an image. I fixed that for now so even if the image is not found. We’ll still have a nice onebox without the image.


(Alan Tan) #19

Just updated onebox!


Amazon onebox still failing on my discourse instance on Digitalocean
(Alan Tan) #20

This topic was automatically closed after 2 days. New replies are no longer allowed.