Onebox fails horribly with tumblr


(Sam Saffron) #1

This

http://hardtruthsfromsoftcats.tumblr.com/

onebox broken for tumblr urls


(Nick Caldwell) #2

Suggested alternate title: Tumblr markup unfit for human or computer consumption.


(William Karavites) #3


(Evaryont 🎆) #4

The problem is at least only with this particular user’s Tumblr page. Other pages embed fine.

Looking at this some more, this particular tumblr does not include <html> tags anywhere. It also includes doctype twice, and has a <script> as the first element of the body. (After the first doctype.)

Hilarious! MySpace would be proud. :clap:


(Jeff Atwood) #5

(Sam Saffron) #6

(Jeff Atwood) #7

Why are we re opening this? If it is a horribly malformed Tumblr page HTML that is hardly our fault.


(Sam Saffron) #8

of course it is our fault, we should be truncating that link at the absolute bare minimum

also, looking at open graph on the page, it seems perfectly sane, or at least sane enough that we can cope with it.


(Vikhyat Korrapati) #9

It does fall back to open graph, problem seems to be with the opengraph_parser gem.

2.1.0 :047 > OpenGraph.new("http://hardtruthsfromsoftcats.tumblr.com/").title
 => "Hard Truths From Soft Cats"
2.1.0 :048 > body = open("http://hardtruthsfromsoftcats.tumblr.com/").read; OpenGraph.new(body).title
 => "<!DOCTYPE html>\n<script>var __pbpa = true;</script><script>var translated_warning_string = 'Warning: Never enter your Tumblr password unless \\u201chttps://www.tumblr.com/l'...

The problem is specifically caused by this line: opengraph_parser/open_graph.rb at master · huyha85/opengraph_parser · GitHub

They use the presence of </html> to determine whether what is passed is a URL or HTML, which this particular page does not include.

There are two ways to fix this:

  1. Send a PR upstream adding an explicit option to specify whether we are passing in a URL or HTML.
  2. Move opengraph parsing into the onebox gem and drop the opengraph_parser dependency, considering that it is only like 20 lines of code anyway.

Personally I think the second one would be better.


(Sam Saffron) #10

We should be using nokogiri to parse this stuff … not sure why even use this opengraph_parser gem … what is it buying us?


(Vikhyat Korrapati) #11

Created a PR to drop opengraph_parser:

https://github.com/discourse/onebox/pull/203

With this in we get:

2.1.0 :002 > Onebox.preview("http://hardtruthsfromsoftcats.tumblr.com/").to_s
 => "<aside class=\"onebox whitelistedgeneric\">\n  <header class=\"source\">\n    <a href=\"http://hardtruthsfromsoftcats.tumblr.com/\">\n      hardtruthsfromsoftcats.tumblr.com\n    </a>\n  </header>\n  <article class=\"onebox-body\">\n    <img src=\"http://25.media.tumblr.com/avatar_62067045f582_128.png\" class=\"thumbnail\"/>\n\n<h3><a href='http://hardtruthsfromsoftcats.tumblr.com/'>Hard Truths From Soft Cats</a></h3>\n\n<p>Updated Every Monday And Thursday</p>\n\n  </article>\n  <div style='clear: both'></div>\n</aside>\n"

(Sam Saffron) #12

Thank you sir!


(Sam Saffron) #13

This topic was automatically closed after 24 hours. New replies are no longer allowed.