Onebox embeds too much content when target page has malformed HTML

Hi,

I had an issue with the following link: http://cupfoundation.wordpress.com/2014/10/13/le-charlatanisme-de-galuel/

Put on a one-line, it almost imports the whole page. I think this is not a normal behaviour, isn’t it?

1 Like

This is indeed really weird, the preview of the onebox is completely different than the cooked version…

We have seen this before, it implies badly invalid markup on the target site which somehow confuses the oneboxer, that is what I recall.

So what should we do about it?

See if the target site passes html validation via the w3c validator, and if not, how many errors does it report?

Error Line 1250, Column 113: Stray end tag a.

…s://fr.wordpress.com/?ref=lof">Build a website with WordPress.com</a></a></div>

Yep, that’ll mess up the document structure.

2 Likes

Not sure how to fix this. Can we make onebox more tolerant of screwed up HTML?

We can, nokogiri can iron out bad html and we can trucate it at some sort of sane size.

@techapj can you add this to your list?

3 Likes

This is now fixed via:

https://github.com/discourse/onebox/commit/1b8bef8f96a4353cdfa6bc67da87d6750d5d31bf

3 Likes

Here’s how it looks now :+1:

3 Likes