Onebox embeds too much content when target page has malformed HTML

Hi,

I had an issue with the following link: http://cupfoundation.wordpress.com/2014/10/13/le-charlatanisme-de-galuel/

Put on a one-line, it almost imports the whole page. I think this is not a normal behaviour, isn’t it?

1 Mi Piace

This is indeed really weird, the preview of the onebox is completely different than the cooked version…

We have seen this before, it implies badly invalid markup on the target site which somehow confuses the oneboxer, that is what I recall.

So what should we do about it?

See if the target site passes html validation via the w3c validator, and if not, how many errors does it report?

Error Line 1250, Column 113: Stray end tag a.

…s://fr.wordpress.com/?ref=lof">Build a website with WordPress.com</a></a></div>

Yep, that’ll mess up the document structure.

2 Mi Piace

Not sure how to fix this. Can we make onebox more tolerant of screwed up HTML?

We can, nokogiri can iron out bad html and we can trucate it at some sort of sane size.

@techapj can you add this to your list?

3 Mi Piace

This is now fixed via:

https://github.com/discourse/onebox/commit/1b8bef8f96a4353cdfa6bc67da87d6750d5d31bf

3 Mi Piace

Here’s how it looks now :+1:

3 Mi Piace