Amazon links are not crawled correctly for the topic details

(Camille Roux) #1

Most of links are crawled correctly except the Amazon ones. Here is an example:

After reading the source code, it’s the role of the CrawlTopicLink job. Can somebody have a look?

(Régis Hanol) #2

@eviltrout I remember you saying you had issues retrieving the title from amazon pages and that you made a special case for it. Does that special case take into account only the .com version?

(Camille Roux) #3

Let’s try it:

EDIT : so, yes, the problem appears only with .fr links, not .com

(Jens Maier) #4

How about

(Camille Roux) #5

same problem with .de links…

(Camille Roux) #6

Found !

(Régis Hanol) #7

I guess you now have enough for a pull request :wink:

(Camille Roux) #8

done! How can we refresh all the links already posted?

(Régis Hanol) #9

That’s a good question. @eviltrout, will a rebake trigger the CrawlTopicLink job?

(Robin Ward) #10

I suspect not, as the job is enqueued after the links are saved, and I believe there is an intelligent diff to not save links that have not changed on save.

We could have the rebake enqueue a Jobs::CrawlTopicLink for each link after it’s done maybe?

(Camille Roux) #11

For the diff, you can use the same regexp as the job.
Good idea for the rebake!

(Jeff Atwood) #12