Amazon links are not crawled correctly for the topic details


(Camille Roux) #1

Most of links are crawled correctly except the Amazon ones. Here is an example:

After reading the source code, it’s the role of the CrawlTopicLink job. Can somebody have a look?


(Régis Hanol) #2

@eviltrout I remember you saying you had issues retrieving the title from amazon pages and that you made a special case for it. Does that special case take into account only the .com version?


(Camille Roux) #3

Let’s try it:

EDIT : so, yes, the problem appears only with .fr links, not .com


(Jens Maier) #4

How about amazon.de?


(Camille Roux) #5

same problem with .de links…


(Camille Roux) #6

Found !


(Régis Hanol) #7

I guess you now have enough for a pull request :wink:


(Camille Roux) #8

done! How can we refresh all the links already posted?


(Régis Hanol) #9

That’s a good question. @eviltrout, will a rebake trigger the CrawlTopicLink job?


(Robin Ward) #10

I suspect not, as the job is enqueued after the links are saved, and I believe there is an intelligent diff to not save links that have not changed on save.

We could have the rebake enqueue a Jobs::CrawlTopicLink for each link after it’s done maybe?


(Camille Roux) #11

For the diff, you can use the same regexp as the job.
Good idea for the rebake!


(Jeff Atwood) #12