Stop scraping scripts won't run on my discourse site


(Anna Naumova) #1

Hello.

I am going to open a discourse platform on my server.

I was trying to stop scraping scripts won’t run on my site. I was thinking to use a plugin to stop them, but I couldn’t find a proper one.

I tried to use PHP script to scrape the site and it’s working.

Please help me to stop the scraping script. Any help will be appreciated. Thanks in advance.

Anna.


(Mittineague) #2

The easiest thing would be to make your site “login required”.


(Anna Naumova) #3

Thanks so much, @Mittneague for your reply.

Is there any way to stop scraping script won’t run on the discourse site?


(Mittineague) #4

Yes, the aforementioned “login required” setting. When any HTTP request is made to the site a “must login” page wil be returned Only registerd members that are logged in will be able to get any further site content.


(Anna Naumova) #5

But I don’t want that option to the live site.

Can’t I have a better option?


(Michael Brown) #6

Stopping scrapers will be an arms race. If they’re dedicated to scraping your content, you won’t be able to stop them.

That said, great first steps are:

  • using login required
  • blocking the scraper’s User-Agent
  • detecting scraping activity and using a tarpit to slow them down or a honeypot to generate irrelevant content for them to pull down and taint their data

The problem which makes solving this difficult is: How do you differentiate from scraping traffic vs. a normal user?


(Anna Naumova) #7

Thanks, @supermathie for your reply.

I am tracking IP. So means one user per IP.

If the user excesses the access count of limit, then I want to let the site block the content unless he passes the reCAPTCHA v2 by google.