Do hits to /srv/status count as crawlers?

I’ve got a site running 3 kubernetes pods (it’s what they wanted!). The crawler stats show > 120K pageviews/day. There are . . . very few . . . users and fewer than 1000 posts, so there isn’t much crawling to be done. Staring at production.log of one of the pods for a few minutes shows only traffic to /srv/status that k8s is using as a health monitor. Should that be getting counted as a crawler? Is there something else I might be missing?

4 Likes

No that should not be counted.

2 Likes

120K is right on the mark for a constant 5k/hr, so I’d say they are being counted even though they shouldn’t.

Maybe something’s going wrong in the detection?

5 Likes

I can confirm that they are being counted (and they are logged from 127.0.0.1)

SELECT * from web_crawler_requests
where date = '2020-03-05'
order by count desc

EDIT: also, I’m bad with numbers, apparently, and it’s ~12K/day.

And that is confusing too because if I grep -c /srv/status in the 3 pods they each have ~15K lines matching /srv/status in the 20 hours that they have been up. So that would be close to 50K/day.

1 Like

They are counted and if you don’t want them counted you have a clear option you can use

https://github.com/discourse/discourse/blob/master/lib/middleware/request_tracker.rb#L102

Set the Discourse-Track-View HTTP header to 0 on the originating request

8 Likes

I appreciate that you over-estimate what’s clear to me. :wink:

That’s almost clear.

OK, so I do need to set HTTP_DISCOURSE_TRACK_VIEW=false in the ENV passed to the container when it cranks up? And/Or I contrive to get the thing that’s hitting /srv/status to include a Discourse-Track-View: 0 header? (or maybe it’s an =, but presumably I can figure that out).

EDIT: I think that did the trick (thanks, @Falco!) but I’ll report back tomorrow when I can tell for sure that it worked.

EDIT2: And GKE has not only health checks from k8s, but also health checks from the load balancer, so both of those need to be configured with the Discourse-Track-View: 0 header.

4 Likes

The later. Assuming the thing has the feature to allow you to use custom headers :sweat_smile:

6 Likes

Yep, custom headers:

pods/probe/http-liveness.yaml 

apiVersion: v1
kind: Pod
metadata:
  labels: {...}
  name: liveness-http
spec:
  containers:
  - name: ...
    image: ...
    readinessProbe:
      httpGet:
        path: /srv/status
        port: 80
        httpHeaders:
        - name: Discourse-Track-View
          value: "0"
      periodSeconds: 3
    livenessProbe:
      httpGet:
        # sent https://github.com/discourse/discourse/pull/9136
        path: /srv/status?shutdown_ok=1
        port: 80
        httpHeaders:
        - name: Discourse-Track-View
          value: "0"  # note the quotes, to avoid yaml making it an integer
      initialDelaySeconds: 3
      periodSeconds: 3
6 Likes

Thanks, Kane! I really appreciate that. That’s what I’m trying, and it appears that it’s working, but I was waiting until tomorrow to see if it’s really working. :wink:

1 Like

I think by default @sam this route should not be counted, I see no value in ever counting “pageviews” to this route… do you?

6 Likes

Sure we can add a bypass here it is just a bit tricky to test and we don’t have a clean pattern for bypassing.

6 Likes

The responses should already be excluded from view tracking because the responses are text/plain and not text/html.

What’s the exact path the GoogleHC was hitting? If it was hitting /, that’s your problem; and the view counting was correct.

5 Likes

Hmm. I think that I added the above-described headers to both K8s and the load balancer and that they’re both hitting /srv/status I’m pretty sure. and I’m still getting exactly 12,960 hits a day most days. I’ll give it another look shortly.

Thanks for your help!

1 Like

That was the problem. I thought that I’d configured both the load balancer and the k8s health check to hit /srv/status, but one of them was hitting /. So when I finally figured out how to fix that, the problem went away.

The great news is that adding the Discourse-Track-View is not necessary!

Now to get all of my uptime robot health checks using /srv/status.

4 Likes