I’ve got a site running 3 kubernetes pods (it’s what they wanted!). The crawler stats show > 120K pageviews/day. There are . . . very few . . . users and fewer than 1000 posts, so there isn’t much crawling to be done. Staring at production.log
of one of the pods for a few minutes shows only traffic to /srv/status
that k8s is using as a health monitor. Should that be getting counted as a crawler? Is there something else I might be missing?
No that should not be counted.
120K is right on the mark for a constant 5k/hr, so I’d say they are being counted even though they shouldn’t.
Maybe something’s going wrong in the detection?
I can confirm that they are being counted (and they are logged from 127.0.0.1)
SELECT * from web_crawler_requests
where date = '2020-03-05'
order by count desc
EDIT: also, I’m bad with numbers, apparently, and it’s ~12K/day.
And that is confusing too because if I grep -c /srv/status
in the 3 pods they each have ~15K lines matching /srv/status
in the 20 hours that they have been up. So that would be close to 50K/day.
They are counted and if you don’t want them counted you have a clear option you can use
https://github.com/discourse/discourse/blob/master/lib/middleware/request_tracker.rb#L102
Set the Discourse-Track-View HTTP header to 0 on the originating request
I appreciate that you over-estimate what’s clear to me.
That’s almost clear.
OK, so I do need to set HTTP_DISCOURSE_TRACK_VIEW=false
in the ENV passed to the container when it cranks up? And/Or I contrive to get the thing that’s hitting /srv/status
to include a Discourse-Track-View: 0
header? (or maybe it’s an =
, but presumably I can figure that out).
EDIT: I think that did the trick (thanks, @Falco!) but I’ll report back tomorrow when I can tell for sure that it worked.
EDIT2: And GKE has not only health checks from k8s, but also health checks from the load balancer, so both of those need to be configured with the Discourse-Track-View: 0
header.
The later. Assuming the thing has the feature to allow you to use custom headers
Yep, custom headers:
pods/probe/http-liveness.yaml
apiVersion: v1
kind: Pod
metadata:
labels: {...}
name: liveness-http
spec:
containers:
- name: ...
image: ...
readinessProbe:
httpGet:
path: /srv/status
port: 80
httpHeaders:
- name: Discourse-Track-View
value: "0"
periodSeconds: 3
livenessProbe:
httpGet:
# sent https://github.com/discourse/discourse/pull/9136
path: /srv/status?shutdown_ok=1
port: 80
httpHeaders:
- name: Discourse-Track-View
value: "0" # note the quotes, to avoid yaml making it an integer
initialDelaySeconds: 3
periodSeconds: 3
Thanks, Kane! I really appreciate that. That’s what I’m trying, and it appears that it’s working, but I was waiting until tomorrow to see if it’s really working.
I think by default @sam this route should not be counted, I see no value in ever counting “pageviews” to this route… do you?
Sure we can add a bypass here it is just a bit tricky to test and we don’t have a clean pattern for bypassing.
The responses should already be excluded from view tracking because the responses are text/plain
and not text/html
.
What’s the exact path the GoogleHC was hitting? If it was hitting /
, that’s your problem; and the view counting was correct.
Hmm. I think that I added the above-described headers to both K8s and the load balancer and that they’re both hitting /srv/status
I’m pretty sure. and I’m still getting exactly 12,960 hits a day most days. I’ll give it another look shortly.
Thanks for your help!
That was the problem. I thought that I’d configured both the load balancer and the k8s health check to hit /srv/status, but one of them was hitting /
. So when I finally figured out how to fix that, the problem went away.
The great news is that adding the Discourse-Track-View
is not necessary!
Now to get all of my uptime robot health checks using /srv/status
.