Prometheus scrape job cannot reach the metrics

I have a running discourse installation (two actually, one in stage another in production, different vms and all). I’m testing on the staging environment. Installation via the official guide.

Currently there is a Grafana/Prometheus/Node Exporter stack deployed via docker compose on the same VM where the discourse installation is already deployed.

Here is the docker-compose.yaml

version: "3"

services:
    cadvisor:
        image: gcr.io/cadvisor/cadvisor:latest
        container_name: cadvisor
        restart: unless-stopped
        volumes:
            - /:/rootfs:ro
            - /var/run:/var/run:ro
            - /sys:/sys:ro
            - /var/lib/docker/:/var/lib/docker:ro
            - /dev/disk/:/dev/disk:ro
        networks:
            - prometheus-cadvisor

    node_exporter:
        image: quay.io/prometheus/node-exporter:latest
        container_name: node_exporter
        command:
            - '--path.rootfs=/host'
        pid: host
        restart: unless-stopped
        volumes:
            - '/:/host:ro,rslave'
        networks:
            - prometheus-node_exporter

    prometheus:
        image: prom/prometheus:latest
        restart: unless-stopped
        container_name: prometheus
        ports:
            - "9090:9090"
        volumes:
            - ./prometheus:/app.cfg
        networks:
            - world
            - prometheus-cadvisor
            - prometheus-node_exporter
            - discourse
            - grafana-prometheus
        command: >-
            --config.file=/app.cfg/prometheus.yaml
            --storage.tsdb.path=/prometheus
            --web.console.libraries=/usr/share/prometheus/console_libraries
            --web.console.templates=/usr/share/prometheus/consoles

    grafana:
        image: grafana/grafana:latest
        container_name: grafana
        restart: unless-stopped
        ports:
            - "3000:3000"
        environment:
            GF_SECURITY_ADMIN_USER: [OMITTED]
            GF_SECURITY_ADMIN_PASSWORD: [OMITTED]
            GF_PATHS_PROVISIONING: '/app.cfg/provisioning'
        volumes:
            - ./grafana:/app.cfg
            - ./grafana/provisioning:/etc/grafana/provisioning
        networks:
            - world
            - grafana-prometheus

networks:
    world:
    grafana-prometheus:
        internal: true
    prometheus-cadvisor:
        internal: true
    prometheus-node_exporter:
        internal: true
    discourse:
        external: true

I rebuilt the discourse specifying a network so that it doesn’t deploy on bridge and connected the Prometheus on the same network.

docker network create -d bridge discourse
/var/discourse/launcher rebuild app --docker-args '--network discourse'

I tested by entering the Prometheus container and pinging the discourse container using the internal network alias and it could reach it.

image

Now, when configuring Prometheus job so that it would scrape the metrics, using an internal IP, I can only see server returned HTTP status 404 Not Found.

This is the Prometheus configuration:

global:
  scrape_interval: 30s
  scrape_timeout: 10s

rule_files:

scrape_configs:
  - job_name: prometheus
    metrics_path: /metrics
    static_configs:
      - targets:
        - 'prometheus:9090'
  - job_name: node_exporter
    static_configs:
      - targets:
        - 'node_exporter:9100'
  - job_name: discourse_exporter
    static_configs:
      - targets:
        - 'vmuniqueID-app:80'

vmuniqueID is a replace of the actual name of the VM.

As per documentation here, access via internal IP should be allowed:

Out of the box we allow the metrics route to admins and private ips.

Please help me see what I am missing :stuck_out_tongue:

Just to further poke into it I tried generating an API key from Discourse and reach it using the internal hostname and the response is not a 301, which is right because every request is supposed to be redirected to https.

The problem, I think, is that requests coming in, even if from an internal IP, are being treated as not authorized and ends up in a 404 for that reason.

You so have the prometheus plugin installed and enabled? It should allow requests from private addresses, but your could try setting the environment variable to allow access from the ip you’re pulling from.

Yep, Prometheus is on the same VM and deployed as a docker container. Everything works (I’ve other exporters deployed as well) but for some reason the Discourse Prometheus plugin, even if being clearly up and running, is not accepting requests.

When you say the ENV variable you are talking about the environment in the app.yaml file of Discourse right?

So, something like this:

env:
  DISCOURSE_PROMETHEUS_TRUSTED_IP_ALLOWLIST_REGEX: 172.20.0.3

172.20.0.3 being the current internal IP that Prometheus will have on the docker virtual network on which also Discourse is attached.

I tried already using the external IP that all the containers share anyway (the VM static IP) but as they are in the same network, when one try to access the other, it does through the internal IP.

A ./launcher restart app should suffice for the envs to be picked up right?

In that case I get:

Get "http://vmi1187507-app:80/metrics": dial tcp: lookup vmi1187507-app on 127.0.0.11:53: server misbehaving

vmi1187507-app is the container network name in its network. The name is correct, I can ping it from the Prometheus container running.
No idea where that 127.0.0.11:53 is coming from to be honest :thinking:

The message is the same if I comment out the env variable.

I would think so, but I’m not entirely sure. You can test from inside the container and see if you can curl it from there.

Running a wget from the prometheus container return:

/prometheus # wget http://vmi1229594-app:80/metrics
Connecting to vmi1229594-app:80 (172.20.0.2:80)
Connecting to [public URL] (172.67.69.84:443)
wget: note: TLS certificate validation not implemented
wget: server returned error: HTTP/1.1 404 Not Found

I’m guessing here that it’s the automatic redirect from the Discourse nginx container?
What happens is that it is forwarding to the https of the public domain name which is an internal cloudflare IP and that is of course telling any request to go back away.

Now that is beside the point because this redirect shouldn’t happen for the path url http://yourwebsite.com/metrics if it comes from an internal IP and I was expecting the plugin to take care of that by adding an nginx conf that add this rule which apparently is not happening?

Can someone from Discourse devs chime in? I don’t want to ping people at random and it feels weird that nobody ever reported this issue before.

Edit: I rebuilt specifying also a static hostname for the network configuration because I noticed that every rebuild a new random one was assigned to the container.
After that I also tried to set the prometheus job to access the https version of the metrics but the issue goes back to the step one:

global:
  scrape_interval: 30s
  scrape_timeout: 10s

rule_files:

scrape_configs:
# other jobs
# [...]
  - job_name: discourse_exporter
    scheme: https
    tls_config:
      insecure_skip_verify: true
    static_configs:
      - targets:
        - 'discourse_app'
/prometheus # wget https://discourse_app/metrics
Connecting to discourse_app (172.20.0.2:443)
wget: note: TLS certificate validation not implemented
Connecting to [public URL] (104.26.4.193:443)
wget: server returned error: HTTP/1.1 404 Not Found

At this point this seems an issue with the plugin itself.

Sounds right. You need to access it using the host name, not the container name.

I’m using the hostname, I wrote a lot and late, might have been confusing but definitely using the internal network hostname.

This is not my area of expertise, but I’ve had a rummage in the posts that the topic timer ate to see if any could be relevant, and possibly found these? (My apologies if I’m way off :slight_smile: :pray: )

Getting Discourse to see the Prometheus server IP - #5 by ishan
Using Prometheous with Cloudflare

Thanks @JammyDodger but unfortunately those resources didn’t help.

They have similar issue but slightly different to the point that they don’t apply in this case.
Just to be sure, I tried what one of those topics suggested (as well as @pfaffman) and played around with the DISCOURSE_PROMETHEUS_TRUSTED_IP_ALLOWLIST_REGEX env variable.

I tested:

  • Commenting it out
  • Added and with internal IP value
  • Added and with external IP value

Tried also changing the Prometheus scrape job to address the Discourse installation as:

  • direct internal IP
  • docker internal hostname
  • direct external IP
  • public domain name

In every case, tried both http and https

In all cases, I’m getting a 404.
What I would expect, is the actual page response as the request is coming from an internal IP.

1 Like

What Jay meant here is that you need to use the configured hostname (DISCOURSE_HOSTNAME in your container .yml definition) as opposed to any hostname that happens to resolve to the correct IP.

This is deliberate, so that you can’t trivially reverse proxy a public instance from just anywhere, and so that only the configured hostname is accepted:

$ curl -I https://try.discourse.org/about.json
HTTP/2 200
server: nginx
date: Mon, 15 May 2023 16:25:05 GMT
content-type: application/json; charset=utf-8
[...]

# the following is equivalent to creating a DNS record at
# try.somebogusreverseproxy.com pointing to the same IP address as try.discourse.org,
# and then requesting https://try.somebogusreverseproxy.com/about.json
$ curl -H 'Host: try.somebogusreverseproxy.com' -I https://try.discourse.org/about.json
HTTP/2 404
cache-control: no-cache
content-length: 1427
content-type: text/html
cdck-proxy-id: app-router-tiehunter02.sea1
cdck-proxy-id: app-balancer-tieinterceptor1b.sea1

Conversely, if you try this:

curl -H 'Host: YOUR_CONFIGURED_HOSTNAME' -I https://discourse_app/metrics

it should work, but it’s a hack. The expectation is that you’ll set up DNS as required so that Discourse can be reached at its configured hostname transparently:

curl -I https://YOUR_CONFIGURED_HOSTNAME/metrics

How to do that depends greatly on your requirements, but the simplest option is to set up an alias in /etc/hosts from where your HTTP requests originate.

3 Likes

The prometheus exporter does not run on port 80 - it listens on its own port. By default port 9405.

5 Likes

Good find but if I try to target that specific port I get a “connection refused” message.

Get "http://discourse_app:9405/metrics": dial tcp 172.20.0.2:9405: connect: connection refused

Tested with a wget from inside the prometheus container as well just to be sure.

/prometheus # ping discourse_app
PING discourse_app (172.20.0.2): 56 data bytes
64 bytes from 172.20.0.2: seq=0 ttl=64 time=0.223 ms
64 bytes from 172.20.0.2: seq=1 ttl=64 time=0.270 ms
^C
--- discourse_app ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.223/0.246/0.270 ms
/prometheus # wget discourse_app:9405/metrics
Connecting to discourse_app:9405 (172.20.0.2:9405)
wget: can't connect to remote host (172.20.0.2): Connection refused

Yep, tested with wget instead (prometheus container is a barebone busybox) but got to the metrics nonetheless.

So what you are saying is that I should find a way to have the container running prometheus have an entry in the /etc/hosts that resolve… I’ve lost you there sorry :slight_smile:

What I did is adding yet another docker with simply an nginx in it and provide a forward proxy configuration that add the header Host to the requests that it receives. It doesn’t expose any port so it can only be accessed by the internal virtual network anyway.

So how things change?

Prometheus Job:

  - job_name: discourse_exporter_proxy
    scheme: http
    static_configs:
      - targets:
        - 'discourse_forward_proxy:8080'

docker-compose.yaml (just the part with the proxy)

version: "3"

services:
# [...]
    discourse_forward_proxy:
        image: nginx:latest
        container_name: discourse_forward_proxy
        restart: unless-stopped
        volumes:
            - ./discourse_forward_proxy/:/etc/nginx/conf.d
        networks:
            - prometheus-discourse_forward_proxy
            - discourse
# [...]

networks:
    prometheus-discourse_forward_proxy:
        internal: true
    discourse:
        external: true

In the directory in which your docker-compose.yaml is, have ./discourse_forward_proxy/discourse_forward_proxy.conf

server {
    listen 8080;

    location /metrics {
      proxy_set_header Host "YOUR_DOMAIN_HERE.COM";
      proxy_pass https://discourse_app/metrics;
    }
}

There you go:

1 Like

Just for posterity, I’ve a repo in which I’ve set up all the necessary.
There are some hardcoded values (like the fqdn of our website in the forward proxy conf file) that will need changing in case someone else want to use it, but maybe it can be useful to someone else out there.

It includes everything, from the docker compose to the nginx conf and the grafana provisioning for resources and dashboards.

That’s due to the next line:

GlobalSetting.add_default :prometheus_collector_port, 9405
GlobalSetting.add_default :prometheus_webserver_bind, "localhost"
GlobalSetting.add_default :prometheus_trusted_ip_allowlist_regex, ""

Binding to localhost means that it can only be connected to on the localhost IP, which is why connecting to 172.20.0.2 fails. This is a security measure to ensure it doesn’t accidentally get exposed to a much wider audience than intended.

If you set in the container definition file:

  DISCOURSE_PROMETHEUS_WEBSERVER_BIND: '*'

It’ll listen on all IP addresses and you’ll be able to connect to it from another container.

The reason why this made it work:

server {
    listen 8080;

    location /metrics {
      proxy_set_header Host "YOUR_DOMAIN_HERE.COM";
      proxy_pass https://discourse_app/metrics;
    }
}

is because this nginx container is now talking to prometheus over the localhost IP.

If you’re not sure of the IPs or ports on which services are listening, you can use ss -ltp or netstat -ltp (inside the container! necessary packages are net-tools and iproute2 respectively) to look at them. For instance, I just rebuilt a container with the prometheus plugin and see:

root@discourse-docker-app:/# ss -ltp
State      Recv-Q     Send-Q           Local Address:Port                 Peer Address:Port     Process                             
LISTEN     0          128                  127.0.0.1:3000                      0.0.0.0:*                                            
LISTEN     0          128                    0.0.0.0:postgresql                0.0.0.0:*                                            
LISTEN     0          128                    0.0.0.0:https                     0.0.0.0:*         users:(("nginx",pid=555,fd=7))     
LISTEN     0          128                  127.0.0.1:9405                      0.0.0.0:*                                            
LISTEN     0          128                    0.0.0.0:redis                     0.0.0.0:*                                            
LISTEN     0          128                    0.0.0.0:http                      0.0.0.0:*         users:(("nginx",pid=555,fd=6))     
LISTEN     0          128                       [::]:postgresql                   [::]:*                                            
LISTEN     0          128                       [::]:https                        [::]:*         users:(("nginx",pid=555,fd=8))     
LISTEN     0          128                       [::]:redis                        [::]:*

root@discourse-docker-app:/# curl http://172.17.0.2:9405/metrics
curl: (7) Failed to connect to 172.17.0.2 port 9405: Connection refused

root@discourse-docker-app:/# curl http://localhost:9405/metrics
# HELP discourse_collector_working Is the master process collector able to collect metrics
# TYPE discourse_collector_working gauge
discourse_collector_working 1


# HELP discourse_collector_rss total memory used by collector process
# TYPE discourse_collector_rss gauge
discourse_collector_rss 38178816
…

That’s the nameserver rejecting the IP lookup request for vmi1187507-app. Port 53 is DNS.

2 Likes

This is great stuff Michael, thanks for taking the time to write it down.

I’ll test it over the weekend as I already spent too much time during my work days for this week :stuck_out_tongue:

During my tries I tried adding the internal IP from which the container with prometheus would appear as requesting the metrics to DISCOURSE_PROMETHEUS_TRUSTED_IP_ALLOWLIST_REGEX but it didn’t work.

You are suggesting DISCOURSE_PROMETHEUS_WEBSERVER_BIND. May I ask from where you got that? I am assuming that it’s another environment variable to add to the app.yml file, right?

How did it not work?

If it failed to connect, then the setting of the allowlist doesn’t matter since that operates after the L4 connection.

There is magic :magic_wand: in the Discourse code base where if you set DISCOURSE_SITE_OR_GLOBAL_SETTING_NAME in the ENV it’ll override it.

So setting that will override:

GlobalSetting.add_default :prometheus_webserver_bind, "localhost"

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.