Hello Community,
We are using the Discourse API on a self-hosted installation. Our use case is quite intensive in terms of concurrency, with around 100 API requests per second on average. We are also using PostgreSQL+Patroni+HAProxy to provide a high-availability database setup. Occasionally, our PostgreSQL locks and Patroni restarts the master node.
We have implemented a cron job to check for blocked transactions, and every time this issue occurs, we find the same kind of operations:
blocked_pid | blocked_user | blocking_pid | blocking_user | blocked_statement | current_statement_in_blocking_process
-------------+--------------+--------------+---------------+------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------
297904 | discourse | 293083 | discourse | UPDATE "api_keys" SET "last_used_at" = '2024-07-16 16:38:16.822352' WHERE "api_keys"."id" = 21 | UPDATE "api_keys" SET "last_used_at" = '2024-07-16 16:34:48.163449' WHERE "api_keys"."id" = 21
296718 | discourse | 293083 | discourse | UPDATE "api_keys" SET "last_used_at" = '2024-07-16 16:34:50.900480' WHERE "api_keys"."id" = 21 | UPDATE "api_keys" SET "last_used_at" = '2024-07-16 16:34:48.163449' WHERE "api_keys"."id" = 21
293101 | discourse | 293083 | discourse | UPDATE "api_keys" SET "last_used_at" = '2024-07-16 16:34:49.485074' WHERE "api_keys"."id" = 21 | UPDATE "api_keys" SET "last_used_at" = '2024-07-16 16:34:48.163449' WHERE "api_keys"."id" = 21
As you can see, many requests are trying to update the last usage time of the same API key.
For now, we have increased the number of API keys that our application is using to reduce the probability of collision. I noticed that there is code in place that checks if the API key was modified in the last minute to avoid an update. However, I assume that since we are using more than one pod to process the requests, this protection is not effective.
I am unsure if we should report this as a bug or if there are any parameters to avoid this kind of collision (either in Discourse or PostgreSQL). Note that for us, the information related to the last usage is relevant, but a resolution of 1 day would be sufficient.
I would also like to know what the preferred approach is to manage PostgreSQL HA with auto recovery.
Thanks