重启后内存使用逐步增加

jrivettcsa · 2024 年9 月 5 日 14:37

在过去几周的某个时候，我不知道是什么时候开始的，但可能是在 Discourse 更新之后，网站开始感觉有些迟钝。我们运行的是 3.4.0.beta2-dev。

我注意到服务器实例几乎没有可用内存，所以我重新启动了它。Discourse 启动后，内存使用最初还可以（约 1.2 GB），但它开始缓慢增长，并且似乎很快就会再次达到迟钝的程度。

该网站并不特别繁忙（每天 20 到 30 位访客），而且多年来一直运行良好，直到最近。

服务器实例有 2 GB 内存，根据我看到的（最低 1 GB；推荐 2 GB）要求，这应该足够了。

在我看来，这很像内存泄漏。当然，如果存在泄漏，可能不是 Discourse，而是 Docker 或其他东西。该实例仅用于 Discourse。

有什么想法吗？有没有办法验证是泄漏，并找出泄漏的进程？

Ed_S · 2024 年9 月 5 日 14:49

内存的“可用”量是一个非常模糊的概念——内存不足的唯一可靠迹象是分页活动。

free
或
free -h
将为您提供一个快照

vmstat 5 5
对于查看包括分页活动在内的各项运行情况非常有用。

jrivettcsa · 2024 年9 月 5 日 17:03

              total        used        free
Mem:          1.9Gi       1.5Gi        73Mi
Swap:         2.0Gi        54Mi       1.9Gi

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0  55524 111624  20080 385060    1    3    68    52  965  349  4  2 93  1  0
 0  0  55524 114884  20088 385152    0    0    13     8 1047  352  2  1 96  0  0
 0  0  55524 112428  20088 385160    0    0     0     3  831  319  3  1 95  0  0
 0  0  55524 111616  20096 385164    0    0     0    51  688  278  2  0 97  0  0
 0  0  55524 109884  20104 385168    0    0     0     8 1117  281  2  1 96  0  1

上面有什么看起来有问题吗？我从 HTOP 获取的内存使用数字似乎与 FREE 匹配。

我主要担心的是内存使用量不断增长的方式。我期望它达到某个点然后围绕该点徘徊，随着网站使用情况的上下波动。持续的上升趋势令人不安。

Ed_S · 2024 年9 月 5 日 17:11

目前来看一切正常——si 和 so（分页）没有活动，磁盘流量（bi 和 bo）也非常少。

Linux 使用空闲内存进行磁盘缓存，因此看到空闲内存降低并非坏事。free 命令的输出显示可用 RAM，手册页说明：

available
估算可以启动新应用程序而无需交换的内存量。

对于 vmstat，buff 和 cache 列是用于磁盘缓存的内存，可以增长以提高 I/O 性能，但在内存压力大时会收缩。因此，对于 free 和 vmstat，‘free’ 的数量是一个悲观的度量。

jrivettcsa · 2024 年9 月 5 日 17:36

好的，谢谢。可能这种迟钝与看似内存不足的情况无关。我将继续关注。

Ed_S · 2024 年9 月 5 日 18:16

仍然有可能某样东西在逐渐变大。

这是我用来查看情况的策略之一：

# ps aux|sort -n +5|tail
systemd+    1659  0.0  1.3 904384 54588 ?        S    16:44   0:00 /usr/lib/postgresql/13/bin/postmaster -D /etc/postgresql/13/main
root         830  0.0  1.6 2253324 65208 ?       Ssl  16:44   0:01 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
systemd+    1682  0.0  1.9 904516 78092 ?        Ss   16:44   0:01 postgres: 13/main: checkpointer 
systemd+    18757  0.1  2.1 912368 85644 ?        Ss   18:06   0:00 postgres: 13/main: discourse discourse [local] idle
1000        1688  0.1  6.5 1006548 256428 ?      Sl   16:44   0:10 unicorn master -E production -c config/unicorn.conf.rb
1000        2189  0.1  8.5 5657760 333248 ?      Sl   16:45   0:06 unicorn worker[3] -E production -c config/unicorn.conf.rb
1000        2113  0.1  8.5 5656608 334352 ?      Sl   16:45   0:07 unicorn worker[2] -E production -c config/unicorn.conf.rb
1000        2044  0.4  8.7 6052196 342380 ?      Sl   16:44   0:23 unicorn worker[1] -E production -c config/unicorn.conf.rb
1000        2006  1.7  9.0 5628640 352492 ?      Sl   16:44   1:33 unicorn worker[0] -E production -c config/unicorn.conf.rb
1000        1971  3.1 11.1 6033652 435388 ?      SNl  16:44   2:54 sidekiq 6.5.12 discourse [0 of 5 busy]

（或者 ps auxc）

Ed_S · 2024 年9 月 6 日 11:28

如果可以轻松监控 CPU 和（磁盘）I/O 活动，我建议关注这些，而不是内存使用情况。特别是 I/O。如果 CPU 使用率低而 I/O 使用率高，并且论坛运行缓慢，这可能表明 RAM 严重不足。

Ed_S · 2024 年9 月 10 日 14:51

网站运行缓慢有几个原因，除了 bug 之外，还可能包括：用户数量、用户活动、数据库规模的逐渐增加；以及随着 Discourse 的发展、添加功能、更新软件组件而不断变大。

但密切关注响应速度以及当前机器的配置是否合适是值得的。

（顺便说一句，我注意到 Hetzner 最便宜的机器现在有 4G 内存，价格与现在已不可用的、只有 2G 内存的最便宜机器相同。我的一个网站仍在运行在旧的 2G 配置上。）

为了记录，由于我一直在跟踪我的主站的使用情况，并且它最近已迁移，服务器是新的并且已重新启动，我将包含一些发现。数据量相当大——请随意不研究它！

机器的当前状态是

# uptime
 13:55:23 up 4 days, 21:10,  1 user,  load average: 0.07, 0.08, 0.02
# free
               total        used        free      shared  buff/cache   available
Mem:         3905344     1638012       98492      481864     2168840     1595004
Swap:        4194288      252928     3941360

我注意到登录时机器会显示
Memory usage: 45%
这最接近“used”列，而不是“free”列。

我一直在定期从以下命令获取读数

   date
   uptime
   free
   ps aux|sort -n +5|tail
   vmstat 5 5

我看到“free”内存被“buffer”和“cache”内存所取代，而进程的 RAM 占用空间（RSS）没有增加。我认为这说明了为什么跟踪“free”内存并不理想，即使一些托管提供商让这很容易做到。我认为这也表明，在这种情况下，没有内存泄漏。

重新启动后不久，我看到这个：

# free
               total        used        free      shared  buff/cache   available
Mem:         3905344     1560508      996400      179712     1348436     1974692
Swap:        4194288           0     4194288

不久之后

# ps aux|sort -n +5|tail
...
1000        1688  0.1  6.5 1006548 256428 ?      Sl   16:44   0:10 unicorn master -E production -c config/unicorn.conf.rb
1000        2189  0.1  8.5 5657760 333248 ?      Sl   16:45   0:06 unicorn worker[3] -E production -c config/unicorn.conf.rb
1000        2113  0.1  8.5 5656608 334352 ?      Sl   16:45   0:07 unicorn worker[2] -E production -c config/unicorn.conf.rb
1000        2044  0.4  8.7 6052196 342380 ?      Sl   16:44   0:23 unicorn worker[1] -E production -c config/unicorn.conf.rb
1000        2006  1.7  9.0 5628640 352492 ?      Sl   16:44   1:33 unicorn worker[0] -E production -c config/unicorn.conf.rb
1000        1971  3.1 11.1 6033652 435388 ?      SNl  16:44   2:54 sidekiq 6.5.12 discourse [0 of 5 busy]
# vmstat 5 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----\nr  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st\n...
 0  0      0 866112 314288 1083816    0    0    32    28  484  621  4  1 95  0  0

您可以看到 sidekiq（435 MByte）和 unicorn（每个 330-350）是最大的进程。

随着时间的推移，空闲 RAM 和 sidekiq RAM（RSS）使用量会减少，可能是由于被分页出去了，但没有产生不良影响——机器没有显示任何分页活动。我认为这是为了增加 buffer 和 cache 空间。

# vmstat 5 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----\nr  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st\n...
 0  0      0 679764 326988 1190840    0    0     0    11  285  396  1  1 98  0  0

大约 14 小时后：

# uptime
 10:12:06 up 17:27,  1 user,  load average: 0.04, 0.02, 0.00
# ps aux|sort -n +5|tail
...
1000        2006  1.2  9.6 5647908 377424 ?      Sl   Sep05  12:42 unicorn worker[0] -E production -c config/unicorn.conf.rb
1000        1971  1.8 11.3 6431988 444184 ?      SNl  Sep05  18:51 sidekiq 6.5.12 discourse [0 of 5 busy]
# vmstat 5 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----\nr  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st\n...
 0  0   2048 199972 342480 1576156    0    0     0    17  361  511  2  2 96  0  0

之后…

# uptime
 19:52:00 up 1 day,  3:07,  1 user,  load average: 0.02, 0.06, 0.01
# ps aux|sort -n +5|tail
...
1000        2006  1.2  9.8 5654308 382944 ?      Sl   Sep05  20:44 unicorn worker[0] -E production -c config/unicorn.conf.rb
1000        1971  1.5 11.1 6431668 436340 ?      SNl  Sep05  25:04 sidekiq 6.5.12 discourse [0 of 5 busy]
# vmstat 5 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----\nr  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st\n...
 0  0   2304 103356 301632 1690136    0    0     0    10  360  511  1  1 98  0  0

之后…

# uptime
 12:13:09 up 1 day, 19:28,  2 users,  load average: 0.05, 0.06, 0.01
# ps aux|sort -n +5|tail
...
1000        2006  1.2  9.1 5654820 358612 ?      Sl   Sep05  31:47 unicorn worker[0] -E production -c config/unicorn.conf.rb
1000        1971  1.3 10.0 6431668 393584 ?      SNl  Sep05  35:08 sidekiq 6.5.12 discourse [0 of 5 busy]
# vmstat 5 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----\nr  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st\n...
 0  0 284416 281596  77904 1908528    0    0     0    38  315  450  1  1 98  0  0

之后

# uptime
 13:26:42 up 2 days, 20:42,  1 user,  load average: 0.20, 0.06, 0.02
# ps aux|sort -n +5|tail
...
1000        2006  1.2  9.3 5789072 365720 ?      Sl   Sep05  51:54 unicorn worker[0] -E production -c config/unicorn.conf.rb
1000        1971  1.2 10.0 6433332 393472 ?      SNl  Sep05  50:44 sidekiq 6.5.12 discourse [0 of 5 busy]
# vmstat 5 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----\nr  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st\n...
 0  0 242944  82016  95188 2082180    0    0     0   131  332  488  1  1 98  0  0

之后

# uptime
 09:21:33 up 3 days, 16:36,  1 user,  load average: 0.13, 0.10, 0.03
# free
               total        used        free      shared  buff/cache   available
Mem:         3905344     1618936      323032      476664     1963376     1619208
Swap:        4194288      250112     3944176
# ps aux|sort -n +5|tail
...
1000        2006  1.2  9.3 5789200 363572 ?      Sl   Sep05  67:02 unicorn worker[0] -E production -c config/unicorn.conf.rb
1000        1971  1.1  9.6 6433652 377472 ?      SNl  Sep05  63:14 sidekiq 6.5.12 discourse [0 of 5 busy]
# vmstat 5 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----\nr  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st\n...
 0  0 250112 321888  56052 1906672    0    0     2    13  293  420  1  0 99  0  0

之后

# uptime
 13:55:23 up 4 days, 21:10,  1 user,  load average: 0.07, 0.08, 0.02
# free
               total        used        free      shared  buff/cache   available
Mem:         3905344     1638012       98492      481864     2168840     1595004
Swap:        4194288      252928     3941360
# ps aux|sort -n +5|tail
...
1000        1971  1.1  9.5 6434676 371648 ?      SNl  Sep05  80:49 sidekiq 6.5.12 discourse [0 of 5 busy]
1000        2006  1.2  9.5 5658468 373404 ?      Sl   Sep05  88:44 unicorn worker[0] -E production -c config/unicorn.conf.rb
# vmstat 5 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----\nr  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st\n...
 1  0 252928 101040  86736 2082372    0    0     0    10  333  482  1  0 99  0  0

jrivettcsa · 2024 年9 月 13 日 12:53

感谢您分享您的观察。我看到的情况与您大同小异，只是我们使用的是 2 GB 的实例，所以剩余空间较少。另外，感谢您指出一些“可用”和“已用”内存的衡量标准并非一定有用。

我上次重启实例是几天前，当时的初始内存使用量为 1.23 GB。从那时起，内存使用量逐渐增加，现在已达到 1.8 GB。目前网站的响应速度仍然相当不错。

该网站的用户实际上并不多，近期用户注册或活动也没有增加。过去一个月大约有 20 个新主题，约 100 篇帖子，以及约 4 位每日活跃用户。

我将继续监控情况，如果实例内存再次达到上限，或者网站再次变得迟缓，或者两者兼有，我将在此处发帖。

jrivettcsa · 2025 年7 月 6 日 13:19

由于内存问题，升级 Discourse 变得越来越麻烦，所以我们最终将虚拟机从 2GB 升级到了 4GB。从那以后，内存使用量趋于稳定。

话题		回复	浏览量
Discourse not using much RAM Installation server-resources	31	1572	2021 年8 月 8 日
Discourse installation has been getting slower and slower and slower Installation server-resources	37	1596	2023 年5 月 15 日
Memory is running out and Discourse stops working Bug	74	14231	2015 年2 月 17 日
Discourse Docker HW reserved/used (CPU, RAM, Disk) and how to manage it Installation server-resources	5	824	2023 年5 月 16 日
Memory creep in last couple of updates Hosting	27	2715	2019 年5 月 21 日

重启后内存使用逐步增加

相关话题