1.2 beta users – please upgrade to beta 9 immediately due to critical memory leak


(Jeff Atwood) #1

Starting on about Feb 2nd 2015, the time of transition between 1.2 beta 6 and beta 7, we’ve been struggling with a severe memory leak in Discourse.

It took us a while to nail it down – almost two weeks, sorry about that – but we finally did. It was EventMachine 1.0.5 and 1.0.6:

Observed with 1.0.6: as soon as the ruby process is started it’ll leak 16KB chunks of memory pretty much every second. This happens without serving any Requests (load balancer in front was disabled).

This is quite nasty; 16KB per second is almost a megabyte per minute, a gigabyte per day. The good news is that we’ve improved our internal monitoring significantly so in the future we’ll be able to tell sooner when someone introduces a serious memory leak.

As of beta 9, we have upgraded to a newer version of EventMachine which fixed the memory leak bug.

This means anyone who was on 1.2 beta 6, 7 or 8 should upgrade to beta 9 immediately. If you do not, expect to run into out of memory errors regularly until you do.


Image uploading problems and other errors
(Jeff Atwood) #2

(Sam Saffron) #3

Another update on this issue.

I have been working pretty relentlessly on this issue. On Friday I completed step #5 of my 7 step plan, which gives me great visibility into memory usage in our enterprise.

https://github.com/discourse/discourse_docker/commit/03b50438d73dbe6076a5a4179e336afaef2b28c2

I noticed that despite all efforts, memory was still climbing. It was even climbing up on containers that are pretty much inactive at the moment (brand new customers)

Having this kind of information is a godsend, it allows one to test various theories.

I spent a bit of time thinking of the trend in the graph. It is constantly going up and totaly unrelated to traffic. This ruled out pg and redis as prime candidates (though clearly anything is possible). Which left me looking at other c extensions.

Previous profiling already excluded a managed leak, the number of objects in the Ruby heap was simply not growing. Number of large objects also not growing.

So I thought about message_bus and the way it relies on EventMachine for 30 second polling and other bits and pieces.

I remembered I upgraded EventMachine recently.

https://github.com/discourse/discourse/commit/d1dd0d888a950d6121afdb764aeeaaa35757ede7#diff-e79a60dc6b85309ae70a6ea8261eaf95

Funny thing is that commit was all about limiting memory growth.

Anyway, it appears there is a memory leak in the EventMachine gem, that was recently merged in by @tmm1.

So, I went back to that container set and upgraded one of the containers in the set to the latest version of EventMachine last night, just before I went to sleep.

In the morning I could see this picture:

So I am very cautiously optimistic here. I applied the fix to our code:

https://github.com/discourse/discourse/commit/43375c8b15f95ac3eb4a797b6a99d20f354cc1e6

We are now deploying this to all our customers, then I will be watching memory for the next 24 hours.

If all looks good we will push a new beta tomorrow. If not, well, I will continue hunting this down.

EDIT a few hours later, this is looking like the real deal across our entire infrastructure.


(Jan Philip Gehrcke) #4

I just saw this on my machine and panicked:

Checked top, saw ruby being responsible and blamed Discourse, figured that this must be a bug you are aware of, entered https://meta.discourse.org, saw this topic, and stopped panicking. Thanks for fixing.


(Rick Cogley) #5

Hi - when I try to upgrade from 1.2.0.beta8 to .beta9, in the “upgrade docker_manager” I get “Sorry, there was an error upgrading Discourse. Please check the logs below.” and see this in the in-line Terminal window:

$ cd /var/www/discourse/plugins/docker_manager && git fetch && git reset --hard HEAD@{upstream}

Forlornly clicking the pretty blue “Start Upgrading” gets me nothing but:

$ cd /var/www/discourse/plugins/docker_manager && git fetch && git reset --hard HEAD@{upstream}
error: cannot open .git/FETCH_HEAD: Permission denied

Was it something I said?


(Sam Saffron) #6

ssh in and then:

cd /var/discourse
git pull
./launcher rebuild app

(Rick Cogley) #7

@sam thanks very much. That worked like a charm. I’m back up and running. Kudos to Team Discourse for putting together an amazing product. The fact that you can do an upgrade this simply, with so many moving parts, is fantastic. If you watch the process as it upgrades, you can tell how much is going on.

Really, really nice work.


(Patrick Klug) #8

Screenshot from just now:

Should I see a 1.2.0.beta9?


(Jeff Atwood) #9

Might take some time for the version ping to catch up. In the meantime visit /admin/upgrade and perform the upgrade.


#10

I got “The page you requested doesn’t exist or is private” when visiting /admin/upgrade
I have logged in as admin.


(Jeff Atwood) #11

See this post above.


#13
discourse@discourse:/var/www/discourse$ git pull
Already up-to-date.
discourse@discourse:/var/www/discourse$ ./launcher rebuild app
-su: ./launcher: No such file or directory

discourse@discourse:/var/www/discourse$ git branch
* master
discourse@discourse:/var/www/discourse$ cat .git/FETCH_HEAD | grep master
eddbfc555ffceb81007e5c8862b84c424994bf2d		branch 'master' of https://github.com/discourse/discourse

(Jeff Atwood) #14

Looks like you do not have a Docker based install. We do not support anything but the Docker based install here.


(Theron Boerner) #15

How can we view those pretty charts on our installs?


(Dylan) #16

Those charts aren’t built in to Discourse, they are for monitoring your server, many different ways to do that. The simplest way to explain it is it’s like Windows task manager. What you use will depend on the type of server you’re running.


(Theron Boerner) #17

I’m just using the standard DigitalOcean install. @sam what are you using for the charts?


(Rodrigo Farcas) #18

Hey @codinghorror

I’ve upgraded via Docker, did not see any message on the console, after a while refreshed page and it says that now I am up to date (did not have any success message)

I am OK or should I check integrity of the files somehow?

Thanks!


(Molly) #19

I clicked upgrade but just get a white blank page.


(Dylan) #20

Like @sam said:

SSH in, then type the following commands:

cd /var/discourse
git pull
./launcher rebuild app


(Molly) #21

Right right, did that. Just reporting it in case it’s helpful :wink: