As of last night site is responding very poorly

Apologies in advance if this is the wrong category, location, etc.
Ive had a discourse site running for about 6 months through a digitalocean vps without many issues. The admin page reads I’m on version 2.5.0.beta4. As of last night, most of the sites page content either refuses to load in or takes a seemingly inane amount of time. For example, I can navigate to pages like the homepage or /admin, but any of the actual content for them (posts, the admin graphs / other tabs) wont seem to load in. I’ve checked in on the system vitals and cpu usage idles around 2%, and there is minimal traffic or disc usage. There is a userbase of maybe 10 or so people as I am just trying out / setting up the site. So that considered this behavior seems very odd.

The only plugins I have according to app.yml are docker_manager and discourse-signatures. I’m the only admin user so I can confirm changes haven’t been made to the site settings in quite a while as well.

My first thought was to restart the machine itself, and i’ve also tried to manually update using git pull and ./launcher rebuild app. I’m not sure what to look for during that process that would indicate if any errors are occurring, but the rebuild seems to complete and the site can be accessed again afterwards but it remains at 2.5.0.beta4. Similarly, trying to access the /admin/update page will eventually just time out. This all seems fairly strange because the site is arguably ‘functional’ - I simply don’t know enough about how it operates to really diagnose anything. I found and can run the discourse-doctor but I’m not sure what it accomplishes - successfully emails me, etc.

The one thing that may seem to indicate an issue is, last night I got an email from the forum about a response to a post, and when I navigate to the ‘latest posts’ category, (after it eventually loads) there doesn’t seem to be any indication that the post exists, because the thread overview in latest doesn’t list it as having that user posted last. I can’t seem to load in the content of any posts so there isn’t a way to check for sure. So there may be some error / mismatch in the database? I’m not sure how something like that would branch out into causing entire chunks of the site to fail loading, or if this is a rabbit hole worth chasing.

Any thoughts on where to start with troubleshooting for an issue like this? Thanks much if you took the time to read : )

1 Like

Hi tuckie! Welcome!

Looks like you are doing all the right things.

I highly recommend you update if you can - you’re pretty far behind the latest version. But be sure to download a backup first so you don’t lose anything.

Can you log in via ssh and see if you are running out of storage?

df -h 

Whatever the case, storage is a good first thing to check, and this command is a good one to run to remove any stale containers that are taking up space:

./launcher cleanup app 

Then I’d try rebuilding the app to the latest version. Let us know if it works this time and doesn’t display any errors in the console.

./launcher rebuild app
1 Like

Thanks for the quick uptake.
Its reading about 7.9 GB free in the drive mounted on /dev/vda1 mounted on / - I am not majorly aware of how the other partitions are used on ubuntu or how they might affect running (discourse is in a container, no?), the rest look to be the boot partition/ etc. There are only about 30-40 posts total on the forum as I test it, so its not (seemingly) in danger there. The cleanup was able to free up ~4GB extra.

As for the app rebuild, I’ve ran this a few times actually. I don’t see any glaring warning messages occurring during the process, but at the same time when its done I don’t see anything saying ‘success’ either - I wouldn’t know what error / warning lines to look for. It removes the old container and then runs the docker container, and then its done. Ive just ran it one more time, and when I connect to the site it tells me that updates are available still, but it takes an incredibly long time to report the version (2.5.0.beta4 still) I’m on and the version to update to.

Part of the problem is that it seems I can’t really use the admin tools either because of response times or failing to load. For example, navigating to the backups tab just displays the loading animation indefinitely. Out of interest I’ve opened the console on the backup tab, and the browser appears to try and fetch javascript files and is failing on all of them, slowly one at a time.

If there’s a way to work with backups through ssh that seems like it would be useful here.

1 Like

It sounds like a network problem. Are you using cloudflare? (if so turn off the orange cloud).

You could have a noisy neighbor at DigitalOcean, so you might open a ticket with them.

It doesn’t make any sense that you say you’ve done a rebuild but the version hasn’t changed. I’d think that you’d need to do the postgres 12 upgrade. Did you not see anything about that when you did the rebuild?

2 Likes

I am on digitalocean, I suppose something like that could be happening, though I’m not sure if that would cause this problem as consistantly or for as long as this. I think a better way I could describe the issue with the site is that it seems like typically the page is able to load the templating or ‘shell’ of the page, but beyond that fetching any actual content for the pages seems to keep loading forever.

As for the rebuild/version change - it could be that an error like that is happening, but I don’t know a good way to go about parsing it, nor do I really know what I’d be looking for. I did see a line along the lines of ‘postgres installed’ looking at the output scroll by as i ran rebuild again just now. I’m not sure if this is because of the work going on inside of a container or not, but for example ./launcher rebuild app | grep 'postgres' doesn’t seem to filter anything out, nor does ./launcher rebuild app > output.txt && grep 'postgres' output.txt. the output.txt does contain information in it but seemingly not everything? it at the very least doesn’t end in the same way as the actual console output.

1 Like

Hello, hoping I’m not going against anything about bumps/etc, but i’d still like some help with this. Sometime over last week situations seem to have gotten worse? I cant say for sure when this happened, as I wasn’t working on this over the holiday last week, but I cannot connect to my site at all now. I can still ping the ip successfully, and the same ip directs to the right domain so it seems like it isn’t a nameserver issue either.

Accessing the site from firefox now produces :

The site at https://aregames.art/ has experienced a network protocol violation that cannot be repaired.

The page you are trying to view cannot be shown because an error in the data transmission was detected.

I’m not able to find much useful information from the browser inspector, because it doesn’t seem like there is a reply to the GET request.

Since discovering this new issue I’ve:

  • ran the rebuild several times over
  • updated ubuntu to 20.04
  • rebuilt again

The site itself was really only used for testing the platform for about a month, and I’m willing to accept it probably wasn’t a great idea to not have kept the software up to date. I’m willing to go about reinstalling discourse from scratch, too. It would be nice of course to find some way to fix this with preserving the site configuration, users, and posts, but the only thing I really need to hold onto is some of the custom CSS I wrote in the theme editor. If there’s somewhere that is stored that I can copy back to a new setup, that would be helpful. I (irresponsibly) don’t have an up-to-date version of it stored locally anywhere…

And again on the rebuild process, I still don’t know exactly how to parse this for any issues. As far as I can tell, it runs and completes without prompting for any input, and the last lines after its done have to do with starting the docker container with the configurations from the yaml. I understand there is a difference between the rebuild completing and completing successfully, but I am not sure what I’d be looking for or where to diagnose if something is going wrong during this.

1 Like

Is the server up? Can you ssh into it?

If you can, reboot the server and then rebuild discourse.

If it is not up after all that, paste the rebuild output here and we can help you.

1 Like

Yes, I can ssh properly and that’s how I’ve ran the rebuild each time. And no, still unreachable after a rebuild. I do see (even after a rebuild) that ifconfig shows the docker container with an ip, different from the server ip, that I cant reach from my systems web browser. I’m not sure if that’s intended or not. ./launcher rebuild app > output.txt only seems to output a portion of the actual console output, but I can include that too.

Ubuntu Pastebin (short output file)
Ubuntu Pastebin (full output pasted from term)
I see a few error messages from postgres saying that the ‘discourse’ database already exists, is this worth looking into?

1 Like

Is your DNS right?

host aregames.art 
aregames.art has address 198.54.117.200
aregames.art has address 198.54.117.199
aregames.art has address 198.54.117.198
aregames.art has address 198.54.117.197

Why so many IPs? What is your server IP ?

5 Likes

Wow, this was actually pretty illuminating - I had actually let my domain name expire, and coincidentally that happened the day I started running into these issues… I was planning to switch providers, so I turned off automatic payments there and the date slipped by, I guess. So it looks like those IPs are related to some sort of parking service for the domain. I’ve just renewed it now, so maybe it will reapply the right records again - not sure how long that usually takes, host is still reporting those IPs. According to the docs I shouldn’t be able to connect via the IP directly so I won’t be able to test if this has worked for a bit I suppose. Thanks for pointing this out.

That being said, I’m still a little confused about the issues I was running into initially - Would I have been accessing a cached version of the page, and because of the name server issues the requests for content weren’t going through? Some things, like even the posts in a thread, or the list of posts when you open the ‘latest posts’ would eventually load in, just after a long time.

update: host aregames.art as you mention above seems to once again resolve to the right ip and mail server. I was able to confirm with the discourse-setup script that it accepts the dns as going to the ip. It looks like the setup script also ran the rebuild. However navigating to the URL produces a server not found. Accessing the ip directly at port 443 does produce an nginx 400 bad request, which sort of seems like progress.

edit again: had to clear my browser cache - site loaded completely fine from an incognito tab. things are looking functional again! I guess… paying for my website was the solution to fixing the site here.

4 Likes

Yes, you were using the cached view.

We added a new feature on Discourse 2.6 to add a specific CSS class to the document when you are on this view, but we don’t have a default UI element for it yet.

You can read more about it on Offline Indicator

4 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.