As I’d recently looked into the health of the system which runs this forum, and have from time to time contributed to support threads about disk use or memory use, I thought I run through some of my commonly-used commands.
I usually work from a console, and ssh into a remote machine, so I like everything to be textual and static.
When first arriving on a machine, I’ll use
uptime
to check the CPU load and time since reboot. CPU load is presented as three numbers, being the short medium and long term average number of processes ready to use CPU time. I’m happy if that’s no larger than the number of CPUs (which I find with egrep Hz /proc/cpuinfo
) or preferably rather less.
top
is commonly used to get an overview of activity. It’s an endlessly refreshing display so I tend to use
top -bin2
instead. I’d usually only use this to find out which processes are using CPU, or possibly which processes are using memory.
free
or
free -h
gives a snapshot of memory use. I tend to look only at how much swap space is in use: preferably rather less than 10%.
vmstat 5 5
gives a series of snapshots of machine activity. I use it mostly for the columns si
and so
which indicates active paging in and out of swap. I like that to be zero. Also the bi
and bo
columns indicate disk activity. Ideally one would have a record of what’s normal on any given machine, in order to know what’s abnormal. I used to run cron jobs on a 10 minute schedule to collect these sorts of textual snapshots - to be looked at only when there’s a problem. I’d rotate those logs daily and keep a month’s worth (by the simple expedient of using the day number in the filename.)
df -h /
shows the amount of free disk space - in the case of Discourse, our forum software, it insists on 5G free before embarking on an upgrade, which it prompts approximately monthly.
netstat -l
is one view of current network connections. I prefer
lsof -n -i
if it’s installed. We expect to see only the processes we expect - so familiarity is key.
/var/discourse/launcher enter app
Discourse is shipped and installed as a complete OS image in docker. With this command we can open a shell within that image, for example to look at processes, memory use, perhaps disk use. It’s also possible to run database commands: either queries or sometimes tidying-up, or (hopefully rarely) repairs.
Within the docker image, then, we could wonder which processes are running under the user ‘discourse’
ps fu -u discourse
Or which processes are using most memory:
ps aux | sort -n -k4 | tail
We exit the docker container with
exit
We talked a bit about monitoring or checking connectivity to the outside world. Ideally, we connect meaningfully to a service we actually care about: perhaps connect to wherever our backups are. But among the lower level checks we can easily do:
ping -c2 g.co
tests both name resolution and connectivity
ping -c2 8.8.8.8
or
ping -c2 1.1.1.1
test just connectivity. But they say little about the quality of a connection: one might need to transfer some data for that.
As noted,
df -h /
tells us how much disk space remains. To see how much is used, and where, I tend to pipe du into sort and then tail:
du -kx / | sort -n | tail -33
To answer a specific question about how much space is used by the base OS, how much by Docker, and how much by the forum data itself, I used:
du -csh /usr /swapfile /var/lib/apt /var/log /var/cache/ /boot/ /*bin /lib | sort -h
du -hsxc /var/discourse/ /var/lib/docker/
du -h /var/discourse/shared/standalone/backups/default
We can ask docker what it’s doing:
docker ps
and how it’s using space:
docker system df
and what volumes (if any) it’s looking after:
docker volume ls
and what images it knows about:
docker image ls
docker image ls -a
Sometimes one is advised to do some dramatic clean up action like this:
docker system prune --all --volumes --force
As an aside, systemd journals a lot of data and when space is tight we don’t need it - we only need it after or during an incident. So we can be very aggressive if we just need to reclaim space right now:
journalctl --disk-usage
journalctl --rotate
journalctl --vacuum-time=1s