A key part of any modern infrastructure is good monitoring, even on my local desktop system I like to have monitoring in place so that if a problem gradually builds up I can trace where it started.
Unfortunately in this case the monitoring system actually caused the problem and Prometheus was using up all my system CPU and RAM.
My guess is that this occurred after a series of upgrades, perhaps an accumulation of old data from containers and VMs that wasn’t being cleaned up.
But with monitoring down it’s hard to troubleshoot so I decided the pragmatic solution is to delete old Prometheus data and keep an eye on things in future.
This got me up and running again
sudo service prometheus stop
sudo rm -rf /var/lib/prometheus/metrics2/wal/*
sudo service prometheus start
I don’t have retrospective data now - but my system is running well and if I start seeing a problem re-occur I’ll try and investigate before it becomes unusable.