wiki'd

by JoKeru

The "high load vm / esxi nfs backup error" story

The beginning:
You have a WhiteBox ESXi host (6 Cores @ 3.2 GHz, 16G RAM, 2x500GB + 2x1TB DAS, free ESXi version - of course) and some VMs running on it.
One of the VMs is a Centreon Server (monitoring).
Everything is going well and you also decide to do some off-site backup for the critical VMs.

The problem:
1. You're starting to receive "high load" alarms for (and from) the Centreon. For a time you ignore them hoping they'll go away. They don't.
2. Also the off-site backup starts to fail, the nfs becomes unstable (all paths down / up).

The foreplay:
1. You check the load on the vm, indeed it's higher than normal, you observe big io wait.
2. You blame the remote nfs server, you configure another one that is more reliable, but the script fails again.

The work:
1. You reboot the Centreon vm, nothing changes, load is still high.
2. During a manual run of the backup script (performed on a pfsense vm because of a smaller disk), you terminate it and wonder what was the current state of the vm that was backing up - a snapshot was applied (and not deleted because script didn't finish all steps) !

By checking all the vms that were setup to be backed up, you notice that Centreon also had a snapshot since 1 month ago (most probably the daily backup crashed at this vm) - the same time when all issues appeared.
You delete the snapshot, the vSphere Client starts to become unresponsive, deletion is stuck at 25%, but luckily it completes until you do something bad like a host reboot.
After the snapshot delete, the load drops to normal values:
[caption id="attachment_1013" align="aligncenter" width="624"]Load Load[/caption]

Testing again the backup script, it run smoothly with no nfs errors - it looks like the nfs connection was affected by the snapshot and by the fact that a very disk-busy vm was writing heavily.

But "storage device performance deteriorated" errors started popping out in the logs, most probably due to overload - this is another story :)
[cc lang='bash']
# bbbbbbbbbbbbbbb aka Centreon is pushing a lot of writes even after snapshot deletion
GID VMNAME VDEVNAME NVDISK CMDS/s READS/s WRITES/s MBREAD/s MBWRTN/s LAT/rd LAT/wr
3181 aaaaaaaaaaaaaaa - 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
3207 bbbbbbbbbbbbbbb - 1 154.94 0.00 154.94 0.00 4.08 0.00 4.51
3215 ccccccccccccccc - 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
3223 ddddddddddddddd - 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
3231 eeeeeeeeeeeeeee - 1 1.41 0.00 1.41 0.00 0.01 0.00 0.27
3239 fffffffffffffff - 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
[/cc]

Conclusions:
- email your ghettoVCB logs (this feature is built in, but needs some tweaks to make it work)
- don't snapshot on disk-busy vms
- monitor your vms & datastores for iops
- monitor your monitoring server

Comments