The beginning:
You have a WhiteBox ESXi host (6 Cores @ 3.2 GHz, 16G RAM, 2x500GB +
2x1TB DAS, free
ESXi
version - of course) and some VMs running on it.
One of the VMs is a Centreon Server (monitoring).
Everything is going well and you also decide to do some off-site
backup for
the critical VMs.
The problem:
1. You're starting to receive "high load" alarms for (and from) the
Centreon. For a time you ignore them hoping they'll go away. They
don't.
2. Also the off-site backup starts to fail, the nfs becomes unstable
(all paths down / up).
The foreplay:
1. You check the load on the vm, indeed it's higher than normal, you
observe big io wait.
2. You blame the remote nfs server, you configure another one that is
more reliable, but the script fails again.
The work:
1. You reboot the Centreon vm, nothing changes, load is still high.
2. During a manual run of the backup script (performed on a pfsense vm
because of a smaller disk), you terminate it and wonder what was the
current state of the vm that was backing up - a snapshot was applied
(and not deleted because script didn't finish all steps) !
By checking all the vms that were setup to be backed up, you notice that
Centreon also had a snapshot since 1 month ago (most probably the daily
backup crashed at this vm) - the same time when all issues appeared.
You delete the snapshot, the vSphere Client starts to become
unresponsive, deletion is stuck at 25%, but luckily it completes until
you do something bad like a host reboot.
After the snapshot delete, the load drops to normal values:
[caption id="attachment_1013" align="aligncenter"
width="624"]
Load[/caption]
Testing again the backup script, it run smoothly with no nfs errors - it looks like the nfs connection was affected by the snapshot and by the fact that a very disk-busy vm was writing heavily.
But "storage device performance
deteriorated"
errors started popping out in the logs, most probably due to overload -
this is another story :)
[cc lang='bash']
# bbbbbbbbbbbbbbb aka Centreon is pushing a lot of writes even after
snapshot deletion
GID VMNAME VDEVNAME NVDISK CMDS/s READS/s WRITES/s MBREAD/s MBWRTN/s
LAT/rd LAT/wr
3181 aaaaaaaaaaaaaaa - 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
3207 bbbbbbbbbbbbbbb - 1 154.94 0.00 154.94 0.00 4.08 0.00 4.51
3215 ccccccccccccccc - 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
3223 ddddddddddddddd - 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
3231 eeeeeeeeeeeeeee - 1 1.41 0.00 1.41 0.00 0.01 0.00 0.27
3239 fffffffffffffff - 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
[/cc]
Conclusions:
- email your ghettoVCB logs (this feature is built in, but needs some
tweaks to make it work)
- don't snapshot on disk-busy vms
-
monitor
your vms & datastores for iops
- monitor your monitoring server