Esxi 5.1 catch 22 with local logcache and faulty local datastore

Share This:

Esxi catch 22 with local logcache and explosive SSD’s

Based upon:
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2056181

This issue happen to us twice in the past , both times resulting in restart of the ESXi host = out of hours downtime for the VM’s. We knew it’s once again failed SSD which was configured as a local logcache (we now know it’s not a good idea and external syslog server is the right answer).

KB Style:
Symptoms

ESXi 5.1 or ESXi 5.5 host is disconnected in vCenter Server
-Local datastore on the host is marked as inaccessible (not able to know without local vSphere connection issues!)
-Unable to connect directly to the host using the vSphere Client
-The host’s management network IP responds to ping
-The hostd management agent is running
-Restarting the management agents fails with the error:
Connect to localhost failed: Connection failure

Few bits to add:
– “Restarting the management agent fails with the error” wasn’t true in this case, both hostd and vpxa service restarted properly and service status displayed as running.
The only way to get connection to localhost error was (apart from logs) trying to execute any esxcli commands.
Also while executing /sbin/services.sh restart throws several connection to localhost failed errors.

More issues noticed on the way:
– shell breaks each time you try to list devices or any location where there is a symlink using faulty local disk. Directories such as /scrach/log or /var/log, you should be able to ls, but anything more would crash the shell and cause few minutes of your server to scream and run naked around the fire…

After endless tries I found the KB http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=205618
I thought, We’ll need to crash my Esxi host 🙁 yet again.

Where’s the catch 22?
To change logcache location that could potentially help with this faulty drive you have to use esxcli (same to remove the datastore),therefore to use esxcli in this case recommendation was to restart esxi host. ?!?!?

Last thing to try was to go to the server and pull the faulty SSD, I would never recommend to do so in different systems.

Few minutes after that we could log in locally to vSphere as well as reconnect to vCenter.
After that it was great to migrate off the vm’s and enter maintenance mode!

After all good effort. No need for downtime.

Leave a comment

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload the CAPTCHA.