Hi Guys,
I currently have 8 HP ProLiant DL580 G7's (Running with 4 X Intel Xeon X7560 @ 2.27Ghz (64 logical processors) and 64 GB RAM) hosts running ESXi 4.1.0 Build 260247 in a HA and Semi-Automated DRS cluster.
In the past month we have had 3 of these hosts randomly disconnect from Virtaul Center (a VM running on one of the hosts) and when I go in to reconnect them, it fails with the following error: "Cannot contact the specified host. The host may not be available on the network, a network configuration problem may exist, or the management services on this host may not be responding."
Steps I have taken to resolve this issue:
1. Tried simply right-clicking the disconnected host and clicking connect, which is when I get the aforementioned error.
2. Logged into the ILO of the hosts and tried restarting management agents, which is when the host froze and I had to turn the machine off and back on via the ILO.
Of the 3 servers, only 1 ever had an actual hardware issue. One of the hosts had a bad RAM Cache Module which needed to be replaced. It has since been replaced.
The first time I called VMware support, I had already hard booted the host and brought it and all the VMs up. When I told them this, they told me that since we are running ESXi, whenever the server is rebooted it clears out all the logs. First of all, is this actually true? Am I missing some way to export those logs to another location? Because that seems like a very bad model to me, but perhaps I just don't understand how I'm supposed to be getting these logs off.
The second time I called VMware support for a different server than the first, the VMs were still up and running even though the host was disconnected, so I called support before I attempted to do anything. They went in and looked at logs and were seeing some errors that suggested there were some problems with reading the local disk. SO, the VMs which reside on a SAN, are still "running in memory" as he put it, but the host itself couldn't read the hard disks. I checked in the IML on the ILO and had a guy down in our datacenter check for LED lights and there was no warnings or hardware failures of any kind.
Last night, this happened on a 3rd host but when I logged into the ILO initially, the server was locked up and when I tried to RDP or SSH to any of the VMs on the host I couldn't, so I'm not sure if was a different issue.
This has happened on the first server 3 or 4 times, the second server 2 times, and now it happened again on this 3rd server.
One thing I read in a community post was I need to specify the vCenter server managed IP in Virtual Center. That had not been done, so I did that.
SO, to sum it all up, has anybody ever experienced this before? Random disconnects without being able to reconnect? And if anybody can shed some light on the log situation for me, that would be great as well.
Thanks guys, sorry for the russian novel post.
James