I have been working with a lot of customers lately where we see that the datastore shows inactive after an unplanned PDL on an ESXi host. First, let us define is a PDL and what can cause an unplanned PDL.
Permanent Device Loss
A Permanent Device Loss situation occurs when an LUN presented to the ESXi is unavailable. When you go to Storage Adapters view, you would see the device reporting as Lost Communication.
You must be wondering what are the causes for a PDL? Well, I have listed some below:
- Array misconfiguration.
- Removing the ESXi host from the array's storage group.
- An LUN failure on the storage array.
- Incorrect zoning configuration that can cause the LUN to be unavailable.
- Power outage on the storage array.
When an unplanned PDL occurs, the host will stop sending I/O requests to the Storage array even though the paths are up and accessible.
The way it decides to stop sending I/Os is with the help of SCSI sense codes that are received from the Storage array using the paths indicating that the device is unavailable.
Once the issue has resolved and the device is available again to the ESXi host, you would notice that the datastore would show as inactive.
This issue occurs when there was a running virtual machine(s) when the storage device went offline.
And you would see similar logging in the vmkernel.log of the ESXi host:
2017-06-14T18:39:48.741Z cpu1:10397)ScsiDevice: 5261: t10.F405E46494C45425342473C6B645D24325B454D24694E477 device :Open count > 0, cannot be brought online
2017-06-14T18:39:58.892Z cpu14:10397)ScsiDevice: 5261: t10.F405E46494C45425342473C6B645D24325B454D24694E477 device :Open count > 0, cannot be brought online
To successfully remount the datastore, follow the below steps:
Run the below command to see the world that has the device open for the LUN.
esxcli storage core device world list -d t10.F405E46494C45425342473C6B645D24325B454D24694E477
As you can see from the above example, there was a VM called PSC65-A that was running on the device when it went offline and the process still shows up although the VM might be unresponsive.
You can also reconfirm by running the below command which will list the processes for all the running VMs on the ESXi host.
esxcli vm process list
Next step would be to kill the World ID using the command:
esxcli vm process kill --type=force --world-id-88817
Now go ahead and perform a rescan of the storage devices either from the UI or from the command line, the datastore will come up as mounted.
You will have to power ON the VM affected VM manually.
If there is no VM listed in the first step, you will have to reboot the ESXi host in order to remount the datastore successfully.
I hope this has been informative and thank you for reading!
2 Comments
The step “esxcli vm process list” is not necessary because you got world it in the previous step
I love you man, you saved me. Thx a lot.