Uncategorized

vSphere 6.5 Update 1 is out, here’s why you want to upgrade

Hi

VMware have just released the first major update to vSphere 6.5, normally, I don’t blog on these but this update is so big and it fixes some really annoying bugs I saw using the GA version of vSphere 6.5..thankfully, we worked hard with their support to overcome some of the issues I highlighted in yellow, this was of course done for  the greater good.

The release notes for ESXI 6.5 U1 can be seen here https://docs.vmware.com/en/VMware-vSphere/6.5/rn/vsphere-esxi-651-release-notes.html and it can be downloaded from here https://my.vmware.com/web/vmware/details?downloadGroup=ESXI65U1&productId=614&rPId=17343

The release notes for vCenter 6.5 U1 can be seen here https://docs.vmware.com/en/VMware-vSphere/6.5/rn/vsphere-vcenter-server-651-release-notes.html and it can be downloaded from here https://my.vmware.com/web/vmware/details?downloadGroup=VC65U1&productId=614&rPId=17343

Below you can see that partial list of things that were close to my heart.

Storage Issues

  • Modification of IOPS limit of virtual disks with enabled Changed Block Tracking (CBT) fails with errors in the log files

    To define the storage I/O scheduling policy for a virtual machine, you can configure the I/O throughput for each virtual machine disk by modifying the IOPS limit. When you edit the IOPS limit and CBT is enabled for the virtual machine, the operation fails with an error The scheduling parameter change failed. Due to this problem, the scheduling policies of the virtual machine cannot be altered. The error message appears in the vSphere Recent Tasks pane.

    You can see the following errors in the /var/log/vmkernel.log file:

    2016-11-30T21:01:56.788Z cpu0:136101)VSCSI: 273: handle 8194(vscsi0:0):Input values: res=0 limit=-2 bw=-1 Shares=1000
    2016-11-30T21:01:56.788Z cpu0:136101)ScsiSched: 2760: Invalid Bandwidth Cap Configuration
    2016-11-30T21:01:56.788Z cpu0:136101)WARNING: VSCSI: 337: handle 8194(vscsi0:0):Failed to invert policy

    This issue is resolved in this release.

  • When you hot-add multiple VMware Paravirtual SCSI (PVSCSI) hard disks in a single operation, only one is visible for the guest OS

    When you hot-add two or more hard disks to a VMware PVSCSI controller in a single operation, the guest OS can see only one of them.

    This issue is resolved in this release.

  • An ESXi host might fail with a purple screen

    An ESXi host might fail with a purple screen because of a race condition when multiple multipathing plugins (MPPs) try to claim paths.

    This issue is resolved in this release.

  • Reverting from an error during a storage profile change operation, results in a corrupted profile ID

    If a VVol VASA Provider returns an error during a storage profile change operation, vSphere tries to undo the operation, but the profile ID gets corrupted in the process.

    This issue is resolved in this release.

  • Incorrect Read or Write latency displayed in vSphere Web Client for VVol datastores

    Per host Read or Write latency displayed for VVol datastores in the vSphere Web Client is incorrect.

    This issue is resolved in this release.

  • An ESXi host might fail with a purple screen during NFSCacheGetFreeEntry

    The NFS v3 client does not properly handle a case where NFS server returns an invalid filetype as part of File attributes, which causes the ESXi host to fail with a purple screen.

    This issue is resolved in this release.

  • The lsi_mr3 driver and hostd process might stop responding due to a memory allocation failure in ESXi 6.5

    The lsi_mr3 driver allocates memory from address space below 4GB. The vSAN disk serviceability plugin lsu-lsi-lsi-mr3-plugin and the lsi_mr3 driver communicate with each other. The driver might stop responding during the memory allocation when handling the IOCTL event from storelib. As a result, lsu-lsi-lsi-mr3-plugin might stop responding and the hostd process might also fail even after restart of hostd.

    This issue is resolved in this release with a code change in the lsu-lsi-lsi-mr3-plugin plugin of lsi_mr3 driver, setting a timeout value to 3 seconds to get the device information to avoid plugin and hostd failures.

  • When you hot-add an existing or new virtual disk to a CBT (Changed Block Tracking) enabled virtual machine (VM) residing on VVOL datastore, the guest operation system might stop responding

    When you hot-add an existing or new virtual disk to a CBT enabled VM residing on VVOL datastore, the guest operation system might stop responding until the hot-add process completes. The VM unresponsiveness depends on the size of the virtual disk being added. The VM automatically recovers once hot-add completes.

    This issue is resolved in this release.

  • When you use vSphere Storage vMotion, the UUID of a virtual disk might change

    When you use vSphere Storage vMotion on vSphere Virtual Volumes storage, the UUID of a virtual disk might change. The UUID identifies the virtual disk and a changed UUID makes the virtual disk appear as a new and different disk. The UUID is also visible to the guest OS and might cause drives to be misidentified.

    This issue is resolved in this release.

  • An ESXi host might stop responding if a LUN unmapping is made on the storage array side

    An ESXi host might stop responding if a LUN unmapping is made on the storage array side to those LUNs while connected to an ESXi host through Broadcom/Emulex fiber channel adapter (the driver is lpfc) and has I/O running.

    This issue is resolved in this release.

  • An ESXi host might become unresponsive if the VMFS-6 volume has no space for the journal

    When opening a VMFS-6 volume, it allocates a journal block. Upon successful allocation, a background thread is started. If there is no space on the volume for the journal, it is opened in read-only mode and no background thread is initiated. Any intent to close the volume, results in attempts to wake up a nonexistent thread. This results in the ESXi host failure.

    This issue is resolved in this release.

  • An ESXi host might fail with a purple screen if the virtual machines running on it have large capacity vRDMs and use the SPC4 feature

    When the virtual machines use the SCP4 feature with Get LBA Status command to query thin-provisioned features of large vRDMs attached, the processing of this command might run for a long time in the ESXi kernel without relinquishing the CPU. The high CPU usage can cause the CPU heartbeat watchdog process to deem a hung process and the ESXi host might stop responding.

    This issue is resolved in this release.

  • An ESXi host might fail with a purple screen if the VMFS6 datastore is mounted on multiple ESXi hosts, while the disk.vmdk has file blocks allocated from an increased portion on the same datastore

    A VMDK file might reside on a VMFS6 datastore which is mounted on multiple ESXi hosts (for example 2 hosts, ESXi host1 and ESXi host2). When the VMFS6 datastore capacity is increased from ESXi host1, while having it mounted on ESXi host2, and the disk.vmdk has file blocks allocated from an increased portion of the VMFS6 datastore from ESXi host1. Now, if the disk.vmdk file is accessed from ESXi host2, and if the file blocks are allocated to it from ESXi host2, the ESXi host2 might fail with a purple screen.

    This issue is resolved in this release.

  • After installation or upgrade certain multipathed LUNs will not be visible

    If the paths to a LUN have different LUN IDs in case of multipathing, the LUN will not be registered by PSA and end users will not see them.

    This issue is resolved in this release.

  • A virtual machine residing on NFS datastores might be failing the recompose operation through Horizon View

    The recompose operation in Horizon View might fail for desktop virtual machines residing on NFS datastores with stale NFS file handle errors, because of the way virtual disk descriptors are written to NFS datastores.

    This issue is resolved in this release.

  • An ESXi host might fail with a purple screen because of a CPU heartbeat failure

     An ESXi host might fail with a purple screen because of a CPU heartbeat failure only if the SEsparse is used for creating snapshots and clones of virtual machines. The use of SEsparse might lead to CPU lockups with the warning message in the VMkernel logs, followed by a purple screen:

    PCPU <cpu-num>  didn’t have a heartbeat for <seconds>  seconds; *may* be locked up.

    This issue is resolved in this release.

  • Disabled frequent lookup to an internal vSAN metadata directory (.upit) on virtual volume datastores. This metadata folder is not applicable to virtual volumes

    The frequent lookup to a vSAN metadata directory (.upit) on virtual volume datastores can impact its performance. The .upit directory is not applicable to virtual volume datastores. The change disables the lookup to the .upit directory.

    This issue is resolved in this release.

  • Performance issues on Windows Virtual Machine (VM) might occur after upgrading to VMware ESXi 6.5.0 P01 or 6.5 EP2 

    Performance issues might occur when the not aligned unmap requests are received from the Guest OS under certain conditions. Depending on the size and number of the not aligned unmaps, this might occur when a large number of small files (less than 1 MB in size) are deleted from the Guest OS.

    This issue is resolved in this release.

  • ESXi 5.5 and 6.x hosts stop responding after running for 85 days

    ESXi 5.5 and 6.x hosts stop responding after running for 85 days. In the /var/log/vmkernel log file you see entries similar to:

    YYYY-MM-DDTHH:MM:SS.833Z cpu58:34255)qlnativefc: vmhba2(5:0.0): Recieved a PUREX IOCB woh oo
    YYYY-MM-DDTHH:MM:SS.833Z cpu58:34255)qlnativefc: vmhba2(5:0.0): Recieved the PUREX IOCB.
    YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674)qlnativefc: vmhba2(5:0.0): sizeof(struct rdp_rsp_payload) = 0x88
    YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674qlnativefc: vmhba2(5:0.0): transceiver_codes[0] = 0x3
    YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674)qlnativefc: vmhba2(5:0.0): transceiver_codes[0,1] = 0x3, 0x40
    YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674)qlnativefc: vmhba2(5:0.0): Stats Mailbox successful.
    YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674)qlnativefc: vmhba2(5:0.0): Sending the Response to the RDP packet
    YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674 0 1 2 3 4 5 6 7 8 9 Ah Bh Ch Dh Eh Fh
    YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674)————————————————————–
    YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674) 53 01 00 00 00 00 00 00 00 00 04 00 01 00 00 10
    YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674) c0 1d 13 00 00 00 18 00 01 fc ff 00 00 00 00 20
    YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674) 00 00 00 00 88 00 00 00 b0 d6 97 3c 01 00 00 00
    YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674) 0 1 2 3 4 5 6 7 8 9 Ah Bh Ch Dh Eh Fh
    YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674)————————————————————–
    YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674) 02 00 00 00 00 00 00 80 00 00 00 01 00 00 00 04
    YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674) 18 00 00 00 00 01 00 00 00 00 00 0c 1e 94 86 08
    YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674) 0e 81 13 ec 0e 81 00 51 00 01 00 01 00 00 00 04
    YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674) 2c 00 04 00 00 01 00 02 00 00 00 1c 00 00 00 01
    YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674) 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674) 00 00 00 00 40 00 00 00 00 01 00 03 00 00 00 10
    YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674)50 01 43 80 23 18 a8 89 50 01 43 80 23 18 a8 88
    YYYY-MM-DDTHH:MM:SS.833Z cpu58:33674) 00 01 00 03 00 00 00 10 10 00 50 eb 1a da a1 8f

    This is a firmware problem and it is caused when Read Diagnostic Parameters (RDP) between the Fibre Channel (FC) Switch and the Hot Bus Adapter (HDA) fails 2048 times. The HBA adapter stops responding and because of this the virtual machine and/or the ESXi host might fail. By default, the RDP routine is initiated by the FC Switch and occurs once every hour, resulting in a reaching the 2048 limit in approximately 85 days.

    This issue is resolved in this release.

  • Resolve the performance drop in Intel devices with stripe size limitation 

    Some Intel devices, for example P3700, P3600, and so on, have a vendor specific limitation on their firmware or hardware. Due to this limitation, all IOs across the stripe size (or boundary), delivered to the NVMe device can be affected from significant performance drop. This problem is resolved from the driver by checking all IOs and splitting command in case it crosses the stripe on the device.

    This issue is resolved in this release.

  • Remove the redundant controller reset when starting controller

    The driver might reset the controller twice (disable, enable, disable and then finally enable it) when the controller starts. This is a workaround for the QEMU emulator for an early version, but it might delay the display of some controllers. According to the NVMe specifications, only one reset is needed, that is, disable and enable the controller. This upgrade removes the redundant controller reset when starting the controller.

    This issue is resolved in this release.

  • An ESXi host might fail with purple screen if the virtual machine with large virtual disks uses the SPC-4 feature

    An ESXi host might stop responding and fail with purple screen with entries similar to the following as a result of a CPU lockup.

    0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]@BlueScreen: PCPU x: no heartbeat (x/x IPIs received)
    0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]Code start: 0xxxxx VMK uptime: x:xx:xx:xx.xxx
    0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]Saved backtrace from: pcpu x Heartbeat NMI
    0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]MCSLockWithFlagsWork@vmkernel#nover+0xx stack: 0xx
    0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]PB3_Read@esx#nover+0xx stack: 0xx
    0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]PB3_AccessPBVMFS5@esx#nover+00xx stack: 0xx
    0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]Fil3FileOffsetToBlockAddrCommonVMFS5@esx#nover+0xx stack:0xx
    0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]Fil3_ResolveFileOffsetAndGetBlockTypeVMFS5@esx#nover+0xx stack:0xx
    0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]Fil3_GetExtentDescriptorVMFS5@esx#nover+0xx stack: 0xx
    0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]Fil3_ScanExtentsBounded@esx#nover+0xx stack:0xx
    0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]Fil3GetFileMappingAndLabelInt@esx#nover+0xx stack: 0xx
    0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]Fil3_FileIoctl@esx#nover+0xx stack: 0xx
    0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]FSSVec_Ioctl@vmkernel#nover+0xx stack: 0xx
    0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]FSS_IoctlByFH@vmkernel#nover+0xx stack: 0xx
    0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]VSCSIFsEmulateCommand@vmkernel#nover+0xx stack: 0x0
    0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]VSCSI_FSCommand@vmkernel#nover+0xx stack: 0x1
    0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]VSCSI_IssueCommandBE@vmkernel#nover+0xx stack: 0xx
    0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]VSCSIExecuteCommandInt@vmkernel#nover+0xx stack: 0xb298e000
    0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]PVSCSIVmkProcessCmd@vmkernel#nover+0xx stack: 0xx
    0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]PVSCSIVmkProcessRequestRing@vmkernel#nover+0xx stack: 0xx
    0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]PVSCSI_ProcessRing@vmkernel#nover+0xx stack: 0xx
    0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]VMMVMKCall_Call@vmkernel#nover+0xx stack: 0xx
    0xnnnnnnnnnnnn:[0xnnnnnnnnnnnn]VMKVMM_ArchEnterVMKernel@vmkernel#nover+0xe stack: 0x0

    This occurs if your virtual machine’s hardware version is 13 and uses SPC-4 feature for the large virtual disk.

    This issue is resolved in this release.

  • The Marvell Console device on the Marvell 9230 ACHI controller is not available

    According to the kernel log, the ATAPI device is exposed on one of the AHCI ports of the Marvell 9230 controller. This Marvel Console device is an interface to configure RAID of the Marvell 9230 AHCI controller, which is used from some Marvell CLI tools.

    As a result of the esxcfg-scsidevs -l command, the host equipped with the Marvell 9230 controller cannot detect the SCSI device with the Local Marvell Processor display name.

    The information in the kernel log is:
    WARNING: vmw_ahci[XXXXXXXX]: scsiDiscover:the ATAPI device is not CD/DVD device 

    This issue is resolved in this release.

  • SSD congestion might cause multiple virtual machines to become unresponsiv

    Depending on the workload and the number of virtual machines, diskgroups on the host might go into permanent device loss (PDL) state. This causes the diskgroups to not admit further IOs, rendering them unusable until manual intervention is performed.

    This issue is resolved in this release.

  • An ESXi host might fail with purple screen when running HBR + CBT on a datastore that supports unmap

    The ESXi functionality that allows unaligned unmap requests did not account for the fact that the unmap request may occur in a non-blocking context. If the unmap request is unaligned, and the requesting context is non-blocking, it could result in a purple screen. Common unaligned unmap requests in non-blocking context typically occur in HBR environments.

    This issue is resolved in this release.

  • An ESXi host might lose connectivity to VMFS datastore

    Due to a memory leak in the LVM module, you might see the LVM driver running out of memory on certain conditions, causing the ESXi host to lose access to the VMFS datastore.

    This issue is resolved in this release.

Categories: Uncategorized

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s