NSX-v: Network performance issues within VXLAN with qfle3 NIC driver (workaround, SR closed)

Last week I faced an issue with network throughput performance within VXLAN on NSX-v with a specific NIC and NIC driver. I came across this problem, because I am currently working on a ‘greenfield’ SDDC implementation project at one of our customers, and I was doing initial performance tests on the platform.

At this customer we have an HPE Synergy composable infrastructure running vSphere 6.5 U1 and NSX-v 6.3.3 (based on the VVD 4.1). The blades within the Synergy frame are 480 Gen10 blades with a QLogic 57840 10/20 Gigabit Ethernet Adapter. The firmware of the NIC is 7.15.xx, and the qfle3 driver versions that were used during the performance tests are the 1.0.60.0-1 and the 1.0.77.4-1. Refer to the VMware HCL for more supported driver versions.

Initially, we started testing (iPerf) the network throughput on normal VLANs, where we gained a throughput of 17.1 Gbits/sec between two VMs on two different physical ESXi hosts. When we initiated the same test on a VXLAN (same layer 2 domain/same Logical Switch) the network throughput dropped to 7.1 Gbits/sec. Only when the VMs were running on the same physical host we were able to get a 17+ Gbits/sec performance on the VXLAN. In other words, when the VXLAN traffic hits the physical NICs the performance dropped immediately.

When we did the same tests with different driver versions, the results were the same, so we decided to create an SR with VMware. After providing all the necessary information to the team they informed us that they have already seen the same issue with this NIC driver at other customers. They requested us to initiate the same test with the VMs placed on the NSX Distributed Firewall exclusion list. When we did that, the throughput increased to 16.7 Gbits/sec. This means that even with no custom firewall rules being added to the DFW, you will get an enormous performance penalty when the traffic is traversing the DFW on the ESXi host.

After sharing the results with the VMware support team, they informed us that at this moment, there is no solution. The VMware NSX engineering team is working together with third-party vendor(s) on a solution. The only workaround now is adding the VMs on the DFW exclusion list to get full network performance. I will update this blog post when I receive a solution or other information from the engineering team.

*this issue also exists on the latest ESXi and NSX versions with this driver.

Current Status (22-3-2019): Not solved, only workaround available

Workaround: Place VMs on the DFW exclusion list or disable the DFW for the complete cluster

SR Status: Closed (VMware can’t fix this at the moment, but will be probably fix this in the upcoming NSX releases and driver versions)

Statement from GSS:

Our developers have identified that due to throttling mechanism in distributed firewall there is a limit in packet per second that vms with filter applied can handle. This limit will be slightly improved in one of the next releases, however based on our internal findings it doesn’t look like we will be able to reach the same throughput as we see without filters applied due to the technology used for filtering the traffic. 

With that being said, if you do not use distributed firewall you can disable it for the clusters. Otherwise in probably the release after the next available there will be a significant improvement, however I would not expect to see throughout close to  20Gbps to be achieved even with this.

Current status:
It’s a known issue to VMware. It is a result in the way the driver using handles VXLAN traffic when DFW filters are applied. It isn’t specific an DFW issue neither an qfle3 driver issue, both the combination of both. I am the second one who faced this limitation.

A slight improvement can be achieved when changing an parameter of the qfle3 NIC driver at the host level:
# esxcli system module parameters set -p “enable_lro=0 enable_vxlan_filters=1 RSS=1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1” -m qfle3
# reboot

With this change implemented the throughput increased to 11.5 Gbits/sec

Last update: 23-4-2019

If you have any questions or remarks. Please reach out to me!

sources: vmware.com, VMware GSS

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s