How to Improve HCX Performance

By: Dennis Boeynaems @DennisBoeynaems

This article is about how to troubleshoot suboptimal performing migrations levering HCX in the context of Azure VMware solution

Architecture

As a reminder here’s a quick recap on how HCX can be setup

HCX over IPSEC

HCX Over Express Route

HCX over Express Route

HCX’ control pane on-premise is often referred to as “the connector”. The AVS HCX is controlled by “the manager”. The actual migrations themselves as well as the L2VLAN stretch are supported across the so-called ‘service mesh’. This consists out of a maximum number of 5 appliances:

The most relevant being:

  • IX: The interconnect appliance
  • NE: Network extension, the appliance responsible for the high performance of VLAN extension
  • WANOP: Line conditioning and deduplication

Getting access and see how they perform is done by accessing the connector (the manager, on the cloud side is not accessible through SSH) directly on-prem. Remember the connector is on-prem, the manager is in the cloud. There is one connector on the source side it is paired to the manager on the cloud side. You cannot connect multiple connectors to each other.

Ssh (or putty if you are on windows) [email protected]<connector IP address/hostname>

On the prompt type ccli (to get into the HCX central CLI):

  • [admin]@ list - List all appliances
    1
    2
    
      0 - UAT-SM-IX-I1 - 172.27.103.21:9443 - Connected
      1 - UAT-SM-NE-I1 - 172.27.103.22:9443 - Connected
    
  • [admin]@ go 0 - Activate the correct appliance based on the ID column (as shown above)
    Here we are selecting the IX appliance (interconnect) as we want to investigate the performance of the interconnect

  • [admin]@ perftest reachability - Validating the service mesh connectivity end to end
    This will produce a lot of information if you hit Y on the below question. Results range from simple ping tests, packet transmissions etc.
    1
    2
    3
    4
    5
    6
    7
    
    ========== PERFTEST ALL STARTING  ==========
    == WanOpt is Present ==
    == TOTAL # of TESTs : 11 ==
    == ESTIMATED  TEST DURATION :  12 minutes ==
     -T option to change individual test duration [default 60 sec]
     -k option to skip 'perftest uplink' if tcp port 4500 or 22 not opened  
    == Are you ready to start ?? [y/n]: y
    

  • [admin]@ perftest uplink - This is to validate the uplink to for baseline WAN connection verification
    Purpose here is to, well validate WAN connectivity measuring bandwidth, RTT, retransmits etc
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    
    ========== Testing UPLINKs ==========
    Testing uplink reachability…
    Uplink-0 round trip time:
    rtt min/avg/max/mdev = 17.086/17.244/17.649/0.223 ms
    Uplink native  throughput test is initiated from LOCAL site.
    ++++++++++ StartTest ++++++++++
    ---------- Uplink-0 [te_0, 172.27.103.21 >>> 172.24.131.50]   ----------
    Duration           Transfer     Bandwidth         Retransmit
    server workload started
    [  4]   0.00-60.00  sec   931   MBytes   130 Mbits/sec    322               sender
    [  4]   0.00-60.00  sec   928   MBytes   130   Mbits/sec                    receiver
    Done
    ---------- Uplink-0 [te_0, 172.27.103.21 <<< 172.24.131.50]   ----------
    server workload started
    Duration           Transfer       Bandwidth       Retransmit
    [  4]   0.00-60.00  sec   995   MBytes   139 Mbits/sec  445               sender
    [  4]   0.00-60.00  sec   994   MBytes   139   Mbits/sec                    receiver
    Done
    
  • [admin]@ perftest all - Of interest to us in this output is the very latest last part of the output
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    
    ================ SUMMARY OF RESULTS ===================
    ** Total Test Duration = 9.5 minutes **
    (Each Test Duration  = 60 sec)
    (Each IPSEC Test Duration  = 30 sec)
    (WANOPT TCPPERF Test Duration= 60 sec)
    ** Throughput Report **
    |-------------------------------------------------------|
    | Test Name    | IF # | Fwd            | Rev            |
    |-------------------------------------------------------|
    | UPLINK       | 0    | 130 Mbits/sec  | 139 Mbits/sec  |
    |-------------------------------------------------------|
    | UPLINK       | 0    | 130 Mbits/sec  | 139 Mbits/sec  |
    |-------------------------------------------------------|
    | IPSEC Tunnel | 0    | 138 Mbits/sec  | 41.1 Mbits/sec |
    |-------------------------------------------------------|
    | WANOPT       | 0    | 37 Mbits/sec   | 21 Mbits/sec   |
    |-------------------------------------------------------|
    | SITE         | 0    | 3.31 Gbits/sec | 3.99 Gbits/sec |
    |-------------------------------------------------------|
    ** Latency Report **
    |--------------------------------------|
    | Uplink IF# | Min/Avg/Max             |
    |--------------------------------------|
    | 0          | 15.729/16.115/16.503 ms |
    |--------------------------------------|
    

In this instance the difference in performance between WANOP and SITE is too big.

This is a clear indicator the WANOPT appliance, it being a bit of a resource hog most likely is residing on a datastore designed for hosting low demanding workload and therefore most likely is on low performing (spinning) disk.

This appliance needs to be sitting on SSD with a lot less latency. Please move the appliance over to a higher performing datastore and then run the test again. The data and subsequent performance of HCX should be considerably higher as a result of this action.

I hope this article provided some more insights in how to get more visibility and extract relevant data from the various appliances together forming the HCX service mesh.