Saturday, April 15, 2017

Upgrading to vSphere 6.5 with NSX already installed

This has been a slow journey: I have so many different moving parts in my lab environment (all the better for testing myriad VMware products) that migrating to vSphere 6.5 was taking forever. First I had to wait for Veeam Backup & Replication to support it (can't live without backups!), then NSX, then I had to decide whether to discard vCloud Director (yes, I'm still using it; it's still a great multitenancy solution) or get my company to give me access to their Service Provider version...

I finally (finally! after over a year of waiting and waiting) got access to the SP version of vCD, so it was time to plan my upgrade...

My environment supports v6.5 from the hardware side; no ancient NICs or other hardware anymore. I was already running Horizon 7, so I had two major systems to upgrade prior to moving vSphere from 6.0U2 to 6.5a:

  • vCloud Director: 5.5.5-->8.0.2-->8.20.0 (two-step upgrade required)
  • NSX: 6.2.2-->6.3.1
There was one hiccup with those upgrades, and I'm sure they may be familiar to people with small labs: the NSX VIBs didn't install without "manual assistance." In short, I had to manually place each host into maintenance mode, kick off the "reinstall" to push the VIBs into the boot block, then restart the host. This wouldn't happen in a larger production cluster, but because mine is a 3-node VSAN cluster, it doesn't automatically/cleanly go into Maintenance Mode.

Moving on...

Some time ago, I switched from an embedded PSC to an external, so I upgraded that first. No problems.

Upgrading the stand-alone vCenter required a couple of tweaks: I uninstalled Update Manager from its server (instead of running the migration assistant: I didn't have anything worth saving), and I reset the console password for the appliance (yes, I'd missed turning off the expiration, and I guess it had expired). Other than those items? Smooth sailing.

With a new vCenter in place, I could use the embedded Update Manager to upgrade the host. I had to tweak some of the 3rd-party drivers to make it compatible, but then I was "off to the races."

After the first host was upgraded, I'd planned on migrating some low-priority VMs to it in order to "burn in" the new host and see if some additional steps would be needed (ie removing VIBs for unneeded drivers that have caused PSODs in other environments I've upgraded). But I couldn't.

Trying to vMotion running machines to the new host, I encountered network errors. "VM requires Network X which is not available". Uh oh.

I also discovered that one of the two DVS (Distributed Virtual Switch) for the host was "out of sync" with vCenter. And no "resync" option that would normally have been there...

Honestly, I flailed around a bit, trying my google fu and experimenting with moving VMs around, both powered-on and off, as well as migrating to different vswitch portgroups. All failing.

Finally, something inspired me to look at my VXLAN status; it came to me after realizing I couldn't ping the vmknic for the VTEPs because they sit on a completely independent IP stack, making it impossible to use vmkping with a VTEP as a source interface.

Bingo!

The command esxcli network vswitch dvs vmware vxlan list resulted in no data for that host, but valid config information for the other hosts.

A quick look at NSX Host Preparation confirmed it, and a quick look at the VIBs on the host nailed it down: esx-vsip and esx-vxlan were still running 6.0.0 versions.

I went back through the process I'd used for upgrading NSX in the first place, and when the host came back up, DVS showed "in sync", NSX showed "green" install status and—most important of all—VMs could vMotion to the host and they'd stay connected!

UPDATE: The trick, it seems, is to allow the NSX Manager an opportunity to install the new VIBs for ESXi v6.5 before taking the host out of maintenance mode. By manually entering Maintenance Mode prior to upgrading, VUM will not take the host out of Maintenance, giving the Manager an opportunity to replace the VIBs. Once the Manager shows all hosts upgraded and green-checked, you can safely remove the host from Maintenance and all networking will work.

Monday, February 20, 2017

ADFS and SNI

SNI, or Server Name Indicator, is an extension to TLS (Transport Layer Security, the evolutionary child of SSL/Secure Socket Layer) that permits multiple certificates (and therefore encrypted sessions) to be bound to the same TCP port.

Starting with ADFS v3.0 (aka ADFS for Windows Server 2012R2), Microsoft uses SNI by default. On the whole, this has little impact on most users of ADFS, but for one small, important subset: users that sit behind reverse proxy or hardware SSL-offload devices.

Readers of this blog know that I use the Citrix NetScaler VPX Express as a reverse proxy for my home lab; until I tried to stand up an ADFS server (running Server 2016) behind it—I'm going to start digging into Office 365 in a serious way and want the most seamless user experience—I'd never had a problem with it.

I just could NOT figure out why the ADFS system was immediately rejecting connections via NetScaler, while it was perfectly happy with local connections.

I knew things were problematic as soon as I did packet captures on the NetScaler: the [SYN]-[SYN, ACK]-[Client Hello] were immediately followed by [RST, ACK] and a dropped connection.


Once I "fired up" a copy of Wireshark and pulled some captures at the ADFS host, however, I was able to compare the difference between the NetScaler-proxied connections that were failing, and on-prem connections that were successful.



At that point, I could explicitly compare the two different [Client Hello] packets and see if I could tell the difference between the two...

Unfortunately, I started with comparing the protocols, ciphers and hash algorithms. It took a while to get the TLS1.2 setup just right to mimic the local connection, but no joy. But then I went after the extensions: only one extension was in the "misbehaving" [Client Hello]
There are a bunch of extensions in the "working" [Client Hello]:
holy crap

To make my task easier, I switched back to google-fu to see if I could narrow down the search; voila!

I found an article that talked about handling ADFS clients that don't support the SNI extension, and the lightbulb went on: my browsers do SNI, but with the NetScaler acting as a proxy SNI support is disabled by default.

Luckily there are two fixes:
  1. Update the ADFS server with a "blanket" or "fallback" binding for the ADFS service (see https://blogs.technet.microsoft.com/applicationproxyblog/2014/06/19/how-to-support-non-sni-capable-clients-with-web-application-proxy-and-ad-fs-2012-r2/)
  2. Update the NetScaler service entry (in the SSL Parameters section) to support SNI for the expected client hostname.
I went with the latter; that way I don't modify any more of the ADFS host than necessary, and because the NetScaler is essentially acting as a client while it's doing its proxy duties, that seemed to make the most sense.

Within a minute of adding the SNI extension, the ADFS system worked as expected.

Wednesday, February 15, 2017

SSL Reverse Proxy using Citrix NetScaler VPX Express

Part 6 in a series

In previous posts I covered the configuration of the NetScaler VPX Express for use as an intelligent reverse proxy, allowing the use of a single public IP address with multiple interior hosts.

In recent days, I've been working on adding Horizon View to my home lab; in addition to requisite Connection Servers, I'm using the EUC Access Point virtual appliance as a security gateway instead of Security Servers paired with dedicated Connection Servers.

The procedure I outline for the creation of a content-switching configuration works as you'd expect...to a point.

I found that I kept getting "Tunnel reconnection is not permitted" errors when trying to login using the dedicated Horizon Client; this was extremely frustrating because HTML access (using nothing but an HTML5-compatible browser) was working flawlessly.

Upon reviewing the client logs, I noticed that the response from the tunnel connection (HTTP/1.1 404 Not Found) was from IIS, not a Linux or other non-Windows webserver. In my configuration, my content-switching plan uses a Windows IIS server as the fall-through (default/no-match).

Theory: for whatever reason, while the registration process for the Horizon Client was being properly switched to the Access Point, login via tunnel was not.

By capturing a trace (including SSL decoding) at the NetScaler and reviewing it in Wireshark, I was able to see that the client is using two different host strings, one during the initial login followed by a second one during tunnel creation.

What's the difference? The initial login doesn't include the port number in the host string; the tunnel request includes it...
Login: vdi.corp.com
Tunnel: vdi.corp.com:433
The fix is to add an additional match criteria for your content switching policy:
Before: HTTP.REQ.HOSTNAME.EQ("vdi.corp.com")
After: HTTP.REQ.HOSTNAME.EQ("vdi.corp.com")||HTTP.REQ.HOSTNAME.EQ("vdi.corp.com:443")
You can also create an additional policy with the "fqdn:443" match, but editing the policy was faster to implement.

UPDATE: I've done some more digging, and there are additional arguments/functions that would also work—and would've worked transparently had I used them in the first place—instead of the EQ("") expression:
HTTP.REQ.HOSTNAME.CONTAINS("vdi.corp.com")
HTTP.REQ.HOSTNAME.SERVER=="vdi.corp.com"
HTTP.REQ.HOSTNAME.STARTSWITH("vdi.corp.com")
HTTP.REQ.HOSTNAME.PREFIX('.',0).EQ("vdi")