Sunday, March 8, 2015

Fix vShield Manager after modifying vSphere VDS uplinks

If you've been following my posts about upgrading my home lab, you know that I removed the add-in 1Gbps NICs and consolidated the motherboard-based 1Gbps NICs on one DVS (distributed virtual switch) in order to add 10Gbps support to my hosts. In that process, I not only rearranged the physical NICs for the uplinks, I also updated the uplink names in order to keep my environment self-documenting.

Things pretty much went as planned, but I didn't expect vShield Manager (vSM) to choke on the changes: when updating the uplink names for the DVS that provided the VXLAN port group, I expected vSM to recognize the changes and handle creation of new VXLAN networks without issue. I was wrong.

The first symptom that I had an issue was the inability of vCloud Director (vCD) to create a new Organizational Network on a deployed Edge device:
So: something is off with the teaming policy. Time to look at vSM to determine whether vCD is sending a bad request to vSM, or if vSM itself is the source of the issue. The easiest way to check is to manually create a new network in vSM; if it succeeds, vCD is sending a bad request, otherwise I need to troubleshoot vSM--and possibly vCenter, too.
Boom: the problem is reproduced even for a test directly in vSM. Time to verify the teaming in the base portgroup in vCenter.
Oops. I hadn't updated the portgroup for VXLAN after moving the uplinks around, although I had done so for the other portgroups on the DVS.
Unfortunately, updating the portgroup to use all the available uplinks didn't help. However, in the process, I discovered an unexpected error in vCenter itself:
vSM was making an API call to vCenter that included one of the old uplink names, one which no longer existed on the DVS. To test the theory, I added a couple of additional uplink ports to the DVS and renamed one to match the missing port. It worked, but not as expected:
vSM was able to send a proper API call to vCenter, but the portgroup had sub-optimal uplink settings: of the two active uplinks, only one had an actual, physical uplink associated with it. This was not a redundant connection, even though it looked like it.

Time to restart vSM to get it to re-read the vCenter DVS config, right? Wrong. Even with a restart & re-entering the vCenter credentials, the state persisted.

At this point, my Google-fu failed me: no useful hits on a variety of search terms. Time to hit the VMware Community Forums with a question. Luckily, I received a promising answer in just a day or two.

I learned that one can use the REST API for vSM to reconfigure it, which can get it back in line with reality. But how do you work with arbitrary REST calls? It turns out, there's a REST client plug-in for Firefox, written to troubleshoot and debug REST APIs. It works a treat:
  1. Set up the client for authenticated headers
  2. Retrieve the DVS configuration as an XML blob in the body of a GET call
  3. Modify the XML blob so that it has the correct properties
  4. PUT the revised XML blob back to vSM.
Voila! Everything works.

Specifics:
1) Use an Authenticated GET on the switches API
2) Using the objectId of the desired DVS, get the specific switch data
3) Update the XML blob with the correct uplink names
4) PUT the revised XML blob
As soon as this blob was accepted with a 200 OK response, I re-ran my test in vSM: success! vCD was also able to successfully create the desired portgroup, too.

Key takeaways:
  1. REST Client for Firefox is awesome for arbitrary interaction with a REST API
  2. Sometimes, the only way to accomplish a goal is through the API; a GUI or CLI command may not exist to fix your problem.
  3. This particular fix allows you to arbitrarily rename your uplinks without having to reset the vShield Manager database and completely reinstall it to get VXLAN working again.

Monday, February 23, 2015

Planning for vSphere 6: VMCA considerations

With the immanent release of vSphere 6, I've been doing prep work for upgrades and new installs. There's a lot of information out there (just check out the vSphere 6 Link-O-Rama at Eric Siebert's vSphere Land for an idea of the breadth & depth of what's already written), but not as much as I'd like to make good decisions in order to future-proof the setup.

I'm sure I join lots of VMware admins in looking forward to the new features in vSphere 6—long-distance vMotion, cross-datacenter & cross-vCenter vMotion, multi-vCPU fault tolerance (FT), etc.—but along with these features come some foundational changes in the way vSphere management & security are architected.

Hopefully, you've already heard about the new PSC (Platform Services Controller), the functional descendant of the SSO service introduced in vSphere 5.1. SSO still exists as a component of the PSC, and the PSC can be co-installed ("embedded") on the same system as vCenter, or it can be independent. Like SSO on vSphere 5.5, the PSC has its own internal, dedicated database which it replicates with peer nodes, similar to what we've come to know and expect from Microsoft Active Directory.

This replication feature not only helps for geographically-distributed enterprises—allowing a single security authority for multiple datacenters—but high availability in a single datacenter through the use of 2 (or more) PSCs behind a load balancer. Note the emphasis on the load balancer: you will end up with the abstraction of the PSC with a DNS name pointing at an IP address on your load balancing solution, rather than the name/IP of a PSC itself.
PSC in load-balanced HA configuration

This delegation means you must plan ahead of time for using load balancing; it's really not the sort of thing that you can "shim" into the environment after implementing a single PSC.

Joining SSO in the PSC "black box" are several old and some brand new services: identity management, licensing...and a new Certificate Authority, aka VMCA (not to be confused with vCMA, the vCenter Mobile Access fling) .

It's that last item—the Certificate Authority—that should make you very nervous in your planning for vSphere 6. The documentation so far indicates that you have upwards of four different modes for your CA implementation that are independent of your PSC implementation choices:
  • Root (the default). Source of all certs for dependent services, this CA's public key/cert must be distributed and placed in your trusted CA store.
  • Intermediate. Source of all certs for dependent services, but the CA itself gets its trust from a parent CA, which in turn must have its public key/cert distributed. In the case of corporate/Enterprise CA/PKI infrastructure, this will already be in place and will be my go-to configuration.
  • None/External-only. All services receive their certs from a different CA all together. This model is equivalent to removing all the self-signed certificates in pre-6 and replacing them with signed certificates. With the proliferation of services, each using its own certificate, this model is becoming untenable.
  • Hybrid. In the hybrid model, the VMCA provides certificates to services that provide internal communication (either service-to-service or client-to-service) while public CA-signed certs are used in the specific places where 3rd-party clients will interact. In this model, the VMCA may act as either root or intermediate CA.
Confused? Just wait: it gets more complicated...

Migrating from one model to another will have risks & difficulties associated with it. The default installer will set you up with a root CA; you will have the option to make it an intermediate at time of install. As near as I can tell from the available documentation, you will need to reinstall the PSC if you start with it as a root CA and decide you want it instead to be an intermediate (or vice-versa). This is consistent with other CA types (eg, Microsoft Windows), so there's no surprise there; however, it's not clear what other replicated services will be impacted when trying to swap CA modes, as it will require previously-issued certificates to be revoked and new certificates to be issued.

You can switch some or all of the certificates it manages with 3rd-party (or Enterprise) signed certs, but once you do, you will have to deal with re-issue & expiration on your own. I can't find anything documenting whether this is handled gracefully & automatically with VMCA-signed certs & services, similar to the centralized/automated certificate management that Windows enjoys in an Enterprise CA environment.

There isn't any documentation on switching from 3rd-party (or Enterprise) back to a VMCA-signed certificate. Presumably, it'll be some CLI-based witchcraft...if it's allowed at all.

Finally, keep in mind that DNS names factor heavily into certificate trust. Successfully changing name and/or IP address of an SSO server—depending on which was used for service registration—can be challenging enough. Doing the same with a system that is also a certificate authority will be doubly so.

So: what's the architect going to do?

For small, non-complex environments, I'm going to recommend what VMware and other bloggers recommend: stick with the single, combined PSC and vCenter server. Use the VCSA (vCenter Server Appliance) to save on the Windows license if you must, but I personally still prefer the Windows version: I'm still far more comfortable with managing the Windows OS environment & database than I am with Linux. Additionally, you're going to want Update Manager—still a Windows service—so I find it easier to just keep them all together.


This also suggests using the VMCA as a root CA, and I'll stick with that recommendation unless you have an Enterprise CA already. If you have the Enterprise CA, why not make it an intermediate? At a minimum, you'll would eliminate the need for yet another root certificate to distribute. More importantly, however, is that it's vastly easier to replace an intermediate CA—even through the pain of re-issuing certificates—than a root CA.

What constitutes small, non-complex? For starters, any environment that exists with one—and only one—vCenter server. You can look up the maximums yourself, but we're talking about a single datacenter with one or two clusters of hosts, so less than 65 hosts for vSphere 5.5; in practice, we're really talking about environments with 20 or fewer hosts, but I have seen larger ones that would still meet this category because—other than basic guest management (eg, HA & DRS)—they aren't really using vCenter for anything. If it were to die a horrible death and be redeployed, the business might not even notice!

Even if you have a small environment by those standards, however, "complex" enters the equation as soon as you implement a feature that is significantly dependent on vCenter services: Distributed Virtual Switch, Horizon View non-persistent desktops, vRealize Automation, etc. At this point, you now need vCenter to be alive & well pretty much all the time.

In these environments, I was already counseling the use of a full SQL database instance, not SQL Express with all of its limitations. Even when you're inside the "performance bubble" for that RDBMS, there are a host of other administrative features you must do without that can compromise uptime. With vSphere 6, I'm continuing the recommendation, but taking it a step further: use AlwaysOn Availability Groups for that database as soon as it's certified. It's far easier to resurrect a cratered vCenter server with a valid copy of the database than rebuilding everything from scratch; I know VMware wants us all to treat the VCSA as this tidy little "black box," but I've already been on troubleshooting calls where major rework was required because no maintenance of the internal PosgreSQL database was ever done, and the whole-VM backup was found wanting...

Once you've got your database with high availability, split out the PSC from vCenter and set up at least two of them, the same way you'd set up at least two Active Directory domain controllers. This is going to be the hub of your environment, as both vCenter and other services will rely on it. Using a pair will also require a load balancing solution; although there aren't any throughput data available, I'd guess that the traffic generated for the PSC will be lower than the 10Mbps limit of the free and excellent Citrix NetScaler VPX Express. I've written about it before, and will be using it in my own environment.

Add additional single and/or paired PSCs in geographically distant locations, but don't go crazy: I've seen blogs indicating that the replication domain for the PSC database is limited to 8 nodes. If you're a global enterprise with many geographically-diverse datacenters, consider a pair in your primary, most critical datacenter and single nodes in up to 6 additional datacenters. Have more than 7 datacenters? Consider the resiliency of your intranet connectivity and place the nodes where they will provide needed coverage based on latency and reliability. If you're stumped, give your local Active Directory maven a call; he/she has probably dealt with this exact problem already—albeit on a different platform—and may have insight or quantitative data that may help you make your decision.

Finally, I'm waiting with anticipation on an official announcement for FT support of vCenter Server , which will eliminate the need for more-complex clustering solutions in environments that can support it (from both a storage & network standpoint: FT in vSphere 6 is completely different from FT in previous versions!). Until then, the vCenter Server gets uptime & redundancy more through keeping its database reliable than anything else: HA for host failure; good, tested backups for VM corruption.

Tuesday, February 10, 2015

HP StoreVirtual VSA: The Gold Standard

HP has owned the Left Hand storage system since late 2008, and has made steady improvements since then. The product had already been officially supported on a VM; not only did the acquisition not destroy that option, but HP has embraced the product as a cornerstone of their "software-defined storage" marketing message.

Although other products existed back in 2008, a virtualized Left Hand node was one of the first virtual storage appliances (VSA) available with support for production workloads.

Fast-forward to August, 2012: HP elects to rebrand the Left Hand product as StoreVirtual, renaming the SAN/iQ to LeftHand OS in order to preserve its heritage. The 10.0 version update was tied to the rebranding, and the VSA arm of the portfolio—HP never stopped producing "bare-metal" arrays based on their 2U DL380 server chassis—promised to bring additional enhancements like increased capacity (10TB instead of 2TB) and better performance (2 vCPUs instead of 1) along with price drops.

The 11.0 version was released with even more features (11.5 is the production/shipping version for both bare-metal and VSA), chief of which—in my opinion—is Adaptive Optimization (AO), the ability for node-attached storage to be characterized in one of two tiers.

Note that this isn't a Flash/SSD-specific feature! Yes, it works with solid state as one of the tiers—and is the preferred architecture—but any two performance-vs-capacity tiers can be configured for a node: a pair of 15K RPM SAS drives as Tier 0 performance with 4-8 NL SAS drives as Tier 1 capacity is just as legitimate. HP cautions the architect, however, not to mix nodes with varying AO characteristics in the same way it cautions against mixing single-tier nodes in one cluster.

Personally, I've played with the StoreVirtual VSA off and on over the years. The original hold-back for getting deeply into it was the trial duration: 30 to 60 days is insufficient to "live with" a product and really get to know it. In early 2013, however, HP offered NFR licensing to qualified members of the VMware vExpert community, and those licenses had year-long duration associated with them.

Unfortunately, however, the hosts I was running at home were pretty unsuited to supporting the VSA: limited RAM and 2-4 grossly inferior desktop-class SATA hard drives in each of 2 hosts. I'd still load up the VSA for test purposes; not for performance, but to understand the LeftHand OS better and how failures are handled, configurations are managed, and how the product interacts with other software like Veeam Backup & Recovery. But then I'd also tear down the cluster when I finished with it in order to regain consumed resources.

When PernixData FVP was still in pre-GA beta, I was able to make some system upgrades to add SSD to newer hosts—still with essentially zero local capacity, however—and was able to prove to myself that a) solid state works very effectively at increasing the performance of storage and b) there is a place for storage in the local host.

With the release of the first VMware Virtual SAN beta, I decided it was time to make some additional investments into my lab, and I was able to not only add a third host (the minimum for supported VSAN deployment) but also provision them all with a second SSD and enterprise SATA disks for the experiment. In that configuration, I was able to use one SSD for iSCSI-based performance acceleration (using the now-GA FVP product) and a second SSD for VSAN's solid state tier. My hosts remained limited in the number of "spinning disk" drives that could be installed (four), but in aggregate across three hosts, the total seemed not only reasonable, but seemed to work in practice.

Unfortunately, I was plagued by hardware issues in this configuration: rarely a week went by without either FVP or VSAN complaining about a drive going offline or being in "permanent failure," and it seemed like the weeks when that didn't occur, the Profile Driven Storage service of vCenter—which is critical to making use of VSAN in other products like vCloud Director or Horizon View—would need to be restarted. Getting FVP or VSAN working correctly would usually require rebooting the host reporting an issue; in some cases, VMs would need to be evacuated from VSAN to provide the necessary free space to retain "availability."

In short, the lab environment with my 1Gbps networking and consumer-grade disk & HBA made VSAN and FVP a little too much work.

But I still had that VSA license... If I could get a better HBA—one that would perform true hardware-based RAID and have deeper queue, not to mention other enterprise SATA/SAS capabilities—I'd be able to leverage the existing disk investment with the VSA and have a better experience.

I was able to source a set of Dell PERC H700 adapters, cables and cache batteries from eBay; these were pulled from R610 systems, so dropping them into mine was trivial and the set cost considerably less than a single kit from Dell. Although I could have rebuilt the VSAN and FVP environments on the new HBA—each disk in the system would need to be set up as a single-spindle RAID0 'virtual volume'—I went with a RAID1 set for the pair of SSD and a RAID5 for the spindles. I would be able to continue leveraging PernixData for acceleration using the RAM-backed function, but I was done messing with VSAN for now.

Setting up the v11.5 VSA initially gave me pause: I was booting from SD card, so I could use 100% of the SSD/HDD for it, but how to do it? If the LeftHand OS had drivers for the PERC array—possible: the core silicon of the H700 is a LSI/Symbios product which might be supported in spite of being a Dell OEM—I could do a DirectPath I/O if there was another datastore available on which to run the VSA. A second, similar alternative would be to manually create Physical RDM mappings for the RAID volumes, but that still left the problem of a datastore for the VSA. Yes, I could run the VSA on another array, but if the host ever had issues with that array, then I'd also end up with issues on my LeftHand cluster—not a good idea!

My final solution is a hybrid: The HDD-based RAID group is formatted as a VMFS5 datastore, and the VSA is the only VM using it. A large, 1.25TB 'traditional' VMDK is presented using the same datastore (leaving ~100GB free for the VSA boot drive and files); the SSD-based RAID group is presented as Physical RDM. This configuration permitted me to enable AO on each node, and get an SSD performance boost along with some deep storage from the collection of drives across all three nodes.

In practice, this array has been more trouble-free than my VSAN implementation on (essentially) identical hardware. A key difference, however, has been the performance with respect to inter-node communication: With VSAN, up to four interfaces can be simultaneously configured for inter-node communication, increasing bandwidth and lowering latency. Even with the lower performance characteristics of the disks and HBA in each host, saturating two of the four gigabit interconnects I had configured was possible with VSAN (when performing sequential reads & writes, eg, backups & storage vMotion), so the single gigabit connection available to VSA was very noticeable.

I have since migrated my network environment to use 10Gbps Ethernet for my back-haul network connectivity (iSCSI, NAS, vMotion) and have objective evidence of improved performance of the LeftHand array. I'll be updating this post with subjective test results when the opportunity presents itself.

Citrix NetScaler UI changes

Which is worse?

  • Searching for the solution to a problem and not being able to find it
— or —
  • Finding the exact solution to a problem in a blog, but discovering that it's an older post using out-dated products and documenting an API or UI that no longer exists?

This question comes from some feedback I received on a series of posts I put together that documents my use of the Citrix NetScaler VPX Express virtual appliance as a reverse proxy.

Citrix is doing the right thing: they're rebuilding the GUI in the NetScaler to eliminate Java (as much as possible). It has been a slow-going process, starting with the 10.0 version (as of this writing, 10.5 is current, and there are still one or two places that use a Java module), and one of the drawbacks is that the new HTML-only UI elements can't duplicate the Java UI—so things are...different.
HA Setup, v10.0 & earlier
HA Setup, v10.5
In the screencaps above, you see the older Java-based dialog box and the newer HTML page. They have some of the same data, but they are neither identical, nor are they found in the same exact place from the principal UI.

How does a blogger serve his/her audience? Does one ignore the past and soldier on, or does one revisit the old posts and update them for a new generation of software? If I had positioned myself as a NetScaler expert, that answer is obvious: UI changes in and of themselves would be post-worthy, and revisiting old functions to make them clear under the new UI would make perfect sense.

In this case, however, I have only had a couple of requests for revised instructions using the equivalent UI; I'm not a NetScaler guru, and to be perfectly frank, I haven't the time needed to redo the series. If I get a lot more feedback that this series needs to be updated, I'll think about a second edition using the new UI, but as of now it's going to stay the way it is.

Homelab 2015: Hello 10Gbps!

New year, new post documenting the home lab. I've accomplished a number of upgrades/updates since my last full roundup of the lab, so rather than posting this as another delta, I'm doing this as a full re-documentation of the environment.

Compute: VMware vSphere 5.5

  • 3 x Dell R610, each spec'd as follows:
    • (2) Intel Xeon E5540 @ 2.53GHz
    • Hyperthreading enabled for 16 logical CPUs per host.
    • 96GiB RAM
    • Boot from 4 or 8GB SD card
    • Dell PERC H700 6Gbps SAS/SATA HBA
      • (4) 500GB Seagate Constellation.2 (ST9500620NS) SATA
        • RAID5
        • Formatted as local vmfs5 datastore
      • (2) 240GB Intel 530 SATA SSD
        • RAID1
        • RDM for StoreVirtual VSA (see below)
    • Quad port Broadcom BCM5709 Gigabit Copper (embedded)
    • Dual port Mellanox MT26448 10GigE (8-lane PCIe), 850nm SFP+ optics
    • iDRAC 6 Enterprise
    • Redundant power

Storage: IP-based

  • iomega StorCenter ix2-200 "Cloud Edition"
    • (2) 1TB Seagate Barracuda (ST1000DM003) 7200RPM SATA
    • RAID1
    • (1) 1000Base-T
    • LifeLine OS v3.2.10.30101
    • NFS export for VMs
  • 2 x Lenovo (iomega/EMC) px6-300d
    • (6) 2TB Hitachi Deskstar (HDS723020BLA642) 7200RPM SATA
    • RAID5
    • (2) 1000Base-T, bonded, multiple VLANs
    • LifeLine OS v4.1.104.31360
    • 2TB iSCSI Target for VMs
  • Synology DS2413+
    • (12) 2TB Seagate Barracuda (ST2000DM001) 7200RPM SATA
    • RAID1/0
    • (2) 1000Base-T, bonded, multiple VLANs (CLI-added)
    • DSM 5.1-5022 Update 2
    • NFS exports:
      • ISOs (readonly, managed by SMB)
      • Logs
      • VMs
  • Synology DS1813+
    • (8) 256GB Plextor PX-256M6S SSD
    • RAID5
    • (4) 1000Base-T
      • (1) Management network
      • (2) iSCSI network (multi-homed, not bonded)
    • ~1.6TB iSCSI Target (block mode)
  • HP StorVirtual "Lefthand OS"
    • (3) ESXi Virtual Appliances, 1 on each host
      • 1280GB VMDK on local storage; tier 1
      • 223GB RDM on SSD volume; tier 0
    • 4486.49GB Raw, 2243GB RAID1
    • (2) 1TB volumes for VMs
      • Thin provisioning
      • Adaptive Optimization
    • iSCSI network: 10GbE
    • Management: 1GbE

Networking:

  • (2) Cisco SG500X-24
    • (4) 850nm SFP+ optics for 10GbE
    • (24) 1000Base-T MDI/MDI-X
    • Primary ISL: 10GbE
    • Backup ISL: (1) 2x1GbE LACP LAG
    • STP Priority: 16384
  • Cisco SG300-28
    • (28) 1000Base-T MDI/MDI-X
    • (2) 2x1GbE LACP LAG for link to SG500X-24
    • STP Priority: 32768
  • Google Fiber (mk.1)
    • "network box"
    • "storage box"
    • "fiber box"
  • Various "dumb" (non-managed) 1GbE switches
  • Apple Airport Extreme (mk.4)/Express (mk.2)

Miscellaneous:

  • (4) APC BackUPS XS1500
  • Internet HTTP/SSL redirection via Citrix NetScaler VPX (HA pair)
  • Remote access via:
    • TeamViewer 10
    • Microsoft RDS Gateway
    • VMware Horizon View
    • Citrix XenApp

Connectivity Diagrams:

Host Configuration

Environment