Sunday, March 8, 2015

Fix vShield Manager after modifying vSphere VDS uplinks

If you've been following my posts about upgrading my home lab, you know that I removed the add-in 1Gbps NICs and consolidated the motherboard-based 1Gbps NICs on one DVS (distributed virtual switch) in order to add 10Gbps support to my hosts. In that process, I not only rearranged the physical NICs for the uplinks, I also updated the uplink names in order to keep my environment self-documenting.

Things pretty much went as planned, but I didn't expect vShield Manager (vSM) to choke on the changes: when updating the uplink names for the DVS that provided the VXLAN port group, I expected vSM to recognize the changes and handle creation of new VXLAN networks without issue. I was wrong.

The first symptom that I had an issue was the inability of vCloud Director (vCD) to create a new Organizational Network on a deployed Edge device:
So: something is off with the teaming policy. Time to look at vSM to determine whether vCD is sending a bad request to vSM, or if vSM itself is the source of the issue. The easiest way to check is to manually create a new network in vSM; if it succeeds, vCD is sending a bad request, otherwise I need to troubleshoot vSM--and possibly vCenter, too.
Boom: the problem is reproduced even for a test directly in vSM. Time to verify the teaming in the base portgroup in vCenter.
Oops. I hadn't updated the portgroup for VXLAN after moving the uplinks around, although I had done so for the other portgroups on the DVS.
Unfortunately, updating the portgroup to use all the available uplinks didn't help. However, in the process, I discovered an unexpected error in vCenter itself:
vSM was making an API call to vCenter that included one of the old uplink names, one which no longer existed on the DVS. To test the theory, I added a couple of additional uplink ports to the DVS and renamed one to match the missing port. It worked, but not as expected:
vSM was able to send a proper API call to vCenter, but the portgroup had sub-optimal uplink settings: of the two active uplinks, only one had an actual, physical uplink associated with it. This was not a redundant connection, even though it looked like it.

Time to restart vSM to get it to re-read the vCenter DVS config, right? Wrong. Even with a restart & re-entering the vCenter credentials, the state persisted.

At this point, my Google-fu failed me: no useful hits on a variety of search terms. Time to hit the VMware Community Forums with a question. Luckily, I received a promising answer in just a day or two.

I learned that one can use the REST API for vSM to reconfigure it, which can get it back in line with reality. But how do you work with arbitrary REST calls? It turns out, there's a REST client plug-in for Firefox, written to troubleshoot and debug REST APIs. It works a treat:
  1. Set up the client for authenticated headers
  2. Retrieve the DVS configuration as an XML blob in the body of a GET call
  3. Modify the XML blob so that it has the correct properties
  4. PUT the revised XML blob back to vSM.
Voila! Everything works.

Specifics:
1) Use an Authenticated GET on the switches API
2) Using the objectId of the desired DVS, get the specific switch data
3) Update the XML blob with the correct uplink names
4) PUT the revised XML blob
As soon as this blob was accepted with a 200 OK response, I re-ran my test in vSM: success! vCD was also able to successfully create the desired portgroup, too.

Key takeaways:
  1. REST Client for Firefox is awesome for arbitrary interaction with a REST API
  2. Sometimes, the only way to accomplish a goal is through the API; a GUI or CLI command may not exist to fix your problem.
  3. This particular fix allows you to arbitrarily rename your uplinks without having to reset the vShield Manager database and completely reinstall it to get VXLAN working again.