SDWAN - Repair a vBond Sync Error

Another short entry to help those out who might run into a similar situation as I did this week. In the SD-WAN fabric, every once in a while the controller certificates need to be updated so that the cEdges and vEdges, but also the controllers themselves can authenticate one another. This week, I reconfigured the vManage to use the Cisco Automated using this CVD from Cisco. The process is relatively straight forward, but it’s important to update the cEdges and vEdges with a new root certificate so that they will successfully authenticate with the new Cisco issues certificates. The main benefit of using the Cisco Automated method is that it no longer requires a Cisco TAC case and it can be sorted within 10-20 minutes (depending on how fast Cisco signs the certs).

Two branch routers with TLOC extension

Please refer to the above guide for the configuration and how to update the certificates. This post will only cover the troubleshooting section as this doesn’t seem to be well covered elsewhere.

Problem

So after a reconfiguration of the provision of the controller certificates, I decided to try it out first with just one vBond. The process took somewhere between 5-10 minutes and the CSR and signing process can be followed in the Plug and Play portal.

After the first vBond was successfully updated, the controllers will sync up and all is well. I then decided to do a vSmart and the vManage because they also had an certificate about to expire. While they were in the CSR process, I thought, why not do the other vBond and vSmart as well so all controllers have the same renewal dates? (These were different due to a region migration)

Doing this was not the best idea. This is what caused the second vBond to not receive updated information about the other controllers and the vManage during and after the renewal process resulting in a sync status: Error.

Moreover, as the edge routers will reset their control connections and have to re-authenticate with the controllers. Because the DNS query for the vBond address returns multiple IP addresses, the edge routers will connect to all vBonds when necessary. However, if a vBond is no longer in sync with the fabric, it will return faulty information and the vBond control connections will remain open. This is visible on the edge routers:

show control connections

or

show sdwan control connections

On the vBond this is visible with the following command:

show orchestrator connections

The result is that in my case, half the fabric was offline from the vManage’s perspective and some router only had a single vSmart connection working. IPSec tunnels were up for all routers, but as mentioned before, the timers had reset indicating that some traffic loss might have occurred during the re-authentication with the vSmarts.

Note: Unlike the software updates (if devices have the same site code), the controllers will update their certificates in parallel. The edge routers will have to re-authenticate the vSmarts and the IPSec connections will be lost in case of having to re-authenticate all vSmarts in the fabric. Generate CSR sequentially!

Re-syncing the vBond

In this case, the vBond had outdated information about the other controllers. Check the vManage GUI and look under Configuration > Certificates > Controllers. The ID we’re looking for is the Certificate Serial.

Log into the vBond in error state and check the serials with the following command. These should not match up with the vSmarts and vManage certificate serials from the vManage GUI:

vbond-1# show orchestrator valid-vsmarts

SERIAL NUMBER                             ORG                      
-------------------------------------------------------------------
19c7bb28e2737b93f59746c784a9e0f0b1599490  Example-ORG - 548455
39754f55650e2898322e3688644af4790e75caa1  Example-ORG - 548455
c9ad779452e747f1c1745bc22151d195d574dab0  Example-ORG - 548455

Normally, adding the correct vManage certificate is already sufficient for the vBond to authenticate with the vManage and allow it to resync:

vbond-1# request controller add
Value for 'serial-num' (<Serial number in vSmart/vManage public certificate>): cb8f5e968ebd4b5ffb3ebb36b34237ac2fa3f179

If you want to clean up everything and add the other vSmarts as well, use the example commands below and replace the certificate serial with the correct ones from your environment:

vbond-1# request controller delete org-name "Example-ORG - 548455" serial-num 19c7bb28e2737b93f59746c784a9e0f0b1599490
vbond-1# request controller delete org-name "Example-ORG - 548455" serial-num 39754f55650e2898322e3688644af4790e75caa1
vbond-1# request controller delete org-name "Example-ORG - 548455" serial-num c9ad779452e747f1c1745bc22151d195d574dab0

Adding the correct ones:

vbond-1# request controller add org-name "Example-ORG - 548455" serial-num cb8f5e968ebd4b5ffb3ebb36b34237ac2fa3f179
vbond-1# request controller add org-name "Example-ORG - 548455" serial-num 1af17dd0a0478b54d04764131bc41660c4c19aca
vbond-1# request controller add org-name "Example-ORG - 548455" serial-num 344df065ecf4bdd7f738064ee49d26f64d8de5aa

Result

If all goes well, the vBond will re-authenticate with the rest of the controllers and sync back up. This should automatically fix any edge routers that have missing control connections and/or an open control connection with the vBond that was previously in error state.

Two branch routers with TLOC extension