Here is a neat little situation I ran into recently, and is worth sharing and reading; if this saves a life it was worth what I had to go through …..
- Cisco Expressway C/E 8.10.3 cluster over wan (2 Control Peers, 2 Edge Peers)
- Customer deployed and managed SD-WAN solution in front of the Edge cluster to the Internet (with two separate transport carriers). I think it was Palos, but we’ll call it a whitebox’ed solution for our purposes
- Using MRA and B2B Cisco Expressway configs
- UAT for MRA and B2B is accepted and works great
A little bit into post go-live the customer applies the zone/search rule config in Cisco Expressway for CMR (basically, SIP URI dialing into WebEx meetings from the video endpoint) and notices that randomly, during a presentation session in the CMR, the BFCP server (AKA, the WebEx meeting) would close the BFCP presentation to the endpoint coming from the customer’s Expressway; all other BFCP clients are still receiving the BFCP presentation. That’s right, it appears that WebEx kicked the BFCP participant coming from the customer’s Edge, but not because the BFCP server closed the session (all other participants remain)! Although it was happening randomly in length of time into the presentation, it would always happen at some point to the endpoint, generally around the 2 minute mark.
Although random, a consistent’ish length would seem to suggest a timer / re-invite of some flavor, and that would be wrong, as ultimately uncovered. Sparing you all the gory tales of escalation and bus underskirt sliding; the issue was in fact, the SD-WAN solution itself.
The Explanation & The Fix
What happened is that every 120 seconds or so, the BFCP server (WebEx meeting) would send a UDP BFCP packet to all the BFCP presentation subscribers. The customer’s SD-WAN solution was identifying these packets according to the customer (gotta love layer 7 capable firewalls 😊) and queuing them onto a physically different link than which the rest of the BFCP stream was on, thus creating physical asymmetry, delay and latency. In a TCP stream, this would likely be tolerated to a degree as packet loss or delay and/or jitter and would simply re transmit ….. but we are dealing with UDP here, no good!
To resolve, the customer had to classify the traffic and force an active/failover transmission through their SD-WAN solution for that traffic, rather than a “load balance” transmission behavior.
Sleuthing & The Closing
In hind sight, seems simple and makes perfect sense right? However, when your only visibility into the network is the Expressway servers themselves, it can be very challenging to discover because at that point in the topology, everything looks like it is coming from and going to the VIP on the firewall pair. So how do you catch something like this when you can’t see everything? PCAPs. Literally counting f**king packet sequence numbers for 6 hours and identifying a consistent pattern of packets coming out of order and being “lost”.
Learn from this Padawan