It’s high time for another summer break (I get closer and closer to burnout every year - either I’m working too hard or I’m getting older ;).
Of course we’ll do our best to reply to support (and sales ;) requests, but it might take us a bit longer than usual. I will publish an occasional worth reading or watch out blog post, but don’t expect anything deeply technical for the new two months.
I spent a bit of time the other day reflecting on how much I’ve learn’t from the course in terms of technical skills and the amount I’ve learned has been great. I literally no idea about things like Git, Jinja2, CI testing, reading YAML files and had only briefly seen Ansible before.
I’m not an expert now, but I understand these things and have real practical experience on these subjects which has given me great confidence to push on and keep getting better.
Christoph Jaggi sent me this observation during one of our SD-WAN discussions:
The centralized controller is another shortcoming of SD-WAN that hasn’t been really addressed yet. In a global WAN it can and does happen that a region might be cut off due to a cut cable or an attack. Without connection to the central SD-WAN controller the part that is cut off cannot even communicate within itself as there is no control plane…
A controller (or management/provisioning) system is obviously the central point of failure in any network, but we have to go beyond that and ask a simple question: “What happens when the controller cluster fails and/or when nodes lose connectivity to the controller”
Architectures based on a bit more operational experience like Big Switch fabric can deal with short-term failures. Big Switch claims ARP entries reside in edge switches, so they can keep ARP going even when the controller fails. It might also be possible to pre-provision backup paths in the network (see also: SONET/SDH) so the headless fabric can deal with link failures (but not link recoveries because those require path recalculation). Dealing with external topology changes like VM migration is obviously already a mission impossible.
Some architectures deal with controller failure by falling back to traditional behavior. For example, ESXi hosts that lose connectivity with the NSX-V controller cluster enter controller disconnected mode in which they flood every BUM packet on every segment to every ESXi host in the domain. While this approach obviously works, try to figure out how much overhead (and wasted CPU cycles) it generates.
On the complete other end of the spectrum are systems with traditional distributed control plane that use SDN controller purely for management tasks. Cisco ACI immediately comes to mind - as I usually joke during my “NSX or ACI” workshops, you could turn off APIC controller cluster when going home for the weekend and the ACI fabric would continue to work just fine.
Where are SD-WAN systems in this spectrum? We don’t know, because the vendors are not telling us how their secret sauce works. However, at least some vendors claim their magic SD-WAN controller replaces routing protocols, which means that controller failure might prevent edge topology changes from propagating across the network.
There’s also the nasty question of key distribution. In traditional systems like DMVPN edge nodes exchange P2P keys with IKE and use shared secrets or pre-provisioned certificates to prevent man-in-the-middle attacks. In an SD-WAN system the controller might do key distribution, in which case I wish you luck when you’ll face a nasty WAN partition (or AWS region failure if the controller runs in the cloud).
Summary: Things are never as rosy as they appear in PowerPoint presentations and demos. Figure out everything that could potentially go wrong (like WAN partitioning), try to find what happens from product documentation, and ask some really hard questions (or change the vendor) if the documentation is not useful. Finally, verify every claim a $vendor makes in a lab.
SD-WAN is the best thing that could have happened to networking according to some industry “thought leaders” and $vendor marketers… but it seems there might be a tiny little gap between their rosy picture and reality.
This is what I got from someone blessed with hands-on SD-WAN experience:
One of the first things I realized when I started my Azure journey was that the Azure orchestration system is incredibly slow. For example, it takes almost 40 seconds to display six routes from per-VNIC routing table. Imagine trying to troubleshoot a problem and having to cope with 30-second delay on every single SHOW command. Cisco IGS/R was faster than that.
If you’re old enough you might remember working with VT100 terminals (or an equivalent) connected to 300 baud modems… where typing too fast risked getting the output out-of-sync resulting in painful screen repaints (here’s an exercise for the youngsters: how long does it take to redraw an 80x24 character screen over a 300 bps connection?). That’s exactly how I felt using Azure CLI - the slow responses I was getting were severely hampering my productivity.
This webinar helped me a lot in understanding Ansible and the benefits we can gain. It is a big area to grasp for a non-coder and this webinar was exactly what I needed to get started (in a lab), including a lot of tips and tricks and how to think. It was more fun than I expected so started with Python just to get a better grasp of programing and Jinja.
In early 2019 we made the webinar even better with a series of live sessions covering new features added to recent Ansible releases, from core features (loops) to networking plugins and new declarative intent modules.
One of my subscribers sent me an interesting puzzle:
>One of my colleagues configured a single-area OSPF process in a customer VRF customer, but instead of using area 0, he used area 123 nssa. Obviously it works, but I was thinking: “What the heck, a single OSPF area MUST be in Area 0”
Not really. OSPF behaves identically within an area (modulo stub/NSSA behavior) regardless of the area number…
Let’s take a recent data center switch using Trident II+ chipset and having 16 MB of buffer space (source: awesome packet buffers page by Jim Warner). Most of switches using this chipset have 48 10GE ports and 4-6 uplinks (40GE or 100GE).