Loading...

Follow NetCraftsmen Blog on Feedspot

Continue with Google
Continue with Facebook
or

Valid

This blog is about AWS, specifically multiple VPCs and routing, and some recent developments. My intent is to give you the flavor of some AWS design approaches, and then point you to some good material for the details.

Amazon AWS Transit VPCs is an important cloud concept to consider using as your organization’s AWS footprint increases. While the other cloud providers differ as to details, some of this may also apply.

Another approach is AWS Transit Gateways. We’ll get to them in a few paragraphs.

AWS Transit VPCs

Why might transit VPCs be important? The short answer is segmentation — divide and conquer (or at least control).

One of the hidden issues in Cloud is administrative and security controls. It is unwise for any single person to have the ability to accidentally or deliberately wipe out your entire cloud infrastructure.

Another is standardization. Turn four server admins loose in AWS and you may get eight different schemes for subnets, server instance connectivity, and routing. That much flexibility may well be counter-productive.

Following that principle, the need for security and fiscal accountability suggests that you find a deployment model empowering staff to be agile, automated, and do what they need, while constraining what security, external and internal connectivity, and portion of the logical topology is under their control.

That is, you might want to segment your cloud presence, giving various teams carefully delineated control and connectivity, and segmenting server functions for security: achieving both security and administrative segmentation!  The segment size could be one or more logical subnets, but it might also be a VPC.

From the standards perspective, you might want to control things such as:

  • Where in your cloud instances can teams spin up VM instances?
  • How can they logically network them? One or several subnets and interfaces?
  • What can those segments route to?
  • Where are your security points of enforcement for inside and Internet traffic? Who controls those?
  • How best to provide common services to various business units and their VPCs or subnets of servers — whatever your segmentation granularity is.

In short, create some standards so you don’t end up with Lift and Shift chaos, everyone doing their own thing, and a nightmare from the support perspective. Especially since experience says that it didn’t get documented with physical servers, so why would you expect it to be documented with nebulous (cloudy) servers?

(Virtual documentation? You can see my server instances, so it must be obvious what each one does, what more do you want? – That just doesn’t cut it, as far as I’m concerned!)

The one technical issue with using VPCs to segment your use of AWS is that the routing was (is) not transitive: you can set up A to B and B to C, but unless you also configure routes for A to C, A can’t get to C via B. Doing this manually doesn’t scale. Automating at scale, maybe not such a great idea either. (Troubleshooting?)

The idea with a transit VPC is to have a VPC as “neutral” or common ground and hang the specialized VPCs off that.

There are two ways I know of around the transit routing situation:

  • Static routes to the transit VPC, NAT there (to each spoke VPC, so they think remote stuff is local). That can get a bit messy. This was documented in a Cisco Live presentation a few years back (Cisco and Under Armour). I can’t find this online anymore.
  • Dynamic routing, using CSRv virtual routers and VPN tunnels, either to other CSRv’s or to AWS VPN endpoints. AWS has made it easy to automate the tunnel establishment, so you also only need CSRv’s for the hub function. AWS and Cisco have whitepapers and documentation of this approach.

References for Transit VPC:

See also the CiscoLive talks about AWS or Cloud and CSR 1000v in general. Recent ones also go into hybrid connectivity with CSR 1000v including Azure and Google, configuration snippets, etc. — Great resources!

AWS Transit Gateway

I’d come across transit gateways a couple of months ago and heard a rumor that lack of transit routing might be going to change. Apparently, it now has!

It turns out, Marwan Al-shawi blogged about this before I could get to it. Since he did a fantastic job of writing this topic up at length, I’ll refer you to his blog for details. My summary follows — trying to share the design approach, and so you know why you might want to read his blog.

An AWS transit gateway acts like a giant hub router and control point for multiple VPCs. The big deal is that you can replace per-VPC tunnels from your datacenters or CoLo locations, with a single tunnel to the transit gateway (TGW). That can help with scaling and BGP peering fatigue at the corporate VPN termination hub site. Assuming that simplification is what you are looking for. DirectConnect to the TGW is apparently a future feature.

The TGW also allows you very flexible control (think VRFs) on routing between the attached VPCs. It does come with some limitations.

Links Comments

Comments are welcome, both in agreement or constructive disagreement about the above. I enjoy hearing from readers and carrying on deeper discussion via comments. Thanks in advance!

—————-

Hashtags: #CiscoChampion #TechFieldDay #TheNetCraftsmenWay #TransitVPC #TransitGateway #AmazonAWS

Twitter: @pjwelcher

Disclosure Statement

NetCraftsmen Services

Did you know that NetCraftsmen does network /datacenter / security / collaboration design / design review? Or that we have deep UC&C experts on staff, including @ucguerilla? For more information, contact us at info@netcraftsmen.com.

The post Amazon Transit VPCs and Transit Gateway appeared first on NetCraftsmen.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

A little bit of training, some researching, and a lot of time on the GUI is what I found to be the recipe in getting comfortable with Aruba Wireless Controllers. Even though I spent more time configuring the controllers via GUI, knowing the syntaxes I now find it much simpler to do it via the CLI. I do find the controller GUI much more useful in troubleshooting and verifying the wireless client connectivity. In this blog, I will share some knowledge and insights that I have gained while deploying Aruba’s Wireless Controllers for one of our customers.

Controllers Overview

Aruba offers wireless controllers in the 7000 series and 7200 series models.  The 7000 series controllers scale for small to large branch offices from 16 to 64 maximum AP capacity with an option of up to 24 switchports for unified wired and wireless access. The 7200 series controllers are suitable for campus networks and support from 256 APs to 2,048 maximum AP capacity.

The controllers can be deployed as Master or Local. Aruba suggest deploying the 7000 series controllers as Local, while the 7200 series are typically deployed as Master controllers. In a Master-Local deployment, Master holds responsibility of all policy configurations. This would include services such as WIPS, Initial AP configurations, user roles and authentication related configurations, etc. The Local controller terminates AP tunnels, processes and forwards user traffic (including authentication), manages ARM (Adaptive Radio Management), Mobility features and QoS.

Aruba also offers a Mobility Master Appliance which provides additional features which are not available in the other controller models. It provides controller clustering capability that allow better user experience via features like Hitless Failover, Automatic User Load Balancing, Automatic AP Load Balancing, and Seamless Roaming across the cluster. This type of deployment could perhaps be considered for sensitive environments where high wireless performance and reliability is a requirement for critical services.

Note: ArubaOS 8.X is required with Mobility Master Appliance. APs cannot terminate on any Master or Mobility Master controllers with this code, they can only terminate on controllers deployed in Local mode. ArubaOS 6.X allows AP termination on either Master or Local controllers.

Licensing

Master controller(s) can be configured with the Centralized Licensing feature. This allows the creation of a shared pool of AP Licenses that can be used by all controllers in the network. When an AP joins a Local controller, it consumes an AP License along with a PEF-NG (Aruba’s Policy Enforcement Firewall) and an RFProtect License from the pool.

While Centralized Licensing allows flexibility, it is important to note that the Maximum AP Capacity of some of the smaller controllers can be a “Gotcha”, if not designed carefully. Suppose you have deployed a 7205 model controller as the Master and a 7030 controller as the Local in a Master-Local HA Active-Active deployment, with a total of 100 AP licenses and centralized licensing feature is enabled. [DD1] [NP2] If the APs are load balanced (50-50 on each controller) and master controller fails, only 14 APs would failover and 36 of them would be down. This is due to the AP capacity of 7030 model controllers only allowing 64 APs to join it.

Redundancy

To enable redundancy, any combination of HA Deployment Models can be used…

  • Master / Standby Master with HA Active-Active Local Controllers — Full redundancy
  • Master with HA Active-Standby Locals — N+1 Redundancy with Over-subscription
  • Master-Local HA Active-Active or Standby-Active — Master Active or acting as backup LMS
  • Independent Masters HA Active-Active — No local controllers, each master acting as backup for the other

As long as the pool of AP licenses are not exceeded, APs have failover capability to the backup LMS IP, which can be another Local or Master based on deployment. During failover to the backup LMS, APs would normally reboot causing minutes of outage, unless Aruba’s AP Fast Failover feature is enabled. This feature allows APs to form standby tunnels to the backup LMS for instant failover and minimize downtime.

Configurations

Aruba wireless controller configurations take a hierarchal approach, where multiple configuration profiles are built separately and are attached to higher-level profiles. Best practice is to configure the lowest-level settings and profiles first, then build up. Reviewing these controller configurations may be confusing for a lot of us without fully understanding the configurational hierarchy.

Following is an output of the CLI command “show profile-hierarchy” on a 7205 controller, it shows how profiles relate to each other and provide some clarity.

ap-group
   wlan virtual-ap
       aaa profile
           aaa authentication mac
           aaa server-group
               aaa authentication-server radius
                   aaa radius modifier
           aaa authentication dot1x
           aaa xml-api server
           aaa rfc-3576-server
       wlan dot11k-profile
           wlan handover-trigger-profile
           wlan rrm-ie-profile
           wlan bcn-rpt-req-profile
           wlan tsm-req-profile
       wlan hotspot hs2-profile
           wlan hotspot advertisement-profile
               wlan hotspot anqp-venue-name-profile
               wlan hotspot anqp-nwk-auth-profile
               wlan hotspot anqp-roam-cons-profile
               wlan hotspot anqp-nai-realm-profile
               wlan hotspot anqp-3gpp-nwk-profile
               wlan hotspot anqp-ip-addr-avail-profile
               wlan hotspot h2qp-wan-metrics-profile
               wlan hotspot h2qp-operator-friendly-name-profile
               wlan hotspot h2qp-conn-capability-profile
               wlan hotspot h2qp-op-cl-profile
               wlan hotspot h2qp-osu-prov-list-profile
               wlan hotspot anqp-domain-name-profile
       wlan ssid-profile
           wlan edca-parameters-profile station
           wlan edca-parameters-profile ap
           wlan ht-ssid-profile
           wlan dot11r-profile
       wlan wmm-traffic-management-profile
       wlan anyspot-profile
   rf dot11a-radio-profile
       rf spectrum-profile
       rf arm-profile
       rf ht-radio-profile
       rf am-scan-profile
   rf dot11g-radio-profile
       rf spectrum-profile
       rf arm-profile
       rf ht-radio-profile
       rf am-scan-profile
   ap wired-port-profile
       ap wired-ap-profile
       ap enet-link-profile
       ap lldp profile
           ap lldp med-network-policy-profile
       aaa profile
           aaa authentication mac
           aaa server-group
               aaa authentication-server radius
                   aaa radius modifier
           aaa authentication dot1x
           aaa xml-api server
           aaa rfc-3576-server
   ap system-profile
   wlan voip-cac-profile
   wlan traffic-management-profile
       wlan virtual-ap
           aaa profile
               aaa authentication mac
               aaa server-group
                   aaa authentication-server radius
                       aaa radius modifier
               aaa authentication dot1x
               aaa xml-api server
               aaa rfc-3576-server
           wlan dot11k-profile
               wlan handover-trigger-profile
               wlan rrm-ie-profile
               wlan bcn-rpt-req-profile
               wlan tsm-req-profile
           wlan hotspot hs2-profile
               wlan hotspot advertisement-profile
                   wlan hotspot anqp-venue-name-profile
                   wlan hotspot anqp-nwk-auth-profile
                   wlan hotspot anqp-roam-cons-profile
                   wlan hotspot anqp-nai-realm-profile
                   wlan hotspot anqp-3gpp-nwk-profile
                   wlan hotspot anqp-ip-addr-avail-profile
                   wlan hotspot h2qp-wan-metrics-profile
                   wlan hotspot h2qp-operator-friendly-name-profile
                   wlan hotspot h2qp-conn-capability-profile
                   wlan hotspot h2qp-op-cl-profile
                   wlan hotspot h2qp-osu-prov-list-profile
                   wlan hotspot anqp-domain-name-profile
           wlan ssid-profile
               wlan edca-parameters-profile station
               wlan edca-parameters-profile ap
               wlan ht-ssid-profile
               wlan dot11r-profile
           wlan wmm-traffic-management-profile
           wlan anyspot-profile
   ap regulatory-domain-profile
   rf optimization-profile
   rf event-thresholds-profile
   ids profile
       ids general-profile
       ids signature-matching-profile
           ids signature-profile
       ids dos-profile
           ids rate-thresholds-profile
       ids impersonation-profile
       ids unauthorized-device-profile
   ap mesh-radio-profile
       ap mesh-ht-ssid-profile
   ap mesh-cluster-profile
   rf arm-rf-domain-profile
   ap provisioning-profile
   ap authorization-profile

Fortunately, you only need to configure a few of these for a successful deployment. Many of the profiles and parameters do not require tweaking in most environments. Here is a review of some of the important configurations and profiles that were recently deployed at a customer site.

The chart above walks through the configuration flow from low-level (top) to high-level (bottom) profiles.

Note: the example configurations shown are only partial configurations to provide a visual and are valid for ArubaOS 6.x code command line.

Starting from the bottom of the chart…

AP-Group combines it all. Each AP must be assigned to an AP-Group at the time of deployment. An AP-Group essentially defines SSIDs the AP will know of and advertise, authentication used, VLAN assignation, etc. You can use a single AP-Group for the entire network or break it down per site or region. Unless you are advertising different SSIDs for different sets of APs (per site or region), it is simpler to use a single AP-Group.

A Virtual-AP Profile is created per SSID and is assigned to the AP-Group. Each Virtual-AP profile is assigned an SSID Profile which defines the SSID and a AAA Profile which defines all authentication parameters corresponding to that SSID. A VLAN is also configured under the Virtual-AP profile and is assigned to all users by default, unless a specific VLAN is assigned to the User Roleto which a user is assigned, which takes precedence.  Attributes are then configured under the AAA profile. AAA authentication parameters are defined within the AAA server/group, dot1x, captive portal etc. User Roles are also configured under AAA profiles to define the pre- or post-authentication roles for users.

User Role defines the access a user is permitted based on the configured Firewall Policiesfor each role. Firewall Policies are essentially ACLs (standard, extended, service based, etc.). As users attempt to connect to any SSID, they are assigned with an initial role. These initial roles define what type of access the user will have prior to authenticating (i.e. only http / https access to the captive portal for guests to authenticate). Once authenticated successfully, a user will get assigned a post-authentication role that provides network access defined by the administrator (i.e only http / https access to the internet, blocking all communications to RFC 1918 address space). User Role can also be derived as a Radius attribute from the AAA server with successful authentication. A VLAN can be assigned to the User Role, to either put them in an initial VLAN with restrictive access or to assign them with a special access VLAN (i.e. Network Admin Access).

Overall, Aruba Wireless Controllers are fairly simple to configure and seem to provide great flexibility in deploying Wireless solutions for your needs. More of my experience this year was with the ArubaOS 6.x code which vastly defers from the new 8.x train, new features, new look, etc. Next up for our customer is transitioning from 6.x to 8.x code with a new pair of Mobility Master controllers, perhaps also the topic for my next blog.

References

Mobility Boot Camp (Training):

https://inter.viewcentral.com/events/cust/search_results.aspx?eventMonthYear=&event_address_id=227&event_id=423&postingForm=default.aspx&cid=aruba&pid=1&lid=1&cart_currency_code=&payment_type=&orderby_location=&orderby_date=&newRegistration=&bundle_location_group=&errmsg=

ArubaOS 6.4.4.x User Guide:

https://www.arubanetworks.com/techdocs/ArubaOS_64x_WebHelp/Web_Help_Index.htm#ArubaFrameStyles/Preface/WhatsNew.htm%3FTocPath%3DAbout%2520this%2520Guide%7C_____1

The post Aruba Wireless Controllers: Architecture & Configurations appeared first on NetCraftsmen.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

I’ve done a couple of application slowness (brownout) troubleshooting sessions recently.

This blog is my attempt to condense some observations from both engagements, to share lessons learned. “Condense” might not be the right word, seeing how long this blog got!

Troubleshooting with some process awareness can help! I have a troubleshooting process. Sometimes I do all the aspects, sometimes I do the short version. I think it helps!

As I was drafting this blog, ipspace.net posted a very comprehensive blog on the same topic. Recommended! Darn, Ivan beat me to print again!

Diagram It

I usually start with a diagram. I’ve learned that the troubleshooting team and I may make unwarranted assumptions, miss aspects of the topology, etc. I’ve learned to ask “and what’s not in the diagram” — I want to know about every device and link, not the simplified or abstraction that’s usually what got diagrammed. I’ve also seen situations where the network diagram shows only network devices, the security diagram shows only security devices, and I need some help figuring out the One Diagram to Rule Them All.

The diagram should cover where the users who are affected are, where the application in question running, and where the key services are, all in relation to the network. If the problem is an app that has gone slow for everyone, then I want to know about services and internal flows in the application and include all of them in the diagram. The key thing here is to make the diagram completely cover all network traffic. So, for instance, “write to disk” might turn out to involve network when the disk being written to is an NFS file share. Don’t assume “that’s just storage”!

Gathering application flow information, which is almost never documented, can take considerable time and multiple iterations, so I usually hold off doing it in detail unless the problem is clearly the application and not the network. That is, I try to check out and eliminate what I / we can do easily first (low hanging fruit), while setting the slower info gathering in motion.

Diagramming makes sure all those involved (especially me, the visiting consultant) have a clear idea of the relevant parts of the network. It saves locking into any assumptions prematurely. It also helps later when you’re stumped and need to revisit where you might have missed something.

List Possible Causes

After building the diagram (or first draft), I go through it and list the things that can be a cause in a broad sense: user PC, user network connection, access to distribution switch uplink, etc. They go into rows in a spreadsheet. This gives me a framework to summarize scoping, test results, and other information. It makes sure I / the team don’t focus on any one thing too soon. It helps the team divide up checking various items.

Document It as You Go

I’ve repeated work too many times, going back to look more closely at something, or going over it with someone else. That costs time. So, I document raw data as I go: good notes, and save screen or CLI output, etc. That takes time but is very useful if you later want to re-check what you did or what you saw.

I usually put the info and captures into a folder. When they’re not too verbose, or screen captures, I use a Word document with section headers. Doing so with the Word Navigator pane helps you pull up the data quickly when needed.

Note: save screen caps separately in a file folder, Word reduces image resolution. Painful to go back later, try to zoom in, and realize the resolution isn’t there anymore.

This also helps spot things I (or often, the local network staff) don’t know or have assumed about the application in question. Depending on how important the gaps seem, some can need immediate resolution, others can be postponed for resolution only if they start seeming more important.

Think Scope

Most of us do this implicitly anyway, but it helps to do it consciously. Scope: what’s affected, what’s not affected.

In a broader sense, it’s always useful to think about what you know, what you can eliminate as a possible cause, or where you should focus.

I like documenting this in the possible “causes” spreadsheet. I usually put a column in for “how do I know this” because sometimes you find that you don’t really know something — or communicated information might be vague and inconclusive. That’s why when someone presents me with a conclusion, I tend to ask “and how do you know that?” or “why do you think that?”

Sometimes I add a column to the spreadsheet for priority: 5 = top, 1 = low priority, 0 = clean / not a problem. Excel automated color coding (5 = red, 0 = green) can help, although inserting / deleting spreadsheet rows messes that up (hint to Microsoft: poorly coded feature!).

Case Study

Problem: doctors’ offices going via their business HQ, thence to hospital-based EPIC. Slowness.

Scoping: We knew that most sites were not complaining, but two had users that were experiencing slowness. That told us a couple of things up front, maybe not definitively, but well enough for first-cut elimination of some possible causes.

Verbal diagram: All sites were connected to a WAN, which connected back to HQ. Most sites’ Internet was also via HQ. So, application and Internet traffic were competing for WAN bandwidth.

HQ had a separate point-to-point connection back to the EPIC provider.

The user-based scoping information eliminated HQ, the link to the EPIC provider, and the EPIC provider’s network as likely causes. At least, pointed that way, I’d consider this to be about 70% proven, given the evidence was purely anecdotal.

Possible causes: user workstations, user site LAN connection, or user site WAN connection.

Further data from site staff: after review, the users in question had rather old computer hardware.

I’ll note that one problem with problem reports from users is that there is usually a good bit of delay before it gets to the helpdesk and percolates to you. It can also be vague, e.g. as to when the problem started occurring, and / or stopped.

Even in this particular case study, there’s the question whether anyone else was trying to use the app at the same time a couple of people were experiencing slowness. It is all subjective evidence. That can make it hard to correlate with link utilization spikes, etc.

Lesson Learned: Train users (gently) to note down time of onset and time when things improved (if they improved), and report those. That can help you see if their problems matched up with other data. (Think about journalism’s “5 W’s and How”: “Who, what, when, where, how, and why”).

Getting Hard Evidence

The EPIC provider staff had done something clever: EPIC provided centralized printing, meaning outbound traffic to printers at the “customer” site was allowed through the firewalls in the path. So, the staff set up smokeping to poll two printers at each customer site. In advance. Visibility for the win!

Where that trick isn’t feasible, tools like Appneta, Netbeez, or ThousandEyes can be useful. If you deploy them at each site, you can monitor things like ping response, DNS response time, or web application response time. Useful for site to Internet, SaaS, cloud, and internal apps. Having hard objective data with accurate timestamps about “user experience” lets you compare which sites were having problems at the same time, or whether the problems were independent.

I’ll also note internal DNS is key to many things today, so it is a good idea to monitor its responsiveness.

A DNS Gotcha

DNS is slow when it doesn’t get a reply, due to lost packets or slow recursive lookup.

This can affect app logins or even database authentication logging, masquerading as DB slowness.

I’ve seen slow reverse DNS lookup because central DNS was not authoritative for some of the private address blocks in use at a site, causing recursion to the Internet. That in turn caused slow logins to a key application (and copious logged complaints from the application).

Most of the above monitoring tools won’t catch that because you have to specify the address or name they resolve.

Hint to tool vendors: perhaps allow for not only fixed DNS name resolution, but reverse resolution of random IP’s in a block.

Lesson Learned Previously: Make sure your site DNS is authoritative for IP lookup for all private or public address blocks in use. E.g. all of 10.0.0.0/8 rather than just the subnets MS AD knows about. Ditto for other private address blocks: make sure reverse lookups stay local (and fail quickly).

Responsibility to be authoritative about reverse lookups can fall through the cracks when the network or another team manages the site and datacenter DNS, and the Microsoft / server team manages user DNS.

SNMP Stats

I’ll briefly hit one of my favorite rants, ahem, themes. I hope you’re already using a network management tool that captures and graphs SNMP stats on all active interfaces, preferably with 5-minute or finer time granularity. You can then look along the traffic path for link problems. This is where having user port stats can help detect if the user’s network connection is the problem.  

User slowness can be caused by link congestion, errors, or discards. As I’ve noted before, anything over 0.001% errors or discards should be fixed, as it can slow things down.

If your NPM platform won’t or can’t poll everything frequently, or won’t threshold below 1%, consider getting a better one. While money may be tight, your / staff’s time may be even a more scarce resource.

Bonus points to Network Management products that take as input the endpoints, figure out the path(s) in each direction, taking ECMP into account, and then show you problems along the paths.

Back to Our Story

In the EPIC story, the WAN MetroEthernet data Comcast provided was rather summarized, only viewable by last day, week, or month. Pretty useless.

The bars shown were (apparently) averages over hours or days. Either that, or the readings were steady for long periods of time. I’ve noticed over the years that most network management products will graph data, but don’t tell you things you need to know to properly interpret the data. Is the data graphing the actual polled data, or is it being lumped into bigger buckets and averaged?

When you average over hours or days, peaks of traffic get averaged with zeroes. In this case, I suspect the data was for business hours only, but with averaging.

All that vagueness is why I don’t like having to guess what the graph is actually plotting.

The key point here is that if you’re looking for congestion, and all you’re seeing is traffic at about 50% of max that might be an average over multiple 5-minute periods or even hours, chances are that for small time intervals, the link could be maxed out. Or not — there’s no way to tell unless you can zoom in.

To wrap up, the problematic sites did seem a bit more heavily utilized, and smokeping did show more ping time variability, suggesting congestion.

Wrapping Up the Case Study

Our primary recommendation was to get the problem workstations upgraded, and as new slowness reports come in, capture the workstation model in use.

A second tentative recommendation (due to poor supporting data) was to consider adding more bandwidth to the problem sites. At the very least, monitor the links internally using 1-minute or 5-minute polling, so that if problems remained after upgrading old workstations, or as the user load increased as anticipated, there would be better data (cost justification) for upgrading the WAN links.

Another Approach

The ultimate answer is of course to monitor every workstation, or selected workstations, right from the workstation itself.

One product I’ve run across (not used directly) is Aternity, now owned by Riverbed. It does actual user experience monitoring. Word of mouth says it can be costly. There are likely other products in that space. Knowing which users have a problem = automated scoping data — could be pretty useful!  

Conclusion

One point to this blog is that you can get pretty useful data without a large investment. The investment needed is in free or cost-effective products and as much of your time as is needed to ensure you’ll have the data you want when you need it.

Scoping and anecdotal user input can help troubleshooting, but do not form a very strong objective basis for doing so. One problem is accurate time: correlating bad UX with other performance statistics.

Network troubleshooting is slow if you don’t already have the data in hand. You can end up with guess work, or with ongoing problems while you get set up to gather the data you need. Slow!

Set yourself up for success by getting a good SNMP performance tool, monitoring everything. And think about adding one or more tools that provide hard objective data about user experience, or user-like experience, at least by site. Tracking wired versus WLAN UX is possible with some of the above tools, e.g. Netbeez.  

As noted in a prior blog, cloud changes the game. That’s where good transaction logs and measurements of service / microservice response times become more essential (In effect, adding service experience to user experience!). Start talking to your APM / application folks and learning the tools.

Comments

Comments are welcome, both in agreement or constructive disagreement about the above. I enjoy hearing from readers and carrying on deeper discussion via comments. Thanks in advance!

—————-

Hashtags: #CiscoChampion #TheNetCraftsmenWay

Twitter: @pjwelcher

Disclosure Statement

NetCraftsmen Services

Did you know that NetCraftsmen does network /datacenter / security / collaboration design / design review? Or that we have deep UC&C experts on staff, including @ucguerilla? For more information, contact us at info@netcraftsmen.com.

The post Dealing with Performance Brownouts appeared first on NetCraftsmen.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

How do you feel about jumbo MTU? I seem to periodically get into debates about jumbos. I’m highly allergic to jumbos. Let’s examine the facts (as I see them), and then we’ll get to the cause of my allergic reaction.

TL;DR: jumbo can cause major operational pain for network administrators.

But first, let’s define jumbo MTU. This usually refers to jumbo frames on Ethernet media. I’m making that distinction because other transports can have different MTU sizes.

Ethernet MTU stands for Maximum Transmission Unit, the largest data payload in an Ethernet frame. The standard is 1500 bytes. Note that the MTU is not the frame size, an Ethernet frame has L2 header (SMAC, DMAC, Ethertype) and checksum added, for a OSI Layer 2 frame size total of 18 more bytes. At Layer 1, there is also the preamble and starting delimiter, adding another 8 bytes.

Using 802.1q VLANs adds 4 bytes (802.1q type code, VLAN ID, then Ethertype).

I’m going to refer you to the Wikipedia page for the various formats for Ethernet frames.

Note that TCP MSS is something different.

Now that everyone is equally confused, let’s dig in deeper…

It Gets Confusing

Various vendors let you set the MTU. One has to be careful, as sloppy terminology sometimes leaves it unclear what exactly is being configured, i.e. which OSI layer headers are included.

On Cisco, setting the IP MTU reduces the IP packet size, usually to accommodate VPN tunneling or other overhead. I usually just use 1400 for such situations — saves on math errors or off by one error, and if you’re splitting the packet, it doesn’t really matter much if you split it a bit earlier than absolutely necessary. You end up with two packets either way.

Why Jumbos?

Reason #1: The ratio of header overhead to payload is better. So, you get marginally more data transmitted with a given amount of bandwidth. Or said differently, jumbos waste less bandwidth on headers.

Reality check: that used to matter. If you’re transmitting to Mars at 8 bits per second, you might care about overhead, greatly. Modems at 16 Kbps or whatever, yes. And yes, I’m old enough to remember modem connect squeal and all that. At 10 Gbps, you likely have bandwidth to spare, unless you’re doing something extreme, where you need every last bit of performance. 

Reason #2: Many devices’ forwarding performance is (was) measurable in packets per second, since how many items the CPU had to deal with was limiting.

This may still be to some extent with cheap server / laptop NICs. TCP offload and other driver techniques may alleviate the CPU burden of adding the packet headers and computing the checksum. Efficient driver coding (e.g. not copying data around in memory!) has also improved forwarding performance. I don’t consider myself a server or NIC expert, so I’ll quickly change the subject …

Reason #3: My storage team (or vendor) told me they need jumbos for better performance.

Vendors have been claiming this for a while, and it may be true. Although one response might be, “and why didn’t you put a more powerful CPU in your storage front end, since the marginal cost would be tiny?” Yes, some organizations do need extreme performance.  

Googling, I see articles that are all over the place. Some might be summarized as “yes, 7% gain in performance”, others show bigger gains and sometimes losses in performance. The right answer is likely “it depends” (on your environment, your NICs, your CPU, your drivers, etc.).

The Downside of Jumbos

Jumbos have to be configured. One more thing that can go wrong / missing. Labor expended.

Jumbos have to be configured to a plan. You have to take re-routing (STP changes or routing changes) into account and set up jumbos consistently across every possible alternative path.

Design-wise, that means at the very least you should pick a region in your datacenter for jumbo deployment, define it well, and then perhaps automate periodic checks for interfaces / ports that didn’t get configured.

You probably don’t want jumbos on campus LAN or the WAN.

What Could Possibly Go Wrong?

If a large frame arrives on a port or interface configured for smaller MTU, it likely gets discarded (See however Cisco “baby giant frames”, which allow a little laxity with frame sizes).

This can lead to very puzzling “why can’t these devices talk” troubleshooting sessions. And then you have to look at every possible path between devices, check the actual MTU (mark up a diagram), and look for inconsistencies. You may find this via doing traceroute or ping with a large packet size and DF bit set, but that will only catch the first problem spot along the current path (you did want High Availability, didn’t you?).

MTU mismatch can be a problem for your routing stability. I’ve seen it now with both EIGRP and OSPF.

With OSPF, the problem arises in the adjacency formation process, if one neighbor has a larger MTU. I’ve seen it once in the field.

You will see the problem during the OSPF Exchange state, and one side won’t be happy because it thinks it is not seeing anything from the other router. This shows up as sequencing through the OSPF state machine, pause, then repeating. The problem occurs when you have enough info in the OSPF LSA DB to cause sending packets too large for the other router. So, this is something you won’t see until one day, when your network gets big enough (and not all that big), OSPF starts breaking.

As often happens, I found an interesting blog at INE about this. It shows CLI output if you’re interested in examining this problem in detail.

However, from a Cisco Tech Note, it looks like there have been various changes in how Cisco handles this situation. Short version: “ip ospf mtu-ignore” may solve your OSPF problem, depending on release version, but you’ll still have jumbo drop issues.

With EIGRP, you can have something similar happen. I’ve seen it with two routers with a L2 switch in between, where the switch had a smaller MTU. As the routing table size grew, EIGRP went unstable between the two routers. This could also happen to OSPF, even if correct MTU checking is going on between the two routers.

Conclusion

Jumbo MTU provides:

  • Less header overhead, more data per packet: minor gain
  • Greater network complexity
  • Minor configuration hassle
  • Really annoying troubleshooting when an MTU mis-match happens somewhere

Conclusion: Just Say No to Jumbo Frames

That is, if local politics, requirements (i.e. extreme performance), and vehemence of argument allow.

References Comments

Comments are welcome, both in agreement or constructive disagreement about the above. I enjoy hearing from readers and carrying on deeper discussion via comments. Thanks in advance!

—————-

Hashtags: #CiscoChampion #TheNetCraftsmenWay #JumboFrame

Twitter: @pjwelcher

Disclosure Statement

NetCraftsmen Services

Did you know that NetCraftsmen does network /datacenter / security / collaboration design / design review? Or that we have deep UC&C experts on staff, including @ucguerilla? For more information, contact us at info@netcraftsmen.com.

The post Just Say No to Jumbo Frames appeared first on NetCraftsmen.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 
NetCraftsmen Blog by Peter Welcher - 4M ago

More and more sites are deploying Cisco Nexus 9K-based fabrics. Basically, if the time has come for datacenter switch refresh, you have two choices: Nexus 7700 etc. with a “classic Nexus” design (core, distribution, Top of Rack, FEX), or Nexus 9K-based fabrics. Fabrics are cool and look like a future direction for Cisco and most folks.

To VXLAN or Not?

Design-wise, you can do classic VLANs, or Nexus VPC-based design, or a fabric. Most are opting for fabrics. Classic VLANs suffer the weaknesses and risk of Spanning Tree Protocol (STP). VPC mitigates that risk somewhat, allowing MLAG, but at the price of a little complexity, mostly behavioral rather than design.

Caution: do not accidentally duplicate a VPC domain ID within a L2 domain.

Using VXLAN increases the complexity, but allows any L2 VLAN anywhere, from the edge systems perspective. Ivan Pepelnjak recently pointed out that VXLAN gives you a more-robust control plane, but does not help much with the broadcast radiation problem of VLANs. So, I will emphasize that Control Plane Protection and storm control features are additional wise precautions in any such environment.

Note: I’ve previously blogged about storm control, N9K is documented as having the same counter-intuitive configuration behavior as the other Nexus switches.

Ok, so with some safety measures, a datacenter running VXLAN may well be more robust than one on STP.

Concerning Datacenter Interconnect (DCI), I prefer VXLAN to Spanning Tree (STP). Routing fails closed whereas STP fails open, which is not good when dealing with WAN latency and lost packets. The same applies to clustering (VMware NSX, firewalls, or servers) over the WAN; stretching control backplanes over links that are inherently less reliable than in datacenters strikes me as risky.

How to Fabric

The next question is building and managing your fabric.

You can configure it manually. VXLAN configuration is a little verbose but can be templated. NetCraftsmen has done that in some small-ish fabrics. Managing it can be done with show commands.

At the scale of 2 spine and 4 to 6 or 8 leaf switches, all that is do-able. Still, automation ought to make it easier. And maybe, while the tools are at it, they could provide some operational support as well, including error checking?

Automation Alternatives

I know of several VXLAN automation alternatives:

  • Cisco Nexus Fabric Manager (now End of Sale — it had previously seemed to not be getting enough visibility to sell, so not getting new features, and eventually doomed by lack of sales)
  • Cisco DCNM (LAN Manager)
  • Cisco ACI
  • Apstra

Let’s discuss them. Well, the latter three, seeing as NFM is off the table now.

Cisco’s DCNM has been evolving from a click-to-configure GUI for datacenter. A couple of years ago, its primary user base seemed to be SAN admins. The product used to be separate (more or less) LAN and SAN products. I see they are still separate “installations”. DCNM also seems to have shifted focus from click-to-configure to more of an automation and management focus / operations.

Anyway, DCNM version 11 supports VXLAN in various ways. It has a fabric builder for fabric initialization and does consistency monitoring. One of the very technical Cisco experts in the VXLAN and DCI areas, Yves-Louis, has blogged about DCNM for VXLAN at length, complete with demo / how-to video clips. You might find it interesting to look at Yves’ prior blogs as well.

Cisco ACI is the obvious alternative. With the starter bundle, small to fairly large sites can buy three-controller ACI for almost the same cost as plain Nexus. That provides you with automation and fabric monitoring and management.

Cisco ACI is great in a lot of ways. It also changes some standard network concepts in subtle ways, i.e. has a learning and experience curve associated with it. Some of that revolves around backing up the fabric configuration, and how to deal with controller failure, in case that ever happens. The rest involves learning the various GUI knobs and influences on how ACI forwards traffic — and thinking a bit differently. ACI makes the fabric into effectively one big router, switch, and L4 firewall, which can have external L2 and L3 connections.   

There are two ways to use ACI: full-blown ACI, using contracts for connectivity and security. As noted, this really changes how you do datacenter, which might be a good thing. The second way is what I call “legacy network mode”, where you do VLANs and routing, more or less as usual. L2OUT and L3OUT policies connect ACI to the external world. ACI documents tend to talk about this more as a migration technique.

The reason I bring this up is that one may want to limit the learning curve and complexity, if your primary goal is fabric management and not contracts and security. Finding staff with ACI skills might be a related concern. As I’ve blogged elsewhere, configuring ACI so it’s not write-once / read-never might also be a consideration.

ACI is certainly well-documented. Online you’ll find several thick books’ worth of information, with a lot more documents on fine details. Cisco has also documented various ways of interconnecting datacenters, using either CLI-based VXLAN or ACI. There’s also at least one Cisco Press book about VXLAN / EVPN, and there’s also Cisco Live presentations. For that matter, I attended the Cisco Live all-day VXLAN / DCI techtorial two years ago (All good, but not necessarily simple). Does the cumulative page count concerning ACI make my point about ACI not being simple?

If you have a heavily virtualized environment, that should probably factor into any ACI decision. That’s a long discussion (alternatives, pros, cons) that I’ll mostly save for another time. If 90% of your apps are in VMware, it may make the most sense to manage security there, especially if doing NSX.

The third choice is Apstra. Apstra is a startup with some significant customers, providing Intent-Based datacenter fabrics for switches from a variety of vendors (Cisco, Arista, Cumulus), including some open-source platforms. The Apstra product automates configuration, validates connectivity in an ongoing fashion, and (reportedly) can do scalable telemetry. Yes, it has risk (startup!!!).

Oh, and Apstra announced a couple of months back that they had added support for multi-tenant multi-vendor VXLAN / EVPN, hiding the vendor differences for you.

Some features:

  • Symmetric Routing (dedicated transit L3 VNI per security zone)
  • All hosts advertised with Type 2 Routes
  • All networks advertised with Type 5 routes
  • Server multihoming using MLAG / vPC
  • Route Target (RT) / Route Distinguisher (RD) auto-generated based on VNI ID

I’m not going to try to scoreboard Apstra versus ACI regarding all the possible VXLAN features, e.g. anycast gateway, IP multicast, IPv6, multi-site VXLAN. I googled several of those topics and am not coming up with Apstra links (Exercise for the reader!).

I asked Apstra about some of the technical details. Apparently:

  • Multicast propagation is head-end replicated in the EVPN reference design today.  Multicast underlay is on the roadmap (Local switch flooding, I believe — have not heard mention of IGMP snooping).
  • No proxy ARP / ARP suppression.
  • For unicast learning in EVPN, proxy ARP is used by hardware if it supports it, by software if it does not.  The local VTEP will advertise the MAC address into EVPN fabric as type-2 routes.  This ensures that the known MAC addresses on local leaves are propagated and learned by other leaves.
  • For silent hosts, flood-and-learn is still used (there is no other way!), so unknown unicast has to be flooded.

Short version: fair VXLAN / EVPN support, currently lacking some of the scaling features around BUM traffic. That may be necessary due to being a multi-vendor platform; lowest-common denominator / not leveraging or may / may not inherit use of some advanced features on Cisco hardware.

Positioning-wise, Apstra does not currently, as far as I know, attempt to do security. Yet, if ever. That’s one reason I included it in my list. Quite possibly far fewer features and options than ACI, for those who want fabric automation and management, and not that much more (at least currently). That might be a Good Thing! Also, according to some off-camera #NFD19 discussion, Apstra appears to have some good features in the works.

I should probably note: beware, Apstra comes with a learning curve as well, just like ACI. Apstra is a startup, so might be less well-documented, and comes with some of the other aspects of buying from a startup / fairly new company.

Conclusion

There are several ways you can automate a Cisco (or other) fabric, should you wish to do so.

References Comments

Comments are welcome, both in agreement or constructive disagreement about the above. I enjoy hearing from readers and carrying on deeper discussion via comments. Thanks in advance!

—————-

Hashtags: #CiscoChampion #TheNetCraftsmenWay #Cisco #Nexus #Datacenter #VXLAN #Fabric

Twitter: @pjwelcher

Disclosure Statement

NetCraftsmen Services

Did you know that NetCraftsmen does network /datacenter / security / collaboration design / design review? Or that we have deep UC&C experts on staff, including @ucguerilla? For more information, contact us at info@netcraftsmen.com.

The post Ways to Automate VXLAN appeared first on NetCraftsmen.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

A few months ago, I wrote about the tradeoffs between using a L3 switch and a router. That blog noted that there are a lot more QoS capabilities on the Cisco routers. L3 switches provide a much more constrained set of QoS features, presumably those suitable for high speed processing in chips.

If you’re trying to do QoS on Cisco gear, how does this affect you? I’ve encountered a few ways in my recent QoS adventures. I’ve also seen some shifts in use of QoS. We’ll go into all that below.

About QoS

QoS is complicated. Figuring out what you ought to be doing, then designing your policy, then deploying it. Very carefully and consistently.

NetCraftsmen has traditionally worked with organizations to do the hardest part, figuring out what you’re trying to do, and coming up with a sustainable overall QoS strategy. We then work with the customer to build standard configurations for the relevant Cisco device types.

Part of what we do is try to minimize the complexity, especially the number of variations in QoS at different points in your network.

Deployment is always interesting. Deploying QoS requires precision. Our experience is that being precise enough and not missing interfaces or forgetting to deploy parts of the QoS template trips up many sites. Sites generally want to deploy QoS internally to cut costs. We partner to verify correct deployment. Alternatively, we can deploy it.

I’ve had high hopes for APIC-EM / EasyQoS, not the least of which is hoping to lower the costs of deploying QoS, while increasing the accuracy of deployment. That would empower us to offer the design and strategy services without deployment cost being a potential barrier.

I haven’t encountered any sites using EasyQoS (as far as I know). I’ve been talking up trying it where I can.

That may mean sites don’t need a consultant to do QoS, which is a win for them. I’d like to think NetCraftsmen might be able to help with QoS design and planning. I have heard concerns about EasyQoS support for Nexus, e.g. Nexus 9Ks. Googling just now suggests that is still an issue.

LiveAction has some interesting QoS and templating capabilities, and can do CBQoS-MIB based reporting, but it too suffers from Nexus configuration impairment (per recent conversation with a LiveAction staff member). I’m not aware of anything else that might be a contender for “QoS deployment tool”.

Cisco AVC: Doomed?

Ok, I’m using “doomed” to get your attention. Here’s what that’s all about.

Cisco router AVC (formerly NBAR) does deep-dive inspection of packets to classify them. AVC is also supported in some Cisco Wireless devices, and apparently in the 3650 / 3850 / Cat 9K switches, subject to some restrictions.

Cisco AVC can export flow information, which could be useful for security. Neat technology! I like the idea of AVC. There are some practical limitations, however…

Consider HTTPS. HTTPS is HTTPS, and the source / destination is about all that is not opaque. Can AVC identify different forms of HTTPS, encrypted web traffic? I highly doubt it. So as more and more web traffic, especially Internet-bound web traffic, shifts to HTTPS, how might that traffic be classified by AVC? Long lists of destination IPs? Updated how often? I tend to doubt any of that will work well.

Google search does not show anything about AVC working with HTTPS traffic. I did find a note that ETA (Encrypted Traffic Analysis) cannot be used on the same interface as AVC.

Tentative conclusion: Cisco AVC is handy, even needed, but only for unencrypted traffic. Please comment if you have reason to think otherwise!

Another challenge for AVC: the document listing AVC supported applications is interesting. There are a lot of entries, which is good. However, specifics about what the various items actually match are not there. E.g. ms-lync versus ms-lync-something. Is ms-lync a catch-all including the video and audio? How can I find out, other than by doing time-consuming testing?

The third item in the “doom” category is L3 switches. I’m seeing a lot of sites using Nexus L3 switches for 10 Gbps and faster WAN links, links to CoLo sites, links within and between CoLo sites, etc. That may represent a conscious decision to do without AVC, and perhaps QoS. I think of QoS as an “insurance for your fragile traffic.” So, switches can only provide limited “insurance”. Fair enough, that’s a decision factor.

I’ll have to note that despite my doing a fair amount of QoS, I have not seen sites using AVC. Others don’t do any QoS.

I end up thinking AVC can be useful for web-based and other apps if HTTPS blindness isn’t a problem for you. It does add considerable complexity, and QoS is already fairly complex. Like many choices in life or networking, one has to temper what one would like to have with what one can afford, in this case, afford in terms of time, complexity and support.

QoS in General

Non-use of QoS or AVC may be symptomatic of something else; QoS is complex and time-consuming, and AVC adds to the complexity. I’d have written “Cisco QoS,” but some of what I’ve seen with QoS / WAN boxes with GUIs is almost worse, encouraging micro-management of applications.

QoS helps in the narrow range where you are a bit tight on bandwidth or need to protect interactive voice and video. QoS cannot help when you’re badly tight on bandwidth (think police car with flashing lights trying to get through a massive traffic jam).

There’s a case for using QoS even with a lot of bandwidth (aggregation points, transition from high to lower speed, microbursts, etc.). Is that a fringe case? Maybe as LAN speeds increase? When I look at user ports in many sites, I see average loads in the Kbps still. As LAN speeds increase, if usage stays low, yes, dropped packets likely become much less of a concern.  Other sites have users moving a lot more data, videoconferencing, etc. There, QoS is recommended!

I suspect many sites are opting to implement more bandwidth, in part hoping to avoid having to deal with QoS or holding off on QoS since there’s no clear need. This might also be the result of a conscious management decision: we’re stretched too thin, we just can’t afford to do QoS.

This also surfaces in a negative sense. I’ve run into sites with users experiencing intermittent slowness. The point I’ve had to make is that end systems and applications can easily consume a 1 Gbps link, at least in bursts. That aggregates onto e.g. 1 Gbps switch uplinks. One then needs good monitoring statistics to determine if that is in fact happening and causing the downstream user slowness.

Classifying and Marking

With various voice products, it has become handy to have the application itself do the DSCP marking. Microsoft has been doing the right markings for Skype / Lync for a while. I generally follow their TechNote advice to shrink the port ranges used for various purposes. In general, I’d like total control, but applications that do the right thing, and / or allow GPO tweaking — that works too. Part of doing QoS is figuring out a sustainable and least painful way to get what you need, and maybe moving some things from the “need” bucket to the “nice to have” bucket.

Lately, we have all sorts of Internet-based voice, video, and conferencing products. Some are well documented, some poorly documented. One has clearly documented they do not use IETF and Cisco standard DSCP markings (sigh!). Ideally, an admin would be able to go to some settings page and set the markings for their organization. This sort of thing is needed, so that e.g. internal station-to-station calls can be given QoS handling, at least on-net. Conference calls, ditto.

Conclusions

QoS can be very helpful as insurance for fragile interactive voice and video traffic, and perhaps streaming video as well.

Cisco Nexus L3 switches can do a modest amount of QoS, which may suffice for many purposes.

Cisco AVC is available in a number of Cisco platforms, including some wireless devices, most recent routers, and some recent campus switches. AVC can provide sophisticated application awareness. It likely cannot do much with HTTPS traffic.

Comments

Comments are welcome, both in agreement or constructive disagreement about the above. I enjoy hearing from readers and carrying on deeper discussion via comments. Thanks in advance!

—————-

Hashtags: #CiscoChampion #TheNetCraftsmenWay #QoS #Cisco

Twitter: @pjwelcher

Disclosure Statement

NetCraftsmen Services

Did you know that NetCraftsmen does network /datacenter / security / collaboration design / design review? Or that we have deep UC&C experts on staff, including @ucguerilla? For more information, contact us at info@netcraftsmen.com.

The post The Changing Cisco QoS Environment appeared first on NetCraftsmen.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

This blog is a small follow-up to the Neat Cisco Nexus Features You Might Have Missed blog.

Some Cisco folks pointed me to the “Catena” feature, implemented in some Cisco Nexus 9K and in the Nexus 7K models. Catena is Latin for “chain”.

From one point of view, Catena is locally-configured hybrid OpenFlow. That is, locally-configured specification of over-ride behavior forwarding (or dropping) based on flow specifications.

What Is Catena

Catena lets you build service chains for services from devices that connect to a single datacenter switch. That is, you can force selected traffic through external devices 1, 2, 3, etc. For example; firewall, IPS, IPsec MITM box, whatever.

Here’s a diagram showing what you can do with Catena:

This to a degree generalizes the ITD topic discussed in my prior blog (link above). It is done by modifying TCAM, so you get full performance. Catena uses the notion of device groups: it load balances across devices within configured device groups.

I’ll note that Catena has “transparent” (L2) and routed modes. For more, I’ll Reference the Fine Manual (RTFM) — Nexus 9K version (or see also the Nexus 7K link below).

My Reaction

If you need to selectively forward to or bypass certain devices, Catena could be very handy.

I do need to note that selective forwarding based on more than destination address does complicate things. Factoring in behavior based on source address or source / destination port means more forwarding equivalence classes of traffic, each behaving in their own way. More to understand! This is not necessarily good or bad, it is more to note that using a feature like this adds complexity, and you need to think about managing that complexity.

My Beer Principle likely applies: one beer good, many beers equals headache. Do too much Catena, and you may develop a headache?

References Comments

Comments are welcome, both in agreement or constructive disagreement about the above. I enjoy hearing from readers and carrying on deeper discussion via comments. Thanks in advance!

—————-

Hashtags: #CiscoChampion #TheNetCraftsmenWay #CiscoCatena #CiscoNexus #DataCenter

Twitter: @pjwelcher

Disclosure Statement

NetCraftsmen Services

Did you know that NetCraftsmen does network /datacenter / security / collaboration design / design review? Or that we have deep UC&C experts on staff, including @ucguerilla? For more information, contact us at info@netcraftsmen.com.

The post Service Chaining via Cisco Catena appeared first on NetCraftsmen.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

During my professional development this year, my organization advised me that they would like me to work towards the Cisco Business Architecture Specialist certifications. My first thought was, “How is this going to help me?” My belief was that I need to focus all of my attention on things such as ACI, Cloud and SD-WAN. I consider myself to be all things technical and really couldn’t understand how the “Business”could truly help me. It’s one of those things where you don’t “assume”, you discover.

As I began my studies, Cisco laid out three tracks to becoming a Cisco Business Architecture specialist. The three tracks are:

  • Cisco Business Architecture Analyst
  • Cisco Business Architecture Specialist
  • Cisco Business Architecture Practitioner

I set my sights on completing all three tracks in becoming a Cisco Business Architecture Practitioner. When starting this journey, I knew there would be some growing pains as I traditionally did not look at “Business” as a requirement for being a successful consultant. It’s amusing how quickly my mindset changed after truly working through the material and gaining a fundamental understanding of the Cisco approach.

Cisco Business Architecture Analyst

The Cisco Business Architect Analyst certification is focused on building your knowledge on the Cisco Business Architecture approach and methodology. The methodology is focused on people, process and technology. This was driven home numerous times through the study material. A few key takeaways from this track was determining the impact of business outcomes by how well you establish business strategies. Understanding the difference between views and viewpoints was something else that caught my attention. This allows successful mapping to business capabilities that ultimately drive the customer’s business architecture. 

Cisco Business Architecture Specialist

The Cisco Business Architecture Specialist certification is focused on the process. It builds on the foundation of the previous track where you develop tools that allow business architects to drive home change and emphasis the importance of a business-led engagement. There were a few takeaways from this track that helped shape my thought process around a business-led approach. The Cisco Business Model Canvas is one of the key tools in understanding the business. Its information is essential to the long-term strength and success of the business model. Understanding internal and external business influencers was another topic with some important concepts. There were two tools I recall that focused on influence. They were the SWOT analysis (Strengths, Weaknesses, Opportunities and Threats) and Stakeholder Analysis Grid. The SWOT analysis focuses on both internal and external factors with strengths and weaknesses focusing on internal influencers, while opportunities and threats focus on external. The Stakeholder Analysis grid allows you to focus specifically on the characteristics of each stakeholder in the business-led approach. This allows a business architect to drive conversations that are valued and place emphasis on the specific needs of each business leader and relevant stakeholder.

Cisco Business Architecture Practitioner

The Cisco Business Architecture Practitioner certification is focused on the execution. At this level you should be able to lead a business-led engagement with the customer. It strengthens the concepts of the two previous tracks, but really drives home the execution for both the customer and the organization. Meaning it also gives insight into partners leveraging the tools provided to develop their own business architect practice. At this stage in the game, it is all about building credibility and rapport. One of the key takeaways from this track includes the business proposal. The business proposal focuses on the value of the business and consists of five key elements:

  • Executive Summary
  • Business Roadmap
  • Business Impact
  • Financial Considerations
  • Appendix

There is also the business roadmap. The business roadmap is essential to delivering business outcomes. The business roadmap is what helps determine business solutions and helps leverage current and new business capabilities.

Overall, completing these tracks help drive home customer maturity levels and how they determine an engagement. There are four levels of customer maturity and they are:

  • Technology Specific – Completely silo, one dimensional (SD-WAN).
  • Technology Architecture – This level is multifaceted covering several technology domains (SD-WAN, Wireless, Collaborations).
  • Partial Business Engagement – At this level, customers want to know how the technology they invest in will impact the business.
  • Business Transformation – Technology does not drive the Business, Business drives the technology.

I bring this up because of a recent meeting that was held with a customer. It was interesting to watch the conversation shift immediately from technology to the business. Understanding specific business needs and outcomes changed the direction of how technology would be used to approach this engagement. The conversation became business-led and Cisco’s Business Architecture certifications allowed me to navigate that discussion with an understanding of exactly how to approach the customer engagement.

Comment

I look forward to any and all comments regarding this subject. I look forward to sharing part two of this discussion, titled “How Business Drives Technology.”

The post Engineering for the Business appeared first on NetCraftsmen.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

This is one of several blogs about the vendor presentations during Network Field Day 19, which took place November 7-9, 2018. This blog contains a summary of the vendor presentation and any related comments or opinions I might have (I’ll share at least some of them).

If this blog motivates you to greater interest in what the vendor had to say, you can find the cleaned up streaming video of their presentations at the Tech Field Day YouTube channel, specifically the NFD19 playlist, or by clicking on the vendor’s logo on the main #NFD19 web page (linked above).

Kentik at #NFD19

Kentik presented about their cloud-based big data approach to analyze many kinds of flow data, both from internal sources and cloud sources. Kentik goes beyond many NetFlow tools by ingesting cloud flow logs and other types of information, automating detection of supporting information (BGP, geo-location, etc.) and innovating to add business and technical value to reports. Kentik Detect also provides threshold and anomaly alerting. And of course, flow data now has a security role as well. 

Avi Freedman (Co-founder and CEO) presented on “Tagging and Enrichment”. I have my own personal spin on this. I think most of us that have worked with NetFlow tools feel like we exhausted their possibilities pretty quickly: packets, source IP, destination IP, protocol, port, etc. There’s also some pain associated with having to interpret the raw data.

Kentik goes beyond that limiting perspective. One of Avi’s / Kentik’s key insights is that if you add contextual information to the flow data, you enhance its reporting and make it more business relevant and easier to consume the data. I.e. making the data useful to the business, without working at the packet-level.

For example, pulling up data by site name rather than IP, ports as named applications, etc. Tie in BGP data so you can identify network exit points (important for Service Providers / large WAN cost management). Tie in SNMP data to get interface names.

Jim Meehan (Director of Product Marketing) covered some of this in his Service Provider presentation. Interesting capabilities include My Kentik white-label portals for customer-specific views of data.

Business relevance means that Kentik moves beyond IT, for instance it can help sales prospecting in Service Providers.

Kentik can use its k-probe agent to get better data from servers. Avi noted that they’re now having to show value in order to get the agent onto servers. Thinking as a server owner might: time / hassle, need to support agent going forward, potential performance impact or interaction with other agents, risk, etc. I hear that adding flow / packet agents or security agents or APM agents is tough to accomplish!

Avi also mentioned that Kentik is seeing more automation / workflow integration, including ServiceNow.

The most recent feature from Kentik is the ability to ingest AWS and Google flow records. Azure is apparently coming in 2019.

The final demo by Jim really caught my eye. You can findit after Crystal Li’s presentation on leveraging cloud information.

For that demo, the group was invited to connect to http://sockshop.gcp.kentik.io and drive the (simple) container-based website to generate traffic between components. Jim then showed the ensuing GCP flow information and some report she’d set up. Jim had added Istio service mesh to the setup, to get latency data by URL path, without having to add agents.

This is a cool capability: network type data for VM or container flows, and not having to get app owners to install agents. Also, reporting not tied to a particular toolor cloud (admittedly limited by whatever the cloud provider does export).

I’m sure the various forms of flow data will have their strengths, weaknesses, and just plain gaps.

I highly recommend you watch Avi’s tech talk and the Cloud demo. The closing “What’s Next” section might also be of interest. The #NFD19 delegates really liked the negative roadmap discussion.

Comments

Comments are welcome, both in agreement or constructive disagreement about the above. I enjoy hearing from readers and carrying on deeper discussion via comments. Thanks in advance!

—————-

Hashtags: #CiscoChampion #TheNetCraftsmenWay #Kentik #NFD19 #Cloud #BigData

Twitter: @pjwelcher

Disclosure Statement

NetCraftsmen Services

Did you know that NetCraftsmen does network /datacenter / security / collaboration design / design review? Or that we have deep UC&C experts on staff, including @ucguerilla? For more information, contact us at info@netcraftsmen.com.

The post Kentik Adds Value, Gets Cloudy appeared first on NetCraftsmen.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

This is one of several blogs about the vendor presentations during Network Field Day 19, which took place November 7-9, 2018. This blog contains a summary of the vendor presentations and any related comments or opinions I might have (I’ll share at least some of them).

If this blog motivates you to greater interest in what the vendor had to say, you can find the cleaned up streaming video of their presentations at the Tech Field Day YouTube channel, specifically the NFD19 playlist, or by clicking on the vendor’s logo on the main #NFD19 web page (linked above).

We’ve heard from Apstra before, and I’ve become somewhat of a fan. Multi-vendor automation at large scale (with a current focus on broadening its data center fabric automation). What’s not to like?

Apstra

Apstra is a hot network automation startup that started talking about “Intent-Based Networking” before that became a Thing.

Apstra now has some marquee customers and has been adding AOS features. AOS is a vendor-neutral platform for expressing intent, automating deployment, and managing, with their current focus being datacenter spine-leaf fabrics.

AOS supports features like VLANs, VXLAN, BGP EVPN, anycast gateway with ARP suppression, IP multicast, and VRFs. AOS caught my eye a while back as a possible alternative to Cisco’s fabric management tools, especially if one wants just fabric without security. DCNM LAN is Cisco’s alternative for that niche — or ACI and just not using some of the features.

One key difference is that AOS supports selected equipment from other switch vendors. Apstra considers that to reduce risk.

Presentation by Mansour Karam, CEO and Founder, Apstra

I’ll refer you to the online #NFD19 videos for Carly Stoughton’s whiteboarding overview. Also, for how DJ Spry leveraged sped-up screen cap automation to the tune of Benny Hill to emphasize how fast Apstra can deploy equipment (once that physical racking and cabling has been done). He then toured the UI some.

Rags Rachamadugu demonstrated Intent-Based Analytics, including how context enriches the analytics. The telemetry data is exposed via the API. There is a catalog of probes as well. Ones that I noted: mismatch of vSphere VLAN versus physical VLAN, MLAG or ECMP imbalance, hot / cold interfaces (versus normal traffic levels). We were told they’re all up on Github for community access.

Some comments along the way indicated that Apstra is focused on extending “intent” to VMware and to devices that attach to the datacenter fabric. I imagine we’ll be hearing more about that as AOS evolves!

Ryan Booth and Jere Julian (of NetworkToCode) discussed and demonstrated AOS and ServiceNow Integration. NetworkToCode can provide AOS / Intent-Based Networking (IBN) training and other services.

David Cheriton (Founder) wrapped up with an off-camera sneak preview roadmap. There is some more neat functionality coming!

Conclusions

Apstra is expanding / broadening functionality while expanding the scope of Intent-Based Networking (and Telemetry, and Analytics, and more).

Comments

Comments are welcome, both in agreement or constructive disagreement about the above. I enjoy hearing from readers and carrying on deeper discussion via comments. Thanks in advance!

—————-

Hashtags: #CiscoChampion #TheNetCraftsmenWay #Apstra #NFD19 #Networking

Twitter: @pjwelcher

Disclosure Statement

NetCraftsmen Services

Did you know that NetCraftsmen does network /datacenter / security / collaboration design / design review? Or that we have deep UC&C experts on staff, including @ucguerilla? For more information, contact us at info@netcraftsmen.com.

The post Apstra’s Intent-Based Networking appeared first on NetCraftsmen.

Read Full Article

Read for later

Articles marked as Favorite are saved for later viewing.
close
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Separate tags by commas
To access this feature, please upgrade your account.
Start your free month
Free Preview