Read the latest industry updates and hot topics in enterprise-scale network monitoring from the Kentik Blog. Kentik is a cloud-based network visibility and analytics solution that delivers unprecedented depth of insight into any network.
Kentik Blog by Crystal Li, Product Marketing, &.. - 9M ago
In a recent blog post, we announced the availability of the “My Kentik” self-service analytics portal. We highlighted the great value-add that Kentik’s service provider customers can now deliver to their end-customers with real-time network data insights. Read ahead to see why having this visibility is important not only to the service provider, but also to their end-customers, who consume those services — subscribers, digital enterprise, hosting customers, etc.
Where is my money spent?
Imagine you are looking for a place to rent. Eventually, you narrow it down to three properties that you like equally. Each of the landlords offers a different option for paying rent:
Option 1: You pay a fixed monthly amount to your landlord — rent, garbage, utilities, are all included, but no further breakdown details are provided.
Option 2: You pay only the rent to your landlord. You pay all other bills on your own directly to the providers.
Option 3: You can log in online to view and pay an itemized bill with all the details presented. Even better, you can make suggestions on choosing different utility vendors.
Hold your thoughts for a moment and think about managing the infrastructure OpEx in your business. How would you want to pay those bills?
Self-service portals are critical customer service tools
90% said they expect an online portal for self-service; and
74% have used a self-service support portal to find answers to service-related questions before calling an agent.
So which factors are driving the self-service revolution in the customer engagement model?
First of all, the self-service portal visualizes the service — it gives consumers a better way to utilize existing services and discover new services that can help solve the business problem.
Second, the self-service portal improves communication and productivity. With the same view that the service consumer has now, it is more likely to have a meaningful conversation with the service provider and more efficiently pinpoint the issues.
Third, the self-service portal is a great tool for consumers to increase the value of the services by visualizing them.
Last but not least, as a positive side effect, service consumers now have the opportunity to learn new skills and improve knowledge of the products and services provided by the service provider.
How “My Kentik” can help
The “My Kentik” portal is a built-in feature of the Kentik platform that enables curated, self-service network traffic visibility for downstream customers. Now service consumers can get a level of visibility that’s never before been available to them — powered by the same dataset their providers have, but highly customized.
Imagine you are subscribed to DDoS protection services from an MSSP (managed security service provider). When an incident occurs, using “My Kentik”, you can now immediately access the traffic and alert details yourself. That allows both high-level and drill-down analysis to speed up troubleshooting and issue resolution, even while waiting on the phone with customer service. To see for yourself, sign up for a demo or free trial.
If you are an enterprise customer who buys Internet/MPLS circuits, with “My Kentik”, you can now view your WAN traffic utilization and understand which departments, WAN sites, and data centers are driving network utilization. You can also use this data to make a more cost-optimized plan for network changes and expansion.
If you’re part of a team of application developers or architects, the “My Kentik” portal can provide cloud or data center traffic insights, so you can understand how the applications you’re deploying are affecting the infrastructure and vice versa. That includes the east-west traffic between microservices and other components within the app itself and informs app architecture decisions that help avoid impact from infrastructure bottlenecks.
If you are a customer receiving IP transit or peering services, “My Kentik” can provide traffic utilization and billing correlation analysis. Breakdowns like port utilization, geolocation, ASN, and traffic type breakdown uncover the drivers behind total traffic volume and explain unexpected increases and cost exposure.
If you’re a customer of hosting or IaaS services, “My Kentik” can provide an understanding of traffic details such as per-host and per-site utilization, application / service utilization and even connection-level details of each host’s historical network activity for security and forensic analysis.
My Kentik Portal delivers transparency across all of these customer types and use-cases that can shorten incident response time, make troubleshooting far easier, enable more robust infrastructure, and accelerate business growth. Talk to your service provider now and ask about their integration with Kentik.
While we’re based in the US, we have always found both great interest and ready adoption by providers and digital businesses around the world, including those in Europe. Despite a relatively short 3+ years since our launch, Kentik’s SaaS solution has already been embraced by dozens of organizations across the European continent — particularly those who deliver or depend upon routed networks and the Internet as an essential part of their operations and business.
Throughout the past several years, Kentik has engaged in active discussions with many more European organizations who understand and value what the Kentik service can provide, in terms of network visibility, traffic analytics, network performance intelligence, and streaming anomaly detection. And because Europe is comprised of a large number of densely clustered countries, there is high need for clarity and insights around peering and transit, for both technical operations and business optimization. Reaching all of these potential new clients requires, at some point, more dedicated resources that are aligned directly with the needs of the European marketplace. The Kentik team has long understood this, and recently committed to make such investments a reality.
And with that decision comes our big news:
The team is embarking upon a major initiative and rollout to expand services within Europe and the greater EMEA region:
First and foremost, this includes offering SaaS hosting from a European location, in Frankfurt, Germany. We chose Frankfurt because of its abundant network connectivity, myriad choices for data center facilities, and its excellent fit for helping with requirements for data residency shared by many European organizations.
Second, Kentik is expanding our channel program to add more local European partners, including our first in Germany, Diverse GmbH, with plans to continue growing our partner roster in the months and years ahead. Diverse GmbH joins our existing list of regional partners, which include Interdata, Baffin Bay Networks, and Acorus Networks.
Lastly, Kentik is investing in local personnel resources within Europe, adding field roles to our existing operations personnel in region.
“On a daily basis, we hear of new and existing network management challenges from enterprises and service providers alike. All are seeking effective technologies for visibility and insights into transit, capacity, and interconnection, and yet regulations create new obstacles for where organizations can store network and user data,” said Jean-Marc Odet, CEO of Interdata, a leading French company specializing in network integration, architecture and security and a value-added reseller (VAR) for Kentik. “We’re excited to see Kentik expand into Europe. Kentik’s powerful network analytics will help many more organizations in the region address their network challenges, and we look forward to working with them on their efforts.”
With these growth initiatives, Kentik will be in place to support more customers in more countries with both shared and specific needs. We will remove barriers to adoption and give European organizations a real choice when it comes cutting-edge network traffic and performance monitoring solutions. Kentik is bringing the world’s most powerful network analytics to Europe! To learn more, read our press release or reach out to us.
Kentik Blog by Dan Rohan, Customer Success Enginee.. - 10M ago
Let’s face it — today’s networks are complex. The physical topology continues to expand with relentless traffic growth, and a constant stream of new technologies like SDN, Clos architectures, and cloud interconnects make it even harder to understand how services traverse the network between application infrastructure and users or customers. Even simple questions like “where will this traffic exit my network?” become difficult to answer.
It’s that “exit point” question in particular that led Kentik to build our BGP Ultimate Exit feature set (UE for short). In a nutshell, UE enriches all the flow data Kentik receives with tags that indicate the PoP, router, interface, etc. where that flow will exit the network, potentially many hops away. For outbound traffic, the exit might be to an adjacent network, a customer, or an upstream provider. For inbound traffic, the “exit” might be to a host in a data center. In either case, UE is insanely valuable across both enterprise and service provider networks, for diagnosing routing issues, understanding which traffic sources are driving traffic growth at remote points in the network, or even understanding the relative cost to serve each customer.
To be clear, UE is fundamentally different than simply looking at traffic based on the typical source or destination fields available in regular NetFlow. To illustrate, let’s consider a network that looks roughly like the diagram below. Each network device (switch, router) in this network is sending data to Kentik (flow, BGP, and SNMP data).
As you can see, we have a network where traffic is entering an edge/border device destined for a host on the same network.
Based upon the flow generated from the border router, we can see that this flow is originating from 22.214.171.124 and headed towards 126.96.36.199. Looking at that same flow, we see that the source interface is the transit interface and the destination interface is the backbone interface connected to device B. We can figure all of this out just by using the standard fields in flow records exported from the edge device alone.
But what if you wanted to figure out which piece of network gear is actually handing this traffic off to the destination host (188.8.131.52)? If you don’t have a detailed mental map of your network (and honestly, who really does at this point?) the traditional approach is to log into a router, run a show route command to figure out which adjacent device is announcing the IP internally, then jumping to that device to look at ARP tables, figure out which interface the device is attached to, rinse and repeat.
BGP Ultimate Exit dramatically simplifies this by determining the egress point (err, the “ultimate exit”) at the exact moment that the traffic ingresses the first router. Kentik stitches together flow data, BGP routes and SNMP interface IP data to determine precisely where the traffic will be routed to without having to do any ‘flow hopping’ recursion lookups or logging into your routers. Here’s a quick sketch of how it works:
As packets ingress the router, it performs lookups to determine how to forward the traffic and also creates a flow record which is exported to Kentik.
Kentik’s ingest layer maintains a full BGP table from each router. By looking up the dest IP from the flow record in the BGP table for the router it was received from, we enrich each flow with additional BGP-related fields, including the BGP next-hop IP address for the matching route.
Kentik’s ingest layer also maintains an SNMP interface IP table for every device in the network. By looking up the next-hop IP in this table, we can tag each flow with the egress router and site that the next-hop IP is associated with.
Kentik also maintains an auto-generated in-memory table of the ASNs that are adjacent to each interface. Comparing each flow against this table allows us to additionally tag it with a specific egress interface.
As another example, imagine you wanted to see the destinations for content from a specific server or set of servers, and how that traffic was being delivered over your infrastructure. Perhaps you’re planning some network maintenance and want to detail the expected impact of your work. Kentik’s UE feature makes this easy. We’ll use the same diagram, but flip the arrows around:
Steps in the Kentik UI:
Select all devices to make sure the query considers the entire network
Add filters to uniquely identify traffic from the server in question. For example, if you wanted to see where traffic from source IP 184.108.40.206 left your network, you’d set up your filters like so:
Choose the dimensions to include in the query output. The example below is particularly useful, showing the network entry point(s) by source IP, source interface and ingress device, then the egress UE device and interface, together with the next-hop ASN and destination IP.
The output will be similar to the diagram below, showing the end-to-end flow of the traffic from that particular server:
For service providers, UE can help sales and product teams understand how each customer impacts network load, and the relative cost of delivering each customer’s traffic. By applying filters for a customer’s ASN, interfaces, or subnets, the SP can get a view of how much of the customer’s traffic is delivered locally out of the same PoP or region where it ingressed, and how much traverses the SP’s backbone to egress in remote regions (with a much higher associated cost). This type of visibility allows product teams to create differentiated pricing structures, calculate margin for each customer, and enforce contracts with “traffic distance” terms.As you can see, Kentik’s BGP Ultimate Exit feature set provides key functionality for understanding and managing traffic flows in networks of all types.
Kentik Blog by Ken Osowski, Independent Industry C.. - 10M ago
Flow-based network monitoring relies on collecting information about packet flows (i.e. a sequence of related packets) as they traverse routers, switches, load balancers, ADCs, network visibility switches, and other devices. Network elements that support traffic monitoring protocols such as NetFlow and sFlow extract critical details from packet flows, like source, destination, byte and packet counts, and other attributes. This “metadata” is then streamed to “flow collectors” so that it can be stored and analyzed to produce network-wide views of bandwidth usage, identify abnormal traffic patterns that could represent possible security threats, and zero in on congestion.
Flow-based monitoring provides significant advantages over other network monitoring methods. It can capture substantially more detail about network traffic composition than SNMP metrics, which show only total traffic volume. It’s also significantly less expensive and easier to deploy than raw packet capture. And since flow generation capability is now a built-in feature of almost all modern network equipment, it provides a pervasive monitoring footprint across the entire network.
What is NetFlow?
NetFlow is the trade name known for a session sampling flow protocol invented by Cisco Systems that is widely used in the networking industry. In networking terms, a “flow” is a unidirectional set of packets sharing common attributes such as source and destination IP, source and destination ports, IP protocol, and type of service. NetFlow statefully tracks flows (or sessions), aggregating packets associated with each flow into flow records, which are then exported. NetFlow records can be generated based on every packet (unsampled or 1:1 mode) or based on packet sampling. Sampling is typically employed to reduce the volume of flow records exported from each network device.
The most commonly deployed NetFlow versions are v5 and v9, with the main difference being that v9 supports “templates” which allow flexibility over the fields contained in each flow record, while v5 flow records contain a fixed list of fields. IPFIX is an IETF standards-based protocol that is largely modeled on NetFlow v9. Other NetFlow variants supported by various network equipment vendors include J-Flow (NetFlow v5 variant) from Juniper, cflowd from Alcatel-Lucent and Juniper, NetStream from 3Com/Huawei and RFlow from Ericsson.
What is sFlow?
sFlow, short for “sampled flow,” is a packet sampling protocol created by InMon Corporation that has seen broad network industry adoption. This includes network equipment vendors that already support NetFlow including Cisco, along with router vendors like Brocade (now Extreme) and in many switching products including those from Juniper and Arista. It is used to record statistical, infrastructure, routing, and other metadata about traffic traversing an sFlow-enabled network device. sFlow doesn’t statefully track flows, as is the case with NetFlow, but instead exports a statistical sampling of individual packet headers, along with the first 64 or 128 bytes of the packet payload. sFlow can also export metrics derived from time-based sampling of network interfaces or other system statistics.
sFlow does not support unsampled mode like NetFlow does, nor does it timestamp traffic flows. It relies on accurate and reliable statistical sampling methods for documenting flows, thereby reducing the amount of flow information that ultimately needs processing and analysis.
Key Differences Between NetFlow and sFlow
Here are some key differences between NetFlow and sFlow:
NetFlow does not forward packet samples directly to collectors but instead exports “flow records” to collectors that are created by tracking a collection of packets associated with a session. This session-specific, summary flow information is created as a single record in the network device’s RAM or TCAM. The device then exports a NetFlow datagram that contains multiple flow records. This stateful session tracking requires its share of network device CPU and memory resources. In some cases, a significant amount of resources when higher packet sampling rates are configured.
sFlow packet sampling consists of randomly sampling individual packets. Based on a defined sampling rate, an average of 1 out of N packets is randomly sampled. sFlow captures packet headers and partial packet payload data into sFlow datagrams that are then exported to collectors.
Since sFlow captures the entire packet header, by default it’s able to provide full layer 2–7 visibility into all types of traffic flowing across the network including MAC addresses, VLANs, and MPLS labels, in addition to the Layer 3 and 4 attributes typically reported by NetFlow. sFlow has less resource impact on devices since it only performs packet sampling and does have to identify and keep track of sessions as is the case with NetFlow.
sFlow has the option to export interface and other system counters to collectors. Counter sampling performs periodic, time-based sampling or polling of counters associated with an interface enabled for sFlow. Interface statistics from the counter record are gathered and sent to collectors by sFlow. sFlow analysis applications can then display the traffic statistics in a report, which helps isolate network device issues. Three different categories of counters can be generated:
Generic interface counters: records basic information and traffic statistics on an interface
Ethernet interface counters: records traffic statistics on an Ethernet interface
Processor information: records CPU usage and memory usage of a device
Flow Protocol Capabilities
Specific Fields Only
Source & Destination Subnet//Prefix
Both NetFlow and sFlow support sampling techniques. With sFlow it’s required, and with NetFlow it’s optional. There is a long-running discussion in the industry about the accuracy of data and insight derived from sampling-based flow protocols. With sampling enabled, network devices generate flow records from a 1-in-N subset of traffic packets. As the variable N increases, flow records derived from the samples may become less representative of the actual traffic, especially for low-volume flows over short time windows. In the real world, how high can N be while still enabling us to see a given traffic subset that is a relatively small part of the overall volume? What is the impact on the accuracy of flow record analysis? Testing performed by Kentik indicates that even at sampling rates as low as 1:10000, lower bandwidth traffic patterns are discernible even in high throughput networks.
Choosing Between NetFlow and sFlow
So which is better? In many ways, sFlow provides a more comprehensive picture of network traffic, because it includes the full packet header, from which any field can be extracted, where NetFlow typically contains only a subset of those fields. sFlow also typically places less load on network devices. In many cases, the choice is not up to the user though, because most networking gear supports only one or the other. Many networks contain gear from multiple vendors, and the key question for the network operator then becomes — does my network monitoring platform support all of the flow protocols that my network generates? This is an important consideration to ensure there are no visibility gaps across the infrastructure.
Contemporary big data network monitoring platforms, such as Kentik Detect®, are well suited to cope with network monitoring challenges. Kentik’s adoption of a big data architecture is at the core of their network flow-based monitoring platform, which supports NetFlow, IPFIX, and sFlow protocols. This allows Kentik to correlate high volumes of flow data records for customers, eliminating network monitoring accuracy concerns. Big data is not only about handling large volumes of data, but also letting network operations staff navigate through and explore that data very quickly.
To get more info on how NetFlow and sFlow are used see these Kentik blogs. To see how Kentik Detect can help your organization instrument its network with multiple flow-based protocols, request a demo or sign up for a free trial today.
Kentik Blog by Ken Osowski, Independent Industry C.. - 1y ago
NetFlow is a protocol that was originally developed by Cisco to help network operators gain a better understanding of their network traffic conditions. Once NetFlow is enabled on a router or other network device, it tracks unidirectional statistics for each unique IP traffic flow, without storing any of the payload data carried in that session. By tracking only the metadata about the flows, NetFlow offers a way to preserve highly useful traffic analysis and troubleshooting details without needing to perform full packet capture — the latter of which can be very expensive and yield few incremental benefits.
NetFlow monitoring solutions quickly evolved as commercialized product offerings represented by three main components:
NetFlow exporter: A NetFlow-enabled router, switch, probe or host software agent that tracks key statistics and other information about IP packet flows and generates flow records that are encapsulated in UDP and sent to a flow collector.
NetFlow collector: An application responsible for receiving flow record packets, ingesting the data from the flow records, pre-processing and storing flow records from one or more flow exporters.
NetFlow analyzer: An analysis application that provides tabular, graphical and other tools and visualizations to enable network operators and engineers to analyze flow data for various use cases, including network performance monitoring, troubleshooting, identifying security threats and capacity planning.
Cisco started by providing NetFlow exporter functions in their various network products running Cisco’s IOS software. Cisco has since developed a vertical ecosystem of NetFlow partners who have mainly focused on developing NetFlow collector and analysis applications to fill various network monitoring functions.
In addition to Cisco, other networking equipment vendors have developed NetFlow-like or compatible protocols, such as J-Flow from Juniper Networks or sFlow from InMon, to create exporter interoperability with third-party collector and analysis application vendors that also support NetFlow, creating a horizontal ecosystem across networking vendors.
The IETF also created a standard flow protocol format called IPFIX that embraces NetFlow from Cisco but now serves as an open, industry-driven standards approach to consistently enhancing flow protocols for the entire networking industry instead of Cisco evolving NetFlow unilaterally.
NetFlow collector and analysis applications represent two key capabilities of NetFlow network monitoring products that are typically implemented on the same server. This is appropriate when the volume of flow data being generated by exporters is relatively low and localized. In cases where flow data generation is high or where sources are geographically dispersed, the collector function can be run on separate and geographically distributed servers (such as rackmount server appliances). In these cases, collectors then synchronize their data to a centralized analyzer server.
Products that support NetFlow components can be classified as follows (with example vendor products listed in each category):
NetFlow exporter support in a device:
Cisco 10000 and 7200 routers
Cisco Catalyst switches
Juniper MTX and PTX series routers (via IPFIX)
Stand-alone NetFlow collector:
SevOne NetFlow Collector
NetFlow Optimizer (NetFlow Logic)
Stand-alone NetFlow analyzer:
Solarwinds NetFlow Traffic Analyzer (NTA)
PRTG Network Monitor
ManageEngine NetFlow Analyzer
Bundled NetFlow collector and analyzer:
Arbor Networks PeakFlow
Open source NetFlow network monitoring:
Network monitoring products that focus on machine and probe data:
Network incident monitoring vendors like Splunk collect a lot of machine and probe data. Many vendors in this product category are seeing the value of integrating NetFlow. However, these platforms are designed primarily to deal with unstructured data like logs. Highly structured data like NetFlow often contains fields with formats that require translation or correlation with other data sources to provide value to the end user.
Pushing NetFlow Limits
With DDoS attacks on the rise, NetFlow has been increasingly used to identify these threats. NetFlow is most effective for DDoS troubleshooting when sufficient flow record detail is available and can be compared with other data points such as performance metrics, routing and location. Unfortunately, the state-of-the-art NetFlow analysis tools up until recently have been challenged to achieve troubleshooting effectiveness, due to data reduction. The volume of NetFlow data can be overwhelming with millions of flows per second, per collector for large networks.
Since most NetFlow collectors and analysis tools are based on scale-up software architectures hosted on single servers or appliances, they have extremely limited storage, compute and memory capacity. As a result, it is common practice to roll-up the details into a series of summary reports and to discard the raw flow record details. The problem with this approach is that most of the detail needed for operationally useful troubleshooting is lost. This is particularly true when attempting to perform dynamic baselining, which requires scanning massive amounts of NetFlow data to understand what is normal, then looking back days, weeks or months in order to assess whether current conditions are the result of a DDoS attack or an anomaly.
How Cloud and Big Data Improve NetFlow Analysis
Cloud-scale computing and big data techniques have opened up a great opportunity to improve both the cost and functionality of NetFlow analysis and troubleshooting use cases. These techniques include:
Big data storage allows for the storage of huge volumes of augmented raw flow records instead of needing to roll-up the data to predefined aggregates that severely restrict analytical options.
Cloud-based SaaS options save the network managers from incurring CapEx and OpEx costs related to dedicated, on-premises appliances.
Scale-out NetFlow analysis can deliver faster response times to operational analysis queries on larger data sets than traditional appliances.
The key to solving the DDoS protection accuracy issue is big data. By using a scale-out system with far more compute and memory resources, a big data approach to DDoS protection can continuously scan network-wide data on a multi-dimensional basis without constraints.
Cloud-scale big data systems make it possible to implement a far more intelligent approach to the problem, since they are able to:
Track and baseline millions of IP addresses across network-wide traffic, rather than being restricted to device level traffic baselining.
Monitor for anomalous traffic using multiple data dimensions such as the source geography of the traffic, destination IPs, and common attack ports. This allows for greater flexibility and precision in setting detection policies.
Apply learning algorithms to automate the upkeep of detection policies to include all relevant destination IPs.
Kentik Detect has the functional breadth for capturing all the necessary network telemetry in a big data repository to isolate even the most obscure DDoS attacks network events — as they happen or predicted in the future. Network visibility using NetFlow is key to managing your network and ensuring the best possible security measures. To understand more about NetFlow see this Kentipedia article and blog post. To see how Kentik Detect can help your organization monitor and adjust to network capacity patterns and stop DDoS threats, read this blog, request a demo or sign up for a free trial today.
Kentik Blog by Michelle Kincaid Director Of Commun.. - 1y ago
Today we’re excited to announce we published our first open source project on GitHub and npm. So what is the project? And what does it do?
If you’ve ever signed up for or purchased anything online, there’s a chance you’ve completed a series of tedious forms. The underlying technology behind those forms has not significantly changed in the last decade. However, with browsers becoming more interactive, libraries have been created to help developers implement robust, nontrivial forms.
While a form often looks simple enough to an end user who completes it, for developers the code for that form can actually encompass a lot more work behind the scenes. ReactJS and Mobx are popular libraries for building highly interactive applications, but offer only low-level building blocks to handle complex data entry needs. Without a form-specific library, the very basics of forms — setup, validation, and submission — require significant boilerplate code.
At Kentik, our network analytics platform requires dozens of complex forms (see image at right). Network and security teams enter everything from IP addresses to complex regular expressions, tackling everything from peering and capacity planning, to performance monitoring, and anomaly detection and alerting. When we started building our platform, Kentik Detect®, we went through an exercise of evaluating open source form libraries to see if they would fit our use cases. Our engineering team wanted a declarative solution to avoid boilerplate code and create consistent user experiences and code patterns, but knew they would need lots of imperative hooks to deal with special cases.
Yet, as we continued to evolve our product and scale our user base to include the top enterprises and service providers internationally, it became clear that we’d need our own solution for forms. At the end of the day (actually 24/7), a great deal of work from our developers goes into ensuring our customers avoid any value errors from our forms and, ultimately, the users gain fast insights into what is happening on their networks. That’s why we created Mobx Form — and we’re putting it up on GitHub and npm because we know other developers in our community might also need help tackling the complexity of these types of forms.
Leading our Mobx Form project, Aaron Smith, Kentik’s engineering manager and one of the brains behind our sleek UI (see below), notes: “Our Mobx Form code was quickly incorporated into our product, and over the course of a year, we’ve been focused on adding to it and fixing bugs. It’s now meeting the needs of developers here at Kentik. And while it’s our first open source project, we’re looking forward to sharing more with the community as we continue to build upon our easy-to-use UI and fast network analytics.”
Kentik Blog by Stephen Collins, Principal Analyst,.. - 1y ago
This series of guest posts has concentrated on the numerous challenges facing enterprise IT managers as businesses embrace digital transformation and migrate IT applications from private data centers into the cloud. The recurring theme has been the critical need for new tools and technologies for gaining visibility into cloud-scale applications, infrastructure and networks. In this post, I would like to finally expand on this theme.
The scope of cloud-scale visibility is daunting and technically demanding. Monitoring needs to span multiple domains: the private enterprise data center and WAN; fixed and mobile service provider networks; the public Internet; and hybrid multi-cloud infrastructure. Full stack visibility is compulsory, including application software, computing infrastructure and visibility into both virtual network layers and the various physical underlay networks.
Network and computing infrastructure is increasingly software-driven, allowing for extensive, full stack software instrumentation that provides monitoring metrics for generating KPIs. Software probes and agents that can be easily installed and spun up on-demand are displacing costly hardware probes that need to be physically deployed by on-site technicians. Active monitoring techniques now play a key role in tracking the performance of cloud-based applications accessed via the Internet, including synthetic monitoring that simulates user application traffic flows for proactively detecting problems before they impact a large number of users.
Performance metrics and other types of monitoring data can be collected in real time using streaming telemetry protocols such as gRPC. At the network layer, streaming telemetry data is displacing SNMP polling and CLI screen scraping for gaining visibility into state information. Now that support for NetFlow, sFlow and IPFIX is commonplace in routers and switches, flow metadata is a readily available source of telemetry for real time visibility into network traffic flows across all monitoring domains.
Network data is big data. The collection of massive amounts of streaming telemetry requires a high-speed data pipeline for ingesting data in real time and distributing it to the appropriate monitoring and analytics tools. Highly scalable Kafka clusters that utilize a publish/subscribe model are a commonly deployed pipeline solution, supplying telemetry data to multiple consumer analytics engines and tools.
Streaming analytics engines consume and process data for generating operational insights in real time. Column-oriented databases ingest data to support near real-time multi-dimensional analytics for correlating a wide range of time series data types. Machine learning engines analyze huge data sets to discover correlations and trends that might be impossible for operators to discern using traditional monitoring techniques. Hadoop-based data lakes support offline batch processing on massive amounts of data for gaining business intelligence insights.
While Big Data open source software is freely available, many enterprise IT organizations can’t sustain the investment needed for developing Big Data monitoring and analytics tools in-house, or the IT managers would prefer to rely on the vendor community to supply fully supported productized solutions based on open source.
Big Data was born in the cloud and Big Data analytics is well-suited for cloud-based deployments. SaaS-based Big Data analytics solutions are also an attractive option for organizations seeking a productized solution with low upfront costs, no on-site installation required and minimal ongoing maintenance.
I conclude by referencing a quote often attributed to astronaut John Glenn — someone who unquestionably had “the right stuff.” Nobody is asking IT managers to do something as outrageously risky as “sitting on top of 2 million parts — all built by the lowest bidder on a government contract.” But digital transformation is not for the faint of heart, so it’s critical that ITOps, NetOPs, SecOPs and DevOps teams make sure they have the right stuff and are properly equipped for the challenges they are facing.