Loading...

Follow Napatech's Smarter Data Delivery blog on Feedspot

Continue with Google
Continue with Facebook
or

Valid

This is the first post in a series of performance resilience blogs that we will be producing over the coming months. Performance resilience is the ability to ensure the performance of your commercial or home-made appliance in any data center environment. In other words, to ensure that your performance monitoring, cybersecurity or forensics appliance is resilient to common data center issues, such as badly configured networks, inability to specify desired connection type, time sync, power, space, etc.

In this first blog, we will look at deduplication and how support of deduplication in your SmartNIC ensures performance resilience when data center environments are not configured properly – router and switch SPAN ports specifically.

Assume the worst

When designing an appliance to analyze network data for monitoring performance, cybersecurity or forensics, it is natural to assume that the environments where your appliance will be deployed are configured correctly and adhere to best practices. It is also fair to assume that you can get the access and connectivity you need. Why would someone go to the trouble of paying for a commercial appliance or even fund the development of an appliance in-house, if they wouldn’t also ensure that the environment meets minimum requirements?

Unfortunately, it is not always like that, as many veterans of appliance installments will tell you. This is because the team responsible for deploying the appliance is not always the team responsible for running the data center. Appliances are not their first priority. So, what happens in practice, is that the team deploying the appliance is told to install the appliance in a specific location with specific connectivity, and that is that. You might prefer to use a tap, but that might not be available, so you need to use a Switched Port Analyzer (SPAN) port from a switch or router for access to network data.

While this might seem acceptable, it can lead to some unexpected and unwanted behavior that is responsible for those grey hairs on the heads of veterans! An example of this unwanted behavior is duplicate network packets.

How do duplicate packets occur?

Ideally, when performing network monitoring and analysis, you would like to use a tap to get direct access to the real data in real time. However, as we stated above, you can’t always dictate that and sometimes have to settle for connectivity to a SPAN port.

The difference between a tap and a SPAN port is that a tap is a physical device that is installed in the middle of the communication link so that all traffic passes through the tap and is copied to the appliance. Conversely, a SPAN port on a switch or router receives copies of all data passing through the switch, which can then be made available to the appliance through the SPAN port.

When configured properly, a SPAN port works just fine. Modern routers and switches have become better at ensuring that the data provided by SPAN ports is reliable. However, SPAN ports can be configured in a manner that leads to duplicate packets. In some cases, where SPAN ports are misconfigured, up to 50% of the packets provided by the SPAN port can be duplicates.

So, how does this occur? What you need to understand with respect to SPAN ports is that when a packet enters the switch on an ingress port, a copy is created – and when it leaves a switch on an egress port, another copy is created. In this case, duplicates are unavoidable. But it is possible to configure the SPAN to only create copies on ingress or egress from the switch, thus avoiding duplicates.

Nevertheless, it is not uncommon to arrive in a data center environment where SPAN ports are misconfigured and nobody has permission to change the configuration on the switch or router. In other words, there will be duplicates and you just have to live with it!

What is the impact of duplicates?

Duplicates can cause a lot of issues. The obvious issue is that double the amount of data requires double the amount of processing power, memory, power, etc. However, the main issue is false positives: errors that are not really errors or threats that are not really threats. One common way that duplicates affect analysis is by an increase in TCP out-of-order or retransmission warnings. Debugging these issues takes a lot of time, usually time that an overworked, understaffed network operations or security team does not have. In addition, any analysis performed on the basis of this information is probably not reliable, so this only exacerbates the issue.

How to achieve resilience

With deduplication built-in via a SmartNIC in the appliance, it is possible to detect up to 99.99% of duplicate packets produced by SPAN ports. Similar functionality is available on packet brokers, but for a sizeable extra license fee. On Napatech SmartNICs, this is just one of several powerful features delivered at no extra charge.

The solution is ideal for situations where the appliance is connected directly to a SPAN port, dramatically reducing the amount of damage that duplicates can cause. But, it also means that the appliance is resilient to any SPAN misconfigurations or other network architectural issues that can give rise to duplicates – without relying on other costly solutions, such as packet brokers, to provide the necessary functionality to complete the solution.

In other words, it is possible to ensure that the performance of your appliance is resilient to misconfigurations in these all too common situations. Stay tuned as we look at other data center issues and provide guidance on how to achieve the needed resilience.

The post Ensuring performance resilience with deduplication appeared first on Napatech.

  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

In simple terms, a network flow is a series of communications between two endpoints. Beyond these characteristics, however, the definition of a flow may not be totally clear for everyone. When utilized in the context of NetFlow or IPFIX records, most people can agree that we typically define a flow by its 5-tuple attributes (source and destination IP, source and destination port and the protocol field). But it is also common to use either a 4-tuple, dropping the protocol field, or even a 2-tuple, using only the IP addresses. The latter has the advantage that it does work for IP fragments as well.

A flow, on the other hand, could also be used in a more abstract context where it covers only a subset of a 5-tuple flow or alternatively a group of 5-tuple flows. And is it then a flow or maybe something else? That is a matter of definition. In other words, you could make your flow definition more specific by tightening your criteria – or more general by widening your criteria. In either case, you will get a collection of network packets with some common characteristics.

Above and beyond basic 5-tuples

At Napatech, we have been much more flexible in the way we have implemented the flow lookup function in our SmartNICs. We can extract several individual fields from the packet based on output from our packet decoder, and combine these in the key we use for our internal flow lookup. This is a very intelligent and flexible architecture, which broadens the scope of the feature far beyond 5-tuple flow matching. We can in fact extract up to 4 elements from anywhere in the packet and combine these into the key we use to look up our flow record. This enables a lot of different use cases and offers the user a very high level of flexibility and freedom to build very advanced solutions.

In addition to the “standard” 5-tuple example, typical use cases include MAC addresses or VLAN or VxLAN tags combined with IP addresses or even multiple VLAN or VxLAN tags combined with both outer and inner IP addresses. The latter requires the possibility to extract elements from more than 4 different locations in the packet – but here, the flexibility of the FPGA can help us as we will be able to extend the number of supported locations in the packet with a simple firmware upgrade to the FPGA.

At Napatech we refer to the feature as flow management, but this is really a bit too narrow as the feature can perform lookup in the flow table based on any key that can be specified using almost any packet data.

So, what seems to be merely a flow can end up a 5-tuple flow, a 2-tuple flow, a single 5- or 2-tuple flow from within a GTP or GPRS tunnel, a layer-2 connection or even a specific VxLAN tunnel. In other words, the scope of Napatech’s flow management feature goes way beyond the standard, basic flow definition to cover a multitude of other compositions and patterns, enabling a powerful network tool with massive potential.

The post What is a flow? appeared first on Napatech.

  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 
Flow records for advanced flow analytics

In my blog Packet Capture – the Complete Source to the Truth, I explained why network packets are the ultimate source for reliable network analysis. But as network speeds continue to grow and the number of network packets persistently rocket, it is no longer a viable solution to do “live” analysis and DPI on a packet-by-packet basis.

As a consequence, a lot of network monitoring and analysis tools focus on flow records, which they collect from NetFlow/IPFIX probes in the network. Based on these flow records, they are able to detect strange behavior and anomalies that need further analysis. Whenever an incident is detected, it is still possible to go even deeper and analyze the underlying network packets if needed.

Typical use

Many of the most advanced network security products available today are implemented this way. They involve both packet capture and packet inspection, including DPI, that result in NetFlow records with metadata. The flows are generated and forwarded to some type of flow-aware intelligence doing more advanced analytics, typically using Artificial Intelligence and machine learning for detection of anomalies and other security issues. Whenever an issue is detected, the flow intelligence has the option to go back and retrieve the underlying packet from the packet capture device for further inspection and analysis.

The figure below shows an example of such a configuration and the same setup is used with or without the packet capture, depending on the need for detailed analysis and documentation of incidents.

Any of the blocks in such a setup can be accelerated by a SmartNIC. For packet capture, the SmartNIC can guarantee zero packet loss and exact timestamping of packets and metadata to improve and accelerate packet indexing. It can even help in forwarding the timestamp or other metadata to the Packet Inspector. For packet inspection, the SmartNICs can extract any forwarded metadata from the packet capture block and help by doing some level of packet decoding; it could also include look-aside acceleration for things like decryption and/or RegEx searches. For the Flow Intelligence, the acceleration needed is more in the area of Machine Learning or Artificial Intelligence but these tasks can also be accelerated using SmartNICs with look-aside engines.

The future: AI on a SmartNIC

In the future, it may even be necessary to accelerate the flow processing by letting the flow/intelligence/anomaly detection happen on the SmartNIC as well. This will require that the SmartNIC includes an AI inference engine capable of executing the Machine Learning model specified by the user and used to detect anomalies or strange behavior.

At Napatech, we have now added flow metrics and state collection to our flow management capabilities, enabling applications to generate NetFlow and IPFIX records with a very limited CPU load, even at full 2×100 Gbps network speed. To demonstrate this, we have implemented a basic IPFIX probe that runs at full 2×100 Gbps, as demonstrated in the video below.

This will enable and accelerate high speed NetFlow/IPFIX probes significantly as most of the packet processing is done in the SmartNIC and only the learning of the flow and the generation of the NetFlow/IPFIX record are done in the CPU. This solution will be available on our newest generation of SmartNICs that support any network speed, e.g.: 8×10 Gbps, 4×25 Gbps, 2×40 Gbps and 2×100 Gbps on the same hardware platform.

The SmartNIC gets even smarter

Going forward, we are investigating the possibility of adding an AI inference engine to our SmartNICs which will be able to accelerate the flow analysis even further and actually detect anomalies in the SmartNIC, based on a machine learning model that is attained elsewhere and executed directly in the SmartNIC.

The post What is a flow? appeared first on Napatech.

  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 
Flow records for advanced flow analytics

In my blog Packet Capture – the Complete Source to the Truth, I explained why network packets are the ultimate source for reliable network analysis. But as network speeds continue to grow and the number of network packets persistently rocket, it is no longer a viable solution to do “live” analysis and DPI on a packet-by-packet basis.

As a consequence, a lot of network monitoring and analysis tools focus on flow records, which they collect from NetFlow/IPFIX probes in the network. Based on these flow records, they are able to detect strange behavior and anomalies that need further analysis. Whenever an incident is detected, it is still possible to go even deeper and analyze the underlying network packets if needed.

Typical use

Many of the most advanced network security products available today are implemented this way. They involve both packet capture and packet inspection, including DPI, that result in NetFlow records with metadata. The flows are generated and forwarded to some type of flow-aware intelligence doing more advanced analytics, typically using Artificial Intelligence and machine learning for detection of anomalies and other security issues. Whenever an issue is detected, the flow intelligence has the option to go back and retrieve the underlying packet from the packet capture device for further inspection and analysis.

The figure below shows an example of such a configuration and the same setup is used with or without the packet capture, depending on the need for detailed analysis and documentation of incidents.

Any of the blocks in such a setup can be accelerated by a SmartNIC. For packet capture, the SmartNIC can guarantee zero packet loss and exact timestamping of packets and metadata to improve and accelerate packet indexing. It can even help in forwarding the timestamp or other metadata to the Packet Inspector. For packet inspection, the SmartNICs can extract any forwarded metadata from the packet capture block and help by doing some level of packet decoding; it could also include look-aside acceleration for things like decryption and/or RegEx searches. For the Flow Intelligence, the acceleration needed is more in the area of Machine Learning or Artificial Intelligence but these tasks can also be accelerated using SmartNICs with look-aside engines.

The future: AI on a SmartNIC

In the future, it may even be necessary to accelerate the flow processing by letting the flow/intelligence/anomaly detection happen on the SmartNIC as well. This will require that the SmartNIC includes an AI inference engine capable of executing the Machine Learning model specified by the user and used to detect anomalies or strange behavior.

At Napatech, we have now added flow metrics and state collection to our flow management capabilities, enabling applications to generate NetFlow and IPFIX records with a very limited CPU load, even at full 2×100 Gbps network speed. To demonstrate this, we have implemented a basic IPFIX probe that runs at full 2×100 Gbps, as demonstrated in the video below.

This will enable and accelerate high speed NetFlow/IPFIX probes significantly as most of the packet processing is done in the SmartNIC and only the learning of the flow and the generation of the NetFlow/IPFIX record are done in the CPU. This solution will be available on our newest generation of SmartNICs that support any network speed, e.g.: 8×10 Gbps, 4×25 Gbps, 2×40 Gbps and 2×100 Gbps on the same hardware platform.

The SmartNIC gets even smarter

Going forward, we are investigating the possibility of adding an AI inference engine to our SmartNICs which will be able to accelerate the flow analysis even further and actually detect anomalies in the SmartNIC, based on a machine learning model that is attained elsewhere and executed directly in the SmartNIC.

The post Analytics in full flow appeared first on Napatech.

  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

In a recent blog, I talked about the performance improvement that can be achieved running VXLAN using HW offload. In this blog I will be discussing another important feature that we have been working on in the Napatech OVS offload initiative: mirror offload.

Rumor has it that turning on OVS mirror has a great performance impact, but I have not previously experimented with it. In continuation of the OVS HW offload, I will therefore now move on to explore the performance gains that can be achieved by offloading traffic mirroring.

Like standard OVS+DPDK, Napatech supports forwarding of mirror traffic to virtual or physical ports on a bridge. In addition, Napatech can deliver the mirror to a high speed NTAPI stream (Napatech API, a proprietary API widely used in zero-copy, zero packet-loss applications). The NTAPI stream can be used by a host application to monitor the mirrored traffic at very low CPU overhead.

OVS mirror test cases

In order to test OVS mirroring I need to settle on a test setup. I want to test the following items:

  1. Establish a baseline for what a VM can do
    This is needed to determine if activating mirror affects the VM. The baseline of interest is
    1. RX only VM capability – useful to determine what the mirror recipient can expect to receive
    2. Forwarding VM capability – useful to determine what an external mirror recipient can expect to receive
  2. Standard OVS+DPDK with mirror to
    1. Virtual port – recipient is a VM on the same server
    2. Physical port – recipient is the external traffic generator
  3. Napatech full offload additions to OVS+DPDK with mirror to
    1. Virtual port – recipient is a VM on the same server
    2. Physical port – recipient is the external traffic generator
    3. Host TAP – recipient is an application (using Napatech API) running on the host

Note 3.c applies to Napatech, I am not even sure that this is possible with standard OVS+DPDK, maybe someone can enlighten me?

Test setup

I use a setup consisting of two servers, Dell R730 and ASUS Z9PE-D8 WS, each with a NT200A01 configured to run 40Gbps instead of 100Gbps.

Dell R730 ASUS Z9PE-D8 WS
CPU Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
CPU sockets 2 2
NUMA nodes 12 per socket 8 per socket
RAM 64GB 64GB
NIC NT200A01 NT200A01

The Dell R730 will be running the OVS+DPDK environment and the ASUS the TRex traffic generator.

OVS mirror test setup

The OVS+DPDK will be configured to have the following topology in the different tests

Topology Tests
1 physical + 1 virtual port 1a, 1b
1 physical + 3 virtual ports 2a, 3a
2 physical + 2 virtual ports 2b, 3b, 3c

The VMs will be running the following programs

  • ‘monitoring – a very simple DPDK application based on the ‘skeleton’ example but with the addition that it shows throughput on stdout every second
  • ‘testpmd’ – a standard DPDK test application. This application can be configured to forward packets to specific MAC addresses, so it is very useful when creating the chain shown in the picture above.

OVS+DPDK will be running in standard MAC learning mode, it will learn which port has which MAC address and automatically create forwarding rules based on this behavior. The traffic generator and the VMs are configured like this:

MAC address Command to forward to next recipient
TRex 00:00:00:00:00:10 Set destination MAC to 00:00:00:00:00:01
VM1 00:00:00:00:00:01 ./testpmd -c 3 -n 4 — -i –eth-peer=0,00:00:00:00:00:02 –nb-cores=1 –forward-mode=mac
VM2 00:00:00:00:00:02 ./testpmd -c 3 -n 4 — -i –eth-peer=0,00:00:00:00:00:10 –nb-cores=1 –forward-mode=mac
Tests Finding the baseline

The baseline tests were achieved by configuring TRex to generate 64B packets at 40Gbps and monitor the output in either the ‘monitor’ app or the TRex statistics. The baseline results showed the following:

RX only Forwarding
OVS+DPDK 7.9Mpps 7.6Mpps
OVS+DPDK full offload 14.7Mpps 6.5Mpps
./monitor

…

Core 0 receiving packets. [Ctrl+C to quit]

Packets:      7931596 Throughput:     3807.166 Mbps

OVS+DPDK VM forwarding baseline

./monitor

…

Core 0 receiving packets. [Ctrl+C to quit]

Packets:     14735328 Throughput:     7072.957 Mbps

VM forwarding baseline – OVS full offload

A common baseline for the two test scenarios will be to forward 6Mpps (64B packets) from TRex. This will allow each VM to handle zero loss forwarding and remain within the capabilities of the physical mirror port running 40Gbps because it will need to transmit the forwarded traffic 3x times.

Mirror to virtual port

The virtual port mirror is created using the following command for OVS+DPDK

ovs-vsctl add-port br0 dpdkvp4 -- set Interface dpdkvp4 type=dpdkvhostuserclient options:vhost-server-path="/usr/local/var/run/stdvio4" -- --id=@p get port dpdkvp4 -- --id=@m create mirror name=m0 select-all=true output-port=@p -- set bridge br0 mirrors=@m

and like this for OVS+DPDK full offload

ovs-vsctl add-port br0 dpdkvp4 -- set interface dpdkvp4 type=dpdk options:dpdk-devargs=eth_ntvp4 -- --id=@p get port dpdkvp4 -- --id=@m create mirror name=m0 select-all=true output-port=@p -- set bridge br0 mirrors=@m

The ‘monitor’ app is used in VM3 to determine the throughput of the mirror port, but  the RX statistics of TRex (port 0) is also observed to see if activating the mirror affects the forwarding performance.

Monitoring VM (Expect 18Mpps) TRex RX (Expect 6Mpps)
OVS+DPDK 7.8Mpps 2.7Mpps
OVS+DPDK full offload 14.9 Mpps 6.0Mpps

VM -> VM forwarding with virtual port mirror – OVS+DPDK

VM -> VM forwarding baseline – OVS offload

The native OVS+DPDK really suffer in this test, both the mirror VM and the forwarding path are greatly reduced in performance.

Mirror to physical port

The physical port mirror is created using the following command and it is the same for both OVS+DPDK and the fully offloaded version:

ovs-vsctl add-port br0 dpdk1 -- set interface dpdk1 type=dpdk options:dpdk-devargs=class=eth,mac=00:0D:E9:05:AA:64 -- --id=@p get port dpdk1 -- --id=@m create mirror name=m0 select-all=true output-port=@p -- set bridge br0 mirrors=@m

The TRex statistics on port 1 are used to determine the throughput of the mirror port. The TRex port 0 shows the forwarding performance.

HW mirror (Expect 18Mpps) TRex RX (Expect 6Mpps)
OVS+DPDK 8.6Mpps 2.8Mpps
OVS+DPDK full offload 18.0 Mpps 6.0Mpps

VM -> VM forwarding – Physical mirror port – OVS+DPDK

VM -> VM forwarding – Physical mirror port – OVS full offload

In this test the native OVS+DPDK also suffer performance degradation both on the mirror and forwarding path.

Mirror to ‘host tap’

The mirror to a host application uses the same setup as the physical mirror port.

HW mirror (Expect 18Mpps) TRex RX (Expect 6Mpps)
OVS+DPDK full offload 18.0 Mpps 6.0Mpps
root@dell730_ml:~/andromeda# /opt/napatech3/bin/capture -s 130 -w

capture (v. 0.4.0.0-caffee)

…

Throughput: 17976315 pkts/s, 11312.309 Mbps
Evaluation

The test results clearly show the potential of offloading the mirror functionality. I performed another test after all the results had been gathered, as I had not considered the forwarding performance from a VM to another VM, as part of the baseline. The results here were interesting because it turned out that the forwarding between two VMs in native OVS+DPDK affect the overall forwarding performance. I was not able to get all 6Mpps forwarded in the case of native OVS+DPDK whereas the fully offloaded version showed no issues.

VM -> VM forwarding – OVS+DPDK


VM -> VM forwarding baseline – OVS offload

Future work

The full offload of OVS shows great potential and we can take it even further. The Napatech mirror solution can also provide packet filtering on the mirrored traffic whereby the actual traffic going out of the mirror port is reduced enabling the monitoring application to only receive what it has interest in seeing. The filtering was not made available at the time of this blog, but it is a rather quick thing to do, so I might come back to that at a later point.

The post OVS mirror offload – monitoring without impacting performance appeared first on Napatech.

  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 
From partial to full OVS offload

Napatech has been working on Open vSwitch (OVS) hardware acceleration/offload via DPDK for a while now. It started with the Flow Hardware Offload, which has been available in OVS since 2.10. This introduced a partial offload of megaflows, where packets were instrumented with metadata aiding OVS to bypass its cache look-ups and forward the packets directly to the destination. See Red Hat blog for details.

Lately, Napatech has been working on a full OVS offload implementation, in continuation of the partial offload already upstreamed. This is a data-plane offload not to be confused with bare metal offload; the control-plane and slow path in this offload method still residing in the host.  The same approach of offloading megaflows is applied – the new stuff is that the VirtQueues are no longer handled by OVS but by the SmartNIC’s Poll Mode Driver (PMD) directly. Packets that can be handled by the SmartNIC fast path never end up in OVS for either ingress or egress, hence OVS is fully offloaded.

6x performance improvement

In order to fully offload the fast path, the OVS actions must also be performed in the SmartNIC – which is achievable as Napatech has recently introduced support for VXLAN and VLAN offload. The VXLAN and VLAN offload combined with the full offload of OVS show up to a ~6x performance improvement compared to a basic OVS+DPDK solution running on a standard NIC.

VXLAN performance

The ~6x performance improvement was tested using the following setup:

  • The test setup consists of two servers each running OVS.
  • Server 1 runs a program that receives packets, swaps the MAC/IP and retransmits the packet.
  • Server 2 runs a traffic generator.
  • A VXLAN tunnel is established between the two servers.

VM to VM via VXLAN

Performance is measured by creating an increasing number of megaflows (ovs-ofctl unique flows) and an increasing number of unique flows within each megaflow.

An example configuration would be:

#!/bin/bash

ovs-ofctl del-flows br-int

for port in {$1}
do
  ovs-ofctl add-flow br-int check_overlap,in_port=1,ip,udp,tp_dst=$port,actions=output:2
done

ovs-ofctl add-flow br-int in_port=2,actions=output:1

Traffic will be generated by changing the UDP source and destination port, for each destination (megaflow) there will be 1 or more source ports. The test setup will stress both the exact match cache and the megaflow cache of OVS.

VXLAN and flow offload in OVS

Please also see the Napatech OVS demo for more details on the performance improvements.

Napatech is working to get the VXLAN offload upstreamed, so stay tuned.

The post Light at the end of the tunnel – OVS offload shows 6x performance improvement appeared first on Napatech.

  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Most network applications and appliances are limited in their performance because packet processing is a very heavy and CPU intensive task. Packet processing normally includes activities like decoding (at least up to layer 4; sometimes even up to layer 7) which enables a stateful evaluation of each packet based on the flow context it belongs to.

Flow shunting

Some applications only require this cumbersome inspection for the first few packets, and the remaining packets in the same flows can then be handled by a flow table lookup or a flow cache. This can significantly increase the performance of the application as a lookup into a flow table is much faster than the full inspection and evaluation of each packet. This flow caching is often referred to as flow shunting. But even with flow shunting, the performance of modern inline security applications remains limited as all packets still have to be received by the application, looked up in the flow cache and, based on the flow context information, be retransmitted or dropped.

Application acceleration

To achieve the next level of acceleration, we therefore need to move the flow shunting and the entire flow table into a SmartNIC, thereby enabling packet management directly in the SmartNIC for flows that have already been classified by the application. That way, we remove the need to bring the packet data into the application space for the lookup and, potentially, the retransmission. This approach will significantly increase the performance of the application, which now only needs to handle packets for flows that have not yet been classified, i.e. unlearned flows. In other words, we enable the application to focus only on packet/flow classification, moving the responsibility of the packet-by-packet forwarding/dropping to the SmartNIC.

Take a network where only 10% of the packets belong to new or unclassified flows. When offloaded by a flow shunting SmartNIC, the application would then only have to process a tenth of the packets, which in the best case will increase the performance by a factor 10. This example demonstrates why flow shunting in a SmartNIC is a gamechanger. As more and more actions are added to the flow shunting mechanism – e.g. NAT, VLAN and VxLAN tagging/de-tagging, and IPSec en-/decryption – the performance gain will increase and make the SmartNIC continually more interesting for flow-based applications.

At Napatech, we have added a flow management feature to the data pipeline inside our SmartNICs. This flow table – or flow cache – has been implemented in the SmartNIC to offload and accelerate any flow-based applications. Based on the flow identified through the lookup, this new feature can perform a number of actions on the packet without involving the application. The first version, demoed in mid-October 2018, included the two basic actions: drop and forward. Based on these functionalities, we can provide flow shunting directly in the SmartNIC.

On the horizon

In the future, we will add more actions and thereby increase the number of use cases where we can do full SmartNIC offload thus providing even greater acceleration. We are currently working towards adding writeback into the flow table on a per-packet basis, thereby enabling metrics as well as stateful updates on a per-packet basis. Other potential actions to be added include:

  • VLAN and VxLAN tagging and untagging
  • NAT and NAPT
  • IPSec en-/decryption

All these ideas will be implemented inline on a per-flow basis and without application involvement, except for the configuration.

This type of SmartNIC-based flow shunting has existed for some time, but Napatech is taking it to the next level with support for up to 2x100G connectivity, 100 Million bi-directional flows, a learning rate of 1-2 million new flows per second, metrics update on a per-packet basis – and adding more innovative actions over time.

The post Going against the flow appeared first on Napatech.

  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

With the acquisition of Altera, Intel® has enabled the Programmable Acceleration Card (PAC) with Intel® Arria® 10 GX FPGA. This new card allows FPGA developers to create AFU (Accelerator Functional Units) that can be deployed on the card and bring FPGA-based value-add to the application.

The OPAE framework

Intel has developed a framework SDK – the Open Programmable Acceleration Engine (OPAE) – that operates with terms like blue and green bitstreams, using the colors to describe the internals of the Intel FPGA. The blue bitstreams are FPGA functional blocks delivered by Intel to make the card work. It contains IO logic for all the surrounding peripherals like PCI, SDRAM, and QSFP+. The green bitstream is where the user code (AFU) is located. The blue bitstream abstracts the IO via standardized APIs enabling the green bitstream to be easily ported to new blue bitstreams on other HW platforms.

The OPAE framework

The OPAE framework strives to ensure that the developer of an AFU only needs to concern him/herself with the AFU specifics and not the bring-up of the entire FPGA. The server application layer of the OPAE contains a driver layer giving the user access to acquire the FPGA, reset the FPGA, read/write to the FPGA, etc., all via standardized functions. The OPAE framework also provides functions like memory allocate, which can be seen both by the FPGA and the CPU.

The Napatech firmware is a green bitstream that interfaces with the blue bitstream components and enables the Napatech driver suite to interact with the Napatech firmware as if it were Napatech hardware. Normally, Napatech applies both blue and green bitstreams on its own hardware, so using only the green bitstream has been challenging, yet successful.

PAC Hardware

The PAC hardware consists of:

  • a QSFP+ enabling a single 40Gbps or 4x10Gbps (via breakout cables)
  • a Intel Arria 10 GX FPGA
  • 8 GB DDR-4 memory
  • x8 PCIe (~50Gbps) host interface
  • flash to store the FPGA image

Image by courtesy of Intel

Napatech value-add

Napatech has a long history of creating its own SmartNICs, but with the collaboration between Intel and Napatech, it is now possible to buy an Intel FPGA based NIC and get Napatech software to run on this non-Napatech hardware. The Napatech software on the Intel hardware enables full 40Gbps zero packet loss RX/TX both on 4x10Gbps and 1x40Gbps. Besides zero packet loss, key Napatech features like deduplication, correlation key generation, flow matching, pattern match, etc. become available as does the highly flexible tuple matcher used for load distribution of traffic. Napatech has enabled the Intel card to work as a plug-n-play solution for several monitoring/security applications like Suricata, Snort, Bro as well as the TRex traffic generator which means that it is also integrated into DPDK when running the Napatech firmware.

Performance

With the Intel PAC running Napatech Software, Intel now has a hardware platform that can claim zero packet loss at 40Gbps, which to my knowledge has not previously been possible. We have been testing internally with both the PAC and Intel X710 – and only the PAC with Napatech Software can ensure zero packet loss. Running e.g. Suricata on X710, the packet loss starts to occur at ~ 1Gbps, but with the Intel PAC, it runs without loss at ~40Gbps.

Final thoughts

I see great potential for the PAC platform and OPAE going forward and look very much forward to seeing what use cases it will fulfill and how the platform will evolve. Hopefully the PAC will enable more FPGA business, both from a hardware and firmware perspective. I don’t think the current PAC is a one-fits-all platform, but it can be the catalyst to spin-off a general interest in FPGA and other form factor boards that all feature the OPAE to enable portability from an application point of view. And in the end, that is what matters: that the application is accelerated.

Intel’s next PAC with a Stratix 10 looks interesting and much more similar to the Napatech hardware products, so let’s see what the future brings.

The post Intel FPGA hardware – Napatech Inside appeared first on Napatech.

  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Recently, I had the honor of hosting a webinar together with IHS Markit analyst Vlad Galabov on the topic of reconfigurable computing. If you have not had a chance to watch it, then it is still available here:

event.on24.com/wcc/r/1834117/79B72AC31F8BE0DB23D744D306EF747B?partnerref=napatech

The webinar rose a number of thought provoking issues, but one in particular was the demise of Moore’s law. When you stop to think about it, the entire IT industry is driven by the premise that Moore’s law will continue to provide double the number of transistors per square inch every 18 months and thereby help us to keep up with the relentless growth in data to be processed. What happens when this is no longer true?

Just think about telecom carriers. They are currently experiencing the “scissor effect” where costs continue to rise in line with the growth in data, but revenues stay the same or even decline. What happens when they have to invest even more in data processing power just to achieve the same outcome as today?

Just think about cloud service providers, who until now have seemed invincible. They have found a way to create a successful business model even in the face of an exponential data growth curve. Nevertheless, even cloud service providers will face challenges as “simply adding more servers” will no longer be enough to stay one-step ahead of the data growth curve. What then?

James Hennessey and David Patterson of Stanford have been using some time this year addressing this very issue. In their determination, the death of Moore’s law (and Dennard’s law before it) is actually kick starting a new “golden age” in computer and software architecture innovation.

“Revolutionary new hardware architectures and new software languages, tailored to dealing with specific kinds of computing problems, are just waiting to be developed,” he said. “There are Turing Awards waiting to be picked up if people would just work on these things.” David Patterson interviewed in IEEE Spectrum Magazine article “David Patterson Says It’s Time for New Computer Architectures and Software Languages”, Sep 2018.

Both Hennessey and Patterson provide the example of Domain Specific Architectures (DSA), which are purpose-built processors that try to accelerate a few application-specific tasks. The idea here is that instead of having general-purpose processors like CPUs to process a multitude of tasks, different kinds of processors are tailored to the needs of specific tasks. An example they use is the Tensor Processing Unit (TPU) chip built by Google for Deep Neural Network Inference tasks. It was built specifically for this task.

One of the advantages of FPGAs is that the hardware implementation can be tailored precisely to the needs of the software application, right down to the data path and register lengths. This allows opportunities to achieve performance improvements by tailoring the processing pipeline and parallelism exactly to the needs of the given application.

This is one of the big advantages of reconfigurable computing based on FPGA technology. The power and reconfigurability, and even the ability to tailor FPGA designs to specific needs, will play a key part in addressing the challenges of improving performance for various software applications as we move beyond Moore’s law.

If you would like to dive deeper into Hennessey and Patterson’s argument, I can recommend this article and video:

www.eejournal.com/article/fifty-or-sixty-years-of-processor-developmentfor-this

www.youtube.com/watch?v=bfPV4x-HrUI

The post Surviving the Death of Moore’s Law appeared first on Napatech.

  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

In my previous blog article, I talked about the history of NFV, the vision with the concept and the challenges in realizing them, particularly in the 5G network era. Following on, in this installment, I will be presenting the solutions that are required in order to overcome the challenges previously discussed.

The Solution: OVS Offload

To meet the stringent requirements, operators have realized that to scale virtualized networking functions (VNFs) to meet performance goals requires data plane acceleration based on FPGA-based SmartNICs. This technique offloads the x86 processors that are hosting the varied VNFs to support the breadth of services promised.

Certain tasks are fundamental to running in software on CPUs, whereas others are horrible candidates for general-purpose processors as the instruction set is not optimized for the task(s). By embracing a workload-specific processing architecture that places the right processing workload at the right place in a system with the right technology, operators can dramatically decrease their total cost of ownership for an NFV solution by reducing the data center footprint that is required for a certain number of users. Obvious workloads that benefit acceleration and offload are networking and security related tasks such as switching, routing, action handling, flow management, load balancing, cryptography (SSL, IPsec, ZUC, etc.), compression and deduplication as examples.

SmartNIC acceleration of virtual switching proves to be the highest-performing and most secure method of deploying VNFs. Virtual machines (VMs) can use accelerated packet I/O and guaranteed traffic isolation via hardware while maintaining vSwitch functionality. FPGA-based SmartNICs specialize in the match/action processing required for vSwitches and can offload critical security processing, freeing up CPU resources for VNF applications. Functions like virtual switching, flow classification, filtering, intelligent load balancing and encryption/decryption can all be performed in the SmartNIC and offloaded from the x86 processor housing the VNFs while, through technologies like VirtIO, be transparent to the VNF, providing a common management and orchestration layer to the network fabric.

SmartNICs can transparently offload virtual switching data path processing for networking functions such as network overlays (tunnels), security, load balancing and telemetry, enabling COTS servers used for NFV workloads to deliver at their full potential. Additionally, FPGA-based SmartNICs are completely programmable, enabling fast new feature roll-outs without compromising hardware-based performance and efficiencies, and at the same time staying in lockstep with new features that may be required.

NFV Offload Architectures

There are numerous offload designs that are possible in this type of workload-specific processing architecture chosen based on the VNF application in question. The first and most fundamental design decision is whether the offload and acceleration are deployed via an “inline” or “look aside” model. When deployed inline, virtual switching and potentially other functions are tightly coupled with the network I/O. Data arrives at the SmartNIC, traverses the OVS data plane, and is demultiplexed via a flow-based match and action handling process.

Flow definition and the actions that can be applied to a flow are numerous and include, but are not limited to, forwarding to physical or virtual ports, packet manipulation, metering, QoS, load balancing, drop, redirect, mirror, or forward to an additional processing block of logic. With software programmable FPGAs, the virtual switch can forward flows to a subsequent software processing stage that can function as the user desires to apply additional processing stage to the data stream inline. Packets can be encrypted /decrypted, compressed/decompressed, deduplicated or can apply any custom workload that may be required to increase application performance and decrease latency – actions that are not a part of the standardized vSwitch capabilities.

Alternatively, when deployed in a look-aside model, after vSwitch processing, traffic is sent to the host processor where the VNF(s) are housed. The VNF application can then determine how to process the traffic. If additional offloads are required, traffic can be passed back to the FPGA-based SmartNIC or SmartCard where custom processing can occur on the traffic to offload the host processor.

NVF acceleration can be done in a transparent mode where the VFN is completely unaware that there is a SmartNIC accelerating data delivery via VirtIO to the application.  Alternatively, in VNF Aware mode, the application itself can be made aware of the programmable acceleration technology and, via APIs, influence the data plane processing of traffic to provide additional offloads and accelerations based on the application itself influencing the data plane.

Demonstrations of this approach show the performance gains that can be achieved. An example is the industry’s first cloud-RAN solution that supports heterogeneous acceleration hardware and full decoupling of software and hardware. Compared to a software-only implementation, an accelerated SmartNIC-based solution achieves 10 times higher ZUC encryption throughput, three times higher PDCP system throughput and 20 times lower latency.

Conclusion

It is a foregone conclusion that gone are the days of fixed-function, hardened, expensive, slow-to-maneuver and costly-to-operate networking and security solutions. The technique to overcome the challenges that are facing NFV deployments requires reconfigurable computing platforms based on standard servers capable of offloading and accelerating compute-intensive workloads, either in an inline or look-aside model to appropriately distribute workloads between x86 general-purpose processors and software-reconfigurable, FPGA-based SmartNICs optimized for virtualized environments.

By coupling general-purpose COTS server platforms with FPGA-based SmartNICs that are capable of supporting the most demanding requirements, network applications can operate at hundreds of gigabits of throughput with support for many millions of simultaneous flows. With this unique architecture leveraging the benefits of COTS hardware for networking applications, the vision of NFV is not over the horizon but is clearly attainable.

To live in the world of software-defined and virtualized computing, without trading off performance, this reconfigurable computing platform architecture will allow companies to reimagine their networks and businesses by bringing hyper-scale computing benefits to their networks and deploy new applications and services at the speed of software.

The post Solving the NFV Challenge: The Need for Virtualized Acceleration and Offloads appeared first on Napatech.

Read for later

Articles marked as Favorite are saved for later viewing.
close
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Separate tags by commas
To access this feature, please upgrade your account.
Start your free month
Free Preview