Loading...

Follow Neotys Blog on Feedspot

Continue with Google
Continue with Facebook
or

Valid

If your company is not testing its software continuously, you’re on the road to doom. Given the increasing demands of the marketplace to have more software at faster rates of release, there’s no way a company can rely on manual testing for a majority of its quality assurance activities and remain competitive. Automation is essential to the CI/CD process, and especially usability testing.

Competitive companies automate their testing process. That’s good news. The bad news is that typically the scope of automation is limited to user behavior testing that’s easy to emulate, for example, functional and performance testing. The more complicated tests – those centered around human factors are typically left to efficiency-challenged manual testing.

The result: a company can have lightning-fast tests executing in some parts of the release process only to be slowed to a snail’s pace when human factors testing needs to be accommodated. There are ways by which human factors testing can be automated within the CI/CD process. The trick is to understand where human intervention in the testing process is necessary; building automated processes accommodating those boundaries.

The Four Aspects of Software Testing

To understand where human factors testing starts/stops, it’s useful to have a conceptual model by which to segment software testing overall. One model separates software testing into four aspects: functional, performance, security, and usability testing (Figure 1). Table 1 describes each.

Functional

Performance
Security

Usability

Figure 1: The Four Parts of Software Testing

Test Type

Description

Example

Functional
  • Verifies that the software behaves as expected relative to operational logic/algorithmic consistency.
  • Unit & component tests; integration & UI testing
Performance
  • Determines a software system’s responsiveness, accuracy, integrity, and stability under particular workloads, operating environments.
  • Load, scaling, and deployment tests
Security
  • Verifies that a software system will not act maliciously, is immune to malicious intrusion.
  • Penetration, authentication, and malware injection tests
Usability
  • Determines how easy it is for a given community of users to operate (use) a particular software system.
  • User interface task efficiency, information retention, and input accuracy testing
Table 1: Definition of the Four Parts of Software Testing Data-Driven vs. Human Driven Tests

Given the information provided above, it makes sense that functional, performance and security testing gets most of the attention when it comes to automation. These tests are machine centered and quantitive – data in/out, all of which can be easily machine-initiated running under script. Things get more complicated with usability testing.

Usability testing requires random, gestural input that can only be provided by a human. As such, creating an automated process for this test type is difficult. It’s not just a matter of generating data and applying it to a web page with a Selenium script. Human behavior is hard to emulate via script. Consider the process of usability testing a web page for optimal data efficiency. The speed at which a human enters data will vary according to the layout and language of the page as well as the complexity of data to be registered. We can write a script that assumes the human behavior about data entry, but to get an accurate picture, it’s better to have humans perform the task. After all, the goal of a usability test is to evaluate human behavior. 

To be viable in a Continuous Integration/Continuous Delivery process, testing must be automated. The question then becomes, how can we automate usability and other types of human factors testing when, at first glance, they seem to be beyond the capabilities of automation? The answer: as best we can.

Automate as Much as Possible

No matter what type of testing you are performing, it’s going to be part of a process that has essentially four stages: setup, execution, teardown, and analysis (Figure 2).

Figure 2: The Four Stages of a Software Test

When it comes to creating efficiency in a CI/CD process, the trick is to automate as much, if not all, of a given stage as possible (operative words being “as possible”). Some types of tests lend themselves well to automation in all phases; others will not. It’s important to understand where the limits of automation in a given test throughout the four steps are.

Full automation is achievable during functional and performance tests

For functional and performance testing, activity automation of all four stages is straight forward. You can write a script that (1) gathers data, (2) applies it to a test case, and (3) resets the testing environment to its initial state. Then, you can make the test results available to another script that (4) analyzes the resultant data, passing the analysis to interested parties.

Security testing requires some manual accommodation

Security testing is a bit harder to automate because some of the test setup and teardown may involve specific hardware accommodation. Sometimes this consists of nothing more than an adjustment of the configuration settings in a text file. Other instances might require the human shuffling of routers, security devices and cables about in a data center. 

Usability tests have a special set of challenges

Usability testing adds a degree of complexity that challenges automation. Test setup and execution requires human activity coordination. For example, if you’re conducting a usability test on a new mobile app, you need to make sure that human test subjects are available, can be observed and they have the proper software installed on the appropriate hardware. Then each subject has to perform the test usually under the guidance of a test administrator. All of this requires a considerable amount of coordination effort which can slow down the testing process when done manually.

Although the actual execution of a usability test needs to be manual, most of the other activities in the setup, teardown, and analysis stages can be automated. You can use scheduling software to manage a significant portion of the setup (e.g., finding then coordinating the invitation of test subjects to a testing site). Also, automation can be incorporated into the configuration of the applications and hardware needed.  

In terms of observing test subjects, you can install software on the devices under test that will measure keyboard, mouse, and screen activity. Some usability labs will record test subject behavior on video. Video files can then be fed to AI algorithms for pattern recognition and other types of analysis. There’s no need to have a human review every second of recorded video to determine the outcome.

The key: hone in on those test activities that need to be performed by a human and automate the remaining tasks. Isolating manual testing activities into a well-bounded time box will go a long way toward making usability testing in a CI/CD process more predictable and more efficient.

Putting it All Together

Some components of software testing, such as functional and performance testing are automation-friendly. Others such as security and usability testing, require episodes of manual involvement, thus making test automation a challenge.

You can avoid having manual testing become a bottleneck in the CI/CD process by ensuring that the scope, occurrence and execution time of the manual testing activities, particularly those around usability tests, are well known. The danger comes about when manual testing becomes an unpredictable black box that eats away at time and money with no end in sight.

Learn More about Usability Testing

Discover more load testing and performance testing content on the Neotys Resources pages, or download the latest version of NeoLoad and start testing today.

Bob Reselman 
Bob Reselman is a nationally-known software developer, system architect, test engineer, technical writer/journalist, and industry analyst. He has held positions as Principal Consultant with the transnational consulting firm, Capgemini and Platform Architect (Consumer) for the computer manufacturer, Gateway. Also, he was CTO for the international trade finance exchange, ITFex.
Bob’s authored four computer programming books and has penned dozens of test engineering/software development industry articles. He lives in Los Angeles and can be found on LinkedIn here, or Twitter at @reselbob. Bob is always interested in talking about testing and software performance and happily responds to emails (tbob@xndev.com).
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Welcome to the Land of Perftopolis – a land wrought with great turmoil pitting brother against brother to restore the performance balance to the world.

In this world, many companies are using COTS (commercial off-the-shelf-packaged solutions) applications such as SAP, King of Port Real, Capital of Perftopolis. SAP is asking its subjects from all over Perftopolis to adhere to new policies as they move from SAP legacy SAP R3 to Sto more original versioning (S/4 HANA) before 2022. In turn, the Houses of Neotys and Worksoft have declared an alliance to help organizations securing their migration easily and productively.

SAP and Neotys (a new alliance in the North)

While making its way through Perftopolis, Neotys sought to surround itself and partner with the leaders and innovators of the industry. Meanwhile, the House of Worksoft had the advantage of three mighty dragons named: Certify, Capture, and Analyser. These superior beasts made the House of Worksoft the most potent house in the kingdom.

Working with its allies, the House of Neotys set forth with its plan to collaborate with Worksoft house given the much-anticipated S4/Hana with a singular focus – to prevent the world from it and make sure the planet will survive it.

Consequently, the houses of Neotys and Worksoft worked together. With their new bond, the partners quickly became the most powerful alliance in the North.

“We always have been focusing on supporting the latest technology to help organizations transitioning to modern technology. Many houses were asking us to supports the primary COTs system: SAP R3,” said Henrik Rexed from the House of Neotys. “To begin a smooth journey to SAP S4 HANA, we decided to provide the full support of SAP GUI,” Henrik said.

Today, NeoLoad is an SAP-certified software.

The Rise of NeoLoad & Certify

The House of Neotys’ dedicated to collaboration with the best Automated testing solutions in the SAP World (the House of Worksoft, in particular) was making them a formidable power. After a few strategic discussions, the Houses of Neotys and Worksoft quickly understood the value of their combined offered on prospects and customers alike. “We had a solution that would not only support SAP system migrations but fulfill King Performa’s decree,” proclaimed Henrik Rexed from Neotys.

It was clear that a collaboration providing an integrated functional and performance testing plug-in with Certify would restore order in Perftopolis.

After the House of Neotys pushed its SAP support live, they met with the House of Worksoft’s R&D team to discuss how the integration could work between their offerings. As a result, the House of Neotys started to work with Certify’s APIs along with an SDK provided by the House of Worksoft.

Back in November 2018, the two entities started to collaborate on a plug-in integration with Certify promising prioritizing time savings.

As of March 2019, the integration was fully completed. The business processes and automation defined in the House of Worksoft’s Certify are now automatically converted into load testing assets in NeoLoad, the famed army of the House of Neotys.

What does this mean?

No human interaction is needed to translate the Certify business process into NeoLoad. Furthermore, the teamwork between the two powers easily converts functional tests to NeoLoad performance tests. When the NeoLoad scenario is correlated, it automatically updates the performance script as changes are made to the Certify business process. This minimizes the new version scenario maintenance time. The plug-in enables functional testing into load testing while saving up to 97% of maintenance time.

As it came into power, the alliance of the Houses of Neotys and Worksoft made them the Kings of the North.

A House in Order

As of today, the combined household represents a strong partnership. Work continues to be done, but especially as the families evangelize their new standard solution in talks and celebrations like the House of Worksoft Banquet entitled, “The Transform Celebration,” and the House of Neotys’ Partner Event in Port Real (Portugal). Other gatherings between the two include House of Neotys events such as “Ramp Up” and select “DSAG” dinners.

Putting it All Together

The integration between NeoLoad and Certify is the best solution for the fast-approaching SAP migration requirement. NeoLoad is a rapid load testing platform, and Worksoft Certify is the most relevant solution for SAP. With the two working together, your SAP testing will never be the same.

Here’s a summary of what the plug-in will provide:

  • Testing of your current system to define the baseline
  • The ability to consolidate or change the environment
  • Testing of your changes to validate no presence of performance regression
  • Initiation of cloud migration
  • Testing of your migrated application, comparison to the baseline to confirm the new platform

It will support simple web and mobile applications as well as all your critical ERP systems like SAP, Oracle, Siebel, Manhattan, and other COTS enterprise products.

  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

[By Jonathan Wright]

After spending a year in Silicon Valley helping Apple & PayPal build next-generation AI platforms, I’m back in the UK assisting the government in preparing for Brexit, supporting start-ups create bleeding edge AI as a Service (AIaaS).

How do you know when your AI is ready “to be unleashed into the wild”? How does testing AIaaS compare to traditional (non-AI) platforms?

Let me start by defining “black box” test methods. There focus based on what I found back in the 1990s when I started as an automation engineer:

  • Functional Testing – User interface and application programming interface
  • Non-functional Testing – Performance, load, and security testing

This is fine when you only care about the inputs/expected outputs. However, that doesn’t work when it comes to testing AI platforms.

The logical move to combine both “black” and “white” boxes to create various shades of grey depends upon the levels of the solution architecture needed to understand before you can begin testing.

Traditional platforms (non-AI) like VoIP back in 1998 were built on a typical multi-tier architecture, with the presentation/UI layer based on Java applet (e.g., WinRunner) having native support through Java hooks.

Note Oracle has recently started to charge for Java, so OpenJava distributions like Amazon Corretto have emerged.

At first glance, AIaaS can feel similar, they may even have a dynamic UI layer through programmable implementation like GraphQL (above) hosted as a local Kubernates cluster or multi-cloud environment. For example, a simple GraphQL contract query is based on type system (schema registry) which cover tree-based read queries, mutations for updates, and subscriptions for live updates from the GraphDB which can dynamically change the UI (Apollo). However, hiding behind the AIaaS wrapper and hiding away behind the Amazon gateway is where the enterprise AI implementation is hiding and is where we start our GATED.AI testing journey.

What does good look like?

In our example GATED.AI scenario (below) we have been asked by a client to list 1,000 products onto Amazon marketplace, which we have been provided with images of each product. After running a simple GraphQL test and I discover that I have 95 unpublished products? shall I pass or fail the test? The question is if we don’t know what the expected outcomes are? how can we simply pass or fail the output? How do we know what good looks like and when to stop? Utilizing the GATED.AI approach it makes sense to switch to a more goal-oriented approach (e.g., capability to identify X). With what acceptance/accuracy such as clarification rate (e.g., 7 out of 10 times) within what timeframe/training time (seconds/minutes/hours) in what type of environment (Computer/IOPs) with what size of training data (hundreds or even millions).

GATED.AI Approach

This allows us to define GATED.AI scenarios ahead of time during the idealization phase:

  • GOAL: CAPABILITY to identify and correctly categorize images of products
  • ACCURACY: REQUIREMENT can successfully categorize women’s fashion (1,000+ subcategories on the Amazon channel) with a CLARIFICATION rate of over 70%
  • TIME: TIMEFRAME/day to process over 10,000 product images
  • ENRICHMENT: SEMANTIC MODELS applying data engineering (Extract, Transform, and Load) heuristics for mining ecosystems to enable AI data lakes
  • DATA: CLUSTERING (percentage split) for the development training set (60%) then testing training set (30%) proving training set (10%) of the training set sizes (5,000/10,000)

Now we know we have an idea what good looks like we can start to prepare the GATED.AI data lake e.g., the baseline training dataset dependences for our GATED.AI tests.

GATED.AI Scenario

The main challenge around the GATED.AI approach is producing a realistic baseline training dataset that is representative of the target AI consumer needs.

  • AI CAPABILITY (GOAL): Identity and correctly categorize images of products

The temptation is for the test data engineer to use a generic dataset (for example stock images returned by a search engine) which would be more suitable for a more generalized AI (unsupervised).

In previous GATED.AI scenarios, we have seen much higher accuracy around the 80%+ classification rates from mining the training data set from harvesting a specific dataset from the target AI consumer (for example, using a spider to transpose a data source such as a website containing existing product imaging with associated metadata).

  • AI REQUIREMENT (ACCURACY): To successfully categorize women’s fashion
  • AI CLARIFICATION RATE (ACCURACY): Over 70%

NOTE – Data Visualisation platforms will help to identify a subset of the test training dataset and avoid cognitive bias/overfitting of the training data.

If I was going to utilize an AI crowdsourced platform with gamification say Kaggle.com list a competition I would be looking for accuracy as close to the 90% based on the industry understanding that Enterprise AI scores above 95% are nearly impossible based on current cognitive technologies capabilities. However, I would also be equality interested in the time, and data variance along with associated consumed compute power.

  • AI TIMEFRAME: Process 10,000 images/day
  • AI COMPUTE (TIME): Auto-scale training nodes/clusters (COMPUTE/GPU/IOPS) based on AI Benchmarks/Performance Index (PI)
  • AI VARIANCE (TIME): Future growth (customers utilizing the service/increase in product images)

Check out my previous blog, on Digital Performance Lifecycle – Cognitive Learning (AIOps) for further detail.

Optionally, the baseline training dataset can also be enriched by applying various semantic maps, for example, cross-validation from different ecosystems of ecosystems.

  • AI SEMANTIC MODEL (ENRICHMENT): Data engineering (Enhance, Transform & Load) heuristics

The above approach enables the creation of an AI ready data lake with an improved level of data hygiene; this can be process mined to identify business process flows for robotic process automation along with the associated datasets.

In the experiences, in test automation, we explore the use of model-based testing (MBT) to support our business process testing efforts to model out something as simple microservice testing based on specification (SWAGGER/OpenAPI).

In the example above we have modeled (MBT) a simple business process flow with acceptance criteria (e.g., login success) using cause and effect modeling, so that we can fully understand the relationship between the A to B mappings (inputs/output) based on dataset used, so it is important not only model the business process flows but model the associated test dataset (MBD).

As previously mentioned, this requires the test data engineer to fully understand not only the data quality of the unstructured/structured dataset but the data coverage and the context-sensitive nature of the domain dataset.

  • AI TRAINING DATA – mining for clustering, number of variables and associated test training data set size

Data engineering is as important as the data science activities as the ability to establish the data pipework to funnel unstructured data from heritage platforms into structured data through a combination of business process automation, and test data management (TDM) capabilities are essential.

GATED.AI Example

In the above GATED.AI scenario, we define the high-level goal of identifying a product image (e.g., pair of trousers) and the image category mapping (e.g., product type).

  • GOAL – AI CAPABILITY: Image category mapping
  • ACCURACY – AI CLASSIFICATION RATE: >70%
  • TIME – AI TIMEFRAME: <1 day for 10,000
  • ENRICHMENT – AI SEMANTIC: Harvest manufactures website and internal ERP platform
  • DATA – AI TRAINING DATA: Training images dataset size (5,000/10,000), semantic model parameters (500+) & training cluster nodes (4/8)

Now the acceptance criteria for the GATED.AI scenario was that the classification rate was over 70% and able to process 10,000 images per day.

In the above three cases, only one AI model passes the GATED.AI scenario (e.g., Model 1 classification rate and Model 3 training time (even with 4 cluster nodes exceeds the day timeframe).

NOTE: If we had not used the GATED.AI approach, then the temptation would have been to select Model 3 as it has the highest clarification rate but takes 5 times longer than Model 2.

GATED.AI Performance Engineering

The above GATED.AI scenario demonstrates the importance of effective performance engineering in AIaaS to assure that it cannot only handle the expected GATED.AI volumetric models, but the underlining enterprise AI platform can scale e.g., auto-scale compute nodes to handle future growth variations and be resilience e.g., self-healing (Chaos Engineering). At the start of this blog we mentioned that traditional testing was no longer going to cut it, so we needed to adopt a “Grey Box” testing approach to improve visibility of individual components so that we could identify bottlenecks throughout the target AIaaS architecture.

Therefore, I am going to refer to a keynote that I gave to the British Computer Society (BCS) back in 2011. In this keynote, I proposed that user load, similar to the “automation pyramid” was only the tip of the iceberg and that interface (messaging/APIs) combined with an ambient background (traffic/noise) should be the real focus to achieve accurate performance engineering.

In the example above, for a sizeable E-SAP migration program, we identified over 500 interfaces in the enterprise architecture diagram above, which either need to be stubbed out with service virtualization or to generate bulk transactions as messages or traffic (e.g., iDocs).

Like our GATED.AI scenario, the business process flow to trigger a simple report may only be a couple of steps through the SAP GUI (learn more about SAP testing), and the response for submitting the report may be a couple of seconds.

However, in the background, the number of transactions that are going on across the ecosystem (e.g., internal & external systems both upstream and downstream) monitored above by the application performance management (APM) platform.

Keeping this in mind, if we return to the GATED.AI scenario, our AIaaS platforms are built on Kubernates cluster which can be deployed locally or multi-cloud.

For this example, I will be running deploying the following Kubernates cluster locally (on Alienware m51 due to the amount of memory required):

  • API Gateway (REST API)
  • Apache Kafka (Streaming Service)
  • Avro Schemas (Schema Registry)
  • Neo4J (Graph Database)
  • MongoDB (NoSQL)
  • Postgres (RDBMS)
  • Apache Zookeeper (Coordination Service)

In the above screenshot, I’m sending a simple JSON message to the ML microservice (which I can intercept/manipulate the request/response pairs), which triggers a number cipher queries and sets a flag in MongoDB that the product is ready to be listed to the channel.

Now depending on the product successfully matches a valid women’s fashion category on the channel specified e.g., eBay vs. Amazon. The status will change from “received” to “ready to list.”

Once the “channel listing service” identifies a cube of 10,000 products on that are “ready to list” to a channel, then it publishes them to the appropriate marketplace every 15mins.

So as a performance engineer, where do I focus my efforts to prove the system is performant, resilient, and can scale?

  1. Observing the behavior of the front end or API?
  2. Interpreting the interactions between node/endpoints within the ecosystem (upstream & downstream), e.g., Kafka Producers & Consumers?
  3. Modeling the sentiment/context of the business processes (cause & effect modeling), e.g., How long does it take images to list on the channel?

Traditional performance testing focused on observing the behavior of the endpoint (UI/API), so if the response time for a REST call took longer than a few hundred milliseconds, then the microservice was worth investigating.

However, that is no longer the case, due to system..

  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

One of the biggest successes in modern site reliability engineering has been to take something that used to be hard and make it easy. This is particularly true with machine learning, which by nature is not only data intensive but can also require vast amounts of processing power. In the past, a data scientist wrote a bunch of analytic code in a statistical language such as R. Then he or she deployed the code to high powered servers, loaded the data and waiting for results. While appropriate, this was time-consuming, even if your computing targets were optimized virtual machine. Deploying and running a set of machine learning algorithms could take hours, if not days. Performance testing machine learning models are essential.

Enter Kubernetes and the next thing you know the code is wrapped up in containers that are designed to run in parallel, scaling up/down on demand. The result is tens or even hundreds of containers running the same code simultaneously. Machine learning processing that used to take hours now takes minutes. Also, the code can be updated faster.

Data scientists were in heaven. After all, what’s not to like? Not only could they have faster processing turnaround of the machine learning models, but they could also alter code quickly for improved analysis.

What followed – data scientists took advantage of the new opportunities and started to tweak their code to improve the models. Then the unanticipated happened. The actual cost of running the models increased while the benefits incurred were marginal. Why? The models were never subjected to adequate performance testing.

This might seem like a techno-fairytale, but sadly, unanticipated cost increases associated with machine learning processing happens all too often. Nobody sets out to waste money. Unforeseen costs are the result of two factors.

  1. Developers not having a clear understanding of the impact of altering machine learning models relative to data consumption.
  2. The absence of performance testing as an intrinsic part of the machine learning release process.
Understanding the Cost of Refactoring a Machine Learning Model

Every time a new dimension is added into the machine learning model, you’ll need to process more data. In doing so, it’s going to cost you time or money. It’s going to slow down processing time, or you’ll need to spend more money on additional computing resources to support the increased workload.

Imagine this scenario. You have a machine learning model designed to identify prospective retail investors for a posh mutual fund. The current model, in production, takes into account residential zip codes for identifying high net worth individuals. However, new information is uncovered, revealing that a significant number of current customers live near exclusive golf courses. Given this insight, the data scientist responsible for refining the machine learning code refactors it using a geolocation map of prospective users to golf course membership, assigning a “prospective customer ranking index” accordingly.

The programming required to perform the refactoring is vanilla. To support the revision, additional data about golf courses is required. The data scientist identifies a useful data source and refactors the code incorporating the new golf course logic into the existing code base. She runs a trial test of the new machine learning model on her desktop machine using a few thousand rows from the newly identified course data. The model works as expected, and the code is released.

Nearly a day later the data scientist receives an email from SysOps informing her that the recent update is making it so that the number of pods required in her cluster needs to increase to 300% (which requires manager approval). The classification algorithm responsible for ranking a golfer as likely to invest in the mutual fund now takes 45 vs. 15 seconds. (well below the threshold for real-time suggestions or async email batch queue processing). It turns out that it takes more time to get through the millions of rows of data the new machine learning model requires, but also, the actual time needed to process one row increased significantly too (despite the machine learning classification process). The sad fact is that an anticipated improvement to the underlying algorithm increased cost overall while providing marginal benefit.

The moral of this story is that even the simplest code change can have unintended, significant consequences. This holds for machine learning models and modern commercial applications operating at web scale. The only difference is that Facebook’s, Google’s, and Amazon’s of the world got the message years ago. So, they made performance testing essential to the release process. Many companies that adopt ML internally or outsource to “AI- enabled offerings” generously provided by these behemoths are just now getting the wakeup call about how human non-deterministic systems are to cost-incurring performance issues.

Avoid Performance Testing Machine Learning Models at Your Own Risk

Desktop performance rarely maps to performance at scale. To get an accurate picture of production level performance, you need to be able to exert a significant load on the system. As a single desktop computer is well suited for programming, it’s not equipped to be able to create or support the degree of load you find in production. In many organizations, performance testing machine learning models never goes beyond the desktop. When things go wrong, the all too common response is, “but, it worked on my machine.” The rest is left up to the Finance Department when paying the invoice.

On the other hand, some organizations realize the limitations of testing on the desktop and will put monitors in place on their production systems to detect anomalies. This is useful from a systems administration standpoint, but production monitoring does not curtail cost increases. To accurately anticipate the actual operational costs of a given machine learning model before release, you need to exert a significant load. This is best done within the boundaries of a well-designed performance test conducted in a testing environment that emulates production as closely as possible.

Such performance testing requires that particular attention be paid to load exertion, which is typically done using test tools the support parallel processing, such as those provided by Neotys. Also, the performance testing becomes part of an overall continuous integration/continuous deployment (CI/CD) process. Automating performance testing machine learning under CI/CD ensures that nothing falls between the cracks.

Regardless of whether your organization is trying to get off of desktop-based performance testing machine learning or away from relying upon system monitors to report trouble after the fact, no matter what, you need to establish a performance testing machine learning models as an essential part of the overall release process.

Putting it All Together

Few organizations aim to create machine learning models that incur expenses which could’ve been avoided. However, many do so mostly because of a lack of awareness about the real cost of computing. Machine learning is an essential component of modern software development. Applying appropriate performance testing to all aspects of the machine learning release process makes it cost effective as well.

Learn More

Discover more load testing and performance testing content on the Neotys Resources pages, or download the latest version of NeoLoad and start testing today.

Paul Bruce
Sr. Performance Engineer

Paul’s expertise includes cloud management, API design and experience, continuous testing (at scale), and organizational learning frameworks. He writes, listens, and teaches about software delivery patterns in enterprises and critical industries around the world.

  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

  [By Alexander Podelko]

Automation is an extremely vague term if we apply it to performance testing. Just mapping of functional testing notions and approaches may be very misleading here. There are multiple levels of automation testing in performance testing and it is almost pointless to talk about “automation” there without clarifying what exactly we are talking about. What, unfortunately, happens quite often as “automation” became a buzzword nowadays. The question is not binary “automate or not automate” – the question is what and how to automate in your specific context considering available tools and technologies.

It is not clear-cut even in functional testing – check Michael Bolton’s “Manual” and “Automated” Testing and The End of Manual Testing blog posts that challenge the notion of “automated” testing. They are somewhat going against the mainstream – but have a lot of good points and it indeed appears that the mainstream sees it in rather a simplified way.

It is even more complicated with performance testing. There is a danger of replacing holistic performance testing (as a part of performance engineering) with just running the same tests again and again automatically. While we can get closer to these bright “automation” promises in the future (as described below), we are pretty far from that today and such replacement leaves huge gaps in mitigating performance, reliability, and scalability risks. So let’s consider the different meaning of “automation” and what they mean for performance testing.

Historic Meaning of automation testing

The first meaning of “automation” was just using a testing tool in functional testing – as opposed to “manual” testing when testers directly worked with the system. According to that meaning, performance testing was always “automated”. The things you can do without a load testing tool or a harness are very limited and it happens in rather special cases.

Generic Meaning of automation testing

Another level is getting all pieces together including setting up and configuring the system under test and testing environment. Here we have a great breakthrough with the arrival of Cloud and Infrastructure as Code. While they are not specific to performance testing, there is no reason not to use it. Such automation, in addition to saving efforts, eliminates human errors in configuring systems – which are a typical reason for difficult to figure out performance issues. Another item that may be added here is data generation. That part is a prerequisite for continuous testing – but could (and should) be used in any kind of performance testing (when available/feasible).

Today’s Meaning of automation testing

Then we may talk about automatic test scheduling and execution with simple result-based alerting. This is also a prerequisite for continuous testing.
The distinction I see here is between performance testing as a part Continuous Integration / Deployment / Delivery (some performance tests are run for each code commit) and continuous performance tests which are run periodically (larger tests that can be, for example, run once a day). The difference becomes very significant if we have multiple commits per day – so we need to build a hierarchy of tests to run (on each commit, daily, weekly, etc.) – details, of course, would heavily depend on the specific context.
We definitely need tool support for that – but most advanced tools already have that functionality, so you can easily start such continuous performance testing. But today you still should:

  • find an optimal set of tests for regression testing;
  • determine how and when running them;
  • create and maintain scrips;
  • do troubleshooting and non-standard result analysis.

Considering iterative development, continuous performance regression testing does have significant value. However, except a few trivial cases, it still leaves script creation/test maintenance in the hands of performance testers – so, as you add more tests (which usually mean more scripts), you get more overheads to maintain and troubleshoot them. And running the same tests, again and again, solves just one task – making sure that no performance regression was introduced.

Looking into the Future

And then we get too much more sophisticated automation topics – which we are just approaching and which nowadays are often grouped into the “Artificial Intelligence (AI)” category. They basically boil down to defining what we are testing, how we are testing (scripting, etc.), and what we are getting out of it (analysis, alerting, etc. beyond simple if statements).
Some interesting developments in that direction can be found in Andreas Grabner’s Performance as Code blog post and presentation.

There are numerous startups trying to solve these problems with AI. My understanding is that they are rather in the early stages and current products can be used in rather special cases. It is rather a large separate topic – you may check, for example, a recent Mark Tomlinson webinar Solving Performance Problems with AI as an introduction to it.

When we get to the point when it will work in more generic and sophisticated cases, it will expand what we can “automate” and probably shift further what a performance engineer will be doing. However, at the present, while automation (in today’s and generic meaning) is definitely here and should be embraced, it is just a piece of the whole performance puzzle and the exact combination of pieces to use heavily depends on your specific context.

Learn More about the Performance Advisory Council

If you want to learn more about this event, see Alexander’s presentation here.

  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

[By Jonathan Wright]

After spending the year before based in Silicon Valley helping Apple & PayPal build next-generation A.I. platforms, I’ve been back in the UK helping Government prepare for Brexit and helping start-ups build bleeding edge AI as a Service (AIaaS). After spending the year before based in Silicon Valley helping Apple & PayPal build next-generation A.I. platforms, I’ve been back in the UK helping Government prepare for Brexit and helping start-ups build bleeding edge Artificial Intelligence as a Service (AIaaS). How do you know when your A.I is ready “to be unleashed into the wild”?

Traditional testing won’t cut it How testing AI as a Service (AIaaS), over Traditional (none-A.I.) platforms? Let me start by defining “Black Box” testing methods, back in the 1990s when I started as an Automation engineer:
• Functional Testing – focusing on the User Interface (UI) also extended to the Application Programming Interface (API)
• Non-Functional Testing – focusing on Performance testing, load testing & Security Testing
This is fine when you only care about the inputs and expected outputs. However, that just doesn’t work when it comes to testing A.I platforms.

The logical move to combine both “Black Box” with “White Box” to create various shades of “Grey Box” depending on the levels of the solution architecture you need to understand before you can begin Testing.

Traditional Platforms (Non-A.I). like the above VOIP platform back in 1998 (above) built on traditional multi-tier architecture, with the presentation / UI layer based on Java Applet, which enterprise tools (such as WinRunner above) have native support through Java hooks.

NOTE – Recently Oracle has started charging for Java so OpenJava distributions like Amazon Corretto have emerged to test against.

At first glance AI as a Service (AIaaS) can feel similar, they may even have a dynamic UI layer through programmable implementation like graphQL (above) hosted as a local Kubernates cluster or multi-cloud environment. For example, a simple graphQL contract query is based on type system (schema registry) which cover tree-based read queries, mutations for updates, and subscriptions for live updates from the GraphDB which can dynamically change the UI (Apollo). However, hiding behind the AI as a Service wrapper and hiding away behind the Amazon Gateway is where the Enterprise A.I. implementation is hiding and is where we start our GATED.AI testing journey.

What does good look like?

In our example GATED.AI scenario (below) we have been asked by a client to list 1,000 products onto Amazon marketplace which we have been provided with images of each product. After running a simple graphQL test and I discover that I have 95 unpublished products? shall I pass or fail the test? The question is if we don’t know what the expected outcomes are? how can we simply pass or fail the output? How do we know what good looks like and when to stop? Utilizing the GATED.AI approach it makes sense to switch to a more goal-oriented approach (i.e. capability to identify X). With what acceptance/accuracy such as clarification rate (i.e. 7 out of 10 times) within what timeframe/training time (seconds/minutes/hours) in what type of environment (Compute/IOPs) with what size of training data (hundreds or even millions of reference data).

GATED.AI – Approach

This allows us to define GATED.AI scenarios ahead of time during the idealization phase:
GOAL – CAPABILITY to identify and correctly categorize images of products.
ACCURACY – REQUIREMENT be able to successfully categorize women’s fashion. (1,000+ subcategories on the Amazon channel) with a CLARIFICATION rate of over 70%.
TIME – TIMEFRAME per day to process over 10,000 product images.
ENRICHMENT – SEMANTIC MODELS applying Data Engineering (Extract, Transform, and Load) heuristics for mining ecosystems to enable A.I. Data Lakes.
DATA – CLUSTERING (Percentage Split) for Development Training Set (60%) then Testing Training Set (30%) then Proving Training Set (10%) of the training set sizes (5,000/10,000).

Now we know we have an idea what good looks like we can start to prepare the GATED.AI data lake i.e. the baseline training dataset dependences for our GATED.AI tests.

GATED.AI – Scenario

The main challenge around the GATED.AI approach is producing a realistic baseline training dataset that is representative of the target A.I Consumer needs.

• A.I. CAPABILITY (GOAL) – to identify and correctly categorize images of products.

The temptation is for the test data engineer to use a generic dataset (for example stock images returned by a search engine) which would be more suitable for a more generalized A.I (unsupervised).

In previous GATED.AI scenarios, we have seen much higher accuracy around the 80%+ classification rates from mining the training data set from harvesting a specific dataset from the target A.I. Consumer (for example using a spider to transpose a data source such as a website containing existing product imaging with associated metadata).

• A.I. REQUIREMENT (ACCURACY) – be able to successfully categorize women’s fashion.
• A.I. CLARIFICATION RATE (ACCURACY) – of over 70%.

NOTE – Data Visualisation platforms will help to identify a subset of the test training dataset and avoid cognitive bias/overfitting of the training data.

If I was going to utilize an A.I crowdsourced platform with gamification say Kaggle.com list a competition I would be looking for accuracy as close to the 90% based on the industry understanding that Enterprise A.I scores above 95% are nearly impossible based on current cognitive technologies capabilities. However, I would also be equality interested in the time and data variance along with associated consumed compute power.
• A.I. TIMEFRAME – be able to process 10,000 images per day.
• A.I. COMPUTE (TIME) – ability to auto-scale training nodes / clusters (COMPUTE / GPU / IOPS) based on A.I. Benchmarks / Performance Index (PI).
• A.I. VARIANCE (TIME) – future growth (customers utilizing the service/increase in product images).

Check out my previous blog, on Digital Performance Lifecycle – Cognitive Learning (AIOps) https://www.neotys.com/blog/neotyspac-digital-performance-lifecycle-cognitive-learning-aiops-by-jonathon-wright/ for further detail.

Optionally, the baseline training dataset can also be enriched by applying various Semantic Maps for example cross-validation from various ecosystems of ecosystems.

• A.I. SEMANTIC MODEL (ENRICHMENT) – applying Data Engineering (Enhance, Transform & Load) heuristics.

The above approach enables the creation of an AI ready data lake with an improved level of data hygiene, this can be process mined to identify business process flows for robotic process automation along with the associated datasets.

In the Experiences in Test Automation we explore the use of Model-Based Testing (MBT) to support our business process testing efforts to model out something as simple microservice testing based on specification (SWAGGER / OpenAPI).

In the example above we have modeled (MBT) a simple business process flow with acceptance criteria (i.e. Login Success) using cause and effect modeling, so that we can fully understand the relationship between the A to B mappings (inputs/output) based on dataset used, so it is important not only model the business process flows but model the associated test dataset (MBD).

As previously mentioned, this requires the test data engineer to fully understand not only the data quality of the unstructured/structured dataset but the data coverage and the context-sensitive nature of the domain dataset.

• A.I. TRAINING DATA – mining for clustering, number of variables and associated test training data set size.

Data Engineering is as important as the Data Science activities as the ability to establish the data pipework to funnel unstructured data from heritage platforms into structured data through a combination of Business Process Automation and Test Data Management (TDM) capabilities are essential.

GATED.AI – Example

In the above GATED.AI scenario, we define the high-level goal of identifying a product image (i.e. pair of trousers) and the image category mapping (i.e. product type).

• GOAL – A.I. CAPABILITY – image category mapping
• ACCURACY – A.I. CLASSIFICATION RATE = > 70%
• TIME – A.I TIMEFRAME = < 1 day for 10,000
• ENRICHMENT – A.I SEMANTIC – Harvest manufactures website and internal ERP platform.
• DATA – A.I TRAINING DATA – Training images dataset size (5,000/10,000), semantic model parameters (500+) & training cluster nodes (4/8).

Now the acceptance criteria for the GATED.AI scenario was that the classification rate was over 70% and able to process 10,000 images per day.

In the above three cases, only one A.I. model passes the GATED.AI scenario (i.e. Model 1 classification rate and Model 3 training time (even with 4 cluster nodes exceeds the day timeframe).

NOTE: If we had not used the GATED.AI approach then the temptation would have been to select Model 3 as it has the highest clarification rate but takes 5 times longer than Model 2.

GATED.AI – Performance Engineering

The above GATED.AI scenario demonstrates the importance of effective performance engineering in A.I as a Service (AIaaS) to assure that it cannot only handle the expected GATED.AI volumetric models but the underlining Enterprise A.I. platform can scale i.e. auto-scale compute nodes to handle future growth variations and be resilience i.e. self-healing (Chaos Engineering). At the start of this blog we mentioned that traditional testing was no longer going to cut it, so we needed to adopt ‘Grey Box’ testing approach to improve visibility of individual components so that we could identify bottlenecks throughout the target A.I as a Service (AIaaS) architecture.

Therefore, I am going to refer to a keynote that I gave to the British Computer Society (BCS) back in 2011 (https://www.slideshare.net/Jonathon_Wright/testing-as-a-service-keynote). In this keynote, I proposed that user load, similar to the ‘automation pyramid’ was only the tip of the iceberg and that interface (messaging / APIs) combined with an ambient background (traffic/noise) should be the real focus to achieve true performance engineering.

In the example above, for a large E-SAP migration program, we identified over 500 interfaces in the enterprise architecture diagram above, which either need to be stubbed out with service virtualization or to generate bulk transactions as messages or traffic (i.e. iDocs).

Like our GATED.AI scenario, the business process flow to trigger a simple report may only be a couple of steps through the SAP GUI (learn more about SAP testing) and the response for submitting the report may be a couple of seconds.

However, in the background the number of transactions that are going on across the ecosystem (i.e. internal & external systems both upstream and downstream) monitored above by the Application Performance Management (APM) platform.

Keeping this in mind, if we return to the GATED.AI scenario, our AI as a Service (AIaaS) platforms is built on Kubernates cluster which can be deployed locally or multi-cloud.

For this example, I will be running deploying the following Kubernates cluster locally (on Alienware m51 due to the amount of memory required):

• API Gateway (REST API)
• Apache Kafka (Streaming Service)
• Avro Schemas (Schema Registry)
• Neo4J (Graph Database)
• MongoDB (NoSQL)
• Postgres (RDBMS)
• Apache Zookeeper (Coordination Service)

In the above screenshot, I’m sending a simple JSON message to the ML Microservice (which I can intercept/manipulate the request/response pairs), which triggers a number cipher queries and sets a flag in MongoDB that the product is ready to be listed to the channel.

Now depending on the product successfully matches a valid women’s fashion category on the channel specified i.e. eBay vs. Amazon. The status will change from ‘received’ to ‘ready to list’.

Once the ‘channel listing service’ identifies a cube of 10,000 products on that are ‘ready to list’ to a channel, then it publishes them to the appropriate marketplace every 15mins.

So as a performance engineer, where do I focus my efforts to prove the system is performant, resilient and can scale?

1. Observing the behavior of the front end or API?
2. Interpreting the interactions between node/endpoints within the ecosystem (upstream & downstream) i.e. Kafka Producers & Consumers?
3. Modeling the sentiment/context of the business processes..

  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

These days migrating code to new, more powerful hardware platforms is no longer every so often thing. It’s continuous and for many businesses, unrelenting. Companies need to migrate code (software migration) to take advantage of technological innovations that drive their competitive edge. Some feel that all that is required is to click the “deploy” button in a CI/CD tool, letting the automation take over. Those who’ve earned their engineering stripes understand that a lot more is required.

Migrations are complex and risky. The best way to protect a system migration, ensure its success is to take the time to develop a well-informed plan which addresses the business and technical considerations and to provide the right levels of observability over system performance to all involved teams. As the saying goes, few people plan to fail; instead, they fail to plan.

Let’s take a closer look.

There are No Second Chances

The first thing to understand regarding software migration is that the odds are you won’t get a second chance. Migrations are significant events that, in many cases, involve taking a system down for a predetermined period. Once down, it must be brought back online on time and working according to expectation. Failure to meet this necessary condition can be career-ending, but most importantly, revenue loss, perhaps the business. After the damage is done, there is rarely a do-over. So, it’s essential that everybody involved in the migration understands that failure is not an option.

Create an Optimal Performance Baseline First

That last thing you want to do is move current performance problems to a new environment. This only nets marginal code that will be poisonous to the new hardware. You only want to migrate code that’s proven to be performant on the old hardware.

This requires the creation of an optimal performance baseline against the legacy hardware environment for the system to be migrated. This way, if you know, the code runs successfully on a legacy system yet molasses slow in the new environment, reason dictates that you know where to look – the hardware. It’s a simple apple to apple comparison. However, when you migrate problematic code to a new environment, and things go haywire, it could be anything.

Performance baselines require metrics, evidence of system operational capacity, and changes under various levels of pressure. Load testing is often used to put this pressure on existing systems before migration, then again after migration but before release to users. Whether you get your metrics from system monitoring dashboards, in-product telemetry, or through testing tools, it’s necessary to simulate the right amount of pressure so that unknowns come out of the woodwork beforehand, giving you observability before and then reusable validation processes after migration. The last thing you want is to leave it to chance that anything could happen.

For situations that “could be anything,” you compromise your customer’s confidence in the whole system. Just look back at the horrendous debut of HealthCare.gov in 2013. Unfortunately, it never established an optimal performance baseline. Fact, they had no baseline whatsoever. Success was seemingly never in the cards.

Maintain Maximum Visibility and Fast Feedback

The rule of thumb for new system migration is not to launch on modern hardware without establishing proper visibility into what that hardware does. You need to make sure that adequate system monitoring and reporting are in place from the start. Possess clarity into the underlying infrastructure, especially when using native cloud providers like AWS, Google Cloud, or Azure. These systems are complex, often transient. Virtualized instances are created and destroyed at a moments notice. Here now but gone in seconds. It’s entirely possible to encounter a performance testing problem on a VM that no longer exists. Avoid this with the forensic approach. Having access to logs and other types of reports that provide past events comprehensively is a must. Having the ability to easily compare performance testing results, before and after changes or component reconfigurations, is a critical capability to have in your tool belt.

Once you migrate, your feedback capabilities need to be as immediate as possible. The faster the rate of change, the quicker the information needs to be available as the consequences of the moment will have a more significant impact. Think of it this way: when taking a stroll down the street, you’re afforded the ability to look away at something on the other side. When running fast, and trying to look elsewhere even for a second could result in a collision with a light post. The same holds for systems under migration. The faster you go, the more you need to see as quickly as possible. Metaphorically speaking, the light post will likely take your system going down, and it may never come back.

Keep Feature Releases in Sync with Migration Updates

Consider a typical problem. You’re performing a late-stage test on an intended migration target, and you uncover a way to boost performance in the software significantly. It’s not a bug fix, not a new feature either. It’s an enhancement. Nonetheless, being an Agile shop, you do what you’re supposed to do. You put the change request into the backlog. A month or two goes by when your team finally gets the go-ahead to implement the change request. There’s only one problem. The new target for your new code is nothing like the environment you designed the enhancement around. Everything is new.

Had you implemented the enhancement against the platform that was operational when the need was identified, the upgrade would have been infinitely more manageable. Now, you have to start at the beginning and run tests on the original environment and the new one (Good testing practices recommend that you do this to conduct accurate comparisons and verification.).

The flexibility within your performance toolchain should equally match your need to re-test in different environments. A quick change to hostnames shouldn’t require re-scripting or re-work. Activities such as automatically tagging dynamic resources in APM tools or changing load testing endpoint details should be absurdly simple, for either a human running ad-hoc sprint tests or as part of configuration sets in continuous integration jobs triggering performance tests. A failure to do so results in delays during patch and feature rollout, or worse, a lack of visibility into the negative impact to end users because of a necessary adjustment after migration.

The moral of this tale is that there needs to be tight coordination between feature release and migration updates. It’s not a matter of just sending change requests to a backlog and working on them when there’s time available in a future sprint. There’s a big difference between “anytime” and the “right time.” If a feature release depends on a particular machine, VM or container environment, it’s best to implement that feature against that environmental dependency. To do otherwise will cost money. Many times the cost can outweigh the benefit.

Make Trending Continuous

Trends are important. They help us determine when to add resources and take them away. They enable users to understand behavior at micro and macro levels. Patterns tell us about the overall capabilities of our systems.

Trends do not happen at the moment. They take time. There is no one snapshot of CPU utilization that will tell us when a maximum threshold nears. The “moment” might say to us we’re maxed. It won’t tell us we’re about to max. This is a significant difference.

To enable such a determination, you need to collect data over time, all the time. This is particularly relevant when it comes to ensuring that an ops migration is going according to plan.

Data collected over time shouldn’t be episodic. It’s more than saying, “we’re migrating over the XYZ provider next month; let’s set up the hardware environment now to make sure our expectations are in line with reality.”

Of course, collecting data to determine operational trends at the onset of migration is useful. The data collection should not stop post-migration. A good rule of them is to start trending on the new hardware immediately, keeping it going throughout the life of the system. You can do this by creating continuous integration (CI) jobs to trigger performance checks and small load tests that regularly execute on the new hardware at predefined intervals. These jobs should run not only before the migration happens, but continue after the migration has occurred.

Having a continuous stream of complete and useful performance data provides the depth of information necessary to make an accurate determination of the trends that will affect the current migration and the next migration. As those who have been in the engineering community for a while understand, when it comes to ops migrations it never over, it’s just one continuing story.

Putting it All Together

An ops migration is rarely smooth. Even under the perfect conditions, adversity can present itself. The trick is to be able to handle mishaps effectively, to expect the unexpected, and to be able to meet the challenges at hand.

Preparation is paramount. You must understand that code migration to a new operating environment is a timeboxed event without the option of a mulligan. Teams also need the right platforms for testing and observability in place to identify and fix performance issues quickly, before and after migration. The code has to be performant before the migration occurs. Also, you need to set up monitoring mechanisms, and fast feedback loops to allow maximum visibility into the target environment’s state – at the time of migration, continuously afterward. You’ll need this information to determine operational trends that will guide future migrations.

Protecting an ops migration can make or break a career for those responsible. Keeping in mind the suggestions described herein will help you protect the current migration as well as establish proper practices that will reduce the risk associated with those in your future.

Learn More

Discover more load testing and performance testing content on the Neotys Resources pages, or download the latest version of NeoLoad and start testing today.

Paul Bruce
Sr. Performance Engineer

Paul’s expertise includes cloud management, API design and experience, continuous testing (at scale), and organizational learning frameworks. He writes, listens, and teaches about software delivery patterns in enterprises and critical industries around the world.

  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

  [By Stijn Schepers]

At the beginning of February, I was honored to represent Accenture at the Performance Advisory Council (PAC) in Chamonix, France. Bringing Performance Engineering to the top of the highest mountain in Europe (Mont Blanc) must have been challenging for Neotys to organize. But they did an awesome job!  Hard to stay focused with such a breathtaking view.

I was privileged to be selected as one of the 12 speakers and to kick-off the event. My topic was about Performance Test Automation beyond Frontiers.

DevOps is a blessing for a Performance Engineer

DevOps has been a blessing for a Performance Engineer. Finally, a Performance Engineer can add value. The days that Load Testing is done at the end of the SDLC are – hopefully – finally over. With classical “waterfall delivery models”, performance testing was done way too late in the life cycle with a primary focus on fully integrated, end-to-end testing. There was not much time to optimize systems and to improve the non-functional aspects of a solution. With DevOps, this has changed completely! Performance Testing starts when features are being designed by assessing the risk these features impose to a performance degradation.  Developers write unit tests to profile there code. Automated load tests are being executed using web services. End-to-end load tests are a final validation of the performance. APM tooling (e.g. Dynatrace, AppDynamics, New Relic.. ) provide us with valuable insights of the impact on performance of these new features in production. With DevOps, performance testing shifts in two directions: left towards the design phase (risk assessments, unit testing) and right toward production (APM, Synthetic Monitoring).

During the first PAC in Scotland, I presented in great detail how a move to DevOps delivery model has changed the way performance testing happens:  https://www.neotys.com/blog/pac-performance-testing-devops/

Automate wisely:  analyse your results based on raw data

DevOps is also about continuous automation. A performance engineer should automate load testing as much as possible. But to which extend can we automate performance testing? Can we automate the analysis of performance measurements and use the outcome of the analysis to pass or fail a build? How can we automatically test the latest version of the application and measure the performance difference with the previous version? And based on this difference how can a framework take an automated decision to pass or fail a build?

I don’t know any commercial tool that has the ability to automate load testing in such a way that we can fully trust the automation framework to take the decision to pass or fail a build. When you analyse your performance results, it is absolutely crucial to analyse the raw data (every measurement of request response time) and not just the averages. Averages hide system bottlenecks.  Therefore my team took up the challenge to build a new innovative framework that not only automatically executes load tests but also analyse the results based on the raw data. This framework is called RAF, the Robotic Analytical Framework.

Raf, the best friend of a Performance Engineer

RAF is unique as it enables a Performance Engineer to automatically analyse the raw data of a load tests and to drive continuous delivery. The picture below explains the working of the framework.

  1. A NeoLoad NoGUI test is automatically launched from JENKINS
  2. NeoLoad exports the RAW data into a file share
  3. RAF automatically polls the file share to find the results file with the raw data. RAF transfers this data – together with runtime data (RunID, version number) – into a MySQL database. The MySQL database becomes a centralized repository of all the load test results.
  4. Based on the type of test (eg. Regression test), RAF analyses the raw data using predefined Validation Rules and smart algorithms. The analysis is done using the raw data, the error count and throughput. The output of the validation is a Test Execution Score (T.E.S).
  5. Based on the value of the Test Execution Score (T.E.S), a build is automatically failed or passed.
  6. RAF performs clean-up steps

Automation Frameworks are a set of tooling

BI Tooling

Frameworks like RAF need to solve complex issues in a simple and intuitive way. Therefore RAF is built in Python and is easy configurable for the application you want to test. The prime goal of RAF is to drive Load Test Automation and Analytics. Tableau is used as a Data Visualisation tool.  It is important to create dashboards that visualize the test results in a comprehensive way.

The dashboard below exists out of different graphs. The prime info is the Test Execution Score (T.E.S). A high score means that the performance, throughput and error rate is equal to previous execution runs. When the score is higher or equal to 70 , the build is PASSED. When the score is lower than 70, the degradation of the performance is not acceptable and the build is FAILED.  Secondary info provided is the raw data, a trend line, error count, throughput and a percentile graph.

For a performance engineer, response times are by far the most important measurement. Response Times are directly linked to End User and Digital Experience.

APM Tooling

Don’t re-invent the wheel and extend your automation framework with software solutions that are already in place. Extending an Automation Framework with APM tooling is a logical next step. APM tooling will provide you with resource utilization (Heap, CPU, Memory, … ) and with additional performance insights (deep dive capabilities). Additionally Health Rules can be defined that trigger the Test Engineer (email, WhatsApp or SMS message) when APM metrics are not conform to the baseline.  As an example, AppDynamics  provides a framework Dexter which is ideal to extend the automation framework with APM features. For more information check out Dexter’s documentation: https://github.com/Appdynamics/AppDynamics.DEXTER/wiki

Take-aways

Performance Test Automation is not a myth. It is a must-do to speed up performance testing and to uplift the quality and consistency of testing. By automating performance tests, performance engineers can spend more time in assessing designs and architectural solutions and in coaching graduates to become performance engineers. I believe that the biggest bottleneck of Performance Engineering is not technology BUT a lack of knowledge and experience engineers that understand the profession. These senior engineers should have a focus on building clever automation frameworks so they have more time for coaching and mentoring.

The PAC at Chamonix was a magical event. The scenery of the Mont Blanc inspired the speakers to get the most out of the event!

Learn More about the Performance Advisory Council

Want to learn more about this event, see Stijn’s presentation here.

  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

  [By Twan Koot]

We just completed a successful PAC event in Chamonix which included two days of experience and knowledge sharing. If you’re reading this post, you’re likely to be interested in my presentation, but there is more to this! Let’s dive into it, shall we.

A large part of the work we performance testers/engineers do after a load testing includes the examination/analysis of the performance monitoring data collected from the test. We each have our methods to identify issues. One of the most common I’ve come across is what I like to call the “three-step dance.”

The first step is a check of the “Magic Three” metrics: CPU, RAM, and IO and looks at graphs from the APM dashboard. The second check is done based on a threshold of percent usage, (for instance, 85% CPU consumed). If the metric stays below this threshold, the specific component is satisfactory. The review is to examine response times against resources graphs to look for peak matches. Should there be a spike in response times, it would stand to reason that a similar one exists in CPU usage, right?

The above-described method to analyze performance testing data is, in my opinion, far from optimal. It’s based on percent usage metrics which don’t accurately reflect server health. That said, is there a better approach to analyzing performance monitoring data? Yes, of course, and it’s USE (Utilization, Saturation, and Errors). It’s actually a method described in Systems Performance Enterprise and the Cloud, by Brendan Gregg (brendangregg.com).

The first part is quite common among performance testers. Utilization may be the most popular metric to consider. Nearly all monitoring or APM tools support this, and for most testers, it’s the metric they watch. Utilization is just the starting point though. The second less popular type, Saturation, to this performance tester, is considered the most critical. Lastly, the hardest of the three to obtain (due to either Cloud platform limitations or a lack of tooling), is Errors. The following table provides an overview of the “Magic Three.”

Table 1.0 USE metrics plotted on CPU, Memory, and Storage device I/O.

  • Utilization – Average time a recourse was busy servicing work.​
  • Saturation – The degree of work which can’t be handled, which is being queued. ​
  • Errors – The amount of errors.

So, how do we apply this fresh new USE metric system? We can use the following flow:

Figure 1.0 USE process flow.

As you can see it’s a pretty straightforward flow starting with the identification of which resources are available for analysis It’s preferable that we do this before running the test, so we have the correct metrics available. Then, select a resource we want to check. Next step includes going through USE metrics for each resource – reviewing each type shows any possible problems. When following this flow, you will systematically check resources in a detailed way which also goes more in-depth then solely looking at %usage of a resource.

To show the importance of looking further beyond % usage, we will be focusing on the CPU resource. We’ve all looked at the CPU usage graph to see if we get high usage. So, what does 80% CPU utilization mean? Is this CPU overloaded? To answer, we should look into what the metric can mean. When looking only at a graph, we may interpret the parameter into something like below:    

Figure 2.0 CPU Usage when looking purely at a CPU utilization graph.

Watch out when examining CPU as there can be something happening elsewhere. When the CPU is servicing work, it may be waiting for other resources to provide feedback for completing a process. This stalled process will show up in many tools as busy CPU while waiting for additional resources. See Figure 3.0. Meanwhile, Figure 4.0 shows NMON on an overloaded machine. Note when checking the CPU tab, we see 30% of the shown utilization is “waits on other resources.”

Figure 3.0 CPU Usage what it actually can mean.

Figure 4.0 CPU Usage Shown in NMON

We now know that CPU usage can mislead when not monitored extensively. But, we only touched the “U” of USE, so let’s go deeper to learn more about the impact of Saturation on CPU resources.

The definition of Saturation, from a performance tester’s perspective, is interesting given the degree of work which can’t be handled, which is being queued. Queueing typically results in delay after delay (or, increased response time). Which metric measures resource saturation? For the CPU, it’s the scheduler, or in many tools, the “RunQueue.” Runqueue is the queue of instructions waiting to be transferred to the CPU for calculation. This provides a quick checkpoint of the CPU to determine potential overload. However, what if we have a run queue of 15? Again, let’s take an even deeper dive into the metrics introducing BCC and EBPF.

(e)BPF stands for “extended” Berkeley Packet Filter. BPF can run in-kernel filtered programs allowing for low overhead monitoring of many deep and low-level metrics. It opens up new possibilities to additional metrics. Since BPF is on low-level programming tools, it can be hard. Luckily, there is BCC, which is described in the following manner, “a toolkit for creating efficient kernel tracing and manipulation programs, including several useful tools and examples. It makes use of extended BPF, formally eBPF, a new feature first added to Linux 3.15. Much of what BCC uses requires Linux 4.1 and above.”

Figure 5.0 BCC tool collection plotted on system components

With a single BCC tool, we can check how much latency a particular run queue may cause. To retrieve this, we can use Runqlat as it allows for the determination of how much Scheduler or RunQueue latency exists. The output is shown in Figure 6.0 displays latency in ms, the number of measurements during the performance monitoring run. Now we can see if a high run queue leads to extensive latencies on the CPU resource.

Figure 6.0 Runqlat showing output overview.

Putting it All Together

Hope I was thorough, yet brief enough in my explanation for you to follow along. When applying these tools, leveraging these new metrics in your analysis, you can efficiently detect and diagnose bottlenecks or performance issues. You also now know how CPU can be misleading and used as a starting point.

Okay, maybe it isn’t your “three-step dance.” Its more formal name – USE works just as well.

Learn More about the Performance Advisory Council

Want to learn more about this event, see Twan’s presentation here.

  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

It’s 2008 and Neotys is revealing NeoLoad 2.4.4 which introduces its first instance of Dynatrace support. Seems like yesterday, right? Now, fast-forward eleven years (as APM has become a critical tool for every performance tester), and we’re at it again with our most enhanced Dynatrace integration yet.

So what’s happened during this time?

Last year during Perform EMEA in Barcelona, we had the honor of being presented the Solution Innovation Award from Dynatrace in recognition of the successful integration between our products. “Dynatrace includes the artificial intelligence needed to detect the performance issues, their origin. The natural synergy between NeoLoad & Dynatrace is a logical association as Dynatrace receives NeoLoad generated data during the test, enabling quick problem analysis,” explains Henrik Rexed, Partner Solution Evangelist at Neotys. With the integration, our goal is to combine the best of both offerings. Performance engineers are now able to automate their performance tests, and operations managers to troubleshoot more effectively. Users benefit from other existing Dynatrace integrations – Jira, for example. “When a defect is identified, the APM tool automatically generates a Jira ticket equipped with all pertinent performance details related to the issue. The result: time-savings, productivity, and efficiency,” adds Henrik.

How a secret meeting from the “Performance Avengers” led to a revamped integration

We wanted to go further with an integration that consolidated the data available in Dynatrace and NeoLoad, to provide real feedback on the quality of production releases. So, Henrik had the chance to be invited by Dynatrace to be part of a secret league that would save the world from poor performance. Let’s call them the avengers of the performance. Go with us on this. Henrik and Andreas Grabner (DevOps Activist at Dynatrace), The “Avengers” had a secret meeting in Linz, Austria aimed at drawing up plans to thwart the evil attacks of performance defects.

After several days of whiteboarding discussion, we were drawing the perfect solution for performance engineers and SRE in a continuous deployment environment. To make our dreams come true, we decided to select the features that will be helpful to our community as soon as possible.

Back in the South of France, Henrik started to write the code of this new integration by targeting Perform in Vegas (January 2019). There, he would showcase our new artillery to dear friends during the HOTDAYS of Dynatrace. The story was simple. Show the value of real continuous performance testing in a full cloud environment. We chose Openshift, a microservices architecture, Github (source control) and Jenkins to deliver the solution.

This required automation. To make this possible, we had to remove any pre-existing configuration we had in place with Dynatrace. If you remember, previous integration instances had several requirements. One hit of Henrik’s Viking hammer on the configuration engine, and “BOOM,” the configuration requirement was gone. What resulted. NeoLoad is configuring Dynatrace for you.

We no longer have to worry about request attributes or apply the tags on our environment. NeoLoad, the load testing platform, does this for us.

Wait a minute. What about tagging? Consider reading the Dynatrace article on this topic.

Another big issue addressed with the redesigned integration focused on automatic issue detection via the Dynatrace AI. Right from the start, it captured a few days of production traffic (with real users) helping to articulate a real baseline. When Andreas explained to Henrik that the project could create the rules related to their SLO in the Dynatrace AI, another swing of the Viking hammer was all that was needed for testers to be able to feed the Dynatrace AI with his own SLA and SLO.

A third improvement with the integration sought to bring increased understanding of how modern architecture is behaving under load. To do this, two crucial KPIs are included: the number of processes consumed by a service (containers, nodes, etc.) and the total amount of CPU, memory, network used. Of note, if you are planning to conduct continuous testing, these are great KPIs to trend. Monitoring the utilization associated with having more containers between two builds could lead to significant regression (at least in terms of cost).

The last thing the “Avengers” wanted to do was add a toolkit to enable continuous testing in the first place. (How many times did you automate your test, detect regression, analyze to finally discover that it’s not a regression but a simple deployment issue?)

The sanity check action scans the architecture, comparing it between each build. If your deployment has fewer containers or is consuming more CPU or memory, it would fail your build.

What’s in store for the offering in 2019?

The Performance Avengers have teamed up again, to develop new integration for efficient continuous testing featuring Jenkins. Here’s a snapshot of some of the features:

  • Support for teams to utilize the integration without any additional configuration (request attributes auto-created)
  • Creation of the tagging rules in Dynatrace so that NeoLoad will collect the metrics of the architecture by looking at the dependencies of the services as well
  • Simple NeoLoad traffic naming within Dynatrace to enable PurePath service flow analysis (isolated to NeoLoad traffic)
  • The creation of specific application rules in IA of Dynatrace. Dynatrace will then be able to detect problem-based on the projects thresholds
  • Monitoring action reporting the number of processes running and their CPU and memory usage for each service
  • Retrieval of custom metrics in NeoLoad
  • Sanity check functionality to validate accurate application deployment (against the baseline)

We recently hosted a webinar with Andreas Grabner, focused on explaining how to avoid extra working effort through the building of more realistic component and end-to-end testing; how to take advantage of the AI engine of Dynatrace during each test.

Are you planning to attend Dynatrace Perform EMEA in Barcelona (May 21-23rd)? Neotys is a proud sponsor. We’d be delighted to discuss the integration and how the combined solution can improve your testing effort.

Read for later

Articles marked as Favorite are saved for later viewing.
close
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Separate tags by commas
To access this feature, please upgrade your account.
Start your free month
Free Preview