Good news if you’re a developer! The 11th annual Kansas City Developer Conference is just around the corner, on July 18th and 19th at the Kansas City Convention Center in downtown Kansas City (with a pre-conference workshop day on July 17th). It’s the place to be for rich interactions with other developers, both new and experienced. Fun fact: This conference will also feature a Kids Conference following the KCDC. Here are 6 sessions we’re excited about this year at KCDC:
With the rise of CI/CD teams are able to deliver new code and features to customers faster than ever before. The pressure to out-innovate the competition is high, and organizations need to be able to move quickly without compromising the quality of the services their customers depend on. In this talk, OverOps’ own Oliver Nielsen will introduce the concept of Continuous Reliability (CR) and demonstrate how to implement it in your own workflows.
CI/CD is taking the world by storm. With the amount of tools that have flooded the market over the last few years, implementing CI/CD workflows is easier than ever before. The work doesn’t stop there, though. As new tools are introduced to accelerate integration and delivery of new code, it’s important to ensure that your team is able to embrace the cultural shift necessary to support these new workflows. Join James Quick as he shared his own experience with leading this transition in his own organization and what he learned from it.
So what do you know about Modernizing Java? Would you like to know more? Dig in deeper with this intermediate-level session with Jayashree S Kumar, a Software Engineer at IBM’s India Software Labs. Learn by example how to enhance your Java applications to stay competitive as the Java platform continues to evolve.
What to expect:
An introduction to the fundamentals of TDD
An instructor-led demonstration of TDD in practice
A collaborative code kata to get a hands-on, practical introduction to TDD
Whether you realize it or not, engaging the right side of the brain is an important exercise for developers. Learn to stretch the other side of your brain with a bit of creative expression in coding. Krista LaFentres, Developer at Andrews McMeel Universal, will teach the options that are available to you for expression, such as creating art with code. Plus, you’ll get hands-on experience and create your own piece from what you learn in this half-day workshop.
What can we learn from LEGO® building when it comes to software? When it comes to making feature changes in software and building castles from LEGO®, both the parallels and complications become quite evident. Most notably, the fact that adding functionality means making changes to every single layer, or deconstructing large parts if we build our software like LEGO®. But there is a better way, as outlined in this session by Competence Coach Hannes Lowette.
See you there?
Come by our booth to say hi, see a quick demo of the product and pick up some cool swag. We hope you have a great time! Want to share your thoughts or insights on the upcoming sessions? Send us your comments or tweet at us @overopshq.
FYI – The maturity model presented in this post is based on the concept of Continuous Reliability, which you can read more about here.
Software reliability is a big deal, especially at the enterprise level, but too often companies are flying blind when it comes to the overall quality and reliability of their applications. It seems like every week, there’s a new report in the news calling out another massive software failure. Sometimes it’s just a glitch on social media causing usability issues, and other times it’s a serious issue in an aircraft system that leads to deadly crashes.
Clearly, not every software failure is fatal, engineers aren’t heart surgeons. However, a single error can impact more patients than a doctor could ever treat in their lifetime. That’s why maintaining application reliability (basically, making sure nothing breaks) is a top priority for every IT organization. And if it isn’t, it should be.
In this post, we will discuss the concept of Continuous Reliability and use it to define the Continuous Reliability Maturity Model.
This model helps teams understand where they stand in terms of reliability and how they can improve. It can also help engineering leaders to chart a course to reach their goals for reliable and efficient execution. But more on that later, let’s dive in.
What is Continuous Reliability?
Continuous Reliability is the idea of balancing speed, complexity and quality by proactively and continuously working to ensure reliability throughout the software delivery lifecycle (SDLC). It is ultimately achieved by implementing data-driven quality gates and feedback loops that enable repeatable processes and reduce business risk.
To do this requires strong capabilities in both data collection and data analysis, meaning being able to access all relevant information about your application and then being able to use that data to proactively surface patterns and prevent software failures.
Achieving Continuous Reliability means not only introducing more data and automation into your workflow, but also building a culture of accountability within your organization. This includes making reliability a priority beyond the confines of operations roles, and enforcing deeper collaboration and sharing of data across different teams in the SDLC.
The Continuous Reliability Maturity Model
The Continuous Reliability Maturity Model is comprised of four levels that align with common patterns of obstacles and pitfalls organizations encounter on their reliability journeys. Below we break down the characteristics and challenges that define each level and provide recommended next steps that will help advance your progress.
The Continuous Reliability Maturity Model
As organizations progress in their reliability maturity, they increase their signal to noise ratio, automate more processes and improve team culture. With this, they are able to increase productivity and provide a better customer experience, improving the overall bottom line for the business.
Let’s take a closer look at each reliability level:
Level 1: Individual Heroics
Organizations at this level are just beginning their reliability journey. This stage is marked by the initial establishment of reliability practices – often leaning toward manual and reactive processes with loose structure. Teams at this stage generally rely on ad-hoc and inconsistent strategies to solve technical issues. Visibility is a major challenge, and most code quality problems are only addressed if a customer complains.
Characteristics: Ad-hoc processes for solving technical issues; early or experimental stages of prioritizing and formalizing reliability strategy; limited visibility into application errors and their root cause.
Primary Challenge: Manual and reactive processes and limited visibility into what’s happening within your applications and services, resulting in late identification of customer impacting issues.
Invest in best practices and a monitoring ecosystem that increase visibility into your system.
Begin establishing best practices for addressing technical incidents.
Clarify roles and responsibilities as they relate to ensuring application quality and reliable operations in production.
Level 2 – Basic Structure
At this stage, teams have established a basic structure with some troubleshooting processes. Application visibility increases as huge amounts of data become accessible through expanded tooling, but the ability to separate the signal from the noise becomes a main challenge as teams seek to better understand which issues have the greatest impact on reliability.
Characteristics: Established processes for incident response and QA; some automation across the SDLC; marked reduction in the number of incidents reported by customers; increased visibility into your system through tooling and processes results in higher volumes of alerts.
Primary Challenge: Increased noise and inefficient prioritization results in alert fatigue.
Introduce anomaly detection capabilities through machine learning.
Document and refine your organizations alerting, escalation and issue resolution priorities.
Optimize on-call procedures and implement a culture of code accountability.
Level 3 – Advanced Structure
At this point, teams are better able to focus their efforts on issues that matter. They have anomaly detection capabilities that help to manage alert fatigue. But despite the seemingly endless amounts of data being collected, issues are still missing context and errors still make it to production. Technical debt remains a mystery.
Characteristics: Reduced alert fatigue due to applied intelligence and added context to existing data; established processes for routing issues to the right people at the right time; increased confidence in processes, tools and team structure; still experience critical production issues that catch you by surprise and you struggle to resolve.
Primary Challenge: Broken feedback loop between production and pre-production due to data blind spots (unknown unknowns).
Invest in new data sources and analysis capabilities to cover the unknown aspects of how your applications behave.
Incorporate learnings from production into the QA process for a more proactive approach to reliability.
Improve cross-team collaboration.
Level 4 – Continuous Reliability
This is the most mature stage of reliability, but our work doesn’t end here. At this level, teams have access to nearly all of the relevant data they need to troubleshoot issues quickly and to monitor reliability based on collected metrics.
Quality gates are set up between the stages of development to automatically block the progression of unreliable code. Feedback loops are also streamlined to ensure that software quality is not only stable, but improving over time and easy to measure. Main challenges at this stage are consistent execution by team members based on the available data and analysis capabilities.
Characteristics: Established processes; ability to capture deep contextual data that fuels feedback loops between teams and stages of software development and delivery.
Primary Challenge: Maintaining consistent delivery of reliable software.
Continue to optimize your reliability processes through detailed post-mortems and data-driven feedback loops between stages of your SDLC.
Apply learnings across the organization.
How OverOps Provides Diagnostic Data & Analysis to Help You Progress Towards Continuous Reliability
While APMs and log analysis tools take a top-down IT Ops approach for reliability, focusing on trace-level diagnostics (symptoms), OverOps captures bottom-up code-level diagnostics (causes) at a lower-level than was ever thought possible.
By analyzing all code at runtime in any environment from test to production, OverOps enables teams to identify and prioritize any new errors, increasing errors, and slowdowns using unique code fingerprints.
Once an anomaly is detected, the exact state of the code and the environment – source code, variables, DEBUG level logs, and full OS/container state are delivered to the right developer, before customers are impacted.
Looking for a place to connect and swap ideas with more than a thousand other engineers across both dev and ops? DashCon 2019 has exactly what you need. The conference will be held at Chelsea Piers, NYC from July 16-17, 2019, located right on the water at Pier Sixty.
Get expert answers to your questions, explore the latest offerings from cutting-edge technology and discover how peers at other companies have handled challenges similar to yours. Here are 5 reasons you don’t want to miss DashCon this year:
Senior Software Engineer Zach McCormick learned the hard way that nothing works out of the box at scale. He’ll share the lessons he learned, discussing best practices Braze has developed for monitoring and measuring the effects of deploying changes for high-frequency, highly available distributed systems. Learn how any size makes an impact, from changes in code to adding new features.
P.S. If you’re interested in how to approach software change in your organization, this post examines how to drive innovation in software development by choosing between incremental and fundamental change.
Ensuring service availability at the world’s largest broadcasting company is no small feat. And that’s what the product teams focus most of their attention on at the BBC. Thousands of services, on-demand content and a highly trafficked website are status quo at this broadcasting giant. And live events add that much more demand, mandating scale. Learn from Ross Wilson, Senior Software Engineer, on how the BBC scales (both technically and within operations teams), with a behind the scenes glimpse into their successful summer featuring both the World Cup and Wimbledon.
The Datadog monitoring service lets you see inside any stack, any app, at any scale, anywhere. Datadog Evangelist Matt Williams will speak on Datadog best practices, and workshop participants will learn the ins and outs of Datadog on a journey from zero to hero:
How to build insightful dashboards and visualizations
Tips for effective alerting
Container monitoring with Autodiscovery
Plus, this workshop will include hands-on experience that you can implement in your own projects.
Get faster at finding and resolving operational incidents and take care of them before users even notice service impact. How? It’s a matter of knowing your application behavior and really grasping the reality of the user experience through a calculated combination of metrics, tracing and logs.
This workshop will implement hands-on labs to take you from no knowledge of Datadog to expert. Learn using scenarios that will hone your troubleshooting and monitoring techniques. Speaker Pierre Guceski of Datadog will dive in to best practices for log collection and management with Datadog. Act fast! You must reserve a seat in advance to participate.
Join Tammy Butow of Gremlin and Jason Yee of Datadog to uncover what you need to begin implementing Datadog in your organization. Or, glean new methods for building onto the ways you’re already using chaos engineering at work. It’s all about running intentional and well-planned experiments to discover how systems behave in the face of failure.
Plus, learn how other companies are using chaos engineering to create reliable distributed systems. All the tools, practices and metrics you need for effective chaos engineering will be covered here. Note: You must reserve your seat in advance to participate in this workshop.
See you there?
Come by our booth to say hi, see a quick demo of the product and pick up some cool swag. We hope you have a great time! Want to share your thoughts or insights on the upcoming sessions? Send us your comments or tweet at us @overopshq.
Development and IT Ops teams commonly find themselves in a game of tug-of-war between two key objectives: driving innovation and maintaining reliable (i.e. stable) software.
To drive innovation, we’re seeing the emergence of CI/CD practices and the rise of automation. To maintain reliability software, DevOps practices and Site Reliability Engineering are being adopted to ensure the stability of software in fast-paced dev cycles. Still, most organizations are finding that accelerating the delivery of software is easier than ensuring that it’s reliable.
The average release frequency at the enterprise level is about once a month (with many releasing more frequently), but when it comes to maintaining reliable software, our capabilities haven’t been improving quickly enough. One SRE Director at a Fortune 500 financial services company even told us they can’t deploy a new release without introducing at least one major issue.
These issues go beyond just the immediate loss of service. They hurt productivity, derail future product features, and jeopardize the customer experience – all while increasing infrastructure expenses. All this begs the question, how can we measure, understand and improve our capabilities when it comes to software reliability?
The Agility-Reliability Paradox
Enterprise development and IT Ops teams have various goals that they work towards, most of which fall under two main categories: driving innovation and maintaining application reliability (e.g. software quality). Too often, the pressure to out-innovate competitors comes at the expense of quality and reliability, where reliability is the dependability of a system to function under the given conditions. As a result, most IT organizations experience significant code-level failures on a regular basis.
Derek D’Alessandro, a catalyst for DevOps adoption and technological change in the industry, has some unique insights regarding the effects of accelerated introduction of changes into our systems. In a post examining the different modes of change that teams can adopt, he says:
It is easy to see the benefit of individual changes. The trick to incremental change is measuring and planning out lots of small changes to ensure they all work together and are all headed in appropriate directions. This is an area in which we often create divergence in our systems, technologies or processes. The divergence sneaks up on us and we only discover it when things become unstable and start to topple, or cross integrations are too complex because it’s not built with a solid architecture.
Although this was taken from a post examining necessary steps for successfully adopting incremental and fundamental changes, the overall point that Derek makes is valid to this greater discussion of the Agility-Reliability Paradox. Namely, shipping new features and driving innovation is great, but not when it is detrimental to the reliability of our systems.
After many years of implementing new workflows to support faster releases, enterprise organizations are beginning to realize that new strategies are needed to ensure not only accelerated, but consistent and reliable delivery of code. The target is not just to ship code faster anymore, but also to close new and widening reliability gaps.
The Pillars of Continuous Reliability
After speaking with hundreds of engineering organizations in various stages of their reliability journeys, OverOps developed the concept of Continuous Reliability to help identify and further define the need for organizations to focus on reliability and stability in their systems.
Continuous Reliability NOUN The notion of balancing speed, complexity and quality by taking a continuous, proactive approach to reliability across the software delivery life cycle through data-driven quality gates and feedback loops.
Achieving Continuous Reliability means not only introducing more automation and data into your workflow, but also building a culture of accountability within your organization. This includes making reliability a priority beyond the confines of operations roles, and enforcing deeper collaboration and sharing of data across different teams in the software delivery lifecycle.
Continuous Reliability is dependent on two core capabilities: data collection and data analysis.
1. Data Collection
Reliability initiatives most often succeed or fail based on the quality of the data that engineering teams rely on. This quality is based on the methods used to capture the data as well as the depth of context they provide.
Many organizations struggle not only to identify every technical failure that occurs, but also to access enough data about known failures to investigate and resolve them. This includes insight into which deployment or microservice an error came from, historical context around when an issue was first or last seen, correlation with system metrics, insight into the source code and state of related variables and more. It is common for engineering teams to rely on manual and shallow data sources like log files when troubleshooting errors, which often demand an unrealistic amount of foresight to actually be useful.
2. Data Analysis
Even if you can capture all the data in the world, it’s only as meaningful as your ability to understand and leverage it in a timely manner. This is fairly obvious when we look again at log files. Sorting through millions of log statements by hand is not efficient or effective, which is why we use log analyzers to sort and analyze our logs for us.
There are many types of analysis we can perform to ensure reliability, though, going far beyond the log files. Static and dynamic analysis can be run on our code, machine learning and artificial intelligence can be applied to our system metrics… you get the picture.
Organizations with strong analysis (and data collection) capabilities are able to surface patterns and automatically prioritize issues based on code-level data and real-time analysis. This analysis serves as a foundation for implementing feedback loops from defining custom quality gates that prevent poor quality releases from ever making it to production.
First Steps to Continuous Reliability
When it comes to application reliability, it can be hard to know where to start. That’s why we’re putting together a Continuous Reliability Maturity Model.
Using this model, engineering leaders will be able to chart a course to reach their goals for reliable and efficient execution. As organizations progress in their reliability maturity, they increase their signal to noise ratio, automate more processes and improve team culture. Finally, they are able to reap the benefits of productivity gains and a better customer experience, improving the overall bottom line for the business.
Check back next week for our post covering the model itself, where you can further understand your capabilities when it comes to application reliability.
Meanwhile, you can ask yourself the following questions to get a sense for where your team is on the journey to Continuous Reliability:
Visibility: Does my system have visibility gaps? Am I able to capture 100% of events? What types of issues am I not able to detect?
Accountability: What role does each member of my team play in achieving Continuous Reliability? How well do we collaborate across the SDLC?
Efficiency: How much time do our developers spend debugging? How much unaddressed technical debt do we have?
Prioritization: Do we have a good alerting system – are we receiving too many, too few or the right amount? What criteria do we use to identify which issues need our attention most?
Measurement: What metrics are we using to assess reliability? Does my team have clear metrics that define reliability for our application and, if so, do those metrics take into account the customer experience?
Business Impact: Do we have SLAs? If no, are we ready to define them? If yes, are we meeting them? How frequently are we experiencing customer-impacting issues? How much insight do we have into the financial impact of errors?
If you have any questions or comments on Continuous Reliability, please comment below or reach out to us at email@example.com!
When it comes to managing containers and the cluster infrastructure they run on, what’s the right tool for you?
Containers have rapidly increased in popularity by making it easy to develop, promote and deploy code consistently across different environments. Containers are an abstraction at the application layer, wrapping up your code with necessary libraries, dependencies, and environment settings into a single executable package.
Containers are intended to simplify the deployment of code, but managing thousands of them is no simple task. When it comes to creating highly available deployments, scaling up and down according to load, checking container health and replacing unhealthy containers with new ones, exposing ports and load balancing – another tool is needed. This is where container orchestration comes in. Containers and microservices go hand in hand, significantly increasing the volume of individual services running in a typical environment compared to the number of monoliths running in a traditional environment. With this added complexity, container orchestration is a must for any realistic deployment at scale.
Aside from orchestration issues, another issue remains to be solved – where and how can containers be run? Additional tools are needed to run a cluster and manage cluster infrastructure. Fortunately, we have a few choices to fill this need.
Docker has become the defacto standard for creating containers. For orchestration and cluster management ECS, Docker Swarm and Kubernetes are three popular choices, each with their own pros and cons.
1. AWS Elastic Container Service (ECS)
One solution is to offload the work of cluster management to AWS through the use of Amazon’s Elastic Container Service (ECS). ECS is a good solution for organizations who are already familiar with Amazon Web Services. A cluster can be configured and deployed with just a few clicks, backed by EC2 instances you manage or by Fargate, a fully managed cluster service.
Pros: Terminology and underlying compute resources will be familiar to existing AWS users. Fast and easy to get started, easily scaled up and down to meet demand. Integrates well with other AWS services. One of the simplest ways to deploy highly available containers at scale for production workloads.
Cons: Proprietary solution. Vendor lock-in: containers are easily moved to other platforms, but configuration is specific to ECS. No access to cluster nodes in Fargate makes troubleshooting difficult. Not customizable and doesn’t work well for non-standard deployments.
Bottom Line: Fast and easy to use, especially for existing AWS users. Great option for small teams who don’t want to maintain their own cluster. But vendor lock-in and the inability to customize or extend the solution may be an issue for larger enterprises.
2. Docker Swarm
For those who are just getting started with Docker, Swarm mode is a quick, easy solution to many of the problems introduced by containers. Swarm extends the standard Docker command line utility with additional commands for managing clusters and nodes, for scaling services and for rolling updates. Service discovery, load balancing and more are all handled by the platform.
Pros: Great starting point for those who are new to Docker or for those who have used Docker Compose previously. No additional software to install, Swarm mode is built in to Docker. Simple and straightforward, great for smaller organizations or those who are just getting started with containers.
Cons: A relative newcomer, Swarm lacks advanced features and functionality of Kubernetes, such as built-in logging and monitoring tools. Likewise, overall adoption lags behind Kubernetes and proprietary offerings like ECS.
Bottom Line: Swarm is a good choice when starting out, it’s quick and easy to use and is built in to Docker, requiring no additional software, but you may find yourself quickly outgrowing its capabilities.
For advanced users, Kubernetes offers the most robust toolset for managing both clusters and the workloads run on them. One of the most popular open source projects on GitHub and backed by Google, Microsoft and others, Kubernetes is the most popular solution for deploying containers in production. The platform is well-documented and extensible, allowing organizations to customize it to fit their needs. Although it is fairly complex to set up, many managed solutions are available including EKS from AWS, GKE from GCP, AKS from Azure, PKS from Pivotal and now even Docker offers their own hosted Kubernetes Service.
Pros: Most popular and widely adopted tool in the space for large enterprise deployments. Backed by a large open source community and big tech companies. Flexible and extensible to work in any environment.
Cons: Complex to learn, difficult to set up, configure and maintain. Lacks compatibility with Docker Swarm and Compose CLI and manifests.
Bottom Line: For true enterprise-level cluster and container management, nothing beats Kubernetes. Although complex, ultimately that complexity translates into additional features that prove extremely valuable as your containerized workload begins to scale. As cloud vendors race to simplify things with managed k8s offerings, it will only get easier to deploy and maintain a cluster in Kubernetes.
Whether you use ECS, Docker Swarm or Kubernetes, orchestration and deployment is just the beginning in terms of challenges associated with containerized applications. With so many moving parts, it can be difficult to understand when something goes wrong, let alone where and why it went wrong.
Traditional monitoring tools are far from perfect even for traditional monolithic architectures, but when it comes to containerized applications their coverage gaps are much harder to overcome. The main challenge in monitoring containerized applications is in understanding the flow of a transactionas it passes through multiple containers to get to the real root cause of an issue.
Logs have always been dependent on developer foresight and are notoriously shallow when it comes to troubleshooting application issues. With microservices, logs are written and stored across multiple services making it even harder to follow the trail of breadcrumbs. APM tools, likewise, provide significant insight into resource consumption and transaction flow through the system but can’t reveal the individual line of code where an error occurred and state of variables at the time of the error.
OverOps is able to provide deep, code-level insights into your containerized applications including the full variable state at the time of an error. Our highly scalable, microservices friendly architecture is easily deployed in your ECS, Swarm or Kubernetes cluster.
For all the problems they solve, containers introduce new challenges that must be addressed in order for them to be used for real production deployments. As organizations continue to adopt containers, the need for tooling becomes more important than ever. Whether you’re offloading work to AWS, keeping it simple with Docker Swarm or going all-in with Kubernetes, code-level monitoring is critical to quickly identify and resolve issues.
QCon NYC is an international software development conference for senior software engineers and architects. It’s the conference professional developers have been attending for the past 8 years to find out what the world’s most innovative software shops are using.
This year’s conference will be held at the New York Marriott Marquis in Times Square, right in the energetic epicenter of NYC. Get in on the discussion and learn the latest trends and tools to use on your projects.
Choose to attend the conference, workshops or, even better – stay all week and attend both. The main conference is from June 24-26, and workshops will be held June 27-28. Here are 5 QCon NYC presentations and workshops to attend if you want to up your game (one for each day of the week):
In 2018, poor quality software cost U.S. companies an estimated $2.84 trillion, according to a recent report from the Consortium for IT Software Quality (CISQ). Despite developers spending roughly 60% of their time finding and fixing pesky application issues, errors still result in major software quality expenses. OverOps’ very own Eric Mizell, VP Solution Engineering, will break down the many costs – both obvious and hidden – that result from error-ridden applications, and provide you with a formula to help convince your manager to make technical debt a priority.
Monday 4:10pm – 5:00pm | Track: Human Systems: Hacking the Org
Do your meetings typically disintegrate into a stifled, disorganized gathering of disengaged team members? The folks who have the best ideas on your team might keep quiet for a variety of reasons. Learn to implement exciting new management techniques that include everyone on your team and cultivate shared ownership. Greg Myers will share Capital One’s adoption of Liberating Structures to help transform meetings into dynamic, creative experiences.
Could you be the problem? Your technical skills got you this far, but you might need to cultivate a few right-brain skills to make up missed connections. If you find yourself constantly butting heads and lost as to why your team members walk away upset, you might be lacking empathy. Learn to change this behavior, and you’ll unlock a new ability to coexist better in your world. Speaker Paul Tevis of Vigemus started his career as a software engineer and now coaches tech leaders who want to work as effectively with people as they do with technology.
Wednesday 4:10pm – 5:00pm | Track: Building High Performing Teams
If your company has been mulling over the risks and benefits of hiring remote workers, this talk is for you. Learn from cases that have successfully implemented remote projects, including eBay, Google, Stitch Fix and WeWork. Get the tools to start hiring remote workers from the ground up: learn to build teams, outline scope, manage their growth and communicate with empathy.
Have you ever frozen up in a presentation? You may be scratching your head as to why this still happens at your experience level. But the secret is, it’s not a matter of trying harder. Rather, it’s about cultivating awareness of what your body is doing and calming it down. Learn to break negative communication habits using kinesthetic learning tools in this workshop. Find out how to trick your body and voice into projecting authority, even when you don’t feel very confident.
Interested in getting started with Chaos Engineering? It’s what makes the greats great – Netflix, LinkedIn, Capital One and others have developed resilience through the practice of Chaos Engineering. Speaking at this workshop will be Casey Rosenthal, who managed the Chaos Team at Netflix, runs Chaos Community Day and – well, wrote the book on Chaos Engineering. Bring a laptop and get ready to dig into a series of hands-on exercises, active discussion and group collaboration. This workshop will cover:
What Chaos Engineering is and what it is not
What makes a good Chaos Engineering experiment
Tools to run your own experiments
How to continue research
See you there?
Come by our booth (it’s #26) to say hi, see a quick demo of the product and pick up some cool swag. We hope you have a great time at the conference, and I’m personally very excited for this one too. Want to share your thoughts or insights on the upcoming sessions? Send us your comments or tweet at us @overopshq.
Editor’s Note: This post was originally published on May 5, 2016. It has since been updated to reflect advancements in the industry.
“The microservices trend is becoming impossible to ignore,” I wrote in 2016. It’s still true, although it’s certainly grown to more than just a passing fad.
Back then, many would have argued this was just another unbearable buzzword, but today many organizations are reaping the very real benefits of breaking down old monolithic applications, as well as seeing the very real challenges microservices can introduce.
For teams dealing with loads of technical debt, microservices offer a path to the promised land. They promise to bring greater flexibility and easier scalability. Smaller code bases are easier to understand, and with clearly separated services the overall architecture is much “cleaner”.
Microservices bring with them new and exciting possibilities (the cake is NOT a lie), but they’re still not without challenges. Anyone that tells you otherwise is sorely mistaken (or, more likely, trying to sell you something). Higher frequency releases and increased collaboration between dev and ops is exciting, but it’s important to stay diligent.
Microservices may be considered a revolutionary way to build applications, but this new approach does not require us to completely start from scratch. Rather than asking what specialized framework you need to build a new microservices architecture, let’s ask how we can use current frameworks to support the same goal.
But first… A short recap of what microservices are and where they came from:
“The microservice architectural style is an approach to developing a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms, often an HTTP resource API. These services are built around business capabilities and independently deployable by fully automated deployment machinery”.
Back then, microservices and the concept of containerized applications were so new there weren’t really specialized tooling or frameworks available to support building, deploying and running those kinds of applications. Rather, the focus was on adapting current tools for use with this new architectural style.
In the past half a decade, the industry has exploded with technology built especially to support new microservices. That doesn’t mean that they’re the best suited for each individual’s needs though. In fact, unlike monoliths, which are generally developed with the tech stack in mind, each service in a microservices architecture can be built using a different framework based on its own functionality.
This post is not about the pros and cons of microservices, but instead looks at the underlying technology most-suited to support it. If you’re looking to dig into some of the common pitfalls of microservices (and how to overcome them, of course), check out this post that covers the main challenges associated with microservices. Instead, we’ll go over some of the most popular frameworks for building microservices – both traditional and container-specialized.
1. Jakarta EE / Java EE
The classic Java EE, now Jakarta EE (JEE), approach for building applications is geared towards monoliths. Traditionally, an enterprise application built with Java EE would be packaged into a single EAR (Enterprise Archive) deployment unit which includes WAR (Web Archive) modules and JARs (Java Archive) files.
Although there aren’t any technological restrictions ruling out the use of JEE for microservices architectures, there is a significant overhead cost. Each service would need to be packaged as a standalone unit, meaning it should be deployed within its own individual JEE server. That could mean deploying dozens or even hundreds of application servers to support a typical enterprise application.
Luckily, the community noticed early-on that the standard JEE didn’t address the new build challenges that microservices introduced. Since 2016, many additional open source projects have been started to support microservices built in JEE.
Eclipse MicroProfile is a continually growing set of APIs based on JEE technologies. It’s an OS community specification for building Enterprise Java microservices, backed by some of the biggest names in the industry, including Oracle, Red Hat and IBM.
Bottom line: There’s no reason you can’t use Java EE for microservices, but it doesn’t address the operational aspects of running multiple individualized services. For those of you that want to move an existing monolith JEE app to microservices, there are plenty of “add-on” tools out there based on JEE technology to support your needs.
2. Spring (Spring Boot and Spring Cloud)
Spring is one of the most popular frameworks for building Java applications and, like with Java/Jakarta EE, it can be used to build microservices as well. As they put it, “[microservices do] at the process level what Spring has always done at the component level.”
Still, it’s not the most straightforward process to get an application with microservices architecture up and running on the Spring framework… You’ll need to use Spring Cloud (heavily leverages Spring Boot), several Netflix OSS projects and, in the end, some Spring “configuration magic”.
For a deep dive on how to build microservices with Spring, check out this post straight from the source.
Bottom line: Spring is well positioned for the development of microservices, together with an offering around external open source projects that address the operations angle. That doesn’t mean it will be easy though.
3. Lagom (Lightbend)
Lightbend provides us with another option. Continuing with the same theme, Lagom wraps around the Lightbend stack with Play and Akka under the hood to provide an easier way to build microservices. Their focus is not only to provide an easy solution for those moving towards microservices, but to ensure that those microservices are easily scalable and reactive.
“Most microservices frameworks out there focus on making it easy to build individual microservices – which is the easy part. Lagom extends that to systems of microservices, large systems – which is the hard part, since here we are faced with the complexity of distributed systems.”
Bottom line: Lagom takes Lightbend’s capabilities and leverages them in one framework, specially designed for building reactive microservices that scale effectively across large deployments. Their focus is not only on the individual microservices, but on the system as a whole.
Not unlike the other frameworks we’ve looked at in this post, Dropwizard is a Java framework for developing ops-friendly, high-performance, RESTful web services. An opinionated collection of Java libraries that make building production ready Java applications much easier.
Dropwizard Modules allow hooking up additional projects that don’t come with Dropwizard’s core, and there are also modules developed by the community to hook up projects like Netflix Eureka, similar to Spring Cloud.
Bottom line: Since Dropwizard is a community project that isn’t backed by a major company like Spring and Pivotal, Java EE and Oracle, Lagom and Lightbend, its development might be slower, but there’s a strong community behind it and it’s a go-to framework for large companies as well as smaller projects.
5. Vertx, Spotify Apollo, Kubeless and other “Microservices-specific” Frameworks
Apart from the 4 big players we’ve mentioned here, there’s a plethora of other projects that are worth mentioning and can also be used for writing microservices:
Vertx, also under the Eclipse Foundation, is a toolkit for building reactive applications on the JVM. Some might argue it should have a spot at the big 4.
Spotify Apollo is a set of Java libraries that is used at Spotify when writing Java microservices. Apollo includes features such as an HTTP server and a URI routing system, making it trivial to implement RESTful services.
Kubeless is a Kubernetes-native serverless framework. It’s designed specifically to be deployed on a Kubernetes cluster so users are able to use native Kubernetes API servers and gateways.
Bottom line: The Java microservices playing field is quite large, and it’s worth checking out the smaller players just as much as the industry giants.
It doesn’t matter which framework or platform you’re using, building microservices isn’t tightly coupled with any of them. It’s a mindset and an architectural approach, and the best course of action (as with most things, I suppose) is to choose whichever option is most-suited to your application’s unique requirements.
With that said, successfully implementing a microservice architecture doesn’t stop at the application itself. Much of the cost around it comes from so-called DevOps processes, monitoring, CI/CD, logging changes, server provisioning and more that are needed to provide continued support to the application in production. So, go ahead and enjoy your cake, but don’t forget to stay diligent as you reap your rewards.
Did we miss out on your favorite framework? Have any interesting stories to share about implementing microservices through any of the options we’ve mentioned here? Please share your insights in the comments section below or get in touch at firstname.lastname@example.org or @overopshq!
Our favorite VP Solution Engineering Eric Mizell gave a talk at DevNexus this year, and we think he rocked it, so naturally we have to share with all of you. (Skip to the bottom to watch right away.)
In this session, Eric covers the concept of Continuous Reliability, from what it is and what makes it hard to achieve, to actionable steps you can take to introduce it to your own workflow.
You’re probably familiar with the following Venn diagram, or at least one that looks very similar.
It’s commonly used to present the idea that teams are generally forced to choose two and sacrifice the third. Eric defines Continuous Reliability as that small triangle in the middle, the seemingly impossible-to-achieve balance between speed, quality AND complexity.
There are many different challenges that we need to overcome to reach this level. That’s why most of us are compromising and going for a “best 2 out of 3” approach. Some of the challenges that Eric lays out include:
Dev vs. Ops Finger-Pointing
Lack of Visibility/Observability
Not to worry, though – he also lays out the path to Continuous Reliability. Without delving too deep into each individual step, here are the points that he brings up as the most important areas to address when Continuous Reliability is your goal:
Automation, automation, automation (reduce human intervention as much as possible)
Know your unknowns (identify what you don’t know)
Identify errors sooner (shift left, test early and often)
Capture more context (logs don’t give full context)
Create more/better code coverage (new or better tooling)
Better data usage (metrics hub to leverage all available data)
Culture of Accountability (accountability for everyone on the team)
Watch the video to get the full scoop on achieving the impossible and adding Continuous Reliability to your workflow!
Continuous Reliability - an Agile Process to Deliver Higher Quality Applications w/ Eric Mizell - YouTube
For more information on how OverOps can help you achieve Continuous Reliability – visit our website!
The following is a guest post from Herb Krasner, an Advisory Board Member for the Consortium for IT Software Quality (CISQ) and industry consultant for 5 decades.
Demands of the competitive global economy have placed a strong emphasis on quality across the IT industry, and it shows no signs of going away. Meeting the customer’s expectations at a high degree of conformance is no longer expected to come at a premium – it is just expected.
In a previous post, we looked at the magnitude and impact of the soaring cost of poor software quality in the US and where those hidden costs are typically found. We now turn our attention to what you, as a leader in your organization, can do about it. Calculating the cost of software quality is an important first step in identifying areas of opportunity to add value from IT while reducing costs, accelerating deliveries and remaining efficient/competitive.
Basically, the costs of software quality (COSQ) are those costs incurred through both meeting and not meeting the customer’s quality expectations. In other words, there are costs associated with defects, but producing a defect-free product or service has a cost as well. Calculating these costs serves the purpose of identifying just how much the organization spends to meet the customer’s expectations, and how much it spends (or loses) when it does not.
Knowing these values allows management and team members across the company to take action in ensuring high quality at a lower cost. While analyzing the COSQ at an organization may lead to the revelation of uncomfortable truths about the state of quality management at the company, the process is important for eliminating waste associated with poor quality. This often requires a mindset and culture shift from viewing software quality defects as individual failures to seeing them as opportunities to improve as a collective team.
In this post, we focus on the various costs of software quality and how those can be measured. In the future, we will examine more closely the discussion of achieving disciplined and mature software development and how it affects a software asset’s total cost of ownership.
As highlighted in the figure above, we show that investing in software engineering discipline and in the Cost of Good Software Quality (CGSQ), will dramatically reduce the Cost of Poor Software Quality (CPSQ).
The American Society of Quality (ASQ) uses the following formula to calculate the Cost of Quality (COQ):
Cost of Quality (COQ) = Cost of Poor Quality (COPQ) + Cost of Good Quality (COGQ)
We use that same formula for the Cost of Software Quality.
Below is a summary of how to properly identify and track both CPSQ and CGSQ.
Cost of Good Software Quality
A discussion of what is meant by good quality software can be found here. The cost of good software quality is as variable as the organizations I have encountered. Some groups invest a lot in proactive quality management and planning, while others make do with patchwork systems and reactive programs aimed at solving problems after they occur.
COGQ is divided into different categories. These are the costs associated with providing good-quality work products, systems or services.
There are three categories: prevention costs (investments made to prevent or avoid quality problems), appraisal costs (costs incurred to determine the degree of conformance to requirements and quality standards) and management control costs (costs to prevent or reduce the likelihood of failures particular to its management functions: contract reviews, planning, goal establishment, and progress review and control of each software project).
Below are some examples of typical costs within each category:
Error Proofing (Defect Prevention Programs)
Setting up effective test environments
Supplier and included component (e.g. OSS) assessments
Design and Code Reviews
Management Control Costs
Costs of carrying out contract reviews
Establishing quality goals, objectives, gating/release criteria and quality standards
Costs of preparing project plans, including quality management plans
Costs of periodic updating of project and quality plans
Costs of performing regular progress review and control
Costs of performing regular progress control of external participants’ contributions to projects
Cost of Poor Quality
COPQ, like its counterpart COGQ, is also divided into different categories. These are the costs associated with providing poor-quality work products, systems or services.
There are four categories: internal failure costs (e.g. costs associated with defects found before the customer receives the product or service), external failure costs (e.g. costs associated with defects found after the customer receives the product or service), technical debt (the cost of fixing the structural quality problems in an application that, if left unfixed, put the business at future serious risk) and management failures (costs incurred by executives and below dealing with the ramifications of poor-quality software).
Below are some examples of typical costs within each category:
Internal Failure Costs
Crisis Management and Overtime
External Failure Costs
Patches, Repairs & Servicing
Company Reputation Damage
Increased Complexity Due to Shortcuts
Debt Service and Interest
Unplanned costs for professional and other resources, resulting from underestimation of the resources in the planning stage.
Damages paid to customers as compensation for late project completion.
Damages to other projects planned to be performed by the same teams involved in the delayed projects. The domino effect may induce considerable hidden failure costs.
Excessive management crisis mode behaviors, like lots of meetings to solve urgent problems.
Some Strategies for COSQ Measurement and Improvements
Most of the measurement of software quality and its related costs are readily tracked with today’s tools if you are willing to insert additional data to capture the effort involved with COSQ. The major component of which is staff effort data, which can easily be converted to $$ when needed. The tools for tracking internal and external problems and defects already exist. What is needed is the recording of the total team effort involved in investigating and resolving those problems and defects. That will give you the major component cost of CPSQ.
Once these basic mechanisms are in place, quality improvement programs can be meaningfully baselined and tracked over time.
Every company is at a different point in the evolution of its understanding of its key metrics/performance indicators and COSQ. Once the basics are in place, management can consider leveraging the following strategies to reduce their company’s CPSQ and positively impact quality and bottom line performance.
1. Create an action plan for software quality and process improvements in your IT shop
Establish baselines and benchmarks
Form a cross-functional action planning team
Select improvement targets/pilot projects
Apply software quality modeling and measurement standards
2. Improve supplier relationships for both product and process improvements
Collaborate during development process, engage suppliers in the corrective action process (from incoming, or customer-reported problems), develop supplier scorecards, audit suppliers based on their product/process risk levels
3. Focus product development on Prevention
Define Critical to Quality attributes, pull in lessons learned from defect information from similar products’ risk files and quality system information
4. Make quality and achievement information and metrics visible across the organization
Collect real time quality data, defects/dispositions, inspection/QA/testing results, and rework, just to list a few, to trend problems and see systemic issues
Evaluate quality and compliance risk from audit results (internally and externally), complaints, reportable events
Use statistical analysis to monitor real-time quality data
5. Leverage technology
Evaluate tools that help analyze and track software quality attributes
Deploy a quality management platform to support operational efficiency which can increase accountability, productivity and reliability
Each of the above initiatives have costs to implement, but also savings when achieved. The costs may increase CGSQ (either through appraisal or prevention categories), and the cost savings can impact both the CPSQ and the CGSQ.
An Example of What Can Be Accomplished
As seen in the below graph for one of the companies that I have worked with, significant improvement in COSQ was achieved over a 5-year period. This led to a 200% improvement in organizational software development productivity over that period.
Understanding Cost of Poor Software Quality in your organization is the first step toward gaining executive buy-in for quality-led operations. This is fundamental to achieving the potential benefits of agile, DevOps and Proactive/Predictive Quality Management. With a CPSQ number in hand, you have the basis for a business case to invest smartly in quality. Determining CPSQ may sound daunting, but in fact, it’s very achievable and simply requires some tried-and-true methods along with a cross-functional team to get the brainstorming on paper.
Enhancing your organization’s approach to calculating COSQ is as much a culture change initiative challenge as it is a process improvement program, where planning and cross-functional team buy-in is required to ensure long-term COSQ calculation repeatability and success.
We hope this post aids you and your team’s efforts in outlining the many factors that impact the COSQ, the associated benefits of uncovering the specific COSQ issues at a deeper level, and the value of the many lasting benefits of pursuing superior software quality in your IT systems.