Why Service Teams are the answer to your DevOps woes
Operations is the lifeblood of any software-as-a-service team, yet is frequently treated as a second-rate consideration. Software developers focus on building features, then punt the work of actually supporting those features to a “DevOps” team. This is an antiquated model from the era when software was distributed on floppy disks. Once the disks were produced, the developers were done. It’s a customer service or ops problem now.
I use “DevOps” in quotes for a reason. Consultants have warped and twisted what was once a simple concept for the purposes of selling naive leaders tools and products. There are as many definitions of “DevOps” as there are consultants wanting to sell you something. I won’t be using this term except to refer to that vague nonsense.
In the modern world, for the vast majority of cases, software is now a service that requires ongoing development, maintenance, and support. Operating software is no longer a different concern, it’s part and parcel of the job of developing.
Throwing Things Over the Fence
Early in my career, I worked on an operations team where developers would throw their code over the fence. Predictably, these systems broke often and we built hours-long manual testing processes after every release to try to catch things. These ran well into the morning hours, with the operations and product team getting to bed around 2 or 3 AM. Even with all that testing, we still had bad code changes make it to production, and the developers would always complain when we had to pull them off of coding work to handle Incidents and fix their shoddy work.
This attitude was largely based in a feeling of developer supremacy - that developers are “too good” to be oncall or be paged. That’s for someone else, paid less, usually offshore, with a more limited skillset. Operations staff were warm bodies to throw at the soul-sucking task of Operations.
This was draining in a way I cannot describe in words. We were on the hook for someone else’s decision making with no real way of influencing those decisions. I’ve never been so stressed in my life as during those years. Turnover was high, morale was low.
This model simply does not work. It encourages a lack of ownership over your work, sets the wrong incentive structures, breaks key feedback loops, and grinds your organization to a halt.
Service Team
There is a better way, which I call the Service Team model. In this model, developers are responsible for designing and developing features, managing infrastructure, maintaining the long-term health of the system, handling customer requests, responding to Incidents, and maintaining a continuous improvement process. The buck stops with the service team. They own the software top-to-bottom, inside-and-out. I’ve been working under this model for years now with great success. We own everything related to our services. We are responsible for the victories and defeats. These are service teams, not development teams or operations teams.
This model is possible largely because of modern technologies that have lowered the technical breadth required to run a high performance software service, such as public clouds and containerization. No longer do you need an intricate knowledge of servers, load balancers, hypervisors, storage arrays, and networking to run a software service. These are now themselves software services provided by public clouds. Even if you are on-prem, a simple Kubernetes cluster abstracts almost all of this away from your service teams.
Feeling The Pain
Your development process directly dictates your operations pain. Missing edge cases, poorly-performing code, memory leaks, and insufficient logging and metrics all contribute to operations pain. If you’re throwing your code over the fence to an operations team, they are on the receiving end of this pain, but they have no tools to influence improvement here.
Some organizations try to build processes to resolve this gap. I’ve been in an organization where they tried allocating dev time every sprint for operations work prioritized by the ops team. That work was always the first to be deprioritized once someone demanded a new feature. I’ve been in an organization where the operations team was just cutting tickets into a blackhole, never seeing the improvements they needed actually get built. In another organization, operations teams were permitted to develop and submit PRs for their own needs, but those PRs often went ignored because they didn’t contribute towards any development team metrics and took forever because the operations team wasn’t familiar with the code-base.
The Service Team model embeds this feedback loop into the team itself. Developers are far more conscientious of their changes when they’ll be the ones woken up at 2AM when things go wrong and they have poor logging and no metrics to actually diagnose and fix the issue. It’s in the service team’s self-interest to address operational concerns as they are shipping code. This effect is so strong that I often see service teams pushing back on management and product teams that are trying to push them to launch without sufficient operational controls in place.
A Systems Frame
Developers being responsible for code, compute, and other infrastructure forces them to look at their software from a systems frame. They cannot afford to ignore the impact specific load balancer configurations may have on their service. They cannot live in a world of “worked on my machine.” They must face the reality that their software is a complex system that they must understand.
Working in high-scale cloud software, almost every scaling issue I’ve encountered is due to some odd interplay between system pieces. It’s very rarely just code or just infrastructure. Building a deep understanding of the whole system, not just the code or infrastructure, is what enabled us to find and fix these issues.
Implementation
Now that we’ve established why this model works better, let’s look at implementation. There are a few key considerations that need to be addressed:
On-Call Rotation
Ticketing Process
Time Allocation for Ops Work
Initial Investments
On-Call Rotation
All members of the service team should be in the on-call rotation, from juniors to seniors. Juniors or new hires who are too inexperienced to be on their own should be shadowing more experienced engineers. A rotation time of one week works very well for this and you should have a minimum of 5 engineers on the rotation to prevent burnout. During the rotation, the on-call is expected to respond to production incidents within a timely manner.
During this time, the engineer on-call has one priority: resolve Incidents. Everything else is secondary to that task. If they have no Incidents, should be working the operations backlog, not the feature backlog. This strict separation prevents operational issues from impacting feature capacity, leading to a more predictable feature development pace.
Ticketing Process
I’ll write a longer post on operations concepts and tickets, but at its core, the team needs a ticketing process that allows for three things:
Requests are customer tickets or tickets from other teams asking for support or some other help.
Incidents track instances service degradations. Outages, performance issues, and so on. An Incident is a single occurrence of this degradation. Incidents have different severities, with some being minor and some needing immediate attention.
Problems track ongoing issues with the service. For instance, a memory leak in the service code that occasionally causes Incidents where hosts run out of memory. Incidents and Problems are linked together to enable tracking the impact of a Problem over time, which aids in prioritization.
You need some ticketing system that can handle these three types of operational tickets. Ideally, this system should integrate with a paging tool that can contact the on-call automatically if an Incident of sufficient severity is opened.
Time Allocation for Ops Work
A portion of the regular sprint (or similar) capacity should be dedicated to working a ticket backlog fully owned by the service team. Product managers, team managers, and so on have no power here. The stakeholder is the service team and they get to decide what work needs done in what order. This is where tasks such as logging or metrics improvements, code refactors and cleanup, automated testing, automation work, and so on go.
Some points of caution:
Do not put feature bug fixes in this backlog. Otherwise, the product teams are incentivized to drive shipping features with bugs and the ops backlog will eventually be inundated with the consequences of that behavior and turn into a bug backlog, choking out actual operational improvements.
Do not save up for operations-dedicated sprints. Progress is best made slowly and steadily. Deferring operations capacity for shipping a feature will eventually result in the operations work never being done.
This backlog closes the loop between operational pain and impovement. It dedicates time for the team experiencing the pain to actually address it and generates a flywheel effect driving towards operational stability both for the team and the customer.
I recommend starting with 33% capacity dedicated to this work. That sounds like a lot, but you probably have more work here than you realize. As you continue this process, your needs may die down and there not be a need for this much capacity. This is a good opportunity for allowing your service team to experiment. They may even decide to pickup a feature with this time, at their discretion. It’s the service team’s time to do with what they want.
Initial Investments
Most organizations without a strong feedback loop are drowning, whether they realize it or not. Your initial goal is get your head above water and then to start swimming.
Phase I: Stop Drowning
Phase I focuses on two key types of work: improving observability and reducing Incidents.
By “observability” I’m not talking about the consultant-speak o11y where you have to buy a bunch of expensive tools. I’m specifically saying “do you even know what your application is doing?” The vendor tools have some great features and can definitely be used to great effect, but often times they are haphazardly implemented as an after-thought and end up being enormous piles of noise. If you’re so early in your operations journey that you don’t understand how your system works, you can’t possibly hope to utilize these tools effectively and they are not worth the financial investment, which tends to be substantial.
Logging and basic metrics can get you most of the way there. You should be able to trace a specific piece of work (request or message) through your system with some kind of unique identifier. Include that identifier in every log message and aggregate your logs in a single place. Congratuations, you now have distributed tracing. You can see what your application is doing. As your team goes through the service to implement logging and metrics, they will need to analyze the application code and infrastructure to understand what’s actually happening, what logging is useful, and what isn’t. This exercise is just as important as the logging itself and is the main reason the vendor tools fail. They promise that you don’t have to do the hard work of understanding, but the understanding is what makes the tooling functional in the first place.
At the same time, you should be reducing Incidents by driving root cause resolution. If you have a memory leak, you need to ruthlessly pursue it and eliminate it, not just write a script to bounce hosts every so often. Dive into root causes and fix them. The increased understanding granted by the logging and metrics work will help you drive these fixes, reducing the on-call burden and freeing up time for your service team to focus on the work in Phase II, which keeps you from drowning again.
Phase II: Start Swimming
Now that you’re not drowning, you need to invest in operational work that improves efficiency and reduces regressions. This is done by investing in Continuous Integration / Continuous Deployment (CI-CD) processes.
I have a whole post planned on CI-CD, but the short version here is to invest in automated deployment and testing. You should be able to merge a commit and have it in production within a few hours.
Automated deployment is trivial, especially in a cloud environment. Use Terraform or CDK to manage your infrastructure and code deployment processes and simply trigger them from a CI-CD pipeline tool such as Jenkins or GitLab.
Almost all testing can be automated nowadays, so there’s not really a reason to do major releases or gate your deployments based on a QA team. Instead, invest in automated tests using a testing framework of your choice and run those tests against a copy of your production environment. If the tests pass there, you’ll most likely be good in production.
Over time, you’ll find gaps in your CI-CD process that will enter the backlog and you’ll iteratively improve to the point you have almost no deployment failures or production regressions.
Making the Change
If you currently have a split development and operations team and want to make this change, there will be friction. Your developers are probably not used to being oncall and your operations team may feel threatened. You’ll need to address both of these concerns.
I recommend spreading the operations team across development teams and investing in them as developers. This was my career path and it worked out great for myself and my team. Having someone who had specialized in operations processes embedded in the team will help mentor the developers on good operations practices and enable sharing of their experience. Your operations personnel have valuable knowledge and shouldn’t be cast aside. Inversely, dedicated operations staff will be a dying breed in the coming years, so giving them the opportunity to learn software development and engineering will provide them with a career path.
Developers may be more problematic depending on the culture within the team.
If the developers are already accustomed to being involved in operations issues, you can leverage the rotation as a selling point. On-call rotations give your service team a break and a clear understanding of when they need to be engaged with work. They’re allowed to properly disconnect when they are not on-call, which goes a long way towards improving mental health and morale.
If the developers are not accustomed to being involved, you’re effectively asking them to do more work for the same pay. You can immediately solve this by giving an oncall stipend or some additional pay to account for the increase in responsibility. It’ll cost more up front, but will payoff in the long-run as the system becomes more stable and the team is able to automate their manual work, improving their efficiency. Ultimately, though, you’ll need to hire people with expectation that they own their service.
Questionably Functional is a reader-supported publication. If you found this article helpful and would like more, please subscribe below.
Some employers may allow you to expense subscriptions to industry newsletters under their Education or Development budgets. Click here for more information.
If you have a question or topic that’s been on your mind, and you’d like my opinion on it, you can submit a question using the link in the navigation bar. Thank you!