Do you respect your mechanic?
This is an odd question from a software developer that doesn’t drive, yet it is one that I recently found myself asking. I was thinking in the context of groups that would share a common plight with System Teams in SAFe, on which the literature is very sparse. The question was simple “Does our work as a System Team really contribute value to the end product?”
To think through this, I drew a comparison to a mechanic in a shipping company. If you don’t have a competent mechanic to order the right trucks, you can’t deliver your products to the customer. If you don’t keep your mechanic around to keep the trucks running they break down and you can’t deliver your product to the customer. So obviously a mechanic is an integral part in the delivery of value to your customer and so a System team must also contribute to the delivery of value to customers in a SW company.
Satisfied with this initial comparison I started to take it further, wondering if people would really expect the same of their mechanic as they seem to from their System Teams?

Would you expect to walk into your mechanic, asking them to fix your problem then and there for you?
Most dealings with mechanics play out the same, bring your car in and they tell you it will be back some time in the next week, if you are lucky. This is generally inconvenient, as we need our cars to go about our day and the bus just doesn’t do it for us. However we all understand there is a backlog of work the mechanic has to do, work from people who came in before us. We can’t see this backlog, there is rarely a “You are 20th in line” ticket system, we just accept the nature of mechanics is to have a backlog of things that need doing. This makes sense as we always see them working and know that if they were sitting idle they wouldn’t be making money. It’s in their best interest to be fully utilised.
Why then is the expectation of a System Team different? In many ways we are much more transparent than a mechanic, we have a backlog of work that is in the same tracking tool that the teams use. Yet it is unacceptable that people are made to wait for their work to be done. Everything should be instantaneous because it’s all just in the Cloud, the concept of lead time is non-existent. We are regularly greeted with visits to our desk to ask, “Could you just do this for me? It’s important!”. Asking to create a ticket is met with scorn, “It would take longer for me to submit it, than it would for you to fix the issue!”.Trying to put order on the backlog with prioritizing techniques like WSJF? Well, that’s just a system to game. Make every backlog item critical, with minimal size. However these tactics just invalidate the system, if everything is important, nothing is.

Would you blame your mechanic for your car constantly breaking down if your parents bought you a crock of a car?
We can’t all afford to go out and buy the top of the line car, sometimes we just have to live with what is given to us. If you are gifted a fantastic pile of rust, that still somehow runs, you smile and accept it. When it inevitably breaks down you bring it into the mechanic, they fix it and you hope not to see them again. You can complain to your heart's content to your parents for the burden they have bestowed onto you, but you never criticise your mechanic for their inability to turn a rust-bucket into a ferrari.
When working in a large multinational SW company that has been around for a while, there are plenty of rust-buckets given to you. Servers that have been sitting in the basement for the past 20 years, storage disks that need to be dusted down to run faster, networks held together with more tape than wires. There is no nostalgia strong enough to enjoy working with these systems but they are often all that is available. System teams need to work tirelessly just to keep these up and running, applying workarounds and patches to cover the rust. Yet rather than appealing to management for more investment, the mechanic must perform the miracle of transforming them into supercomputers.

When your car runs like crap because you haven’t maintained it, is it the mechanics fault for not calling you into the garage?
We’ve all seen “that” car. The one with all the warning lights on, duct tape holding things together, no room in the back from all the receipts and junk collected over time. We all know how those people respond when questioned;
“Should you do something about that flashing red light?”
“Nah, it’s fine, it always does that when the car is on.”
When the day finally comes that the car dies, it is brought to the garage and it looked over. The warning signs are all there, the mechanic looks on disapprovingly; “Could have salvaged this if you came in 2 months ago!” You just drop your head in agreement, no arguments.
Yet when a server fills up with logs because a team decided every trivial action their app performs is worthy of remembering, the System team did not provide enough disk space, despite everyone getting a standard size. When a system finally fails after the health-check has been flashing red for weeks, it was the inaction of the System team to notify the development team of the issue. When hack upon hack start surfacing performance issues in front of the customer, it is a faulty environment that is causing the problem.

When you leave the keys in your car, is the mechanic at fault for not keeping your car safe?
No one would ever expect a call from their mechanic telling them they left their car door open. That much responsibility can not be put on so few. Every car owner is responsible for what they do to ensure they keep it safe. If you leave your keys lying around, you take the risk your car will be stolen.
Yet when a system is poorly designed, it should have been caught by the System Team. They manage the build systems and secret stores, they should know what everyone is doing with the keys given to them. They should ensure that no one is leaving them lying around for people to pick up. They should ensure that every team is applying good practices in their build and deploy scripts.

Much of this is just venting frustrations. I would like to hope that I am alone in experiencing this. However I doubt it. Major headlines of faults in banks often point the blame at the IT guy that screwed up and missed something. The weakest link in the chain is always the one that doesn’t get enough attention. This seems obvious, once it has broken. So is there a way to ensure our mechanics get enough attention?
DevOps, as always, seems the obvious answer. Shared responsibility between the mechanic and the driver, but is it a panacea? Most talks and articles I’ve seen look at the shiny side of it, the cool alerting and monitoring tools, severless setups and auto-magic of Kubernetes or Swarm. But what about those stuck in the paleolithic era of IT? Is there any love for old school mechanics who have to have to get their hands dirty (maybe dusty)?
Would applying the same mindset and respect that we do with our trusted mechanics put us in a better place with our System Teams? Could we trust that they have a long backlog of work they are trying to get done? Respect that they are not responsible for the horrid state of the infrastructure given to them? Acknowledge that when things go wrong, there may be some fault in the teams for the problem? Realise that security can not be owned by a small group and needs to be shared collectively?
These seem like redundant questions when typing them out, yet they continually need to be asked.