From IT Leader to SRE Champion: Navigating Complexity with People Skills

John Adams
Aug 23, 2025
7 min read

The Intersection of IT Leadership and SRE Principles

From IT Leader to SRE Champion: Navigating Complexity with People Skills — cinematic scene — Cloud & SRE

Remember your first job? Maybe you were managing a small team or even just yourself. Over ten years ago, I led teams deploying complex systems – think mainframes, maybe some early network gear, the kind that required careful planning and deep expertise to keep running. Back then, the focus was on deployment success: did we get it right? Did it function for users?

Fast forward through countless projects, migrations, and scaling efforts. The landscape shifted dramatically with cloud computing and DevOps principles. Suddenly, 'getting it right' wasn't enough; maintaining that reliability at scale became paramount. This is where SRE truly entered the conversation.

My transition involved more than just learning new tools or methodologies. It required fundamentally understanding and reinterpreting established IT leadership practices through an SRE lens:

Responsibility Shift: While IT leaders manage infrastructure after deployment, often reacting to issues, SRE champions reliability proactively throughout the development lifecycle.
Metrics Matter: Availability, latency, error budgets – these aren't just technical metrics; they are business ones. IT success was measured by project completion dates or system uptime during specific support windows. SRE ties this directly to continuous availability and performance in dynamic cloud environments.
Ownership Culture: Old models often saw infrastructure teams as distinct from development teams ('Dev' vs 'Ops'). SRE breaks down these silos, embedding reliability thinking within every team responsible for services.

It wasn't about abandoning project management or my technical roots. It was about evolving them – adding layers of proactive measurement and a relentless focus on the user experience that systems provide, 24/7.

Beyond Code Deployment: Common Hurdles for New SRE Teams

From IT Leader to SRE Champion: Navigating Complexity with People Skills — editorial wide — Cloud & SRE

Stepping into an SRE role can feel like entering uncharted territory, even if you've been managing infrastructure your whole career. The core challenge isn't technical deployment; it's shifting culture and mindset across the entire organization. When I made this transition, my initial hurdles were often about people:

The Blame Game: For decades, incidents felt like a failure to be distributed among teams – development for coding bugs, operations for configuration errors. SRE demands blameless postmortems, focusing on understanding systemic issues rather than finding scapegoats.

Hurdle: Resistance to transparency and accountability.
Solution: Start small with retrospectives focused purely on improvement, not finger-pointing. Frame it as a learning opportunity for everyone involved – developers need to understand Ops pain points, and vice-versa.

The Reactive Mentality: IT teams often live in reactive mode: fix what's broken, deploy updates when ready. SRE culture thrives on proactivity (anticipation) and automation.

(Personal Anecdote Alert) I remember the sheer volume of basic troubleshooting tasks – resetting routers, checking physical connections – that drained valuable time and prevented focus on deeper issues or proactive improvements.
Hurdle: Lack of automated checks for routine failures.
Solution: Invest heavily in automation. Define what needs to be automated (e.g., common failure scenarios) and start building scripts, playbooks, or tools accordingly.

The Monitoring Black Hole: Many organizations have monitoring systems that generate alerts but rarely provide actionable insights. Data is collected, but teams don't know how to use it effectively for improving system health or preventing incidents before they happen.

Hurdle: Inadequate alerting and lack of clear metrics.
Solution: Define your service level objectives (SLOs) first. This sets the baseline for acceptable performance and availability, guiding what needs monitoring and how critical certain thresholds are. Then, focus on correlating events, reducing noise, and ensuring dashboards tell a coherent story about system health.

Networking at Scale in Cloud-Native Environments

From IT Leader to SRE Champion: Navigating Complexity with People Skills — concept macro — Cloud & SRE

Ah, networking – always a beast! In traditional IT, we dealt with physical constraints, cabling, dedicated routers. Now, cloud-native environments introduce complexity like never before: distributed systems across availability zones, micro-segmentation, containerized networks (like VPCs, CNI), and the sheer scale of automation challenges.

My networking teams faced a significant shift:

Visibility Nightmare: Tracking flows became harder as services spun up/down dynamically. IP addresses changed constantly.
Automation Potential & Peril: We could automate configuration management for network policies in Kubernetes (using tools like Calico or Cilium), but the potential for misconfiguration errors was immense and required robust testing and validation processes.

This is where people skills become absolutely critical:

SREs must understand how application traffic patterns impact underlying network design.
Network engineers need to grasp developer perspectives on availability, latency, and ease of deployment – especially regarding new IP-based architectures or service discovery mechanisms.

We built bridges by speaking their language. For networking teams:

Use the Right Tools: Implement robust monitoring for network infrastructure (e.g., using Prometheus and Grafana for metrics). But don't stop there.
Educate Peers: Explain how SRE practices like automation can directly benefit application deployment speed without sacrificing reliability or security. Show concrete examples of configuration drift being caught automatically.
Collaborative Incident Resolution: When a network issue hits, involve developers early to understand the traffic and service dependencies.

AI Integration: Opportunities, Risks, and Managing Team Adoption

Artificial Intelligence is transforming SRE from a reactive discipline to a predictive one. We're leveraging ML for anomaly detection in logs and metrics, using AIOps tools like Dynatrace or Datadog to correlate complex events, and even exploring machine learning models for root cause analysis.

My team's experience with AI was cautious but pragmatic:

The Opportunity: Think of automated log analysis catching subtle patterns indicative of future failure points – a game-changer.
Example: An ML model trained on historical metrics data might predict an impending performance degradation before users are affected, allowing proactive scaling or optimization.

The Risks (and the Human Element):

False Positives: AI models aren't perfect. Too many false alarms can lead to alert fatigue and disregard for legitimate issues.
Model Opacity ("Black Box"):") Understanding why an AI tool flagged something requires technical expertise in ML interpretation, which might not be universally available or understood within the team.
Data Drift & Concept Shift: Environments change constantly (especially with CI/CD and infrastructure changes). The AI model needs ongoing maintenance.

Beyond the technical challenges lies managing adoption:

The Skepticism Curve: Teams accustomed to traditional metrics might question "How can we trust an algorithm?"
(My Approach) Start small, measure effectiveness rigorously against defined SLOs (e.g., reduction in P1/P2 incidents), and be transparent about limitations. Explain the model training data sources clearly.
The Skills Gap: AI tools are sophisticated but require specific skills to operate effectively.
(Solution) Invest in training or hire talent with both SRE fundamentals and ML/Cloud/AI expertise. Foster a learning culture where teams feel comfortable experimenting safely.

Building a High-Performance Monitoring Culture Starts with Leadership Alignment

This is perhaps the biggest leap for any IT leader transitioning into SRE. Technical tools are essential, but without alignment from leadership (including senior management), monitoring initiatives flounder. I learned this hard early on – people look to leaders for direction and validation that something really matters.

How did we achieve buy-in?

Communicate the Value: Translate technical metrics into business impact language.

Example: Instead of saying "Our API latency SLO is being violated," say "This affects customer conversion rates by X% in production."

Visible Commitment: Leaders need to be genuinely visible champions – maybe even sitting on incident response teams or championing tool upgrades personally.
Foster a Learning Environment: Encourage experimentation with new monitoring techniques and tools, backed by robust testing (preferably non-disruptive). Reward transparency during postmortems.

This alignment trickles down: When developers see that reliability is prioritized at leadership level, they start designing for it from day one. QA teams focus on automated tests reflecting SLOs more closely. Operations staff feel empowered to push back on designs lacking sufficient monitoring or observability.

The Long View: Embedding Reliability into Your Digital Transformation Strategy

SRE isn't just a team; it's an approach, woven into the fabric of how technology is built and delivered in modern organizations. For me, embedding SRE principles meant ensuring reliability was baked into every strategic initiative:

The Product Owner Perspective: We held joint design reviews with product owners to discuss potential failure modes during new feature development.
"A great user story," I'd ask, "but if it causes our core API response time to blow up for 5% of users even briefly, that violates our SLO. How do we prevent that?"
The Cloud Migration Journey: Reliability wasn't an afterthought during migration but the primary driver.
We didn't just say "move this workload." We said "migrate this service ensuring its error budget is maintained or improved, and proving it via chaos engineering."
Measuring Success Beyond Uptime: While uptime is crucial, true SRE success lies in meeting business objectives reliably.

This requires looking at your organization's roadmap through an SLO-colored lens. Ask: What are the critical services that must meet their reliability targets for these initiatives to succeed? Ensure teams responsible for building and operating those services understand why reliability matters so much.

Framework Spotlight: How to Use PagerDuty Chats for Proactive Incident Management

PagerDuty, alongside tools like Opsgenie or VictorOps, is crucial in the SRE toolkit today. But beyond just managing alerts, its chat capabilities offer a powerful way to coordinate response and knowledge sharing:

Contextual Clarity: When an incident hits, PagerDuty Chats can instantly gather all relevant engineers (on-call or otherwise) into one conversation thread.
`We opened a #incident channel for P1/P2 events immediately. The chat history automatically collected logs from integrations we configured – no manual pasting needed on my part, just setting up the triggers properly.`

Shared Understanding: Use the chat to walk through the investigation step-by-step.
`As I was troubleshooting a network anomaly causing high latency in our containerized service, I shared relevant 'show run' outputs and traceroute results directly into the chat. Everyone got visibility without needing separate Slack DMs or email threads bouncing off each other.`

Postmortem Foundation: The chat history often becomes the raw material for postmortems.
`We found that a poorly documented network policy in our CI/CD pipeline caused the misconfiguration. Having it flagged and discussed live made writing concise, factual postmortems much easier – everyone agreed on the facts first.`

Documentation by Default: Don't forget to document key learnings or action items directly into the chat after resolution.
`I added a bullet point in the closing message: "Needs to investigate why policy validation fails intermittently. Error budget impact was zero." This serves as both closure and future reference.`

Key Takeaways

Embrace Interdisciplinary Fluency: You must understand not just code, but infrastructure, networking, security at scale. Speak their language.
Prioritize People Above Processes (Initially): The biggest challenge is cultural. Invest time in building trust across teams and fostering a shared understanding of reliability goals first; technical debt can be addressed later with clearer tools or processes.
SLOs Are Non-Negotiable Business Drivers: Frame everything around service level objectives. Make sure they are understood, measurable, and linked directly to user satisfaction and business outcomes.
Balance Proactive Measures with Reactive Maturity: Automate as much as possible for routine tasks (monitoring alerts), but be ready to engage technically deep experts when automated responses fail or require human intervention – define clear escalation paths based on SLO impact.