Senior Site Reliability Engineer - Freelance

Our client is looking for a highly skilled Senior Site Reliability Engineer (SRE) to serve as a hands-on reliability expert across three SaaS product lines. You’ll lead Tier-3 incident response, drive root-cause analysis, automate resilient infrastructure, and coach product teams in observability and SLO best practices. This is a high-impact, cross-functional role for someone passionate about performance, reliability, and DevOps culture.

Responsibilities:

Tier-3 incident response & root-cause analysis for all customer-facing products (GT Motiv, Contra Expert, Innovation Group).
Deep application debugging across Java, .NET, Go, Python stacks; correlate logs, traces, metrics (Datadog APM/Logs/RUM).
Network-level troubleshooting (TCP/IP, TLS, DNS, load-balancers, service mesh) to eliminate latency and availability bottlenecks.
Reliability engineering & automation: define/track SLOs & error budgets, build self-healing, fail-over, autoscaling and chaos-testing routines.
Observability platform ownership: create dashboards, alerting rules, and runbook automation; continuously close visibility gaps.
Post-incident improvement: facilitate blameless post-mortems, document findings, and drive architectural and process fixes.
Cross-functional coaching: embed with product squads to uplift logging, testing, and resilient design practices.

Requirements:

8+ years in SRE / DevOps / production-engineering roles for high-availability SaaS.
Expert networking skills: packet-level analysis, transport protocols (TCP, TLS), HTTP & gRPC
Cloud proficiency in AWS, Azure or GCP, with experience in hybrid or multi-cloud topologies.
Coding ability: strong in Java and/or C# (.NET) plus one scripting language (Go, Python, Bash); able to debug unfamiliar codebases.
Observability & incident tooling: Datadog (preferred) or equivalent APM + log stack, plus PagerDuty/ServiceNow.
IaC & GitOps: Terraform , CICD, ArgoCD.
24×7 on-call readiness and proven ownership of SLOs/SLA compliance.
Excellent written & spoken English (international stakeholder base).

Nice-to-have:

Experience running event-driven / streaming platforms (Kafka, RabbitMQ) and micro-services architectures.
Prior work in SRE consulting / “reliability guild” supporting multiple product lines.

If interested please apply here: https://app.upper.co/job/c029c993-5d9b-44fb-a968-b8c85c14c752?sourcerId=9a3ffee2-3acb-4f85-a458-9a7469aa02bf

Senior Site Reliability Engineer - Freelance

Key Skills

Related Jobs

Junior DevOps

Software Engineer - Human Data Platforms (Remote)

Senior Software Engineer

Related Jobs

Junior DevOps

Software Engineer - Human Data Platforms (Remote)

Senior Software Engineer

Cookie Settings