Our client is looking for a highly skilled Senior Site Reliability Engineer (SRE) to serve as a hands-on reliability expert across three SaaS product lines. You’ll lead Tier-3 incident response, drive root-cause analysis, automate resilient infrastructure, and coach product teams in observability and SLO best practices. This is a high-impact, cross-functional role for someone passionate about performance, reliability, and DevOps culture.
Responsibilities:
- Tier-3 incident response & root-cause analysis for all customer-facing products (GT Motiv, Contra Expert, Innovation Group).
- Deep application debugging across Java, .NET, Go, Python stacks; correlate logs, traces, metrics (Datadog APM/Logs/RUM).
- Network-level troubleshooting (TCP/IP, TLS, DNS, load-balancers, service mesh) to eliminate latency and availability bottlenecks.
- Reliability engineering & automation: define/track SLOs & error budgets, build self-healing, fail-over, autoscaling and chaos-testing routines.
- Observability platform ownership: create dashboards, alerting rules, and runbook automation; continuously close visibility gaps.
- Post-incident improvement: facilitate blameless post-mortems, document findings, and drive architectural and process fixes.
- Cross-functional coaching: embed with product squads to uplift logging, testing, and resilient design practices.
Requirements:
- 8+ years in SRE / DevOps / production-engineering roles for high-availability SaaS.
- Expert networking skills: packet-level analysis, transport protocols (TCP, TLS), HTTP & gRPC
- Cloud proficiency in AWS, Azure or GCP, with experience in hybrid or multi-cloud topologies.
- Coding ability: strong in Java and/or C# (.NET) plus one scripting language (Go, Python, Bash); able to debug unfamiliar codebases.
- Observability & incident tooling: Datadog (preferred) or equivalent APM + log stack, plus PagerDuty/ServiceNow.
- IaC & GitOps: Terraform , CICD, ArgoCD.
- 24×7 on-call readiness and proven ownership of SLOs/SLA compliance.
- Excellent written & spoken English (international stakeholder base).
Nice-to-have:
- Experience running event-driven / streaming platforms (Kafka, RabbitMQ) and micro-services architectures.
- Prior work in SRE consulting / “reliability guild” supporting multiple product lines.
If interested please apply here: https://app.upper.co/job/c029c993-5d9b-44fb-a968-b8c85c14c752?sourcerId=9a3ffee2-3acb-4f85-a458-9a7469aa02bf
Key Skills
Ranked by relevance
Related Jobs
3 roles aligned with this opportunity
Senior Software Engineer
2025-03-03
Intern- Data Science
2026-05-28
AI Engineer Trainee
2026-05-28
- Posted
- Jun 06, 2025
- Type
- Full-time
- Level
- Mid-Senior
- Location
- Romania
- Company
- Upper
Industries
Categories
Related Jobs
3 roles aligned with this opportunity
Senior Software Engineer
2025-03-03
Intern- Data Science
2026-05-28
AI Engineer Trainee
2026-05-28