Join our SRE L2 squad supporting ~1000 AWS-hosted services. You’ll own operational reliability, rapid triage, and proactive maintenance across production and non-prod, partnering closely with Cloud Engineering, SOC, and application teams.
Key Responsibilities
- Deliver 24×7 monitoring, incident response, and problem management; drive MTTA/MTTR reduction and SLO/SLI adherence.
- Perform preventive health checks; analyze ticket trends to implement continual service improvements and automation to reduce toil.
- Execute blameless postmortems and high-quality RCA; maintain SOPs/runbooks and reliability dashboards.
- Configure/tune observability (Dynatrace, CloudWatch, ELK); enable self-healing workflows and workload optimizations.
- Support change/service requests within agreed SLAs; collaborate during transitions and onboard new AWS services.
Core Skills & Tools
- AWS: Lambda, ECS/Fargate/EC2, API Gateway, SNS/SQS, Kinesis, RDS; IAM/KMS foundations.
- Observability & ITSM: Dynatrace, CloudWatch, ELK; ServiceNow for incidents/changes; SLI/SLO dashboards.
- Toil Reduction
- Reliability Practices: Error budgets, capacity/performance benchmarking, automation/runbook execution, FinOps awareness.
Qualifications
- 5+ years SRE/DevOps or L2 operations for cloud-native stacks; strong AWS production experience.
- Proven incident/change/problem management in 24×7 environments; adept at RCA and postmortems.
- Hands-on with observability tooling and operational automation; excellent collaboration and documentation skills.
Shift Coverage & Locations
Follow-the-sun model with overlapping handoffs across Canada/India to ensure continuous support. Success is measured by uptime, MTTR/MTTD, change failure rate, error-budget consumption, SLO adherence, RCA quality, and CSI throughput.
Key Skills
Ranked by relevance
Related Jobs
3 roles aligned with this opportunity
Site Reliability Engineer (SRE) Mid-Level / Senior, Portugal
2026-04-11
Network Engineer
2026-04-07
DevOps Engineer
2026-04-10
- Posted
- Apr 10, 2026
- Type
- Full-time
- Level
- Mid-Senior
- Location
- Greater Toronto Area
- Company
- HCLTech
Industries
Categories
Related Jobs
3 roles aligned with this opportunity
Site Reliability Engineer (SRE) Mid-Level / Senior, Portugal
2026-04-11
Network Engineer
2026-04-07
DevOps Engineer
2026-04-10