Job Summary
The Site Reliability Engineer (SRE) is responsible for ensuring the reliability, availability, and performance of enterprise systems in a managed services (Day 2) environment.
The role operates as a centralised reliability function across application, infrastructure, and vendor support layers, governing operational activities to ensure that incidents, changes, and patching activities are executed without impacting service stability or SLA commitments.
The SRE works closely with Application Engineers (L1.5) and Application Vendors (L2), providing oversight, risk control, and engineering-driven improvements to maintain a stable and resilient production environment.
Key Responsibilities
1. Reliability & Service Assurance
- Own end-to-end service reliability, including availability, performance, and system stability
- Define and track reliability metrics (e.g., uptime, latency, error rates)
- Ensure SLA compliance through proactive monitoring and operational governance
- Establish service health indicators and early warning mechanisms
2. Monitoring & Observability
- Design and implement monitoring, logging, and alerting frameworks across application and infrastructure layers
- Define alert thresholds and reduce alert noise to improve signal quality
- Develop dashboards and reporting for real-time visibility of system health
- Continuously enhance observability coverage across services
3. Incident Management & RCA
- Lead major incident management (P1/P2) as incident commander
- Perform end-to-end root cause analysis (RCA) across application, infrastructure, and vendor domains
- Coordinate with Application Engineers and Vendors for issue resolution
- Drive preventive and corrective actions to reduce incident recurrence
4. Change & Patch Governance
- Assess operational risks associated with changes, releases, and patching activities
- Work with Application Engineers (L1.5) and Vendors (L2) to ensure safe execution of application patches
- Perform pre- and post-change validation to ensure system stability
- Govern Go/No-Go decisions and support rollback planning in case of service degradation
5. Performance & Capacity Management
- Monitor and optimise system performance across application and infrastructure layers
- Conduct capacity planning and forecasting to ensure scalability and resilience
- Identify and address performance bottlenecks proactively
6. Automation & Continuous Improvement
- Drive automation of operational processes, including monitoring, recovery, and validation
- Implement self-healing and resilience mechanisms where applicable
- Develop and maintain operational runbooks and automation scripts
- Continuously improve system reliability through engineering practices
7. Collaboration & Governance
- Work closely with Application Engineers (L1.5) for execution of operational activities
- Collaborate with Vendors (L2) for defect resolution and product-level fixes
- Ensure compliance with governance, security, and audit requirements
- Support service reviews, reporting, and continuous improvement initiatives
Requirements
Core Requirements
- Experience in Site Reliability Engineering, DevOps, or production operations in enterprise environments
- Strong understanding of cloud platforms (preferably AWS)
- Experience with monitoring and observability tools
- Strong troubleshooting capability across application and infrastructure layers
- Experience in incident management and root cause analysis
- Familiarity with ITIL processes (incident, problem, change management)
Preferred
- Experience in system integrator or managed services (Day 2 operations) environment
- Exposure to enterprise applications (e.g., IWMS platforms such as Archibus or similar)
- Experience with automation and scripting (Python, PowerShell, etc.)
- Knowledge of performance tuning and capacity planning
Key Competencies
- Strong analytical and problem-solving skills
- Ability to lead during high-pressure incidents
- Structured and governance-driven mindset
- Proactive approach to reliability and continuous improvement
- Strong stakeholder coordination across internal teams and vendors
Key Skills
Ranked by relevance
Related Jobs
3 roles aligned with this opportunity
Network Operations Center Engineer
2026-03-12
Network Operations Center Engineer
2026-02-06
L1 Cloud Engineer
2026-02-27
- Posted
- Apr 01, 2026
- Type
- Full-time
- Level
- Mid-Senior
- Location
- Singapore
- Company
- LPS
Industries
Categories
Related Jobs
3 roles aligned with this opportunity
Network Operations Center Engineer
2026-03-12
Network Operations Center Engineer
2026-02-06
L1 Cloud Engineer
2026-02-27