-
LPS

[LPS] Site Reliability Engineer

LPS
Singapore · Full-time · Mid-Senior

Job Summary

The Site Reliability Engineer (SRE) is responsible for ensuring the reliability, availability, and performance of enterprise systems in a managed services (Day 2) environment.

The role operates as a centralised reliability function across application, infrastructure, and vendor support layers, governing operational activities to ensure that incidents, changes, and patching activities are executed without impacting service stability or SLA commitments.

The SRE works closely with Application Engineers (L1.5) and Application Vendors (L2), providing oversight, risk control, and engineering-driven improvements to maintain a stable and resilient production environment.


Key Responsibilities

1. Reliability & Service Assurance

  • Own end-to-end service reliability, including availability, performance, and system stability
  • Define and track reliability metrics (e.g., uptime, latency, error rates)
  • Ensure SLA compliance through proactive monitoring and operational governance
  • Establish service health indicators and early warning mechanisms


2. Monitoring & Observability

  • Design and implement monitoring, logging, and alerting frameworks across application and infrastructure layers
  • Define alert thresholds and reduce alert noise to improve signal quality
  • Develop dashboards and reporting for real-time visibility of system health
  • Continuously enhance observability coverage across services


3. Incident Management & RCA

  • Lead major incident management (P1/P2) as incident commander
  • Perform end-to-end root cause analysis (RCA) across application, infrastructure, and vendor domains
  • Coordinate with Application Engineers and Vendors for issue resolution
  • Drive preventive and corrective actions to reduce incident recurrence


4. Change & Patch Governance

  • Assess operational risks associated with changes, releases, and patching activities
  • Work with Application Engineers (L1.5) and Vendors (L2) to ensure safe execution of application patches
  • Perform pre- and post-change validation to ensure system stability
  • Govern Go/No-Go decisions and support rollback planning in case of service degradation


5. Performance & Capacity Management

  • Monitor and optimise system performance across application and infrastructure layers
  • Conduct capacity planning and forecasting to ensure scalability and resilience
  • Identify and address performance bottlenecks proactively


6. Automation & Continuous Improvement

  • Drive automation of operational processes, including monitoring, recovery, and validation
  • Implement self-healing and resilience mechanisms where applicable
  • Develop and maintain operational runbooks and automation scripts
  • Continuously improve system reliability through engineering practices



7. Collaboration & Governance

  • Work closely with Application Engineers (L1.5) for execution of operational activities
  • Collaborate with Vendors (L2) for defect resolution and product-level fixes
  • Ensure compliance with governance, security, and audit requirements
  • Support service reviews, reporting, and continuous improvement initiatives


Requirements

Core Requirements

  • Experience in Site Reliability Engineering, DevOps, or production operations in enterprise environments
  • Strong understanding of cloud platforms (preferably AWS)
  • Experience with monitoring and observability tools
  • Strong troubleshooting capability across application and infrastructure layers
  • Experience in incident management and root cause analysis
  • Familiarity with ITIL processes (incident, problem, change management)


Preferred

  • Experience in system integrator or managed services (Day 2 operations) environment
  • Exposure to enterprise applications (e.g., IWMS platforms such as Archibus or similar)
  • Experience with automation and scripting (Python, PowerShell, etc.)
  • Knowledge of performance tuning and capacity planning



Key Competencies

  • Strong analytical and problem-solving skills
  • Ability to lead during high-pressure incidents
  • Structured and governance-driven mindset
  • Proactive approach to reliability and continuous improvement
  • Strong stakeholder coordination across internal teams and vendors

Key Skills

Ranked by relevance

sla powershell python devops cloud itil
Login to Apply
Posted
Apr 01, 2026
Type
Full-time
Level
Mid-Senior
Location
Singapore
Company
LPS

Industries

IT Services IT Consulting

Categories

Information Technology

Related Jobs

3 roles aligned with this opportunity

View all jobs
View Job Details
LPS
Related

Network Operations Center Engineer

2026-03-12

Full-time
Associate
Singapore
IT Services
Information Technology
View Job Details
LPS
Related

Network Operations Center Engineer

2026-02-06

Full-time
Associate
Singapore
IT Services
Information Technology
View Job Details
LPS
Related

L1 Cloud Engineer

2026-02-27

Full-time
Entry
Singapore
IT Services
Information Technology