[LPS] Site Reliability Engineer

LPS

Singapore · Full-time · Mid-Senior

Job Summary

The Site Reliability Engineer (SRE) is responsible for ensuring the reliability, availability, and performance of enterprise systems in a managed services (Day 2) environment.

The role operates as a centralised reliability function across application, infrastructure, and vendor support layers, governing operational activities to ensure that incidents, changes, and patching activities are executed without impacting service stability or SLA commitments.

The SRE works closely with Application Engineers (L1.5) and Application Vendors (L2), providing oversight, risk control, and engineering-driven improvements to maintain a stable and resilient production environment.

Key Responsibilities

1. Reliability & Service Assurance

Own end-to-end service reliability, including availability, performance, and system stability
Define and track reliability metrics (e.g., uptime, latency, error rates)
Ensure SLA compliance through proactive monitoring and operational governance
Establish service health indicators and early warning mechanisms

2. Monitoring & Observability

Design and implement monitoring, logging, and alerting frameworks across application and infrastructure layers
Define alert thresholds and reduce alert noise to improve signal quality
Develop dashboards and reporting for real-time visibility of system health
Continuously enhance observability coverage across services

3. Incident Management & RCA

Lead major incident management (P1/P2) as incident commander
Perform end-to-end root cause analysis (RCA) across application, infrastructure, and vendor domains
Coordinate with Application Engineers and Vendors for issue resolution
Drive preventive and corrective actions to reduce incident recurrence

4. Change & Patch Governance

Assess operational risks associated with changes, releases, and patching activities
Work with Application Engineers (L1.5) and Vendors (L2) to ensure safe execution of application patches
Perform pre- and post-change validation to ensure system stability
Govern Go/No-Go decisions and support rollback planning in case of service degradation

5. Performance & Capacity Management

Monitor and optimise system performance across application and infrastructure layers
Conduct capacity planning and forecasting to ensure scalability and resilience
Identify and address performance bottlenecks proactively

6. Automation & Continuous Improvement

Drive automation of operational processes, including monitoring, recovery, and validation
Implement self-healing and resilience mechanisms where applicable
Develop and maintain operational runbooks and automation scripts
Continuously improve system reliability through engineering practices

7. Collaboration & Governance

Work closely with Application Engineers (L1.5) for execution of operational activities
Collaborate with Vendors (L2) for defect resolution and product-level fixes
Ensure compliance with governance, security, and audit requirements
Support service reviews, reporting, and continuous improvement initiatives

Requirements

Core Requirements

Experience in Site Reliability Engineering, DevOps, or production operations in enterprise environments
Strong understanding of cloud platforms (preferably AWS)
Experience with monitoring and observability tools
Strong troubleshooting capability across application and infrastructure layers
Experience in incident management and root cause analysis
Familiarity with ITIL processes (incident, problem, change management)

Preferred

Experience in system integrator or managed services (Day 2 operations) environment
Exposure to enterprise applications (e.g., IWMS platforms such as Archibus or similar)
Experience with automation and scripting (Python, PowerShell, etc.)
Knowledge of performance tuning and capacity planning

Key Competencies

Strong analytical and problem-solving skills
Ability to lead during high-pressure incidents
Structured and governance-driven mindset
Proactive approach to reliability and continuous improvement
Strong stakeholder coordination across internal teams and vendors

Key Skills

Ranked by relevance

sla powershell python devops cloud itil

Related Jobs

3 roles aligned with this opportunity

View all jobs

Network Operations Center Engineer

2026-03-12

Full-time

Associate

Singapore

IT Services

Information Technology

Network Operations Center Engineer

2026-02-06

Full-time

Associate

Singapore

IT Services

Information Technology

L1 Cloud Engineer

2026-02-27

Full-time

Entry

Singapore

IT Services

Information Technology

🇸🇬

Country Guide

Singapore

High-pay global hub in Asia

Posted: Apr 01, 2026
Type: Full-time
Level: Mid-Senior
Location: Singapore
Company: LPS

Industries

IT Services IT Consulting

Related Jobs

3 roles aligned with this opportunity

View all jobs

Network Operations Center Engineer

2026-03-12

Full-time

Associate

Singapore

IT Services

Information Technology

Network Operations Center Engineer

2026-02-06

Full-time

Associate

Singapore

IT Services

Information Technology

L1 Cloud Engineer

2026-02-27

Full-time

Entry

Singapore

IT Services

Information Technology

[LPS] Site Reliability Engineer

Key Skills

Related Jobs

Network Operations Center Engineer

Network Operations Center Engineer

L1 Cloud Engineer

Related Jobs

Network Operations Center Engineer

Network Operations Center Engineer

L1 Cloud Engineer

Cookie Settings