Track This Job
Add this job to your tracking list to:
- Monitor application status and updates
- Change status (Applied, Interview, Offer, etc.)
- Add personal notes and comments
- Set reminders for follow-ups
- Track your entire application journey
Save This Job
Add this job to your saved collection to:
- Access easily from your saved jobs dashboard
- Review job details later without searching again
- Compare with other saved opportunities
- Keep a collection of interesting positions
- Receive notifications about saved jobs before they expire
AI-Powered Job Summary
Get a concise overview of key job requirements, responsibilities, and qualifications in seconds.
Pro Tip: Use this feature to quickly decide if a job matches your skills before reading the full description.
My Client are seeking a highly motivated and skilled Site Reliability Engineer in Singapore.
The primary focus is on applying software engineering principles to build the tools and automation necessary to ensure system reliability. The ideal candidate will leverage a strong background in either Site Reliability Engineering or Software Engineering, with a passion for driving operational excellence that directly impacts key business metrics
Key Responsibilities
- Design and implement robust, real-time monitoring and alerting systems to ensure continuous service availability and rapid detection of issues.
- Develop and manage a centralised dashboard that aggregates disaster metadata, historical trends, and communication links to enable upper management to quickly assess infrastructure disruptions.
- Drive the implementation and testing of comprehensive Disaster Recovery strategies to minimize downtime and ensure business continuity.
- Collaborate with development teams to optimize the performance and resilience of our Microservices architecture, ensuring optimized system performance.
- Establish and maintain robust monitoring systems to significantly enhance performance visibility and debugging capabilities.
- Apply software engineering practices to automate operational tasks that reduce disaster recovery time and minimize operational costs.
Qualifications
- Bachelor’s degree in Computer Science or a related technical field (preferred).
- 1+ years of experience in systems operations or site reliability engineering.
- Proven expertise in establishing and maintaining monitoring systems with Prometheus and Grafana.
- Demonstrated experience in real-time monitoring and implementing effective Disaster Recovery solutions.
- Experience working with and optimizing systems built on a Microservices architecture.
- Strong analytical and problem-solving skills, with a focus on enhancing performance visibility and debugging.
- Ability to translate complex operational data into clear, actionable insights for a centralized management dashboard.
- Proficiency in a programming language (e.g., Python, Go) for automation and tooling development.
Regrettably, only shortlisted candidates will be notified.
Please note that data provided is for recruitment purposes only.
Business Registration No.: 202004228R | License. No. - 20S0118 | EA Registration No. - 【R1986587】
Key Skills
Ranked by relevanceReady to apply?
Join Trulyyy and take your career to the next level!
Application takes less than 5 minutes

