About the Company
A rapidly growing technology firm operating at the forefront of artificial intelligence and advanced software solutions. The company fosters a fast-paced, collaborative, and innovation-driven culture, uniting talent across engineering, research, and product teams to create impactful solutions. This role offers the opportunity to work on exciting projects, leverage cutting-edge technologies, and make a real difference in the AI and mobile development space.
Key Responsibilities
Cluster Operations & Management
- Manage and maintain container clusters (e.g., Kubernetes, Docker) and open-source component clusters (e.g., Kafka, Redis, Elasticsearch) across multiple environments and business units.
- Monitor and optimize distributed systems to ensure high performance, scalability, and reliability.
Infrastructure Platform Development
- Design, build, and improve infrastructure operations platforms.
- Develop and maintain solutions for infrastructure management, CI/CD pipelines, monitoring and alerting systems, and centralized logging.
- Lead platform standardization efforts and drive automation to streamline operations.
High Availability & Reliability
- Ensure maximum uptime for production services through proactive monitoring, rapid incident response, and root cause analysis.
- Continuously refine service architecture, deployment strategies, and operational processes for improved resilience.
- Implement and maintain SLA/SLO frameworks, applying reliability engineering best practices.
Automation & Process Improvement
- Develop automated systems for operations and maintenance to minimize manual intervention.
- Create self-service tools and workflows to boost team productivity.
- Define and enforce best practices for infrastructure-as-code, configuration management, and change control.
Required Qualifications
Experience & Education
- Minimum 2 years of hands-on experience in Systems Operations, DevOps, or Site Reliability Engineering (SRE).
- Bachelor’s degree in Computer Science, Engineering, or a related technical discipline preferred.
Cloud & Infrastructure
- Familiarity with public cloud platforms (AWS, Azure, or GCP) is highly valued.
- Strong understanding of large-scale internet architectures and distributed systems.
- Proven experience with infrastructure monitoring, logging, and observability tools.
Technical Skills
- Proficiency in scripting and automation (e.g., Shell, Python).
- Strong knowledge of containerization technologies (Kubernetes, Docker).
- Hands-on experience managing production-grade container clusters and maintaining CI/CD pipelines.
- Familiarity with infrastructure components such as Nginx, MySQL, Redis, Kafka, and Elasticsearch.
Advanced Networking (Preferred)
- Experience with Service Mesh architectures, Cilium CNI, and eBPF technologies.
- Understanding of network security, load balancing, and traffic management.
- Knowledge of cloud-native networking patterns and best practices.
If you’re ready to make an impact in a role that combines software development with cutting-edge AI, we encourage you to apply. Please note that only shortlisted candidates will be contacted.
CEI: 23S1921
Key Skills
Ranked by relevance
Related Jobs
3 roles aligned with this opportunity
Site Reliability Engineer
2026-04-10
Infrastructure Security Engineer - Remote
2026-04-08
Head of Software Engineering
2026-04-11
- Posted
- Aug 12, 2025
- Type
- Full-time
- Level
- Associate
- Location
- Singapore
- Company
- Tardis Group
Industries
Categories
Related Jobs
3 roles aligned with this opportunity
Site Reliability Engineer
2026-04-10
Infrastructure Security Engineer - Remote
2026-04-08
Head of Software Engineering
2026-04-11