Track This Job
Add this job to your tracking list to:
- Monitor application status and updates
- Change status (Applied, Interview, Offer, etc.)
- Add personal notes and comments
- Set reminders for follow-ups
- Track your entire application journey
Save This Job
Add this job to your saved collection to:
- Access easily from your saved jobs dashboard
- Review job details later without searching again
- Compare with other saved opportunities
- Keep a collection of interesting positions
- Receive notifications about saved jobs before they expire
AI-Powered Job Summary
Get a concise overview of key job requirements, responsibilities, and qualifications in seconds.
Pro Tip: Use this feature to quickly decide if a job matches your skills before reading the full description.
Responsibilities
- Ensure platform reliability and availability across production and pre-production environments through proactive monitoring, alerting, and automation
- First response for incidents, contribute to problem management and root cause analysis
- Supporting the development team's effort towards reliability, creating a solid reliability culture within the development lifecycle
- Develop troubleshooting documentation for production support resources
- Collaborate with Engineering teams to develop optimised and productive runbooks, operational documentation and automation of operational tasks
- Collaborate with development and cloud engineering teams to embed reliability and performance into the software delivery lifecycle
- Design, implement, and evolve observability solutions (metrics, logs, traces, dashboards) using tools such as Prometheus, Grafana, and ELK
- Participate in on-call rotations and continuously improve alert quality and response processes
- Champion a culture of reliability, performance, and continuous improvement across teams
- Bachelor's Degree or MS in Engineering or equivalent
- Experience in operating at least one container orchestration cluster (Kubernetes, Docker Swarm)
- Experience developing or maintaining software for production services at scale
- Experience with ELK
- Experience with AWS
- Experience with Grafana/Prometheus stack
- Strong scripting skills (Bash, Python or Go)
- Excellent communication skills
- Thinking out of the box and anticipating challenges. It is imperative we are not simply reactive; we must expect challenges and question technologies, procedures and thinking already in place. You will be expected to constantly review and challenge at all levels
- Versatility. We work with agile/lean methods. We'd much rather iterate and learn than assume we know all the answers
- Being a team player. You don't (always) work in isolation and are excited by the thought of using your team whilst involving product, experience design, engineering, and more in the process
- Telephony knowledge (SIP, VoIP);
- Experience in Linux Administration (RedHat, CentOS, AL);
- Working knowledge in Configuration Management tools (Terraform, Ansible);
- Experience with TCP/IP and general networking concepts;
- RDBMS knowledge (MySQL, Postgres);
- NoSQL knowledge (Redis)
- Fixed compensation;
- Long-term employment with the working days vacation;
- Development in professional growth (courses, training, etc);
- Being part of successful cutting-edge technology products that are making a global impact in the service industry;
- Proficient and fun-to-work-with colleagues;
- Apple gear
Key Skills
Ranked by relevanceReady to apply?
Join Omilia and take your career to the next level!
Application takes less than 5 minutes