Track This Job
Add this job to your tracking list to:
- Monitor application status and updates
- Change status (Applied, Interview, Offer, etc.)
- Add personal notes and comments
- Set reminders for follow-ups
- Track your entire application journey
Save This Job
Add this job to your saved collection to:
- Access easily from your saved jobs dashboard
- Review job details later without searching again
- Compare with other saved opportunities
- Keep a collection of interesting positions
- Receive notifications about saved jobs before they expire
AI-Powered Job Summary
Get a concise overview of key job requirements, responsibilities, and qualifications in seconds.
Pro Tip: Use this feature to quickly decide if a job matches your skills before reading the full description.
Client is revolutionizing the world of data management and analytics with our innovative cloud data platform, purpose-built for petabyte-scale datasets. Our mission is to help organizations drastically reduce data costs while increasing their data retention.
We are looking for a Site Reliability Engineer (SRE) to join our dynamic Services team. In this role, you will contribute to the reliability and scalability of our cutting-edge platform, ensuring exceptional solutions tailored to our customers’ unique needs. This is a highly technical, hands-on role that requires deep expertise in system reliability and automation.
Key Responsibilities
- Infrastructure Reliability: Deploy, maintain, and ensure a highly reliable fleet of Kubernetes clusters and deployments across multiple cloud platforms.
- Service Optimization: Design, implement, and maintain systems and processes to enhance the reliability, availability, and performance of our services.
- CI/CD Management: Build and optimize CI/CD tools and processes to ensure efficient and reliable deployments.
- Monitoring and Incident Response: Develop and manage monitoring, alerting, and incident response strategies to minimize downtime and enable rapid recovery.
- Root Cause Analysis: Conduct comprehensive root cause analyses for system failures, implementing long-term preventive measures.
- Automation and Efficiency: Automate repetitive tasks and optimize system performance to improve operational efficiency.
- On-Call Support: Participate in covering weekday business hours and once-monthly weekend shifts.
- Cross-Functional Teamwork: Work closely with software engineering, infrastructure, and product teams to integrate reliability practices into every stage of the development lifecycle.
- Reliability Advocacy: Champion SRE best practices and foster a culture of operational excellence across the organization.
- Global Team Collaboration: Collaborate with a distributed team of engineers worldwide to provide round-the-clock support.
- Customer Support: Interface with customers to address and resolve reported incidents, ensuring a seamless user experience.
- SRE Expertise: Proven experience as a Site Reliability Engineer or similar role, with a history of supporting complex distributed systems (minimum five years supporting complex distributed systems).
- Observability Tools: Experience with monitoring and debugging tools like Prometheus, Vector, Grafana, Superset, or Kibana.
- Cloud Platforms: Proficiency in at least one major cloud platform (AWS, GCP, Azure, or Linode).
- Database Knowledge: Experience with SQL databases; familiarity with PostgreSQL is a plus but not required.
- Programming Skills: Proficiency in programming languages such as Python, Go, or Rust.
- Linux Expertise: Strong experience with Linux systems, including performance tuning and system-level troubleshooting.
- Communication Skills: Excellent written and verbal communication skills, with the ability to convey technical concepts clearly to diverse audiences, including customers and cross-functional teams.
Key Skills
Ranked by relevanceReady to apply?
Join unosquare and take your career to the next level!
Application takes less than 5 minutes

