unosquare
8282 - Site Reliability Engineer Cloud, Infrastructure and ITOps
unosquareArgentina11 hours ago
Full-timeEngineering, Information Technology
Job Description

Client is revolutionizing the world of data management and analytics with our innovative cloud data platform, purpose-built for petabyte-scale datasets. Our mission is to help organizations drastically reduce data costs while increasing their data retention.

We are looking for a Site Reliability Engineer (SRE) to join our dynamic Services team. In this role, you will contribute to the reliability and scalability of our cutting-edge platform, ensuring exceptional solutions tailored to our customers’ unique needs. This is a highly technical, hands-on role that requires deep expertise in system reliability and automation.

Key Responsibilities

  • Infrastructure Reliability: Deploy, maintain, and ensure a highly reliable fleet of Kubernetes clusters and deployments across multiple cloud platforms.
  • Service Optimization: Design, implement, and maintain systems and processes to enhance the reliability, availability, and performance of our services.
  • CI/CD Management: Build and optimize CI/CD tools and processes to ensure efficient and reliable deployments.
  • Monitoring and Incident Response: Develop and manage monitoring, alerting, and incident response strategies to minimize downtime and enable rapid recovery.
  • Root Cause Analysis: Conduct comprehensive root cause analyses for system failures, implementing long-term preventive measures.
  • Automation and Efficiency: Automate repetitive tasks and optimize system performance to improve operational efficiency.
  • On-Call Support: Participate in covering weekday business hours and once-monthly weekend shifts.

Collaboration and Customer Engagement

  • Cross-Functional Teamwork: Work closely with software engineering, infrastructure, and product teams to integrate reliability practices into every stage of the development lifecycle.
  • Reliability Advocacy: Champion SRE best practices and foster a culture of operational excellence across the organization.
  • Global Team Collaboration: Collaborate with a distributed team of engineers worldwide to provide round-the-clock support.
  • Customer Support: Interface with customers to address and resolve reported incidents, ensuring a seamless user experience.

Qualifications And Skills

  • SRE Expertise: Proven experience as a Site Reliability Engineer or similar role, with a history of supporting complex distributed systems (minimum five years supporting complex distributed systems).
  • Observability Tools: Experience with monitoring and debugging tools like Prometheus, Vector, Grafana, Superset, or Kibana.
  • Cloud Platforms: Proficiency in at least one major cloud platform (AWS, GCP, Azure, or Linode).
  • Database Knowledge: Experience with SQL databases; familiarity with PostgreSQL is a plus but not required.
  • Programming Skills: Proficiency in programming languages such as Python, Go, or Rust.
  • Linux Expertise: Strong experience with Linux systems, including performance tuning and system-level troubleshooting.
  • Communication Skills: Excellent written and verbal communication skills, with the ability to convey technical concepts clearly to diverse audiences, including customers and cross-functional teams.

Key Skills

Ranked by relevance