N2S.Global
Site Reliability Engineer
N2S.GlobalAustralia5 hours ago
Full-timeInformation Technology

A Site Reliability Engineer (SRE) bridges the gap between software development and IT operations to ensure systems are reliable, scalable, and efficient. They apply software engineering principles to operational challenges, focusing on automation, monitoring, and performance optimization.

Key Objectives

  • Maintain high availability and performance of production systems.
  • Automate manual processes to improve efficiency and reduce human error.
  • Monitor system health and proactively prevent incidents.
  • Balance feature development speed with system reliability using SLIs, SLOs, and error budgets.

Core Responsibilities

  • Run and monitor production environments, ensuring uptime and reliability.
  • Build software and systems to manage infrastructure and applications.
  • Partner with development teams for testing, release procedures, and capacity planning.
  • Create sustainable systems through automation and continuous improvement.
  • Respond to on-call incidents, troubleshoot issues, and implement fixes.
  • Develop disaster recovery plans and ensure compliance with SLAs.

Required Skills & Qualifications

  • Bachelor’s degree in Computer Science or related field.
  • Strong programming skills in Python, Java, Go, or similar languages.
  • Experience with cloud platforms (AWS, GCP, Azure) and container orchestration (Kubernetes).
  • Familiarity with CI/CD tools, monitoring systems (Prometheus, Grafana), and configuration management (Ansible, Terraform).
  • Knowledge of distributed systems, networking, and storage technologies.

Preferred Attributes

  • Problem-solving mindset with a focus on automation and scalability.
  • Ability to work in cross-functional teams and communicate effectively.
  • Experience with incident management and performance tuning.

Key Skills

Ranked by relevance