Solas IT Recruitment
Senior Site Reliability Engineer - Kubernetes
Solas IT RecruitmentIreland10 days ago
Full-timeRemote FriendlyInformation Technology

Senior Site Reliability Engineer

  • Remote role with cutting-edge, expanding organisation
  • 60-70K plus 5% Bonus, 6% Pension, Healthcare for you and family, Life Cover



Key Responsibilities:

  • Collaborate with engineering teams to enhance service quality through robust testing, performance tuning, and fault identification.
  • Develop automated solutions to maintain systems and services, ensuring smooth project execution by working closely with internal engineering teams.
  • Oversee system performance by implementing continuous monitoring and balancing feature development with system reliability, adhering to established service level objectives.
  • Contribute to the formulation of practices, technologies, and procedures to maintain Security, Compliance, and Availability requirements across system landscapes.
  • Manage, plan, and execute system upgrades to ensure minimal downtime and optimal system availability.



Required Skills & Qualifications:

  • Kubernetes: Extensive expertise in managing, deploying, and troubleshooting production Kubernetes clusters, with experience in container orchestration. Familiarity with Amazon EKS is an advantage.
  • Automation & Configuration Management: Proficiency with Ansible, Helm, and Kustomize for automating infrastructure provisioning and deployment. Skilled at managing Kubernetes manifests and ensuring streamlined application releases across different environments.
  • Monitoring Tools: Hands-on experience with systems like Prometheus and Grafana to monitor system health, identify issues, and optimize performance.
  • Cloud Infrastructure (AWS): Strong knowledge of AWS services such as EC2, S3, IAM, VPC, and associated tools for managing scalable cloud infrastructure.
  • Infrastructure as Code (IaC): Experience with Terraform for provisioning and maintaining cloud resources, ensuring repeatability and version control in cloud deployments.
  • Messaging & Queuing Systems: Familiarity with message brokers such as RabbitMQ, Kafka, or managed services like AmazonMQ, with experience in optimizing reliable communication between distributed systems.
  • Database Expertise: Strong background in managing cloud-based MySQL databases, particularly with Amazon RDS, focusing on high availability, security, and performance.
  • Networking & Security: Solid understanding of network security and design to ensure system protection, compliance, and industry-standard audit readiness.
  • High Availability Systems: Demonstrated experience in maintaining critical system uptime through fault tolerance, disaster recovery, and proactive monitoring to minimize downtime.
  • Collaboration & Cross-functional Teamwork: Proven ability to work effectively across multiple teams, departments, and stakeholders to execute project plans efficiently.
  • Programming: Competency in high-level programming languages such as Python, Go, or JavaScript, with a strong grasp of modern development tools and CI/CD pipelines for automating testing, deployment, and monitoring.
  • Problem Solving & Optimization: Strong problem-solving skills with a proactive approach to identifying bottlenecks, system issues, and opportunities for performance improvements.

Key Skills

Ranked by relevance