Track This Job
Add this job to your tracking list to:
- Monitor application status and updates
- Change status (Applied, Interview, Offer, etc.)
- Add personal notes and comments
- Set reminders for follow-ups
- Track your entire application journey
Save This Job
Add this job to your saved collection to:
- Access easily from your saved jobs dashboard
- Review job details later without searching again
- Compare with other saved opportunities
- Keep a collection of interesting positions
- Receive notifications about saved jobs before they expire
AI-Powered Job Summary
Get a concise overview of key job requirements, responsibilities, and qualifications in seconds.
Pro Tip: Use this feature to quickly decide if a job matches your skills before reading the full description.
Role Description
We are seeking a Senior Site Reliability Engineer for a full-time, on-site position in Germany. The successful candidate will be responsible for maintaining and enhancing the reliability, availability, and performance of critical systems and infrastructure. Key responsibilities include developing and optimizing system architecture, troubleshooting and resolving issues, automating workflows, and collaborating with teams to ensure smooth operations and continuous systems improvement. The role demands hands-on technical expertise and proactive engagement in identifying and mitigating potential system vulnerabilities.
Key Responsibilities:
- Platform Engineering & DevOps: Manage Kubernetes and container orchestration, including Helm chart configurations and CI/CD pipelines (Jenkins, ArgoCD). Develop automation scripts (Python, Bash, Go) and deploy Infrastructure-as-Code (IaC) solutions.
- Observability, Monitoring & Visualisation: Maintain Prometheus solutions (scrape configurations, alert rules, PromQL queries), administer Thanos and Grafana.
- Elastic Stack Operations & Log Management: Configure and optimise Elasticsearch clusters, Logstash pipelines, and Kibana dashboards for secure, scalable log processing.
- Incident Response, Troubleshooting & Collaboration: Participate in 24x7 on-call rotations for rapid incident response, troubleshoot platform, data and performance issues, and engage in Major Incident Management (MIM).
- Secure Operations & Compliance: Ensure system operations meet security and data protection requirements, maintain secure documentation, and manage access control policies.
- Strong grasp of Linux concepts, preferably in Kubernetes environments.
- Solid understanding of networking fundamentals and REST APIs.
- Proficiency in Python, Go, or Bash.
- Proficiency in Git-based configuration management workflows.
- Familiarity with CI/CD tools like Helm, Jenkins, or ArgoCD.
- Experience with Elasticsearch and/or OpenSearch.
Requirements & Skills
Fluent English & German communication skills.
Willingness to work shift-based 24x7 on-call support, including weekends and holidays.
Must possess Ü2 security clearance.
Citizenship required: Member state of the EU and NATO. No dual citizenship outside these countries.
- Must reside in Germany and hold a German labor contract.
- Two shifts: Day and Night (Fixable)
- Mon-Fri (3 days on-site & 2 days work from home) Fixable
Key Skills
Ranked by relevanceReady to apply?
Join AA Consultant Group and take your career to the next level!
Application takes less than 5 minutes

