Site Reliability Engineer

As a Site Reliability Engineer (SRE), your responsibilities will include building scalable infrastructure on which we deliver our software. You will help ensure the reliability, availability, and performance of our production and development infrastructure.

You will collaborate with cross-functional teams to drive reliability automation, optimize deployment strategies, and enhance infrastructure monitoring.

PRIMARY RESPONSIBILITIES

Develop and maintain automation and processes to improve system reliability and enable teams to build and deploy secure and scalable applications in AWS using technologies such as Kubernetes and Terraform.
Establish and maintain infrastructure and application monitoring systems.
Define and monitor SLIs, SLOs, and SLAs to ensure operational excellence.
Analyze usage trends to forecast infrastructure needs and ensure scalability.
Conduct load testing to validate system capacity and optimize performance.
Participate in the incident lifecycle: preparation, detection, response, analysis, and post-incident learning. Be ready to respond to a team or business critical incident in a timely manner (be a part of the on-call rotation).
Work closely with development teams in all phases of SDLC to investigate areas of improvement and seek for bottlenecks.
Guide and encourage teams to follow SRE best practices.
Participate in operations efforts and be the point person for infrastructure activities.
Participate in architectural decisions to help improve the quality of our software and infrastructure.

QUALIFICATIONS

BS in Computer Science, Engineering, or a related field, or equivalent practical experience.
3 years of experience as an SRE or a similar role.

KNOWLEDGE, SKILLS, AND ABILITIES

Strong problem-solving and analytical skills; Strong ability to troubleshoot complex issues ranging from system resourcing, network issues to application stack traces.
Experience with monitoring tools (e.g., Prometheus, Grafana, Datadog)
Strong proficiency in programming or scripting languages (e.g., Python or Bash).
Hands-on experience with Kubernetes, Docker, and infrastructure-as-code tools (e.g., Terraform).
Proven expertise in managing AWS Cloud Infrastructure.
Experience in Linux/Unix administration.
Ability to read and understand Java and Python code.
Excellent communication and collaboration abilities. Be able to justify and stand for the proper solution.
Ability to work effectively in a cross-functional, fast-paced environment.
Nice to have:
Knowledge of database operations and performance optimization
Experience with GitLab
Experience with Atlassian services
Experience programming in Java or other OOP languages

PHYSICAL DEMANDS AND WORK ENVIRONMENT

Duties may require working outside normal working hours (evenings and weekends) at times.
Employment in stable, well recognized international company
Competitive salary and benefits
Friendly team of professionals
Flexible working hours
Medical insurance
Training programs and excellent travel conditions
Variety of recognition opportunities
Diversity friendly environment
Modern office

Site Reliability Engineer

Key Skills

Related Jobs

Back End Software Engineer

Trade Support Analyst (with capital markets exp_EST working hours)

AI Product Engineer

Related Jobs

Back End Software Engineer

Trade Support Analyst (with capital markets exp_EST working hours)

AI Product Engineer

Cookie Settings