Senior Site Reliability Engineer (Azure)

We are looking for a skilled and adaptable Senior Site Reliability Engineer (SRE) to join our team, specializing in advanced 3rd line support for essential enterprise systems hosted on Azure. Your role will be critical in maintaining the reliability, performance, and availability of our cloud infrastructure through your expertise in Azure, DevOps, observability, and automation.

If you thrive in fast-paced environments and enjoy solving complex technical challenges, we’d love to hear from you.

Feel free to work remotely from anywhere across Lithuania or connect with colleagues at our Vilnius and Kaunas offices.

Responsibilities

Lead advanced troubleshooting and incident management for cloud-based systems, ensuring rapid resolution and root-cause analysis
Maintain and enhance system reliability, performance, and uptime across Azure environments
Implement and optimize observability, monitoring, and logging solutions using Azure Monitor, Application Insights, Log Analytics, and Prometheus
Automate infrastructure provisioning and management using Infrastructure-as-Code (IaC) tools like Terraform and scripting languages (Bash, PowerShell, Python)
Optimize deployment pipelines and ensure secure, scalable workflows in Azure DevOps
Collaborate with cross-functional teams to drive service improvements and share best practices
Proactively set up alerts and monitoring to prevent SLA degradation and ensure high availability
Conduct post-incident reviews and implement long-term reliability solutions
Support performance tuning and resource optimization for cloud workloads
Communicate effectively with both technical and non-technical stakeholders

Requirements

3+ years of experience in DevOps or Site Reliability Engineering
Proven expertise with Azure services, including AKS (Kubernetes), Azure Monitor, Application Insights, Log Analytics, Cosmos DB, PostgreSQL, and Azure DevOps
Strong hands-on experience with observability and monitoring tools (Azure Monitor, Log Analytics, Application Insights, Prometheus, Grafana)
Proficiency in Infrastructure-as-Code (Terraform) and scripting (Bash, PowerShell, Python)
Demonstrated incident management skills, including root-cause analysis and postmortem processes
Experience automating deployment pipelines and routine operational tasks
Excellent problem-solving and debugging skills in complex, real-time environments
Strong verbal and written communication skills for cross-team collaboration
Ability to prioritize and manage multiple tasks in a fast-paced, Agile environment
Minimum English language proficiency at the B2+ level

Nice to have

Experience with AWS services (EKS, RDS, CloudWatch, X-Ray) and AWS monitoring tools
Familiarity with distributed logging pipelines and resource optimization in AKS/EKS
Knowledge of advanced Kubernetes use cases (service scaling, network configurations)
Experience with incident automation tools and observability enhancements (Grafana, OpenSearch)
Relevant certifications in Azure, AWS, or Kubernetes

We offer

Engineering Heritage: Best-in-class experts sharing a culture of engineering excellence and tackling complex engineering challenges for over 30 years.
Advanced Tech Stack: Innovative projects where you can apply or enhance your expertise in Cloud, Data, AI, and other emerging technologies
World-Class Clients: Work closely with 295+ of the Forbes Global 2000 on creating disruptive solutions that make a global impact
Professional Growth: Exceptional support for career development with comprehensive resources for upskilling or reskilling in pioneering practices
GenAI Community: Strong AI competencies with 600+ experts across 55+ locations driving GenAI-enabled transformation journeys
Entrepreneurial Culture: If you're passionate and dedicated to improving business transformation, we provide the support you need to bring your ideas to life
Hybrid Setup: The flexibility to work from any location in Lithuania, whether it's your home or our dynamic offices in Vilnius and Kaunas
Other Benefits: Additional vacation and trust days, private health insurance, Employee Stock Purchase Plan and more

About EPAM

EPAM is a leading global provider of digital platform engineering and development services. For over 30 years, our team has helped leading brands navigate the waves of digital transformation, building solutions that help them stay competitive through constant market disruption.

With offices in 55+ countries, EPAM has grown in Lithuania to over 1,200+ talented innovators in just 4 years. We foster creativity and unconventional ways of doing things, welcoming like-minded professionals to join us

Salary range €3.8K-€5.5K gross, based on your experience and interview results.

Join our team in our cozy offices in Vilnius or Kaunas.

Senior Site Reliability Engineer (Azure)

Key Skills

Related Jobs

Senior DevOps Engineer

Senior NodeJS Engineer

Senior SecOps Engineer

Related Jobs

Senior DevOps Engineer

Senior NodeJS Engineer

Senior SecOps Engineer

Cookie Settings