-
Tech Mahindra

Site Reliability Engineer

Tech Mahindra
Australia · Full-time · Associate

About the job

Tech Mahindra represents the connected world, offering innovative and customer-centric information technology experiences, enabling Enterprises, Associates and Society to Rise™. We are a USD 5.1 billion company with 126,200+ professionals across 90 countries, helping 1058 global customers including Fortune 500 companies. We are focused on leveraging next-generation technologies including 5G, Blockchain, Cybersecurity, Artificial Intelligence, and more, to enable end-to-end digital transformation for global customers. Tech Mahindra is one of the fastest-growing brands and amongst the top 15 IT service providers globally.


Why Tech Mahindra?

You will work within a successful, established and trusted organisation, where you the opportunities are only limited by your aspirations; Do you want to travel the world? Do you want to be always learning new skills and developing professionally and personally? We offer everything you could desire with Flexible work, excellent salary, healthcare, dedicated training/certification platform and unrivalled recognition throughout the business.


Roles:

We are seeking a Senior Observability Engineer with expertise in configuring and optimizing monitoring tools such as Dynatrace, Elasticsearch, and Nagios XI. In this role, you will play a crucial part in ensuring system reliability and aligning observability practices with Site Reliability Engineering (SRE) standards


Key Responsibilities:

Observability and Monitoring Strategy: Design and implement end-to-end observability solutions that align with SRE standards. Provide comprehensive monitoring and alerting to track system health, performance, and reliability.

Tool Configuration and Management: Configure, deploy, and manage Dynatrace, Elasticsearch, and Nagios XI to monitor critical applications, infrastructure, and network components, supporting real-time visibility into service performance.

SRE Standard Implementation: Collaborate with engineering and operations teams to develop and implement observability practices that meet SRE standards, such as setting SLAs, SLOs, and error budgets.

Performance Optimization: Work with development and infrastructure teams to identify performance bottlenecks and optimize applications, with a focus on meeting SRE metrics for system reliability and availability.

Incident Management and Root Cause Analysis: Develop alerting and escalation processes based on SRE best practices. Lead or support incident response and perform root cause analysis to continuously improve reliability.

Data Analysis and Dashboarding: Set up and maintain dashboards, log management, and metric visualizations in Dynatrace, Elasticsearch, and Nagios XI. Provide insights into performance trends and system health in alignment with SRE goals.

Documentation and Mentorship: Create clear documentation of observability practices, configuration details, and troubleshooting guidelines. Mentor junior team members and promote an SRE-driven observability mindset.


Requirements:

Experience: 5+ years in observability, monitoring, or related engineering roles, with a focus on SRE standards and at least 3 years working with Dynatrace, Elasticsearch, and Nagios XI.

Technical Skills:

• Expert experience with Dynatrace for application performance monitoring and troubleshooting.

• Proficiency with Elasticsearch for log analysis, data indexing, and search optimization.

• Strong experience configuring and managing Nagios XI for infrastructure monitoring.

• Scripting skills (e.g., Python, Bash) to automate monitoring, data collection, and reporting.

SRE Knowledge:

• Strong understanding of SRE principles and best practices, including SLAs, SLOs, error budgets, and incident response.

• Familiarity with tools and practices for observability in distributed systems, microservices, and cloud-based infrastructure.

Preferred Additional Skills:

• Experience with cloud platforms (AWS, Azure, GCP) and monitoring of cloud-native environments.

• Familiarity with additional observability tools (e.g., Prometheus, Grafana) and infrastructure as code (e.g., Terraform, Ansible).

• ITSM/Incident Management tools knowledge (e.g., ServiceNow, PagerDuty).

Soft Skills:

• Strong analytical and troubleshooting skills.

• Effective communicator who can convey technical information to stakeholders.

• Collaborative mindset with a proactive approach to system reliability.


For further information please contact Gayathri Ganapathy at [email protected]

Key Skills

Ranked by relevance

c nat ai elasticsearch nagios ios cloud esp ha incident response wan sla das pan ui artificial intelligence infrastructure as code microservices cybersecurity data analysis blockchain prometheus terraform pagerduty ansible grafana python scala excel rust bash git aws gcp spi
Login to Apply
Posted
Nov 18, 2024
Type
Full-time
Level
Associate
Location
Sydney

Industries

IT Services IT Consulting

Categories

Information Technology

Related Jobs

3 roles aligned with this opportunity

View all jobs
View Job Details
Egov Select
Related

Network and Systems Engineer

2026-05-28

Full-time
Not Applicable
Belgium
IT Services
Information Technology
View Job Details
Tech Mahindra
Related

Software Engineer-261313

2026-05-16

Full-time
Mid-Senior
Australia
IT Services
Information Technology
View Job Details
NRB
Related

Développeur Mobile Flutter ou Native

2026-05-28

Full-time
Not Applicable
Belgium
IT Services
Engineering