About the job
Tech Mahindra represents the connected world, offering innovative and customer-centric information technology experiences, enabling Enterprises, Associates and Society to Rise™. We are a USD 5.1 billion company with 126,200+ professionals across 90 countries, helping 1058 global customers including Fortune 500 companies. We are focused on leveraging next-generation technologies including 5G, Blockchain, Cybersecurity, Artificial Intelligence, and more, to enable end-to-end digital transformation for global customers. Tech Mahindra is one of the fastest-growing brands and amongst the top 15 IT service providers globally.
Why Tech Mahindra?
You will work within a successful, established and trusted organisation, where you the opportunities are only limited by your aspirations; Do you want to travel the world? Do you want to be always learning new skills and developing professionally and personally? We offer everything you could desire with Flexible work, excellent salary, healthcare, dedicated training/certification platform and unrivalled recognition throughout the business.
Roles:
We are seeking a Senior Observability Engineer with expertise in configuring and optimizing monitoring tools such as Dynatrace, Elasticsearch, and Nagios XI. In this role, you will play a crucial part in ensuring system reliability and aligning observability practices with Site Reliability Engineering (SRE) standards
Key Responsibilities:
• Observability and Monitoring Strategy: Design and implement end-to-end observability solutions that align with SRE standards. Provide comprehensive monitoring and alerting to track system health, performance, and reliability.
• Tool Configuration and Management: Configure, deploy, and manage Dynatrace, Elasticsearch, and Nagios XI to monitor critical applications, infrastructure, and network components, supporting real-time visibility into service performance.
• SRE Standard Implementation: Collaborate with engineering and operations teams to develop and implement observability practices that meet SRE standards, such as setting SLAs, SLOs, and error budgets.
• Performance Optimization: Work with development and infrastructure teams to identify performance bottlenecks and optimize applications, with a focus on meeting SRE metrics for system reliability and availability.
• Incident Management and Root Cause Analysis: Develop alerting and escalation processes based on SRE best practices. Lead or support incident response and perform root cause analysis to continuously improve reliability.
• Data Analysis and Dashboarding: Set up and maintain dashboards, log management, and metric visualizations in Dynatrace, Elasticsearch, and Nagios XI. Provide insights into performance trends and system health in alignment with SRE goals.
• Documentation and Mentorship: Create clear documentation of observability practices, configuration details, and troubleshooting guidelines. Mentor junior team members and promote an SRE-driven observability mindset.
Requirements:
• Experience: 5+ years in observability, monitoring, or related engineering roles, with a focus on SRE standards and at least 3 years working with Dynatrace, Elasticsearch, and Nagios XI.
• Technical Skills:
• Expert experience with Dynatrace for application performance monitoring and troubleshooting.
• Proficiency with Elasticsearch for log analysis, data indexing, and search optimization.
• Strong experience configuring and managing Nagios XI for infrastructure monitoring.
• Scripting skills (e.g., Python, Bash) to automate monitoring, data collection, and reporting.
• SRE Knowledge:
• Strong understanding of SRE principles and best practices, including SLAs, SLOs, error budgets, and incident response.
• Familiarity with tools and practices for observability in distributed systems, microservices, and cloud-based infrastructure.
• Preferred Additional Skills:
• Experience with cloud platforms (AWS, Azure, GCP) and monitoring of cloud-native environments.
• Familiarity with additional observability tools (e.g., Prometheus, Grafana) and infrastructure as code (e.g., Terraform, Ansible).
• ITSM/Incident Management tools knowledge (e.g., ServiceNow, PagerDuty).
• Soft Skills:
• Strong analytical and troubleshooting skills.
• Effective communicator who can convey technical information to stakeholders.
• Collaborative mindset with a proactive approach to system reliability.
For further information please contact Gayathri Ganapathy at [email protected]
Key Skills
Ranked by relevance
Related Jobs
3 roles aligned with this opportunity
Network and Systems Engineer
2026-05-28
Software Engineer-261313
2026-05-16
Développeur Mobile Flutter ou Native
2026-05-28
- Posted
- Nov 18, 2024
- Type
- Full-time
- Level
- Associate
- Location
- Sydney
- Company
- Tech Mahindra
Industries
Categories
Related Jobs
3 roles aligned with this opportunity
Network and Systems Engineer
2026-05-28
Software Engineer-261313
2026-05-16
Développeur Mobile Flutter ou Native
2026-05-28