At Tata Technologies we make product development dreams a reality by designing, engineering and validating the products of tomorrow for the world’s leading manufacturers. Due to our continued growth, we are now recruiting for a Senior DevOps / SRE Engineer, to strengthen our team in Gothenburg.
Scope of role
We are seeking a Senior DevOps / SRE Technical Engineer to serve as a key technical owner for cloud infrastructure, observability, reliability engineering, and cloud cost optimization across AWS and GCP.
This role carries clear accountability and measurable outcomes in the following areas:
1. End-to-end observability (design → implementation → continuous improvement)
2. Systematic cloud cost optimization across AWS & GCP (FinOps)
3. Production reliability governance and risk reduction
4. Root cause analysis (RCA) and systemic improvement of major incidents
You will be expected not only to design but also to deliver, operate, and be assessed against concrete results.
Responsibilities
1) End-to-End Observability
What you will own:
Independently design and implement a comprehensive end-to-end observability system covering:
• Infrastructure (AWS/GCP, Kubernetes, network, storage)
• Platform (message queues, databases, caches, API gateways)
• Application layer (microservices, critical business flows)
• Business layer (key business metrics)
You will be expected to produce:
1.Unified Observability Architecture Document
• Overall architecture diagram (Metrics + Logs + Traces)
• Data flow diagram (collection → processing → storage → visualization)
• Tooling selection and justification (e.g., Prometheus, Datadog, OpenTelemetry)
2.Standardized Observability Data Model
• Unified metrics naming conventions
• Standardized tracing model (Trace ID, Span, sampling strategy)
• Structured logging standard (JSON schema)
3.Operational Dashboards
• Infrastructure health dashboard
• Platform services health dashboard
• Business API check of KPI dashboard
4.Alerting System
• Defined P0/P1/P2 alert levels
• Alert noise reduction strategy
• Automated alert routing by team/service 5.SLI / SLO / SLA Framework
• At least 5 critical business SLOs defined and tracked
• Clear error budget policy
2) Cloud Cost Optimization – FinOps (Core Requirement)
What you will own:
Lead systematic cost optimization across AWS and GCP without compromising performance, reliability, or user experience.
You will implement:
1.Unified Cost Visibility System
• Combined AWS + GCP cost dashboards
• Cost breakdown by: Team/Product/Service/Environment (Dev/Test/Stage/Prod)
2.Actionable Cost Optimization Plan
• Compute (EKS/GKE, EC2/Compute Engine, Serverless)
• Storage (S3/GCS tiering, lifecycle policies)
• Databases (RDS/Cloud SQL sizing, connection pooling, caching)
• Network costs (egress, cross-region traffic)
3.Cost Shift-Left Mechanisms
• Cost checks integrated into CI/CD
• Mandatory resource ownership and budget limits
• Quarterly cost reviews
3) Production Reliability & Incident Governance
What you will own: Move from reactive “firefighting” to systematic reliability engineering.
Required Deliverables:
1.Incident Management Framework
• Standard P0/P1 incident response process
• RCA template and follow-up tracking mechanism
2.Reliability Governance Framework
• Error budget policy
• Standardized canary/gradual rollout process
• Automated rollback mechanisms
3.Risk Register
• Identified systemic risks and technical debt
• Prioritized remediation roadmap
4) Kubernetes & Multi-Cloud Platform Optimization
What you will deliver:
• Optimize EKS/GKE cluster architecture
• Improve stability (reduce OOMs, node instability, network issues)
• Improve resource utilization
Knowledge/Experience
Experience
• 5+ years of DevOps / SRE / Cloud Platform experience
• At least 3 years in a Staff/Principal or Tech Lead role
• Experience operating large-scale distributed systems in production
Cloud Expertise
• Deep expertise in both AWS and GCP
• Ability to design cross-cloud architectures
• Strong experience with Terraform / Pulumi / CDK
Observability Expertise
• Proven experience designing and implementing observability from scratch
• Deep hands-on experience with Prometheus/Grafana/Loki/Elastic/Kibana
Kubernetes
• Deep understanding of Kubernetes internals (Scheduler, Controllers, etcd, CNI, CRI)
• Experience managing large-scale production clusters
Programming
• Proficiency in Java or Python/Go
Strong Plus
• Google SRE background or deep SRE practice
• Experience with Chaos Engineering
• Proven FinOps success cases
• Knowledge of eBPF and performance profiling
• Open-source contributions
• Experience designing multi-cloud disaster recovery (Active-Active or Active-Passive)
If you are passionate about bringing innovation to the projects, you work on then we would love to hear from you.
Tata Technologies: Engineering a better world.
Tata Technologies would like to thank all applicants for their interest; each application will be reviewed against the set criteria for the role. We would like to advise that only candidates under consideration will be contacted. If you do not hear from us within 10 working days following the closing date it will mean that unfortunately your application has not been successful. We will however retain your details for any suitable future opportunities.
Key Skills
Ranked by relevance
Related Jobs
3 roles aligned with this opportunity
AI Solution Engineer
2026-05-23
Full Stack Engineer
2026-05-24
Senior Data & Machine Learning Engineer (all genders)
2026-05-21
- Posted
- Feb 17, 2026
- Type
- Full-time
- Level
- Mid-Senior
- Location
- Gothenburg
- Company
- Tata Technologies
Industries
Categories
Related Jobs
3 roles aligned with this opportunity
AI Solution Engineer
2026-05-23
Full Stack Engineer
2026-05-24
Senior Data & Machine Learning Engineer (all genders)
2026-05-21