-
View all jobs
Era4 develops, owns and operates AI infrastructure across the UK, powered by renewable energy. Converting legacy industrial and energy sites into modern data-centre facilities, Era4 is combining brownfield regeneration opportunities with cleaner, efficient, scalable compute capacity for healthcare, research, finance, enterprise, and public-sector organisations
Role Summary
We are seeking a DevOps Engineer to design, operate, and continuously improve our Kubernetes-based AI infrastructure. This role focuses on cloud-native platform engineering, GPU-accelerated workloads, reliability, automation, and customer enablement.
You will play a key role in delivering a production-grade AI platform that enables ML engineers, data scientists, and enterprise customers to build and run AI workloads at scale.
You will be responsible for the reliability, scalability, and performance of our Kubernetes-based GPU platforms. You will ensure our AI platform operates securely and efficiently while delivering an exceptional customer experience. This is a hands-on platform engineering position focused on systems reliability, automation, and continuous improvement.
Key Responsibilities
Kubernetes Platform Operations:
You’ll be joining a mission-driven start-up building critical national infrastructure, where operational excellence directly enables growth. This role offers high visibility with leadership, real autonomy, and the chance to shape how a next-generation company operates at scale.
Diversity & Inclusion
Era4 is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.
Role Summary
We are seeking a DevOps Engineer to design, operate, and continuously improve our Kubernetes-based AI infrastructure. This role focuses on cloud-native platform engineering, GPU-accelerated workloads, reliability, automation, and customer enablement.
You will play a key role in delivering a production-grade AI platform that enables ML engineers, data scientists, and enterprise customers to build and run AI workloads at scale.
You will be responsible for the reliability, scalability, and performance of our Kubernetes-based GPU platforms. You will ensure our AI platform operates securely and efficiently while delivering an exceptional customer experience. This is a hands-on platform engineering position focused on systems reliability, automation, and continuous improvement.
Key Responsibilities
Kubernetes Platform Operations:
- Operate and evolve a production Kubernetes environment supporting GPU-accelerated AI workloads.
- Manage cluster lifecycle (deployment, upgrades, scaling, resilience, multi-node operations).
- Implement high availability, failover, and maintenance strategies to minimize disruption.
- Enable aaS capabilities and segmentation for multi-tenant workloads.
- Infrastructure as code tooling and lifecycle.
- Network Overlays, Storage: Block, File and Object.
- Experience with Ansible, YAML, Terraform, Python, Jenkins and GitOps.
- Manage NVIDIA GPU infrastructure within Kubernetes (device plugins, drivers, CUDA compatibility).
- Implement GPU partitioning and workload isolation strategies (e.g., MIG, quotas, namespaces).
- Monitor and optimize GPU utilization, workload efficiency, and cluster capacity.
- Support AI/ML training and inference workloads with performance tuning and best practices.
- Design and maintain observability frameworks (metrics, logs, tracing).
- Implement proactive monitoring, alerting, and capacity planning.
- Lead incident response for platform-level events and drive root cause analysis.
- Automate operational workflows and infrastructure provisioning (IaC, configuration management).
- Contribute to platform reliability engineering practices (SLOs, SLAs, error budgets).
- Implement RBAC, network policies, and security hardening.
- Ensure secure multi-tenant workload isolation.
- Maintain compliance, data protection, and access governance standards.
- Support customer lifecycle of onboarding, provisioning and operations.
- Provide guidance on workload configuration, scaling strategies, and best practices.
- Collaborate with engineering and vendor teams to resolve complex platform issues.
- Produce high-quality technical documentation and operational playbooks.
- Strong hands-on experience operating production Kubernetes clusters.
- Experience with GPU-enabled Kubernetes environments.
- Solid Linux system administration, networking, storage and security skills.
- Experience with Infrastructure as Code and automation.
- Strong understanding of distributed systems, APIs, and cloud-native architectures.
- Experience implementing monitoring and observability solutions (e.g., Prometheus, Grafana.
- Proven incident management and root cause analysis experience.
- Strong communication skills and ability to work cross-functionally.
- Experience operating AI/HPC infrastructure.
- Deep understanding of Kubernetes scheduling, networking, and storage.
- Experience with high-performance datacentre networking and tuning.
- Background in DevOps or Site Reliability Engineering (SRE).
You’ll be joining a mission-driven start-up building critical national infrastructure, where operational excellence directly enables growth. This role offers high visibility with leadership, real autonomy, and the chance to shape how a next-generation company operates at scale.
Diversity & Inclusion
Era4 is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.
Key Skills
Ranked by relevance
kubernetes
ai
storage
devops
cloud
infrastructure as code
system administration
incident response
high availability
prometheus
terraform
jenkins
ansible
python
linux
Related Jobs
3 roles aligned with this opportunity
View Job Details
Related
AI Fullstack Engineer
2026-04-11
Full-time
Not Applicable
Argentina
Technology
Engineering
View Job Details
Related
Machine Learning Engineer (Remote)
2026-04-10
Full-time
Not Applicable
United Kingdom
Technology
Engineering
View Job Details
Related
Engineering Manager
2026-04-11
Full-time
Not Applicable
Ireland
Technology
Engineering
Login to Apply
- Posted
- Apr 08, 2026
- Type
- Full-time
- Level
- Not Applicable
- Location
- England
- Company
- Era4
Industries
Technology
Information
Internet
Categories
Engineering
Information Technology
Related Jobs
3 roles aligned with this opportunity
View Job Details
Related
AI Fullstack Engineer
2026-04-11
Full-time
Not Applicable
Argentina
Technology
Engineering
View Job Details
Related
Machine Learning Engineer (Remote)
2026-04-10
Full-time
Not Applicable
United Kingdom
Technology
Engineering
View Job Details
Related
Engineering Manager
2026-04-11
Full-time
Not Applicable
Ireland
Technology
Engineering