Site Reliability Engineer - L2 Support

Open Innovation AI

United Arab Emirates · Full-time · Mid-Senior

Company Overview

Open Innovation AI is a global technology company that specializes in developing advanced solutions for managing AI workloads. Its flagship product, the Open Innovation Cluster Manager (OICM), orchestrates complex AI tasks efficiently across diverse infrastructures. The platform is hardware-agnostic, optimized for various GPUs and accelerators hardware, and facilitates seamless integration and scalability for enterprise AI applications. Open Innovation AI focuses on optimizing and simplifying AI workload management and making AI technologies accessible to organizations of all sizes. With its innovative solutions, companies can reduce operational costs, accelerate time to value, and maximize their return on investment, ensuring that their AI strategies contribute directly to enhanced business outcomes.

Role Overview:

The Site Reliability Engineer – L2 is responsible for supporting and maintaining Open Innovation AI Products and deployments across customer environments, including secure and isolated on-premises infrastructures. This role requires strong troubleshooting skills across hardware, Linux OS, Kubernetes, middleware, and application layers.

The engineer is expected to diagnose and resolve technical incidents, applying deep product knowledge and strong analytical skills to restore service availability. The role requires solid understanding of operational processes such as Incident, Change, and Problem Management, along with a thorough grasp of the product architecture and how customers use it in production environments.

Role Responsibilities:

Provide L2 technical support for OICM deployments running in secure and isolated customer environments.
Diagnose and resolve incidents across hardware, Linux OS, Kubernetes clusters, containerized services, middleware, and platform components.
Perform detailed analysis of logs, system behavior, and application output to identify root causes and restore service functionality.
Review, validate, and execute approved changes following Change Management procedures, including system updates, configuration adjustments, and component upgrades.
Maintain a strong understanding of the OICM and other OI product’s architecture, its services, dependencies, and typical customer usage patterns.
Collaborate with L1 and Service Desk teams by providing technical guidance, clarifying issue details, and ensuring accurate ticket triage.
Escalate complex, code-level or product-defect issues to L3 with complete diagnostic in-formation and structured analysis.
Conduct on-site platform health assessments, validating Kubernetes cluster status, ser-vice integrity, system resources, and overall environment readiness.
Work closely with the Systems Engineering team to analyze and resolve performance is-sues across compute, storage, networking, and Kubernetes layers, and ensure that identified optimizations are reflected in the product and operational practices.
Update and maintain technical documentation including SOPs, runbooks, troubleshooting steps, and known-issue guides.
Participate in post-incident reviews, contributing technical insights and recommending improvements to prevent recurrence.
Ensure all activities adhere to established Incident, Change, and Problem Management processes.
Support on-call rotations and provide timely assistance during high-priority or critical incidents

Required experience & Qualification

Bachelor’s degree in Computer Science, Information Technology, Engineering, or a related field.
4–7 years of experience in L2 technical support, SRE, DevOps, Infrastructure Operations, or Platform Engineering roles within on-prem or secure environments.
Strong proficiency in Linux system administration, including troubleshooting, log analysis, service management, and performance tuning.
Hands-on experience with Kubernetes, container runtimes, and distributed systems deployed in on-prem environments.
Solid understanding of compute, storage, networking, and virtualization layers relevant to enterprise installations.
Practical experience with middleware and data-layer components such as Kafka, Redis, PostgreSQL, or similar technologies used in distributed on-prem environments.
Strong understanding of ITIL-aligned and experience operating within structured operational frameworks.
Ability to diagnose complex issues across multiple layers of the stack.
Experience working in secure, restricted, or isolated environments is an advantage.
Excellent analytical skills, communication abilities, and a methodical approach to troubleshooting.
Ability to produce clear technical documentation, including SOPs, runbooks, and investigation reports.
Certifications such as RHCSA/RHCE, CKA/CKAD/CKS.

Key Skills

Ranked by relevance

ai kubernetes linux storage system administration virtualization postgresql devops redis kafka itil

Related Jobs

3 roles aligned with this opportunity

View all jobs