Track This Job
Add this job to your tracking list to:
- Monitor application status and updates
- Change status (Applied, Interview, Offer, etc.)
- Add personal notes and comments
- Set reminders for follow-ups
- Track your entire application journey
Save This Job
Add this job to your saved collection to:
- Access easily from your saved jobs dashboard
- Review job details later without searching again
- Compare with other saved opportunities
- Keep a collection of interesting positions
- Receive notifications about saved jobs before they expire
AI-Powered Job Summary
Get a concise overview of key job requirements, responsibilities, and qualifications in seconds.
Pro Tip: Use this feature to quickly decide if a job matches your skills before reading the full description.
The Operations Architect defines and governs the operational model for enterprise platform capabilities delivered by multiple vendors, ensuring solutions are production-ready, observable, secure, and supportable at scale. The role designs end-to-end service management practices (SLOs/SLAs, monitoring, incident/change/problem management, DR, and capacity/cost controls) and ensures operational requirements are embedded from design through delivery.
Working with platform/cloud, security, and solution architects, as well as vendor teams and operations teams, the architect drives operations readiness reviews, creates runbooks and support processes, and enables a consistent, efficient operating model across cloud-agnostic deployments.
Duties & Responsibilities
- Define operational architecture and service management model across capabilities (ITIL-aligned where applicable).
- Establish observability standards: metrics/logs/traces/audits, OpenTelemetry instrumentation, dashboarding, alerting, and anomaly detection.
- Define SLOs/SLAs/OLAs, error budgets, and operational KPIs; ensure vendors deliver evidence and meet acceptance gates.
- Design incident management workflows (triage, escalation, RCA), integrate with ITSM, and standardize runbooks/playbooks.
- Define change and release management practices (CAB inputs, deployment rings, canary/rollback, feature flags coordination).
- Establish resiliency and DR requirements: backup/restore patterns, RPO/RTO targets, DR testing cadence, and failover runbooks.
- Define capacity, performance, and availability engineering processes (load testing, scaling policies, GPU/TPU capacity planning).
- Implement security operations integration: SIEM/SOAR alignment, alert routing, vulnerability/patch management SLAs.
- Define FinOps operational controls: tagging standards, showback/chargeback, budgets, anomaly detection, cost optimization playbooks.
- Lead operational readiness and handover: L1/L2/L3 training, reverse-shadowing, SOPs, and post-go-live stabilization plans.
Skills & Abilities
- Strong expertise in operating cloud-native platforms: SRE/ITIL practices, reliability engineering, and service management.
- Ability to turn NFRs into measurable SLOs, monitoring, and operational acceptance criteria.
- Solid understanding of observability stacks and telemetry design (OTel, APM, SIEM integration).
- Experience designing DR/BCP, backup strategies, and operational test plans in regulated environments.
- Proven capability to drive operational standardization across multiple vendors and teams.
Education & Background
- Bachelor’s degree in Computer Science, Information Technology, Cybersecurity, or related field; Master’s degree highly preferred.
- 8+ years in operations architecture, SRE, DevOps leadership, or service management for enterprise platforms.
- Experience running production systems on Azure plus exposure to at least one other cloud (GCP/AWS) and hybrid setups.
- Experience with ITSM tooling and processes (incident/change/problem, CMDB), including KPI/SLA reporting.
- Proven experience with monitoring/APM and security operations integration (SIEM, vulnerability management).
- Certifications desirable: ITIL, SRE-related training, Azure/AWS/GCP ops certs, Kubernetes CKA/CKS (optional).
Preferred Tools / Soft Skills
Preferred Tools
- Observability/APM: OpenTelemetry, Dynatrace/Datadog, Prometheus/Grafana/Loki/Tempo (as applicable)
- ITSM & operations: ServiceNow (or equivalent), CMDB, PagerDuty/Opsgenie-style on-call tooling
- Security & cloud ops: Microsoft Sentinel, Defender for Cloud, Azure Monitor/Log Analytics, Kubernetes tooling
Soft Skills
- Calm, structured leadership during incidents and high-pressure escalations
- Strong facilitation skills for readiness reviews, RCAs, and cross-vendor alignment
- Clear documentation and operational discipline (runbooks, SOPs, checklists)
- Continuous improvement mindset and ability to drive measurable reliability gains
- Strong collaboration and influencing skills across engineering, security, and vendor teams
Key Skills
Ranked by relevanceReady to apply?
Join Starlink Qatar and take your career to the next level!
Application takes less than 5 minutes

