DevOps Manager

Atlas SearchUnited States3 days ago

Full-timeInformation Technology

Track This Job

Add this job to your tracking list to:

Monitor application status and updates
Change status (Applied, Interview, Offer, etc.)
Add personal notes and comments
Set reminders for follow-ups
Track your entire application journey

Save This Job

Add this job to your saved collection to:

Access easily from your saved jobs dashboard
Review job details later without searching again
Compare with other saved opportunities
Keep a collection of interesting positions
Receive notifications about saved jobs before they expire

AI-Powered Job Summary

Get a concise overview of key job requirements, responsibilities, and qualifications in seconds.

Pro Tip: Use this feature to quickly decide if a job matches your skills before reading the full description.

About the Opportunity

Our client, a rapidly growing fintech innovator specializing in global B2B payment solutions, is seeking a Lead DevOps Engineer to drive architectural, operational, and team leadership across a high-impact DevOps function. The role combines hands-on technical engineering with strategic ownership of AWS cloud infrastructure, secure networking, observability, and cost-efficient operations.

Job Responsibilities

Guide and mentor the DevOps team, including direct management of two engineers, fostering a culture of reliability, security, and continuous improvement
Partner with executive leadership on DevOps strategy, roadmaps, and organizational alignment across engineering initiatives
Architect, implement, and operate AWS infrastructure leveraging Terraform, CloudFormation, or CDK to ensure scalability, resilience, and cost efficiency
Design and manage secure networking patterns, including VPCs, Transit Gateways, load balancers, DNS, API gateway, and hybrid connectivity
Build and maintain end-to-end CI/CD pipelines utilizing GitHub and GitHub Actions for automated deployments across multiple teams
Operate and optimize Kubernetes (EKS and upstream), including lifecycle management, networking, RBAC, autoscaling, and workload security
Develop and manage containerized environments with Docker, creating standardized images, build pipelines, and enforcing runtime best practices
Implement and oversee cloud security controls including IAM architecture, secrets management, encryption, and compliance guardrails
Lead observability strategy using Datadog for comprehensive metrics, logs, traces, dashboards, and alerts across distributed systems
Drive incident response, root-cause analysis, and long-term reliability improvements
Develop internal tools and automation to streamline operational processes and enhance developer experience
Lead efforts on cloud cost management, monitoring AWS spend, optimizing resource usage, and collaborating with finance for cost forecasting
Maintain high-quality documentation for architecture, operational procedures, and infrastructure standards
Create and manage runbooks for incident handling, on-call workflows, and operational tasks to ensure reliable production support
Oversee disaster recovery planning and execution, including RTO/RPO targets and cross-team exercises
Manage backup and restore processes for critical systems, ensuring integrity, compliance, and regular validation
Promote and integrate AI-assisted engineering practices to enhance automation, code quality, and team efficiency
Establish DevOps and SRE best practices at scale, influencing architectural decisions across the organization

Job Requirements

Extensive professional experience in senior or lead DevOps/SRE roles, including direct mentorship or management of engineering staff
Advanced expertise designing and operating AWS production environments (EC2, EKS, ECS, Lambda, RDS, DynamoDB, S3, IAM, VPC, CloudFront)
Strong foundation in networking fundamentals: TCP/IP, DNS, routing, load balancing, VPNs, firewalls, and distributed systems communication
Deep knowledge of cloud security: IAM, least-privilege access, encryption, secrets management, vulnerability response, and incident handling
Hands-on experience with Datadog for monitoring, metrics, logs, tracing, and alerting
Robust proficiency with Docker container lifecycle, including build standards, runtime, and orchestration
Demonstrated production experience integrating and managing Kubernetes (EKS or upstream clusters)
Expertise in CI/CD automation using GitHub and GitHub Actions
Strong scripting or programming skills in Python, Go, or Bash
Experience implementing infrastructure as code, preferably with Terraform
Familiarity with AI-assisted development tools and workflows to improve engineering productivity
Proven ability to lead complex technical initiatives, influence architectural decisions, and drive cross-team collaboration

Preferred:

Experience with service mesh technologies (e.g., Istio, Linkerd) or zero-trust networking models
SRE practice knowledge: SLOs, error budgets, chaos engineering, capacity planning
Prior exposure to distributed systems, high-throughput pipelines, or data platform engineering
AWS professional or security certifications
Track record of organizational adoption of AI-powered engineering tools

Key Skills

Ranked by relevance

Ready to apply?

Join Atlas Search and take your career to the next level!

Application takes less than 5 minutes

Apply