Track This Job
Add this job to your tracking list to:
- Monitor application status and updates
- Change status (Applied, Interview, Offer, etc.)
- Add personal notes and comments
- Set reminders for follow-ups
- Track your entire application journey
Save This Job
Add this job to your saved collection to:
- Access easily from your saved jobs dashboard
- Review job details later without searching again
- Compare with other saved opportunities
- Keep a collection of interesting positions
- Receive notifications about saved jobs before they expire
AI-Powered Job Summary
Get a concise overview of key job requirements, responsibilities, and qualifications in seconds.
Pro Tip: Use this feature to quickly decide if a job matches your skills before reading the full description.
We are looking for a motivated MLOps Engineer to join our team, working remotely from Canada (Western Timezone only – Pacific or Mountain time zones). As an MLOps Engineer, you will bridge the gap between data science and operations, ensuring seamless integration, deployment, and management of machine learning models in production environments. Your mission will be to automate, scale, and monitor the entire ML lifecycle, leveraging your expertise in cloud infrastructure, DevOps practices, and scripting to deliver efficient, reliable, and secure data-driven solutions that support business innovation.
Key Responsibilities
- Architect, provision, and automate infrastructure on both hyperscaler CSPs and NCP for AI/ML workloads.
- Build, optimize, and maintain end-to-end machine learning pipelines (CI/CD/CT) for continuous integration, delivery, and training in high-throughput, GPU-driven environments.
- Advance Infrastructure as Code (IaC) methods with tools such as Terraform, Ansible, and proprietary SDKs/APIs.
- Manage the deployment and orchestration of large-scale clusters, GPU scheduling, VM automation, and data/storage/network for multi-cloud landscapes.
- Containerize, serve, and monitor ML models using Slurm, Docker, Kubernetes (including Helm and advanced GPU scheduling).
- Implement comprehensive monitoring, model/data drift detection, and operational analytics tailored to high-performance compute platforms. (OTEL, DCGM)
- Ensure robust security, compliance, identity management, and audit readiness in mixed cloud environments. (SOC2)
- Collaborate across engineering, AI research, and operations, producing clear technical documentation and operational runbooks.
Main Requirements
- 6+ years of infrastructure, cloud, or MLOps experience, with at least 1 year in NCP platforms (e.g., CoreWeave, Nebius, Lambda Labs, Yotta).
- Expertise in CSPs (AWS, Azure, GCP) and NCPs (specialized GPU/AI clouds).
- Strong proficiency in IaC (Terraform, Ansible, Pulumi) and DevOps principles.
- Deep hands-on experience orchestrating and monitoring GPU-accelerated workloads and large-scale Slurm or Kubernetes based environments.
- Strong Go/Python (or comparable scripting language) and solid Linux/Unix administration.
- Proven experience in ML pipeline and model deployment in heterogeneous or multi-cloud AI setups.
- Excellent teamwork, stakeholder management, and communication for cross-disciplinary project delivery.
Preferred Skills
- Familiarity with GPU-as-a-Service, job orchestration, MLflow/W&B, and advanced monitoring (OTEL, ELK, LGTM, DCGM).
- Industry certifications in major clouds (AWS/GCP/Azure).
- Experience supporting enterprise-grade business continuity, disaster recovery, and compliance in mixed cloud environments.
Key Skills
Ranked by relevanceReady to apply?
Join Amaris Consulting and take your career to the next level!
Application takes less than 5 minutes