Amaris Consulting
DevOps Engineer
Amaris ConsultingCanada16 days ago
Full-timeRemote FriendlyConsulting

We are looking for a motivated MLOps Engineer to join our team, working remotely from Canada (Western Timezone only – Pacific or Mountain time zones). As an MLOps Engineer, you will bridge the gap between data science and operations, ensuring seamless integration, deployment, and management of machine learning models in production environments. Your mission will be to automate, scale, and monitor the entire ML lifecycle, leveraging your expertise in cloud infrastructure, DevOps practices, and scripting to deliver efficient, reliable, and secure data-driven solutions that support business innovation.


Key Responsibilities

- Architect, provision, and automate infrastructure on both hyperscaler CSPs and NCP for AI/ML workloads.

- Build, optimize, and maintain end-to-end machine learning pipelines (CI/CD/CT) for continuous integration, delivery, and training in high-throughput, GPU-driven environments.

- Advance Infrastructure as Code (IaC) methods with tools such as Terraform, Ansible, and proprietary SDKs/APIs.

- Manage the deployment and orchestration of large-scale clusters, GPU scheduling, VM automation, and data/storage/network for multi-cloud landscapes.

- Containerize, serve, and monitor ML models using Slurm, Docker, Kubernetes (including Helm and advanced GPU scheduling).

- Implement comprehensive monitoring, model/data drift detection, and operational analytics tailored to high-performance compute platforms. (OTEL, DCGM)

- Ensure robust security, compliance, identity management, and audit readiness in mixed cloud environments. (SOC2)

- Collaborate across engineering, AI research, and operations, producing clear technical documentation and operational runbooks.


Main Requirements

- 6+ years of infrastructure, cloud, or MLOps experience, with at least 1 year in NCP platforms (e.g., CoreWeave, Nebius, Lambda Labs, Yotta).

- Expertise in CSPs (AWS, Azure, GCP) and NCPs (specialized GPU/AI clouds).

- Strong proficiency in IaC (Terraform, Ansible, Pulumi) and DevOps principles.

- Deep hands-on experience orchestrating and monitoring GPU-accelerated workloads and large-scale Slurm or Kubernetes based environments.

- Strong Go/Python (or comparable scripting language) and solid Linux/Unix administration.

- Proven experience in ML pipeline and model deployment in heterogeneous or multi-cloud AI setups.

- Excellent teamwork, stakeholder management, and communication for cross-disciplinary project delivery.


Preferred Skills

- Familiarity with GPU-as-a-Service, job orchestration, MLflow/W&B, and advanced monitoring (OTEL, ELK, LGTM, DCGM).

- Industry certifications in major clouds (AWS/GCP/Azure).

- Experience supporting enterprise-grade business continuity, disaster recovery, and compliance in mixed cloud environments.

Key Skills

Ranked by relevance