AI Inference Optimization Engineer

Track This Job

Add this job to your tracking list to:

Monitor application status and updates
Change status (Applied, Interview, Offer, etc.)
Add personal notes and comments
Set reminders for follow-ups
Track your entire application journey

Save This Job

Add this job to your saved collection to:

Access easily from your saved jobs dashboard
Review job details later without searching again
Compare with other saved opportunities
Keep a collection of interesting positions
Receive notifications about saved jobs before they expire

AI-Powered Job Summary

Get a concise overview of key job requirements, responsibilities, and qualifications in seconds.

Pro Tip: Use this feature to quickly decide if a job matches your skills before reading the full description.

We are looking for a highly capable AI Inference Optimization Engineer to drive the performance and efficiency of AI models during inference—especially large language models (LLMs). This role involves close collaboration with data scientists, ML researchers, and infrastructure engineers to scale cutting-edge AI systems across real-world business scenarios.

Key Responsibilities

Optimize AI models for speed, resource efficiency and accuracy
Improve and maintain APIs for AI inference, ensuring system reliability, observability, and scalability
Benchmark inference systems to identify and resolve performance bottlenecks
Design and tune inference systems for high concurrency, low latency, and high reliability
Work with GPU hardware and frameworks like CUDA and Triton for performance optimization
Implement and monitor GPU utilization, memory usage, throughput, and system health
Manage resource allocation for real-time and batch inference workloads
Design and maintain ML infrastructure for training, inference, model and dataset management, and orchestration
Collaborate with business teams to optimize the integration and use of AI applications across various scenarios
Research and implement state-of-the-art techniques in LLMs, AIGC, NLP, CV, and ML system engineering
Stay current with emerging trends in model inference, distributed systems, and AI infrastructure

Qualifications

Strong understanding of machine learning, deep learning, and inference principles
Proficiency in Python, C++
Experience with deep learning frameworks such as PyTorch or TensorFlow
Familiarity with CUDA, Triton, or similar GPU frameworks
Hands-on experience with inference optimization techniques (e.g., quantization, speculative decoding, continuous batching)
Experience with distributed systems, large-scale data processing, and parallel computing
Familiar with ML infrastructure for large-scale training and deployment
Ability to analyze and optimize GPU kernel performance
Advanced experience with frameworks like DeepSpeed, Megatron, FSDP, or GSPMD
In-depth CUDA programming and tuning using tools like Cutlass or Triton
Experience in LLM application and agent development
Research or industry experience in generative AI, NLP, CV, or multimodal models
Competitive programming accolades (e.g., ACM/ICPC, NOI/IOI, Kaggle, Top Coder)
Familiar with latest LLM research trends including long context handling, active learning, alignment, and agent ecosystems

About Us

Hyperfusion provides high-performance computing and AI solutions for businesses of all sizes. We focus on innovation, security, and exceptional performance, supported by reliable hardware and certified data centers.

Apply

Post Date

2025-05-22

Job Type

Employment type

Full-time