Track This Job
Add this job to your tracking list to:
- Monitor application status and updates
- Change status (Applied, Interview, Offer, etc.)
- Add personal notes and comments
- Set reminders for follow-ups
- Track your entire application journey
Save This Job
Add this job to your saved collection to:
- Access easily from your saved jobs dashboard
- Review job details later without searching again
- Compare with other saved opportunities
- Keep a collection of interesting positions
- Receive notifications about saved jobs before they expire
AI-Powered Job Summary
Get a concise overview of key job requirements, responsibilities, and qualifications in seconds.
Pro Tip: Use this feature to quickly decide if a job matches your skills before reading the full description.
Role Purpose
TrekAI is at the forefront of reinventing education through AI. We are a high-growth, mission-driven startup where your work directly impacts teachers, students, and entire school systems.
TrekAI is building the next generation of AI-driven education technology, and we need a DevOps Systems Engineer to ensure our cloud platform is fast, reliable, and resilient. This is a hands-on role focused on operational excellence, developer experience, and customer responsiveness. You will automate deployments, harden infrastructure, and make sure TrekAI’s multi-agent learning platform scales securely and smoothly as adoption grows.
You’ll work closely with the Systems Architect to design scalable topologies, with the Engineering Leader to streamline CI/CD pipelines and developer workflows, and with the AI/Data Science Leader to deploy and monitor model-serving infrastructure. Your work will directly impact how quickly TrekAI can respond to schools, ship improvements, and recover from incidents — making you a critical enabler of customer trust and satisfaction.
Key Responsibilities
Platform Automation & CI/CD
- Build and maintain CI/CD pipelines for microservices and AI models.
- Automate infrastructure with Terraform, Helm, ArgoCD for reproducibility and speed.
Operations & Monitoring
- Deploy and manage observability stacks: Prometheus, Grafana, Loki, Sentry, Posthog, Alloy.
- Instrument systems for metrics, logging, tracing, and error detection to improve uptime and recovery.
- Manage and maintain service-level dashboards and alerting for production systems.
Resilience & BC/DR
- Implement backup, failover, and disaster recovery strategies to ensure ≥99.9% uptime.
- Run DR tests and incident simulations to validate recovery plans.
Developer Experience
- Shorten lead time for changes and improve local-to-production consistency.
- Provide self-service environments for developers and QA.
Customer Responsiveness
- Support school pilots, rollouts, and live trials by ensuring platform readiness.
- Rapidly address production issues to minimize impact on teachers and students.
Required Education & Experience
- BS in Computer Science, Engineering, or related discipline and/or equivalent 5+ years of hands-on DevOps or systems engineering experience in SaaS or platform environments.
- Strong cloud experience (AWS, GCP, or Azure) with virtualization technologies, virtual machine environments, VM and container orchestration (Kubernetes/OpenShift).
- Solid knowledge and administrative experience with Linux distributions (e.g., Ubuntu, Debian, RHEL, NixOS), cloud networking administration and Windows (client side) / Mac OS (client-side)
- Solid programming and DB skills: Python, React, Node.js, Java, Json, SQL, NoSQL
- Expertise with CI/CD tools (Azure DevOps, GitHub Actions, Jenkins, etc.).
- Familiarity with observability stacks: Prometheus, Grafana, Loki, Sentry, Posthog, and log aggregation pipelines.
- Experience implementing BC/DR procedures and failover strategies.
- Knowledge of networking (routing, TLS certs), secrets management, and secure RBAC.
- Educational technology (EdTech) or SaaS platform experience is a plus.
- Startup experience is a plus
Key Skills
Ranked by relevanceReady to apply?
Join TrekAI and take your career to the next level!
Application takes less than 5 minutes

