Skip to main contentA logo with &quat;the muse&quat; in dark blue text.

Senior AI Operations (AI Ops) Engineer

3+ months ago Palo Alto, CA

At Navan, we aren't building a single, generic chatbot. We are building a Composable AI Microservice Architecture, a swarm of hundreds of hyper-specialized AI services, each meticulously "programmed" to solve small, focused tasks with high precision. This fleet powers Ava, our AI support engine, and a suite of cutting-edge generative tools for travel and expense management.

As a Senior AI Operations (AI Ops) Engineer, you are the architect of the platform that makes this scale possible. You will move beyond traditional MLOps to manage a "factory" of Language Models. Your challenge is one of orchestration and standardization, ensuring that every service in the swarm meets a rigorous bar for quality, reliability, and cost-efficiency.

What You’ll Do
  • Orchestrate the AI Fleet: Build and own the runtime environment for 100+ specialized AI services. Manage model routing, context versioning, and standardized memory/history stores.
  • High-Density Inference Optimization: Design and implement SageMaker Multi-Model Endpoints (MME) and Inference Components to serve multiple tuned SLMs per GPU, maximizing hardware utilization while minimizing latency.
  • Deterministic Service Excellence: Treat reliability as a layered engineering problem. Build deterministic "shells" around probabilistic LM outputs, prioritizing data-layer validation and strict serialization.
  • Automated Evaluation & Observability: Implement "LLM-as-a-judge" patterns and automated benchmarking to detect semantic drift and hallucinations across the fleet before they impact the user.
  • Standardize the Workflow: Obsess over building reusable patterns and Terraform-based infrastructure that eliminate "snowflake" configurations, allowing us to deploy new specialized AI tasks in minutes.
  • Agency Strategy: Partner with AI Researchers to find the "Goldilocks zone" for agentic autonomy—balancing the flexibility of LLM tool-use with the precision required for production stability.
What We’re Looking For
  • Experience: 5+ years in SRE, Platform Engineering, or MLOps, with at least 2 years focused on deploying LLMs/SLMs in production environments.
  • SageMaker Mastery: Deep hands-on expertise with AWS SageMaker, specifically configuring Multi-Model Endpoints (MME), Inference Components, and GPU-backed instances (G5/P4).
  • SLM Expertise: Proven experience with Small Language Models (e.g., Mistral, Llama 3, Phi) and parameter-efficient fine-tuning (PEFT) deployment strategies like LoRA/QLoRA.
  • Technical Stack: * Languages: Strong proficiency in Python and Terraform.

Want more jobs like this?

Get Data and Analytics jobs in Palo Alto, CA delivered to your inbox every week.

Job alert subscription
  • Orchestration: Experience with Docker, Kubernetes (EKS), or AWS ECS/Fargate.
  • Data: Familiarity with Snowflake and Vector Databases.
  • The "AI Ops" Mindset: You understand that AI at scale is a statistical challenge. You are comfortable debugging issues at the data/serialization layer rather than defaulting to prompt tweaks.
  • CI/CD & Automation: Experience building robust pipelines (Jenkins, GitHub Actions) for non-deterministic software, including automated "eval" stages.
  • Education: BS or MS in Computer Science, Engineering, Mathematics, or a related technical field.
  • The posted pay range represents the anticipated low and high end of the compensation for this position and is subject to change based on business need. To determine a successful candidate’s starting pay, we carefully consider a variety of factors, including primary work location, an evaluation of the candidate’s skills and experience, market demands, and internal parity.

    For roles with on-target-earnings (OTE), the pay range includes both base salary and target incentive compensation. Target incentive compensation for some roles may include a ramping draw period. Compensation is higher for those who exceed targets. Candidates may receive more information from the recruiter.

    Pay Range
    $116,100$258,000 USD
    Client-provided location(s): Palo Alto, CA
    Job ID: 7588592
    Employment Type: OTHER
    Posted: 2026-02-11T18:36:09

    Perks and Benefits

    • Health and Wellness

      • Parental Benefits

        • Work Flexibility

          • Office Life and Perks

            • Vacation and Time Off

              • Financial and Retirement

                • Professional Development

                  • Diversity and Inclusion