Dear applicants, please keep in mind that applications without provided salary expectations and active LN profile will not be considered. Hope for your understanding.
We are hiring a Senior Applied AI Engineer to own the reliability, evaluation, and production stability of advanced multi-agent AI systems operating at real production scale. This role is focused on transforming LLM-powered workflows from “demo-ready” prototypes into resilient, observable, production-grade systems capable of handling non-deterministic model behavior, complex routing logic, and human-in-the-loop escalation flows. You will work closely with technical leadership and product stakeholders to design, evaluate, optimize, and maintain agentic AI systems across multiple communication channels and workflows. This is a highly hands-on engineering role for someone who thrives in production environments and understands the realities of deploying AI systems under live traffic conditions.
Details
Location: LATAM
Work Model: Fully Remote
Employment Type: Full-time
Seniority Level: Senior
Industry: AI / Agentic Systems / SaaS
Start Date: ASAP
English Level: Fluent English Required
Time Zone: LATAM-friendly collaboration preferred
About the Role
This position is dedicated to AI agent reliability, evaluation pipelines, observability, and continuous optimization of production LLM systems. The ideal candidate combines strong backend engineering expertise with deep practical experience operating AI products in real-world environments. You will take ownership of evaluation frameworks, scoring systems, tracing infrastructure, production debugging, and the iterative optimization loop between prompts, architecture decisions, and system behavior. The role requires both technical depth and product intuition, especially around how evaluation systems directly impact product quality and user experience.
Key Responsibilities
Design, build, and maintain evaluation pipelines for production AI agent systems
Instrument multi-agent workflows with tracing and observability tooling
Build evaluation datasets using real production traffic and interaction logs
Develop quality scoring and robustness scoring systems for LLM outputs
Improve reliability of AI systems handling non-deterministic model behavior
Implement and optimize HITL (Human-in-the-Loop) escalation workflows
Analyze production failures and drive architectural improvements
Own the full feedback loop between evaluations, prompt optimization, architecture updates, and re-testing
Contribute to prompt engineering and model optimization strategies
Collaborate on multi-agent orchestration and workflow reliability decisions
Work across backend systems, deployment pipelines, monitoring, and operational sustainment
Participate in production support and on-call responsibilities
Maintain high engineering standards around scalability, observability, and maintainability
Operate independently across development, testing, deployment, and production ownership
Requirements
5+ years of backend or AI engineering experience in production environments
Strong hands-on experience with production LLM or agentic AI systems
Proven experience debugging and maintaining non-deterministic AI workflows under live traffic
Experience building or operating evaluation/evals pipelines for AI systems
Strong understanding of scorer design, feedback loops, and AI system evaluation methodologies
Excellent Python backend engineering skills
Production experience with:
FastAPI
Django
Celery
LangGraph or similar orchestration frameworks
Experience with observability and tracing tools such as:
Langfuse
Grafana
Loki
OpenTelemetry or equivalent
Experience deploying and operating distributed backend systems
Strong understanding of AI reliability, prompt behavior, and model failure handling
Ability to independently own projects end-to-end
Experience working in asynchronous remote teams
Strong written communication skills in English
Nice to Have
Experience with:
DSPy
DPO
RLHF-related optimization workflows
Experience with multi-agent orchestration systems
Production experience with:
GPT-4.x
Claude
Whisper
Multi-model AI stacks
Experience building AI tooling for communication or workflow automation
Background in high-growth startups or product-focused engineering teams
Experience with distributed systems and event-driven architectures
Familiarity with AI observability and experiment tracking frameworks
Exposure to vector databases, retrieval systems, or memory architectures
Experience scaling AI products with real customer usage
Tech Stack:
Python
FastAPI
Django
Celery
LangGraph
Langfuse
Grafana
Loki
LLM APIs (OpenAI / Anthropic / multi-model stacks)
What Success Looks Like
AI agents reliably handle real production traffic with measurable quality improvements
Evaluation pipelines provide actionable scoring and monitoring insights
Observability systems surface failures before they impact users
Human escalation triggers operate accurately and consistently
Prompt and architecture iterations measurably improve production outcomes
AI systems become resilient, scalable, and maintainable over time
Interview Process
HR / Introductory Call
Technical Deep Dive
Take-Home Technical Assessment
Final Team & Culture Interview
Offer Stage