About Eightgen
Eightgen is an AI services company that partners with founders, CIOs, and CXOs to transform ideas into working products. We help startups and enterprises ship AI automation at scale—from intelligent workflows and custom AI agents to enterprise-grade applications.
Our Values
Integrity & Ethics We conduct business with honesty and transparency. We do what's right for our clients, our team, and our partners—even when it's harder. We handle data responsibly and respect the trust placed in us.
Quality & Accountability We take ownership of our work and deliver on our commitments. We build software we're proud of, with attention to detail and a commitment to excellence. When we make mistakes, we acknowledge them and fix them.
Trust & Autonomy We hire talented people and give them the freedom to do their best work. We communicate openly, share context generously, and trust each other to make good decisions.
Inclusion & Belonging We are committed to building a diverse team where everyone feels welcome, respected, and heard. Different backgrounds, perspectives, and experiences make us stronger.
We are a fully remote team that values outcomes over hours and collaboration over hierarchy.
The Role
As an AI Engineer Lead, you will be a hands-on technical leader for our AI engineering work — spending roughly 70% of your time designing, building, and shipping AI systems, and the remaining 30% providing technical direction, reviewing AI/ML work, and mentoring engineers on the team.
You will own the end-to-end design of the LLM-powered features, agents, and data pipelines your team builds — from prompt and retrieval strategy to evaluation, guardrails, and production deployment. This is not a pure research role, not a data-science/notebook role, and not a people-management role: we want a strong software engineer who can take an AI problem from a vague business goal to a reliable, evaluated, production-grade system — owning the services, APIs, and data flow around the model, not just the model — and lead a small team through it.
We are an AI-native engineering team. You will build with LLMs (as the product) and using AI coding assistants (Cursor, Claude Code, GitHub Copilot, or similar) as integral tools in your workflow — and you'll set that standard for the team. Much of our work involves multi-agent systems — orchestrating teams of LLM agents through long-running, human-in-the-loop workflows — so comfort building and reasoning about agentic systems is central to the role.
Our AI Engineering Philosophy
We believe the most effective AI engineers are those who:
-
Measure before they trust — every agent, RAG pipeline, or fine-tune ships with an evaluation harness, a labeled dataset, and a clear definition of "good enough"; quality is gated on metrics, not vibes, and regressions are caught before they ship
-
Treat AI systems as software — versioned prompts, reproducible pipelines, tests, and observability — not one-off notebook experiments
-
Engineer around model limits — design for hallucination, latency, cost, and non-determinism from day one, with retries, fallbacks, and guardrails
-
Stay pragmatic about the stack — reach for the simplest thing that works (a good prompt over a fine-tune, retrieval over a bigger model) and only add complexity when the metrics demand it
-
Keep humans in control — AI accelerates the work, but quality, safety, and correctness remain the engineer's responsibility
Key Responsibilities
-
Lead AI delivery end-to-end — own the design and delivery of the LLM features, agents, and pipelines your team is building, define standards within that scope, and ship reliable, maintainable AI systems on time
-
Design agentic AI systems — produce technical designs for RAG pipelines, multi-step and multi-agent (lead + sub-agent) systems, tool-use/function-calling flows, and long-running orchestrations with human-in-the-loop gates, with a clear eye on accuracy, latency, cost, and failure modes
-
Build evaluation and observability — define metrics, build eval datasets and harnesses, and instrument LLM calls so quality and regressions are visible, not guessed at
-
Govern model cost and routing — route work across model tiers, set budget guards, and apply context/token-management strategies so systems stay within cost and latency targets without sacrificing quality
-
Stay hands-on — contribute directly across prompt engineering, retrieval, agent orchestration, model integration, the supporting backend services and APIs, and data pipelines — leading by example, not just by review
-
Engineer for production — bake in cost controls, rate-limit handling, caching, guardrails, prompt-injection defenses, secure credential handling, and PII/data handling as first-class concerns
-
Raise the bar — conduct thorough reviews of prompts, pipelines, and code; provide actionable feedback; and grow the AI engineering capability of those around you
-
Make pragmatic trade-off calls — weigh prompt-vs-fine-tune, build-vs-buy, model-vs-cost, and speed-vs-accuracy decisions within your area and clearly articulate the reasoning
-
Collaborate cross-functionally — partner with product, design, and business stakeholders to turn ambiguous goals into well-scoped, well-evaluated AI work
Required Qualifications
Technical Skills
-
6+ years of professional software engineering experience overall, including 2+ of those years building production LLM / AI-powered systems (not just prototypes)
-
Strong applied LLM experience — production work with the OpenAI, Anthropic, or open-weight model APIs, including prompt engineering, structured output, and function/tool calling
-
Multi-agent orchestration experience — building multi-step and multi-agent systems (lead + sub-agent teams, tool-using agents) with agent frameworks (Claude Agent SDK, LangChain, LlamaIndex) or equivalent, or directly against model SDKs, including parsing streamed structured output and managing long-running agent sessions
-
Long-running, human-in-the-loop pipeline orchestration — has built stateful, resumable workflows (state machines or equivalent) with approval/milestone gates, recovery, and clear stage hand-offs
-
RAG and retrieval expertise — chunking and embedding strategies, vector stores (pgvector, Pinecone, Weaviate, or similar), and retrieval evaluation/tuning
-
Evaluation discipline (core to this role) — has built eval datasets and offline/online eval harnesses for non-deterministic systems, defined precision/quality metrics, and used them as a regression gate on prompt and pipeline changes — not as a one-time benchmark
-
Deep Python expertise — production experience with FastAPI (our primary backend framework), async patterns, type hints, Pydantic v2, and modern Python best practices
-
Solid backend and data fundamentals — API design, SQL and data modelling (PostgreSQL or similar), and building the services and pipelines that AI features depend on
-
Cloud platform experience — production experience on Google Cloud Platform (Cloud Run, Cloud SQL, GCS) or equivalent AWS/Azure services, with a practical grasp of IAM, secrets, and cost trade-offs
-
Demonstrated technical leadership — has led engineering work through code/design reviews, operational ownership, or mentoring
AI-Assisted Development Skills
-
Hands-on experience with AI coding assistants such as Cursor, Claude Code, GitHub Copilot, or similar tools in day-to-day workflows
-
Strong review instincts for AI-generated output — able to spot subtle bugs, security issues, or architectural missteps in AI-assisted code
-
Ability to guide teams on AI tool adoption — helping teammates use AI tools effectively and critically, not blindly
Preferred Qualifications
-
Experience with multi-tier model routing & cost governance — routing work across model tiers (e.g., fast/cheap vs. frontier models) per task, enforcing budget limits, and applying context/token-compaction strategies to control cost and latency
-
Experience with real-time streaming of LLM output to clients (Server-Sent Events or WebSockets), including replay/late-join handling
-
Experience with secure credential handling — encrypting third-party/provider tokens at rest (e.g., Fernet), JWT-based auth, and rate limiting
-
Experience with sandboxed / subprocess code execution and Docker / Docker Compose orchestration of ephemeral environments
-
Experience with fine-tuning, LoRA/PEFT, or model distillation, and a clear sense of when not to fine-tune
-
Familiarity with inference optimization — quantization, batching, streaming, and serving open-weight models (vLLM, Ollama, TGI)
-
Experience with prompt-injection / LLM security and safe handling of untrusted input and PII
-
Background in data-intensive applications — pipelines, analytics, or enterprise integrations
-
Experience with LLM observability/eval tooling (LangSmith, Langfuse, Arize, Ragas, or similar)
-
Prior work in early-stage or consulting environments where scope evolves quickly and engineers wear multiple hats
Technical Environment
Our primary stack is listed first in each row, but we equally value experience with comparable tools — the underlying skills transfer.
| Layer | Technologies |
|---|---|
| AI / LLM Stack | Anthropic (primary), OpenAI, open-weight models; Claude Agent SDK / Claude Code CLI; LangChain, LlamaIndex; function calling & structured (stream-JSON) output |
| Agent Orchestration | Custom state-machine orchestrators; lead + sub-agent teams; human-in-the-loop milestone gates; multi-tier model routing with budget guards (comparable experience with Temporal / Prefect / Dagster transfers) |
| Retrieval & Vector Stores | pgvector (primary), Pinecone, Weaviate, Qdrant; embedding & reranking models |
| Evaluation & Observability | Langfuse, LangSmith, Ragas, custom eval harnesses; OpenTelemetry for LLM tracing |
| Backend | Python 3.11+ (FastAPI primary), Pydantic v2, Typer (CLI); Node.js / TypeScript or Django a plus |
| Real-time | Server-Sent Events (SSE) streaming with ring-buffer replay; polling fallback |
| Frontend (nice-to-have, not required) | React 19, React Router 7, TanStack Query v5, Zod, Radix UI + Tailwind, Vite |
| Relational Databases | PostgreSQL (primary), SQLite, MySQL — schema design, indexing, query tuning |
| Analytical & NoSQL Stores | ClickHouse, BigQuery, Redis, MongoDB |
| Auth & Security | JWT (HTTP-only cookie) auth, bcrypt, Fernet credential encryption, rate limiting (slowapi or similar) |
| Cloud Platforms | Google Cloud Platform primary (Cloud Run, Cloud SQL, GCS); equivalent comfort on AWS or Azure |
| Inference & Serving | vLLM, Ollama, TGI; quantization & batching where it matters |
| DevOps & Observability | Docker & Docker Compose, GitHub Actions, Terraform; Grafana, Datadog |
| Quality Tooling | Ruff, Pyright (Python); ESLint, Vitest + Testing Library + MSW (frontend) |
| AI Development Tools | Cursor, Claude Code, GitHub Copilot (your choice) |
Engagement Details
- Contract Duration: Initial 3-month engagement
- Extension: Strong opportunity to extend for additional 6+ months based on performance and project needs
- Work Arrangement: Fully remote
- Start Date: Immediately