AI Engineer Lead Contractor | Careers at Eightgen AI

About Eightgen

Eightgen is an AI services company that partners with founders, CIOs, and CXOs to transform ideas into working products. We help startups and enterprises ship AI automation at scale—from intelligent workflows and custom AI agents to enterprise-grade applications.

Our Values

Integrity & Ethics We conduct business with honesty and transparency. We do what's right for our clients, our team, and our partners—even when it's harder. We handle data responsibly and respect the trust placed in us.

Quality & Accountability We take ownership of our work and deliver on our commitments. We build software we're proud of, with attention to detail and a commitment to excellence. When we make mistakes, we acknowledge them and fix them.

Trust & Autonomy We hire talented people and give them the freedom to do their best work. We communicate openly, share context generously, and trust each other to make good decisions.

Inclusion & Belonging We are committed to building a diverse team where everyone feels welcome, respected, and heard. Different backgrounds, perspectives, and experiences make us stronger.

We are a fully remote team that values outcomes over hours and collaboration over hierarchy.

The Role

As an AI Engineer Lead, you will be a hands-on technical leader for our AI engineering work — spending roughly 70% of your time designing, building, and shipping AI systems, and the remaining 30% providing technical direction, reviewing AI/ML work, and mentoring engineers on the team.

You will own the end-to-end design of the LLM-powered features, agents, and data pipelines your team builds — from prompt and retrieval strategy to evaluation, guardrails, and production deployment. This is not a pure research role, not a data-science/notebook role, and not a people-management role: we want a strong software engineer who can take an AI problem from a vague business goal to a reliable, evaluated, production-grade system — owning the services, APIs, and data flow around the model, not just the model — and lead a small team through it.

We are an AI-native engineering team. You will build with LLMs (as the product) and using AI coding assistants (Cursor, Claude Code, GitHub Copilot, or similar) as integral tools in your workflow — and you'll set that standard for the team. Much of our work involves multi-agent systems — orchestrating teams of LLM agents through long-running, human-in-the-loop workflows — so comfort building and reasoning about agentic systems is central to the role.

Our AI Engineering Philosophy

We believe the most effective AI engineers are those who:

Measure before they trust — every agent, RAG pipeline, or fine-tune ships with an evaluation harness, a labeled dataset, and a clear definition of "good enough"; quality is gated on metrics, not vibes, and regressions are caught before they ship
Treat AI systems as software — versioned prompts, reproducible pipelines, tests, and observability — not one-off notebook experiments
Engineer around model limits — design for hallucination, latency, cost, and non-determinism from day one, with retries, fallbacks, and guardrails
Stay pragmatic about the stack — reach for the simplest thing that works (a good prompt over a fine-tune, retrieval over a bigger model) and only add complexity when the metrics demand it
Keep humans in control — AI accelerates the work, but quality, safety, and correctness remain the engineer's responsibility

Key Responsibilities

Lead AI delivery end-to-end — own the design and delivery of the LLM features, agents, and pipelines your team is building, define standards within that scope, and ship reliable, maintainable AI systems on time
Design agentic AI systems — produce technical designs for RAG pipelines, multi-step and multi-agent (lead + sub-agent) systems, tool-use/function-calling flows, and long-running orchestrations with human-in-the-loop gates, with a clear eye on accuracy, latency, cost, and failure modes
Build evaluation and observability — define metrics, build eval datasets and harnesses, and instrument LLM calls so quality and regressions are visible, not guessed at
Govern model cost and routing — route work across model tiers, set budget guards, and apply context/token-management strategies so systems stay within cost and latency targets without sacrificing quality
Stay hands-on — contribute directly across prompt engineering, retrieval, agent orchestration, model integration, the supporting backend services and APIs, and data pipelines — leading by example, not just by review
Engineer for production — bake in cost controls, rate-limit handling, caching, guardrails, prompt-injection defenses, secure credential handling, and PII/data handling as first-class concerns
Raise the bar — conduct thorough reviews of prompts, pipelines, and code; provide actionable feedback; and grow the AI engineering capability of those around you
Make pragmatic trade-off calls — weigh prompt-vs-fine-tune, build-vs-buy, model-vs-cost, and speed-vs-accuracy decisions within your area and clearly articulate the reasoning
Collaborate cross-functionally — partner with product, design, and business stakeholders to turn ambiguous goals into well-scoped, well-evaluated AI work

Required Qualifications

Technical Skills

6+ years of professional software engineering experience overall, including 2+ of those years building production LLM / AI-powered systems (not just prototypes)
Strong applied LLM experience — production work with the OpenAI, Anthropic, or open-weight model APIs, including prompt engineering, structured output, and function/tool calling
Multi-agent orchestration experience — building multi-step and multi-agent systems (lead + sub-agent teams, tool-using agents) with agent frameworks (Claude Agent SDK, LangChain, LlamaIndex) or equivalent, or directly against model SDKs, including parsing streamed structured output and managing long-running agent sessions
Long-running, human-in-the-loop pipeline orchestration — has built stateful, resumable workflows (state machines or equivalent) with approval/milestone gates, recovery, and clear stage hand-offs
RAG and retrieval expertise — chunking and embedding strategies, vector stores (pgvector, Pinecone, Weaviate, or similar), and retrieval evaluation/tuning
Evaluation discipline (core to this role) — has built eval datasets and offline/online eval harnesses for non-deterministic systems, defined precision/quality metrics, and used them as a regression gate on prompt and pipeline changes — not as a one-time benchmark
Deep Python expertise — production experience with FastAPI (our primary backend framework), async patterns, type hints, Pydantic v2, and modern Python best practices
Solid backend and data fundamentals — API design, SQL and data modelling (PostgreSQL or similar), and building the services and pipelines that AI features depend on
Cloud platform experience — production experience on Google Cloud Platform (Cloud Run, Cloud SQL, GCS) or equivalent AWS/Azure services, with a practical grasp of IAM, secrets, and cost trade-offs
Demonstrated technical leadership — has led engineering work through code/design reviews, operational ownership, or mentoring

AI-Assisted Development Skills

Hands-on experience with AI coding assistants such as Cursor, Claude Code, GitHub Copilot, or similar tools in day-to-day workflows
Strong review instincts for AI-generated output — able to spot subtle bugs, security issues, or architectural missteps in AI-assisted code
Ability to guide teams on AI tool adoption — helping teammates use AI tools effectively and critically, not blindly

Preferred Qualifications

Experience with multi-tier model routing & cost governance — routing work across model tiers (e.g., fast/cheap vs. frontier models) per task, enforcing budget limits, and applying context/token-compaction strategies to control cost and latency
Experience with real-time streaming of LLM output to clients (Server-Sent Events or WebSockets), including replay/late-join handling
Experience with secure credential handling — encrypting third-party/provider tokens at rest (e.g., Fernet), JWT-based auth, and rate limiting
Experience with sandboxed / subprocess code execution and Docker / Docker Compose orchestration of ephemeral environments
Experience with fine-tuning, LoRA/PEFT, or model distillation, and a clear sense of when not to fine-tune
Familiarity with inference optimization — quantization, batching, streaming, and serving open-weight models (vLLM, Ollama, TGI)
Experience with prompt-injection / LLM security and safe handling of untrusted input and PII
Background in data-intensive applications — pipelines, analytics, or enterprise integrations
Experience with LLM observability/eval tooling (LangSmith, Langfuse, Arize, Ragas, or similar)
Prior work in early-stage or consulting environments where scope evolves quickly and engineers wear multiple hats

Technical Environment

Our primary stack is listed first in each row, but we equally value experience with comparable tools — the underlying skills transfer.

Layer	Technologies
AI / LLM Stack	Anthropic (primary), OpenAI, open-weight models; Claude Agent SDK / Claude Code CLI; LangChain, LlamaIndex; function calling & structured (stream-JSON) output
Agent Orchestration	Custom state-machine orchestrators; lead + sub-agent teams; human-in-the-loop milestone gates; multi-tier model routing with budget guards (comparable experience with Temporal / Prefect / Dagster transfers)
Retrieval & Vector Stores	pgvector (primary), Pinecone, Weaviate, Qdrant; embedding & reranking models
Evaluation & Observability	Langfuse, LangSmith, Ragas, custom eval harnesses; OpenTelemetry for LLM tracing
Backend	Python 3.11+ (FastAPI primary), Pydantic v2, Typer (CLI); Node.js / TypeScript or Django a plus
Real-time	Server-Sent Events (SSE) streaming with ring-buffer replay; polling fallback
Frontend (nice-to-have, not required)	React 19, React Router 7, TanStack Query v5, Zod, Radix UI + Tailwind, Vite
Relational Databases	PostgreSQL (primary), SQLite, MySQL — schema design, indexing, query tuning
Analytical & NoSQL Stores	ClickHouse, BigQuery, Redis, MongoDB
Auth & Security	JWT (HTTP-only cookie) auth, bcrypt, Fernet credential encryption, rate limiting (slowapi or similar)
Cloud Platforms	Google Cloud Platform primary (Cloud Run, Cloud SQL, GCS); equivalent comfort on AWS or Azure
Inference & Serving	vLLM, Ollama, TGI; quantization & batching where it matters
DevOps & Observability	Docker & Docker Compose, GitHub Actions, Terraform; Grafana, Datadog
Quality Tooling	Ruff, Pyright (Python); ESLint, Vitest + Testing Library + MSW (frontend)
AI Development Tools	Cursor, Claude Code, GitHub Copilot (your choice)

Engagement Details

Contract Duration: Initial 3-month engagement
Extension: Strong opportunity to extend for additional 6+ months based on performance and project needs
Work Arrangement: Fully remote
Start Date: Immediately

AI Engineer Lead ContractorContract