
Autonomous AI Engineer (Agentic + RAG + Eval Harness)
Upwork
Remoto
•14 hours ago
•No application
About
Were commissioning a fixed-price pilot that can evolve into a retainer if successful. Budget for the 23 week pilot is $4,500. The goal is an MVP agent that reads a task brief, plans subtasks, calls tools and APIs, retrieves from our knowledge base, and produces a validated deliverable with its own self-checks. The system should include a planner/executor that decomposes tasks and runs them with tool use such as web search, code execution, and database reads and writes. It should ship with a retrieval-augmented generation pipeline spanning roughly twenty thousand internal documents, with guardrails to reduce hallucinations. An evaluation harness must measure quality, latency, and cost with automated regression checks, and we want basic observability for traces, token usage, costs, and pass or fail metrics presented in a simple dashboard. Were flexible on stack, but expect Python or TypeScript, LangGraph or LangChain or CrewAI, OpenAI or Anthropic or Groq models, pgvector or Weaviate for vector storage, LiteLLM for routing, sandboxed code runners like E2B, OpenTelemetry for traces, and Docker for packaging. We will provide API keys, a small redacted dataset, sample tasks, and acceptance rubrics. Deliverables are a repo with a clear README and one-click Docker run, configurable agent graphs with a tool registry, a RAG service with evaluation scripts wired into CI, a minimal UI in Next.js or Streamlit to submit tasks and view traces, and a short deployment guide from development to staging. Success looks like at least a twenty-five percent reduction in manual task time versus our baseline, an average cost per standard task of no more than thirty cents, and a factuality score of at least 0.85 on our rubric. Please include two or three concrete agent or RAG examples with links or repos, and a short paragraph describing how you design evaluations to catch silent regressions.