Freelance Agent Evaluation Engineer
Please submit your CV in English and indicate your level of English proficiency.
Mindrift connects specialists with project-based AI opportunities for leading tech companies, focused on testing, evaluating, and improving AI systems. Participation is project-based, not permanent employment.
What this opportunity involves
We're building a dataset to evaluate AI coding agents — how well a model handles real-world developer tasks. You'll create challenging tasks and evaluation criteria within realistic simulated environments:
- Build virtual companies following a high-level plan - codebase, infrastructure, and context (conversations, documentation, tickets) that form a realistic environment with development history
- Assemble and calibrate tasks from intermediate states of the virtual company: craft the prompt, define evaluation criteria, and ensure the task is solvable and the evaluation is fair
- Design tasks set in isolated environments - emulations of a developer's workstation: a Linux machine with development tools (terminal, CLI), MCP servers (repository, task tracker, messenger, documentation, etc.), and a real web application codebase
- Write tests that accept all correct solutions and reject incorrect ones - neither too strict (breaking on valid approaches) nor too lenient (passing bad ones)
- Iterate with an AI agent on tests - verifying they catch real problems, don't miss bad solutions, and don't break on good ones
- Review code written by agents, analyze why an agent failed or succeeded, and design edge cases and adversarial scenarios
- Iterate based on feedback from expert QA reviewers who score your work on quality criteria
What this is NOT
- Not data labeling
- Not prompt engineering
- Not writing code from scratch - the agent writes most of the code; you guide and evaluate
A significant part of the work is done together with AI - it's very hard to create tasks that challenge frontier models without using frontier models.
What we look for
This opportunity is a good fit for experienced developers, software engineers, and/or test automation specialists open to part-time, non-permanent projects. Ideally, contributors will have:
- Degree in Computer Science, Software Engineering, or related fields
- 5+ years in software development, primarily Python (FastAPI, pytest, async/await, subprocess, file operations)
- Background in full-stack development, with experience building React-based interfaces (JavaScript/TypeScript) and robust back-end systems
- Experience writing tests (functional, integration — not just running them)
- Docker containers, and familiarity with infrastructure tools (Postgres, Kafka, Redis)
- CI/CD understanding (GitHub Actions as a user: triggers, labels, reading results)
- English proficiency - B2
You don't need to be an expert in every item, but you should be comfortable reading and reasoning about code across the stack.
Why this is hard
- Frontier models are already good at coding. Creating a task that genuinely challenges the best models is non-trivial. You need to deeply understand where models fail and what scenarios reveal the difference between a good and a bad solution.
- Tasks have many valid solutions. Writing tests that accept all correct solutions and reject incorrect ones is harder than it sounds.
How it works
Apply → Pass qualification(s) → Join a project → Complete tasks → Get paid
Effort estimate
Tasks for this project are estimated to take 20 hours to complete, depending on complexity. This is an estimate and not a schedule requirement; you choose when and how to work. Tasks must be submitted by the deadline and meet the listed acceptance criteria to be accepted.
Compensation
On this project, contributors can earn up to $50 per hour equivalent , depending on their level and pace of contribution.
Compensation varies across projects depending on scope, complexity, and required expertise. Please note that other projects on the platform may offer different earning levels based on their requirements.
Empfohlene Jobs
Trainee Datenanalyse Revision International (m/w/d) in Neckarsulm, Lidl
Deine Aufgaben Lerne Lidl von der Pike auf kennen und erweitere dein Wissen rund um die Datenanalyse im Bereich Revision. … konkret heißt das: Einblicke in Filiale, Lager, Lidl-Länder sowie d…
Identity and Access Management Expert
Grow with us! We are currently looking for a colleague to join our Identity and Access Management (IAM) Team that is a part of our Information Security Solutions Department. We are responsible for en…
Empfangsmitarbeiterin (m/w/d)
Für ein internationales Unternehmen mit Sitz in Frankfurt am Main suchen wir zum nächstmöglichen Zeitpunkt eine_n Empfangsmitarbeiter_in (m/w/d) in Vollzeit. Die Position bietet eine verantwortungsvo…
Sachbearbeiter (m/w/d) medizinische Qualitätssicherung Schwerpunkt Sonographie
Weil die Mitarbeitenden hier wirklich zählen - es gibt für mich tolle Benefits und Entwicklungsmöglichkeiten. Das und viel mehr ist drin bei der Kassenärztlichen Vereinigung Hessen. Nichts zählt m…
Senior Associate / Salary Partner (w/m/d) Banking & Finance
Das bringst Du mit - Durch Deine zwei abgeschlossenen Examina hast Du deine juristischen Fähigkeiten unter Beweis gestellt. - Neben Deinem großen Interesse für Finanzierungen aller Art, verfügst D…
(Senior) Manager / Steuerberater Transaction Tax / M&A Tax (w/m/d)
Are you ready to shape your future with confidence? Gemeinsam die Welt jeden Tag ein bisschen besser machen. Für diesen Anspruch setzen wir bei EY alles in Bewegung und gehen als Team „all in“. Sch…
(Senior) Manager - OT / IoT Cyber Security (w/m/d)
Cyber Security ist Deine Expertise und Security Beratung rund um OT/IoT gehört zu Deinen Schwerpunkten? Dann sei Teil unseres Teams und bringe Dich hier ein: Find it - Fix it - Run it: Du erkenns…
Senior Project Manager, Laboratory
Senior Project Manager, Laboratory ICON plc is a world-leading healthcare intelligence and clinical research organization. We’re proud to foster an inclusive environment driving innovation and excel…
Key Account Operations Specialist Tyre Dealer (m/f/d)
To reinforce our team, we are looking for a Key Account Operations Specialist Tyre Dealer (m/f/d) to join us at the earliest possible date. You can look forward to a diverse role with a strong in…
Mitarbeiter/-in Hummus Bar - Kleinmarkthalle (Voll/Teilzeit)
Wir sind ein lokaler Hersteller aus Frankfurt am Main. Unsere kleine Manufaktur fuer Dips und Aufstriche liegt direkt im Herzen Bornheims. Die Zutaten von “Just a dip” sind alle handverlesen und werd…