Senior Software Engineer - Agent Evaluation - Freelance/Remote 100+ openings
💡 지원 팁: "Braintrust에서 무료로 지원하기"를 클릭하면 Braintrust의 공식 사이트로 이동합니다. 이 과정은 사용자에게 100% 무료이며, 추천 보너스를 통해 저희 플랫폼을 지원하는 데 도움이 됩니다.
⚠️ 번역 안내: 본 채용 정보는 AI로 번역되었습니다. 부정확하거나 모호한 부분이 있다면 영어 원문을 기준으로 확인해 주세요.
Role Overview
Open to candidates in the North America, South America, Asia and Europe. _Please submit your CV in English and indicate your level of English proficiency.
Mindrift connects specialists with project-based AI opportunities for leading tech companies, focused on testing, evaluating, and improving AI systems. Participation is project-based, not permanent employment.
What this opportunity involves
We're building a dataset to evaluate AI coding agents - how well a model handles real-world developer tasks.
You'll create challenging tasks and evaluation criteria within realistic simulated environments:
- Build realistic developer environments - a virtual company with codebase, infrastructure, and context (tickets, docs, conversations) that forms a believable development history
- Design tasks from intermediate states of these environments - craft the prompt, define what "solved" means, and ensure the task is solvable by an AI agent
- Write tests that verify agent solutions - accept all valid approaches and reject incorrect ones, neither too strict nor too lenient
- Iterate on tasks and tests based on QA feedback - review agent solutions, analyze failures, and refine until the evaluation is fair and robust
What this is NOT
- Not data labeling
- Not prompt engineering
- Not writing code from scratch - the agent writes most of the code; you guide and evaluate
What we look for
- 5+ years in software development
- Core stack: Python (FastAPI), JavaScript/TypeScript (React), Docker, Postgres, Kafka, Redis
- Experience writing tests (functional, integration)
- English proficiency - B2+
Why this is hard
Frontier models are already good at coding. Creating a task that genuinely challenges the best models is non-trivial. You need to deeply understand where models fail and what scenarios reveal the difference between a good and a bad solution. Tasks have many valid solutions - writing tests that accept all correct solutions and reject incorrect ones is harder than it sounds.
How it works
Apply → Pass qualification(s) → Join a project → Complete tasks → Get paid
Effort estimate
Tasks for this project are estimated to take 20 hours to complete, depending on complexity. This is an estimate and not a schedule requirement; you choose when and how to work. Tasks must be submitted by the deadline and meet the listed acceptance criteria to be accepted.
Hiring & Onboarding Process
The process is designed to move quickly and typically includes the following steps:
- App review and invitation to a virtual project introduction session (approximately 30 minutes)
- Platform registration and identity verification
- Technical assessment (approximately 35 minutes)
- Background check (completed at no cost to candidates)
- Onboarding and project-specific training tasks
- Begin production work!
Additional Requirements
- Willingness to complete identity verification as part of the onboarding process.
- Ability to complete a technical assessment.
- Willingness to join and participate in Discord, which will be used for project communication and updates.
- Successful completion of a background check is required prior to onboarding.
- Reliable internet connection and ability to communicate effectively in a remote environment.
맞춤형 채용 알림 받기