referry - Job Search Platform Logoreferry
查看所有工作機會

Founding Harness Engineer - AI Infrastructure (Remote, USA)

5 小時前發布|美国|$100-$120/小時|Freelance|10 年以上經驗|SPEQD
DevopsPythonCi/cd

💡 應徵小提示: 點擊「在 Braintrust 免費應徵」會將您導向 Braintrust 的官方網站。這對您完全免費,同時也能透過推薦獎金幫助我們營運這個平台。
⚠️ 翻譯說明: 本職缺資訊由 AI 翻譯,若有不準確或歧義之處,請以英文原版為準。

Role Overview

What This Is AI writes code. That's maybe 30% of shipping software. Testing, security, integration, deployment, monitoring: that other 70% is where teams stall out.

Stripe merges 1,000+ agent PRs per week. Spotify's background agent has 1,500+ merged PRs in production, with feedback loops that reject 25% of output before a human sees it. Ramp ships more than half its PRs from agents. Each had years and full devex teams behind it. Most companies can't do that. SPEQD builds it for them.

We configure the environment around the model: sandboxes, verification loops, context harnesses, security, deployment pipelines. The industry calls this harness engineering. Princeton/Stanford showed a 64% improvement from environment design alone. Same model, same task. Only the harness changed. You're employee #1.

What You'll Build 1. Agent infrastructure and verification. Sandboxed compute where coding agents run with real toolchains (git, test runners, linters, browsers). Containers, serverless runtimes (Modal, Fly.io, cloud VMs), session management, snapshot-based fast startup. On top of that: the verification loop. Agent writes code, runs tests, checks against a spec, iterates on failure, opens a PR only when it passes. Deterministic checks plus an LLM-as-judge layer for what linters can't catch. Stop hooks that kill bad PRs before they waste anyone's time. Meta's research team argues static test suites break at AI code velocity; you'll need to think about just-in-time testing too.

2. Context and tool harnesses. What the agent sees matters more than which model you pick. You control context through progressive disclosure, limit available tools (Spotify deliberately restricts agent tools for predictability), and wire up self-validation. You write harness config, connect tool servers, build reusable skills, and maintain the spec files agents reason over.

3. Security, DevOps, and SRE. AI-generated code has 1.7x more bugs and up to 2x more security vulnerabilities than human-written code (Checkmarx, Stack Overflow 2026). You implement boundary design: sandbox isolation, credential scoping, execution hooks, audit trails. You own CI/CD automation, IaC validation, deployment risk scoring, monitoring. Mix of open-source and agentic SaaS (CodeRabbit, Harness, Datadog). Pipeline from merged PR to production is yours.

4. Technical direction. New agent frameworks ship monthly. You tell us what to use and what to skip. You simplify our plans where they're overbuilt. You're the one who says "drop half of this, here's what matters for the first three clients."

What You Bring

  • Software delivery systems. You've seen enough production stacks to know where they break and why. You map the full dependency graph, not just the code.
  • AI coding agents in real workflows. Claude Code, Codex, Aider, OpenCode. At least one, used for real work not a weekend experiment.
  • Infrastructure. Containers, serverless, cloud VMs, CI/CD, IaC. Python-first. Comfortable in a terminal.
  • Systems thinking. Sandbox to control plane to verification to deployment to monitoring. You hold the whole chain and design for it.
  • You default to simple. Most technical plans are twice as complex as they need to be. You cut scope, not corners.
  • You don't need to be managed. Point you at a problem with constraints, you figure out the path. You flag blockers early. You ship.

Strong Plus

  • Agent orchestration: LangGraph, LangChain, Open SWE Agent
  • DevOps / SRE: CI/CD, monitoring, incident response
  • Azure: Azure DevOps, Container Instances, Durable Functions (initial clients are Azure-native)
  • LLM observability: LangSmith, Braintrust, W&B
  • Multi-agent orchestration: coordinating specialized agents, not single-agent loops
  • Agent security: Meta's Rule of Two, sandbox isolation, credential management

Strong Plus

  • Austin location
  • Client presentations
  • Management experience
  • Certifications or a degree

Timeline Weeks 1-2: Internal build.** Review our technical plan, cut what's unnecessary. Stand up sandboxes, get a coding agent completing tasks end-to-end, wire the verification pipeline. By week 2, the system works internally.

Week 3: Client environment.** Real client stacks. Different repos, CI/CD, constraints. Adapt the harness, get agents running against their codebase. Same 2-3 week ramp, now on someone else's infrastructure.

Weeks 4-6: Stabilize and optimize.** Fix what broke. Tune feedback loops. Harden security. Customize for the client's architecture. By week 6, client engineers use it without you in the room.

Comp and Why $200K-$250K base. Uncapped variable comp tied to delivery outcomes: client milestones, platform launches, engagement revenue. Pay for outcomes, not hours. If you help deliver a $400K engagement, your comp reflects that. No ceiling.

An exceptional performer, with SPEQD hitting its pipeline targets, could realistically double their base. The math works if the work works.

Equity participation available after first successful client engagement. Four active engagements in pipeline.

82% of companies adopted agentic AI in the first five months of 2025 (Jellyfish). Almost none have a system for making the output production-ready. Stefan Weitz, CEO of HumanX Conference: _"I haven't seen anyone take on the mantle of 'we're not just gonna be your intern engineer, we're gonna be your architect.'"

The CEO handles sales, clients, and strategy. You run the tech. Nothing between you and production. Remote. US-based.

接收個人化職缺提醒

💰 245 個高薪職缺

絕不寄送垃圾郵件
隨時取消訂閱
職缺來自頂尖平台