Role Overview

Contract → Ongoing

Company: Evercred (HPEC, Inc.)

Location: Remote. Must overlap 4-5 hours inside 8AM - 5 PM Pacific, Mon-Fri.

Engagement: 2-week paid trial ($2,500 - $5,000 depending on availability & experience) → ongoing contract with competitive pay if trial succeeds. Equity evaluation at 3 months.

Start: Immediate - within 5 business days of offer.

About Evercred

Evercred is a healthcare credentialing platform that verifies physicians, nurses, and allied health professionals against authoritative state and federal sources (FSMB, state DCAs, boards of nursing) and delivers verified credential packets to hospitals and enterprise clients. We are a small, high-velocity team shipping daily into production.

Why This Role Exists

We need one thing from this hire: our engineering and QA teams must never lose a day of productivity to an infrastructure failure again.

In the last six weeks we have lost multiple team-days to:

A staging outage caused by missing AWS Secrets Manager values.

A preview (dev) environment down for over a week.

A new production target ("Azul," App Runner) that is not ready for QA when it was committed to be.

The fixes are not exotic - they are secret validation, realistic timelines, proactive communication, and test-first discipline. We need someone who treats reliability as the primary product.

What You'll Own

AWS infrastructure across production, Beta, and nonprod accounts - App Runner, ECS Fargate, RDS Postgres, Secrets Manager, Route53, VPC, ECR.

CI/CD pipelines (GitHub Actions) - pre-deploy gates, post-deploy health checks, E2E smoke tests (Playwright, Gauge), required status checks on main.

Infrastructure as Code - completing a Terraform-only migration and sunsetting CDK on the remaining paths.

Monitoring and alerting - CloudWatch, BetterStack, PostHog, Rollbar, SigNoz. Every customer-facing flow alarmed and paged.

Third-party integrations - Stripe, Twilio, Anthropic, SendGrid, California DCA BreEZe, FSMB, state nursing boards.

Incident response - first responder for P1 infra incidents with <30 min response and a written postmortem within 48 hours.

Cost, secrets, and backups - 10-15% AWS cost reduction, tested key rotation, tested RDS disaster recovery.

The First Two Weeks (Paid Trial)

Concrete, groomed, documented work is already waiting for you:

Week 1 - Unblock the team:

1. Take the Azul App Runner cutover from "not ready" to "QA can test it on an isolated cloned database" (GitHub epic #3137, children #3138-#3140).

2. Ship a CI pre-deploy secret-ARN validation gate so a missing or blank secret can never again take down an environment (#3071, #3075).

3. Audit all Secrets Manager paths across environments and resolve the California DCA outage class (#2659).

Week 2 - Make it stable and handed off:

1. Deliver the Azul cutover go/no-go checklist and complete remaining Azul phases (#3136, #3141).

2. Produce a written triage of our beta branch work (Hatchet workers, SigNoz tracing, Gauge E2E, pnpm 10): keep / port / discard with reasoning.

3. Stabilize the preview environment and add alerting so nobody finds out from QA that it's down.

Trial success = QA can test Azul, pre-deploy secret validation is live, zero new outages introduced, and you hit what you committed to.

The Ongoing Role (If Trial Succeeds)

40 hrs/week, 70% DevOps / 20% hands-on full-stack (TypeScript/Next.js and Python/FastAPI) / 10% reporting. Monthly ops reports, weekly 30-min with the CPO, daily async standup.

Measured on: uptime, incident count and response time, CI/CD reliability, cost delta, zero-outage-from-missing-secrets, proactive risk escalation.

Must-Have Experience

5+ years production DevOps/SRE, including real incident response ownership.

AWS production depth: App Runner, ECS Fargate, RDS Postgres, Secrets Manager, IAM, Route53, VPC, CloudWatch. You've debugged a VPC connector / security group / secret ARN problem at 2 AM.

Terraform at production scale, including migrating off CDK or another IaC tool.

GitHub Actions CI/CD - authoring pipelines with pre-deploy gates, post-deploy verification, and blue/green or similar cutover patterns.

Secrets management discipline - you treat "missing env var" as a CI failure, not a runtime error.

Cloudflare + Route53 DNS cutovers executed under traffic.

Hands-on code - you can write and review Next.js/TypeScript and Python/FastAPI changes, not just config.

Postgres operational skills - migrations, snapshots, schema changes under load.

Nice-to-Have

Hatchet or similar durable job runner (we're migrating from Inngest).

SigNoz, BetterStack, Rollbar, PostHog - any or all.

Playwright / Gauge E2E in CI.

Experience with healthcare, regulated data, or HIPAA-adjacent environments (we are not HIPAA but we will need a PII conscious mindset - SOC 2 / HITRUST / NCQA CVO).

Experience serving as the sole DevOps at a growth-stage startup.

How We Work

PRs target dev, never main. Author ≠ reviewer.

Pre-commit and CTO-review hooks (anthropic skill) gate every commit. Don't bypass them.

Fail loud. No silent fallbacks that mask configuration errors.

We use Claude-powered internal tools to largely automate our SDLC (/groom, /cto-review, /qa, /start, /standup) and expect engineers to use them.

Daily async standup in Slack #standup by 10 AM Pacific. Blockers escalated before they become outages.

What We Don't Want

"Assessment phase" before execution. The tickets are already groomed - pick them up and ship.

Estimates that don't match reality. If a date slips, we need to know 48 hours before, not the day of.

Heroes who fix things quietly and leave no runbook. Every fix ships with a check.

Anyone whose first instinct on a failed hook is --no-verify.

How to Apply

Send to mark@evercred.com: 1. Resume or LinkedIn.

2. Short note (≤200 words) on the single worst production outage you've personally owned: what broke, what you shipped to prevent the class of failure, and how long it took.

3. Your availability to start and your hourly/weekly rate for the ongoing engagement.

We will respond within 2 business days. Trial offers extended within 5 business days of a qualifying application.

Senior DevOps Engineer — Healthcare Credentialing Platform

Role Overview