Independent evaluation for AI in public-sector systems.

We help agencies, companies, and organizations identify promising AI use cases, evaluate vendors on real workflows, and measure whether tools improve outcomes before they are scaled.

Built by some of the world's leading researchers in AI evaluation and A/B testing from MIT, University of Chicago, UT Austin, Princeton, and other top research universities.

Demos and benchmarks aren't enough.

43-pt
Gap between expected and measured productivity when experienced developers used AI coding tools in a randomized trial.
2,133
AI use cases already in production across 41 federal agencies.
77.6%
Of agent-benchmark papers report neither fairness nor cost; two of the metrics public-sector buyers need most.
93%
False-positive rate of a state fraud-detection algorithm deployed without independent testing.

Sources: METR randomized study of experienced developers (2025); federal AI use case inventory; agent-benchmark survey of 1,300+ papers; Michigan MiDAS unemployment fraud-detection system.

Why it matters

AI adoption is moving faster than evidence.

AI adoption is outpacing the evidence needed to guide it. Federal AI contracting jumped 1,200% to $4.5 billion in 2024, but only 10% of federal agencies have a comprehensive AI governance framework in place.

Demos and self-reported benchmarks show potential but they rarely predict real-world impact. TrialAI helps you measure what AI tools actually do in your workflows so you can scale what works, fix what doesn't, and walk away from what shouldn't be deployed at all.

What we do

Three services that meet you where the decision is.

Not every decision needs a randomized trial. We match the strength of evidence to the size of the decision: from a procurement roadmap to a full embedded evaluation. The three services work independently or together as a single pipeline to full-scale deployment.

Service 01

AI Decision Framework & Procurement Roadmap

We equip your leadership team with a practical framework for evaluating AI claims: distinguishing benchmark performance from real-world value, matching evidence standards to deployment risk, and identifying where AI is a plausible fit for real operational problems.

You receive A decision framework for AI evaluation, a prioritized use-case portfolio, and a procurement-ready roadmap for leadership.

Service 02

Independent Vendor Evaluation

We test AI tools on the workflows you actually run — such as benefits determinations, document verification, eligibility support, citizen-facing chat — and measure fairness, reliability, accessibility, cost, and speed against the standards your organization has to meet.

You receive A vendor evaluation memo, benchmark report, and procurement-ready criteria.

Service 03

Impact & Measurement

We design and run randomized trials that measure how AI actually affects the outcomes you care about, such as service access, completion, placement, learning, and fairness, and we build the data pipelines and monitoring infrastructure to keep evaluating as models, prompts, and features change.

You receive A trial design, measurement system, and ongoing monitoring dashboard.

Where we come in

The decisions we help you make.

"We're about to buy an AI tool for eligibility determinations. We need independent evidence it works before we sign."

"We're deploying an application screening model. We need to know if it's fair and accurate."

"We funded three AI pilots. Now we need evidence about which ones improve outcomes before deciding which to scale."

"We shipped an AI feature. Usage is up, but we don't know if it's actually helping users."

How we work

Rigor that scales with the stakes.

For low-stakes pilots, a workflow-specific evaluation may be enough. For tools that will shape benefits eligibility, hiring, or learning at scale, you need trial-level evidence. We help you decide which level of rigor fits the decision in front of you. Then we deliver it.

Once a tool is in production, we build the infrastructure to keep measuring it and we stay involved as long as you need us. New model versions, prompt changes, and product updates can all be evaluated against a stable baseline, so the evidence keeps pace with the technology.

Durable infrastructure

Setting up rigorous evaluation infrastructure is mostly a one-time cost. Once the data pipelines, randomization, and monitoring are in place, every follow-on test runs at a fraction of the original effort.

On past engagements, we've evaluated subsequent product changes for a small fraction of the original setup cost to turn a one-time investment into durable capability.

Why TrialAI

We combine research rigor with live-platform implementation.

Most teams can do one side of this work well: rigorous evaluation, or production engineering. Doing both together, inside the systems where decisions actually get made, is rare. That combination is what we bring.

Our team designs and runs randomized controlled trials in live production systems. We apply causal inference and econometrics to AI decision making. We build, fine tune, and red team models. We bring public sector expertise in benefits access, workforce, housing, education, criminal justice, and financial inclusion. And we ground this work in evaluation, fairness, and bias auditing methods from peer reviewed publications we have written.

That range of expertise, in one team, is what makes rigorous AI evaluation possible from scoping a use case through measuring outcomes in the field.

TrialAI is a program of Learning Collider at Renaissance Philanthropy, drawing on a network of researchers from Chicago, Oxford, MIT, Princeton, Brown, UC San Diego, and UT Austin, and policy relationships across more than 20 states.

Selected projects

  • AI housing navigator — building and evaluating an AI navigator on AffordableHousing.com, serving thousands of Public Housing Authorities across the U.S., to help individuals apply for benefits and secure housing.
  • AI hiring algorithms — building an LLM-based résumé screening tool, evaluating it for efficacy and fairness across demographic groups, and debiasing it against human reviewers and traditional ML models.
  • AI lending algorithms — building and evaluating underwriting models using a randomized controlled trial and machine learning.

Planning an AI procurement, pilot, or renewal?

Let's scope the use case, the evidence standard, and the evaluation approach before you commit.

hello@trialai.co

TrialAI is a program of Learning Collider at Renaissance Philanthropy.