Making clinical-trial search feel like booking a flight

2025-07-08

A clinical-trial matching tool: describe your case or upload a report, and TrialGPT finds relevant trials near you, checks the inclusion and exclusion criteria against your situation, and drafts a message to the trial coordinator.

25 · jul 08(day zero) a personal portfolio note, not an official Foundation 29 communication; I do not speak for my organization here. I built this at Foundation 29, a nonprofit for people with rare and hard-to-diagnose conditions, together with Julián Isla and Javier Logroño, two genuinely wonderful people to work with. The project, its mission and its history are theirs, shaped over years, above all by Julián's long search for a diagnosis for his son. My aim here is only to explain what I worked on, respectfully and without any private or internal detail. Foundation 29 works at the frontier and is a little podium-or-ambulance: all in, cross the line first or fold trying. This one, happily, was a podium, and in about a week.

If you have ever tried to find a clinical trial for yourself or someone you love, you already know the search is quietly broken. The data is not the problem. There is a public registry, ClinicalTrials.gov, with hundreds of thousands of studies, comprehensive and free. The problem is that it was built for researchers, not for the people the research is for. Every trial hides behind a wall of inclusion and exclusion criteria written in clinician shorthand, and the only honest way to know whether you qualify is to read all of it, for every candidate, and cross it against your own medical history. It is a weekend of reading, and most people give up long before the weekend is over. The rest ask a doctor who has forty other things to do that afternoon.

The pitch we kept coming back to was blunt: finding a clinical trial should feel like booking a flight. You say where you are and what you need, something else does the searching and the filtering and the fine print, and you walk away with a short list and a way to act on it. That is the whole product. It is called TrialGPT, and it lives at trialgpt.app.

I should say up front that I am not a clinician. I am a systems and standards person who wandered into healthcare through the back door, which is the wrong and the right qualification for a tool like this at once: wrong because I cannot read an oncology report the way an oncologist can, right because it forces me to treat every clinical judgment as something the machine must never pretend to make on its own.

1. The hard part was never the search

Everyone assumes the hard part is finding trials, but that was never it. Full-text search over a registry is a solved problem, and a decent keyword query already returns a pile of candidates. The hard part is the pile. Each trial comes with a list of inclusion criteria (you must be like this) and exclusion criteria (you must not be like that), and eligibility is the intersection of dozens of those clauses against one messy human history. Clauses like "ECOG performance status 0 to 1", "no prior treatment with a PD-1 inhibitor", "eGFR above 60". A patient does not know most of those words, and the ones they do know are scattered across three PDFs from two hospitals.

So TrialGPT is not really a search engine. It is an eligibility reasoner with a search engine bolted to the front, and its flow is deliberately short, because every extra step is a place where a scared or exhausted person quits. You tell it about your case, in your own words or, for much better results, by uploading a medical report. It reads that, retrieves the relevant trials, checks each one's criteria against you clause by clause, and, for anything that fits, drafts a note to the coordinator you can send from the page. Four moves, and the whole design fights to keep it at four.

Uploading a report is where the quality jumps, and the reason is not glamorous. Free text is what a person remembers to mention; a report is what a clinician wrote down, denser and more precise and full of exactly the values eligibility hinges on, the ones a patient would never volunteer because they do not know they matter. So a large part of the work is unglamorous extraction, in two stages: a document-layout model (Azure's Form Recognizer) turns the uploaded PDF into clean text, and then GPT-4o reads that text, pulls out the clinical events that matter, and normalizes them into English, because the registry is in English and a lot of the patients are not. What comes out is not prose. It is a short, structured list of facts a criterion can be checked against, which is a very different object from a paragraph a person typed.

From there, those events become a keyword query to the ClinicalTrials.gov v2 API, which hands back candidate studies and, crucially, the free-text eligibility criteria for each. It is retrieval-augmented in the honest sense of the word: retrieve the relevant documents, then let the model reason over them. But the retrieval is a plain keyword call to the registry, not a vector store, and there is no embedding search or re-ranking on our side, which surprises people who expect a wall of vector infrastructure. The interesting work is not the retrieval. It is what the model does with each trial's criteria afterward, in two passes: first it turns that wall of free text into a structured list of inclusion and exclusion clauses, and then it checks your case against every clause, one at a time.

The output of that check is the detail I would defend in an interview. It is not a yes or a no. Each clause comes back as one of three states: met, not met, or uncertain.

JAVASCRIPT Copy

// matchCriteriaLLM asks GPT-4o for one verdict per criterion, never a boolean. // +1 meets an inclusion clause (or does not trip an exclusion) // 0 uncertain: the report simply does not say // -1 fails an inclusion clause (or trips an exclusion) { inclusionMatches: [1, 0, -1], exclusionMatches: [0, 1, 1] }

That middle state, the zero, is the whole point. An eligibility clause can hinge on a value the report never mentioned, and a system that flattened "the report does not say" into a confident yes or no would simply be lying. The tempting thing to build is a green "eligible" badge, because it feels helpful and demos beautifully, and it is also the most dangerous thing you can ship: eligibility can turn on a lab value from last month or a drug you took two years ago, and the model only knows what the report told it. So we kept walking the language back toward "worth checking" and away from "you qualify", and every time we did the product got less impressive in a screenshot and more trustworthy in real life. It produces a first pass, not a verdict, and never the thing that decides you are in. The honest gap, and the question I would expect to be pressed on, is evaluation: there is no gold-standard eligibility set to score against, so that three-state verdict is the only real confidence signal, and full tracing of every prompt and response is what stands in for a proper eval harness for now.

2. The unglamorous half

The trials are written in English. Most patients are not. So the whole pipeline normalizes to English to do the matching and translates everything back at the end, which quietly makes a translation service one of the most load-bearing components in the product, and the least glamorous. I learned exactly how load-bearing the day an ECONNREFUSED to the Azure translator, thrown while translating a single trial title into French, took down the ability to show results in the user's language at all. No timeout, no retry, no fallback: one flaky network call to a boring dependency, and the important half of the product went dark. The fix was not clever, just overdue, a request timeout and exponential backoff and failover to a second Azure region. The lesson generalizes well past this project: find the unglamorous service everything silently routes through, and give it retries and a fallback before you polish anything the user can see.

The other unglamorous truth is that finding the trial is only half the problem. Most people, having found a promising trial, still do nothing, because the next step is to cold-contact a research coordinator at a hospital and they have no idea what to say. So TrialGPT drafts the message for them: who they are, why they might fit this specific trial, and a clear ask, ready to send from the page. It is a small feature, and it is probably the difference between a tool people admire and a tool people use.

One thing I would build differently. The whole matching sequence is orchestrated from the client, the Angular app calling each backend step in order, extract, retrieve, parse, match, explain, one request at a time. It worked, and it made early iteration fast, but too much of the product's logic ended up living in a single enormous frontend component. If I rebuilt it, the orchestration would move to the server behind one endpoint, and the client would go back to being a client.

Two more things an interviewer always reaches for. The first is cost and latency: a single match run fans out to a lot of GPT-4o calls, one to extract, one to rank, then a criteria-parse and a match and an explanation per candidate trial, all at temperature zero with retries, so it is neither cheap nor instant, and the obvious next move is to cache and batch those per-trial calls. The second is that the model ignores "return JSON only" about as often as it obeys it, wrapping its answer in markdown fences, so a tolerant parser strips them and every step degrades to a safe empty result instead of throwing. Small, unglamorous, and the difference between a demo and something that stays up.

3. A medical AI, under the EU AI Act

I spend a good part of my working life on AI standards and the EU AI Act, so building a health-adjacent tool at Foundation 29 was less a side project than a very concrete exam. A system that helps route patients toward medical trials sits close to the high-risk end of that framework, and for good reason. It made a few decisions non-negotiable. The human stays in the loop, always: TrialGPT drafts, it never sends, and it never books. It does not diagnose, and it is careful never to sound like it does. It explains its reasoning per criterion instead of hiding behind a score, because "here is the clause, and here is why I think it applies to you" is something a person can check, while a bare confidence number is not. The honest hard part is the one a checkbox never solves: a medical report only becomes useful by passing through several services, and every hop is a place where sensitive data has to be handled with care rather than enthusiasm. None of this makes the tool compliant by fiat, and I am wary of anyone who claims a checkbox does, but writing the thing while holding the regulation in the other hand changed what we built, and mostly for the better. It is a search-and-outreach assistant, not a medical device, and nothing it produces is clinical advice.

It started in early testing, deliberately rough so people would throw weird edge cases at it and try to break it. These days it is doing rather more than testing, which is the part that keeps me careful: a tool carrying this much responsibility earns trust slowly, in front of real cases, not in a demo. Booking a flight became easy because someone decided the messy part, the fares and the filtering and the rules, was the software's job and not yours. A family looking for a trial deserves the same decision. That is all TrialGPT is trying to be.

26 · jun 20(11m 12d later) a year on, TrialGPT has kept getting better without this page keeping up: the team has moved to newer models, tightened the matching, and widened what it can read. what is live today is well past the version described here.