A technical look at why template-based CCDA parsing loses the medication that matters most, and what we built instead.
A medication buried in a non-standard section, never extracted, never flagged, is not an edge case. It's an architectural outcome. Most clinical data pipelines were designed to confirm receipt, not to actually read what arrived.
Here's what that looks like under the hood.
The template problem
Most CCDA processing tools, including widely-used open source implementations, work by matching document sections to known templates via an OID identifier. Each template renders a corresponding FHIR® resource. OID matches, data comes through. If no match, data is silently dropped.
This isn't a bug. It's the consequence of treating CCDA parsing as a conversion problem rather than a comprehension problem. For documents produced by major EHRs following standard templates, it works well enough. But Epic, Allscripts, and most real-world EHR implementations routinely produce non-standard structures, vendor-specific extensions, and sections that fall outside the template set.
There's a second failure mode that gets even less attention: embedded HTML tables. CCDAs frequently encode clinically meaningful data, medication lists, lab panels, care instructions, in HTML tables nested inside the XML. These tables often contain information that doesn't exist anywhere in the structured data. Template-based parsers ignore them entirely.
What schema-driven parsing actually means
Particle's parser doesn't match against templates. It uses auto-generated models derived directly from the full HL7 CCDA schema and a recursive visitor that traverses the entire document tree. Every field the schema defines, not just the fields a template happens to reference. Non-standard sections, vendor-specific extensions, embedded HTML tables: captured by default.
The numbers clarify this advantage: 106% more clinically relevant elements per document, 76% more procedures, 73% more unique encounter records, compared to a template-first baseline.
Why deduplication order matters
There's a more subtle problem that compounds the template issue. Template-based tools convert to FHIR® first, then apply deduplication as a post-processing step. By the time deduplication runs, the original document context, which section a record came from, its XPath position, what wrapped it, is gone. Deduplication is operating on a lossy representation.
Particle's pipeline inverts this. Before any output format is generated, all extracted data lives in an intermediate graph model: normalized tables representing the full relational structure of the clinical document, with complete provenance retained. Deduplication happens here, on the raw graph, with full structural context available. Content-identical records are identified by hash. Cross-file encounter records are merged using precise CCDA identifiers, not fuzzy datetime matching.
The result: duplicate artifacts dropped from 64% to 8%. Mapped-code completeness improved by 22 data points absent from template outputs.
Enrichment at the graph layer
Because enrichment happens before output is generated, every downstream format benefits automatically. Surescripts prescription fill data is joined directly to the clinical graph, extending the medication record with real-world fill data that CCDAs alone can't provide. Facility enrichment, demographic enrichment, date imputation, cross-file encounter merging: all of it runs at the graph layer, on clean data, before anything gets converted to FHIR® or FLAT.
This is the foundation Signal is built on. Medication reconciliation is only as good as the medication data underneath it. An AI-generated discharge summary is only as good as the clinical elements that were actually extracted.
Conformance is not comprehension
A conformant CCDA could pass every syntactic check. The pipeline can run without errors. And the medication is still dropped that was in the document the entire time.
Schema-driven parsing, graph-layer deduplication, pre-output enrichment: none of these are optimizations bolted onto a working system. They're corrections to the assumptions the original design got wrong. The template-first model was built to confirm receipt. If you're building anything that needs to act on clinical data, that's not enough.