ADR 004 – Ground Truth Schema for HTR Evaluation
- Date: 2026-04-11
- Status: Accepted
- PRD: None
- Drivers: PellelNitram (GitHub)
- Deciders: PellelNitram (GitHub)
Context
To evaluate HTR model predictions, a ground truth dataset is needed. Before building annotation tooling and evaluation code, a schema must be defined so that all components agree on what ground truth data looks like.
The project currently uses an image-based ML model (via htr_pipeline) but the long-term
goal is to operate natively on strokes (as stored in Xournal++ .xopp/.xoj files). The
schema should serve both present and future needs without requiring re-annotation.
Decision
Ground truth is stored as one .gt.json file per source document, named
<document-stem>.gt.json (e.g. 2024-07-26_minimal.xopp →
2024-07-26_minimal.gt.json). The format is defined by
docs/schemas/ground_truth.schema.json.
Key design choices:
-
Stroke references, not coordinates. Each annotation identifies a group of strokes by
page_index,layer_index, andstroke_indiceswithin the source document. No pixel coordinates or bounding boxes are stored, making the schema independent of rendering resolution and naturally aligned with future stroke-based models. -
Reference to source document, not embedded data. The file stores a
filenameandsha256hash of the source document rather than duplicating its contents. The hash detects silent drift if the source file is modified after annotation. -
Closed annotation class vocabulary. Annotations are assigned one of a fixed set of classes:
word,digit,mathematical_expression,arrow,diagram,table,drawing,separator,correction,other. Extending the vocabulary requires a schema version bump, which makes class changes explicit and traceable. -
textconditionally required. Thetexttranscription field is required forword,digit, andmathematical_expression, and forbidden for all other classes. This is enforced via JSON Schemaif/then/else. -
Annotator and timestamp metadata.
annotator_idandcreated_atare required top-level fields to support inter-annotator agreement analysis and dataset versioning. -
Annotation tool enforces completeness. The tool must classify every stroke before saving. Therefore the schema has no partial-annotation marker — every saved file is complete ground truth.
Rationale
Storing stroke references rather than pixel coordinates future-proofs the annotations: the current image-based model is a temporary detour, and re-annotating an entire dataset to change coordinate systems would be expensive. Stroke indices are cheap to resolve at evaluation time.
A closed class vocabulary prevents annotator inconsistency (e.g. "line" vs "Line" vs
"horizontal_line"). JSON Schema versioning (schema_version: "1.0.0") makes class
additions explicit.
Stroke references also serve pixel-based models. To evaluate a bounding-box model against
this ground truth, a conversion utility loads the referenced strokes and computes
(xmin, ymin, xmax, ymax) at the target DPI. Stroke references are strictly more
information than a stored bounding box — the conversion is one-directional (strokes →
boxes, never boxes → strokes), so storing strokes future-proofs the ground truth for both
model types.
JSON Schema was chosen as the schema language because it is language-agnostic, has
validators in both Python (jsonschema) and JavaScript (ajv), and the data is already
JSON. Pydantic can generate a JSON Schema from a BaseModel if a typed Python
representation is needed later.
Consequences
Pros
- Annotations are independent of rendering DPI — usable for both current image-based models (after coordinate conversion) and future stroke-based models.
- Source document hash makes dataset integrity verifiable.
- Closed class vocabulary keeps annotations consistent across annotators.
- Single file per document keeps the dataset easy to manage.
Cons
- Stroke indices are positional and fragile: if the source
.xoppfile is edited after annotation (strokes added or removed), indices may silently shift. The SHA-256 hash is the only guard — tooling must refuse to load annotations when the hash does not match. - Duplicate stroke references across annotations (same stroke assigned to two words) cannot be caught by JSON Schema and require a separate code-level validator.
Tooling Requirements
The schema alone is not sufficient to guarantee a valid dataset. Annotation tools must additionally enforce the following:
-
Schema validation. Every saved
.gt.jsonfile must be validated againstground_truth.schema.jsonbefore writing. -
No duplicate stroke references. JSON Schema cannot enforce that the same stroke is not assigned to two annotations (same
page_index,layer_index,stroke_index). The annotation tool must reject any attempt to assign an already-annotated stroke to a second annotation. -
Completeness enforcement. The tool must ensure every stroke in the source document is covered by an annotation before saving. Completeness can be verified by cross- referencing the total stroke count from the
.xopp/.xojfile against the union of allstroke_indicesin the.gt.jsonfile. -
Hash mismatch guard. The tool must refuse to load a
.gt.jsonfile if the SHA-256 hash of the currently loaded source document does not matchsource_document.sha256.
Annotation Granularity
Ground truth is annotated at word level, not character level.
Word-level is sufficient for the current evaluation goals: word detection (did the model
find the right strokes?) and transcription accuracy (CER/WER computed over words). It is
also the right starting point because character-level annotation is significantly more
expensive to produce, and stroke-to-character mapping is inherently ambiguous in
handwriting — a single stroke can span multiple characters (e.g. a crossing stroke in t
or f), and some characters require multiple strokes.
Character-level annotation would additionally enable training and evaluating character segmentation models and fine-grained per-character error analysis. If that becomes a requirement, it can be added as an optional field in a future schema version — but the annotation cost should only be paid when a character-level model exists to benefit from it.
Annotation Conventions
The following conventions must be followed consistently across annotators:
digitgranularity. A sequence of digits written as a single connected unit (e.g.123with no visible gap between digits) is onedigitannotation withtext: "123". Digits that are spatially separated (e.g.1 2 3) are each a separatedigitannotation. The determining factor is whether the digits form one visually grouped unit or multiple distinct ones.
Alternatives
-
Store pixel bounding boxes (as in the existing
v1_2024-10-13annotation schema): simpler for image-based evaluation but ties ground truth to a specific rendering DPI and is useless for stroke-based models. -
Embed stroke data (x/y arrays) instead of referencing by index: makes files self-contained but duplicates data already present in the source document and bloats the dataset.
-
Open class vocabulary: maximum annotator freedom but leads to inconsistent labels and requires post-hoc normalisation before evaluation.
-
Protocol Buffers: stronger typing and cross-language code generation, but requires a build step and is heavier than necessary for small annotation files.