ADR 004 – Ground Truth Schema for HTR Evaluation

Date: 2026-04-11
Status: Accepted
PRD: None
Drivers: PellelNitram (GitHub)
Deciders: PellelNitram (GitHub)

Context

To evaluate HTR model predictions, a ground truth dataset is needed. Before building annotation tooling and evaluation code, a schema must be defined so that all components agree on what ground truth data looks like.

The project currently uses an image-based ML model (via htr_pipeline) but the long-term goal is to operate natively on strokes (as stored in Xournal++ .xopp/.xoj files). The schema should serve both present and future needs without requiring re-annotation.

Decision

Ground truth is stored as one .gt.json file per source document, named <document-stem>.gt.json (e.g. 2024-07-26_minimal.xopp → 2024-07-26_minimal.gt.json). The format is defined by docs/schemas/ground_truth.schema.json.

Key design choices:

Stroke references, not coordinates. Each annotation identifies a group of strokes by page_index, layer_index, and stroke_indices within the source document. No pixel coordinates or bounding boxes are stored, making the schema independent of rendering resolution and naturally aligned with future stroke-based models.
Reference to source document, not embedded data. The file stores a filename and sha256 hash of the source document rather than duplicating its contents. The hash detects silent drift if the source file is modified after annotation.
Closed annotation class vocabulary. Annotations are assigned one of a fixed set of classes: word, digit, mathematical_expression, arrow, diagram, table, drawing, separator, correction, other. Extending the vocabulary requires a schema version bump, which makes class changes explicit and traceable.
text conditionally required. The text transcription field is required for word, digit, and mathematical_expression, and forbidden for all other classes. This is enforced via JSON Schema if/then/else.
Annotator and timestamp metadata. annotator_id and created_at are required top-level fields to support inter-annotator agreement analysis and dataset versioning.
Annotation tool enforces completeness. The tool must classify every stroke before saving. Therefore the schema has no partial-annotation marker — every saved file is complete ground truth.

Rationale

Storing stroke references rather than pixel coordinates future-proofs the annotations: the current image-based model is a temporary detour, and re-annotating an entire dataset to change coordinate systems would be expensive. Stroke indices are cheap to resolve at evaluation time.

A closed class vocabulary prevents annotator inconsistency (e.g. "line" vs "Line" vs "horizontal_line"). JSON Schema versioning (schema_version: "1.0.0") makes class additions explicit.

Stroke references also serve pixel-based models. To evaluate a bounding-box model against this ground truth, a conversion utility loads the referenced strokes and computes (xmin, ymin, xmax, ymax) at the target DPI. Stroke references are strictly more information than a stored bounding box — the conversion is one-directional (strokes → boxes, never boxes → strokes), so storing strokes future-proofs the ground truth for both model types.

JSON Schema was chosen as the schema language because it is language-agnostic, has validators in both Python (jsonschema) and JavaScript (ajv), and the data is already JSON. Pydantic can generate a JSON Schema from a BaseModel if a typed Python representation is needed later.

Consequences

Pros

Annotations are independent of rendering DPI — usable for both current image-based models (after coordinate conversion) and future stroke-based models.
Source document hash makes dataset integrity verifiable.
Closed class vocabulary keeps annotations consistent across annotators.
Single file per document keeps the dataset easy to manage.

Cons

Stroke indices are positional and fragile: if the source .xopp file is edited after annotation (strokes added or removed), indices may silently shift. The SHA-256 hash is the only guard — tooling must refuse to load annotations when the hash does not match.
Duplicate stroke references across annotations (same stroke assigned to two words) cannot be caught by JSON Schema and require a separate code-level validator.

Tooling Requirements

The schema alone is not sufficient to guarantee a valid dataset. Annotation tools must additionally enforce the following:

Schema validation. Every saved .gt.json file must be validated against ground_truth.schema.json before writing.
No duplicate stroke references. JSON Schema cannot enforce that the same stroke is not assigned to two annotations (same page_index, layer_index, stroke_index). The annotation tool must reject any attempt to assign an already-annotated stroke to a second annotation.
Completeness enforcement. The tool must ensure every stroke in the source document is covered by an annotation before saving. Completeness can be verified by cross- referencing the total stroke count from the .xopp/.xoj file against the union of all stroke_indices in the .gt.json file.
Hash mismatch guard. The tool must refuse to load a .gt.json file if the SHA-256 hash of the currently loaded source document does not match source_document.sha256.

Annotation Granularity

Ground truth is annotated at word level, not character level.

Word-level is sufficient for the current evaluation goals: word detection (did the model find the right strokes?) and transcription accuracy (CER/WER computed over words). It is also the right starting point because character-level annotation is significantly more expensive to produce, and stroke-to-character mapping is inherently ambiguous in handwriting — a single stroke can span multiple characters (e.g. a crossing stroke in t or f), and some characters require multiple strokes.

Character-level annotation would additionally enable training and evaluating character segmentation models and fine-grained per-character error analysis. If that becomes a requirement, it can be added as an optional field in a future schema version — but the annotation cost should only be paid when a character-level model exists to benefit from it.

Annotation Conventions

The following conventions must be followed consistently across annotators:

digit granularity. A sequence of digits written as a single connected unit (e.g. 123 with no visible gap between digits) is one digit annotation with text: "123". Digits that are spatially separated (e.g. 1 2 3) are each a separate digit annotation. The determining factor is whether the digits form one visually grouped unit or multiple distinct ones.

Alternatives

Store pixel bounding boxes (as in the existing v1_2024-10-13 annotation schema): simpler for image-based evaluation but ties ground truth to a specific rendering DPI and is useless for stroke-based models.
Embed stroke data (x/y arrays) instead of referencing by index: makes files self-contained but duplicates data already present in the source document and bloats the dataset.
Open class vocabulary: maximum annotator freedom but leads to inconsistent labels and requires post-hoc normalisation before evaluation.
Protocol Buffers: stronger typing and cross-language code generation, but requires a build step and is heavier than necessary for small annotation files.