Benchmarking Plan for NDC-Sharded Indices and Self-Recognition Evolution

Context #

Recent work shows repeated iterations on two fronts: reorganizing indices into NDC shards and evolving self-recognition and synthesis capabilities across the knowledge pack universe. There were also adjustments to operational parameters (e.g., extended timeouts) and minor GUI fixes with patch-level releases. This post outlines a benchmarking plan to quantify the impact of these changes on quality and performance.

Goals #

Measure retrieval and synthesis quality following NDC sharding and self-recognition updates.
Quantify throughput and latency effects of the new architecture and timeouts.
Isolate contributions of individual components via ablation studies.
Establish stable, comparable baselines for future iterations.

Baselines #

Every ML/AI project benefits from a clear baseline for comparison. We will:

Use the pre-sharding, pre-evolution system state as the primary baseline.
Where relevant, include a simple non-ML or heuristic baseline as a floor comparison.

Rationale #

baselines provide the first point of comparison for iterative improvements and help guard against regressions.

Data Management and Splits #

Effective evaluation depends on disciplined data handling:

Maintain strict hold-out sets: training, development (validation), and evaluation (test).
Do not use evaluation data for model or system decisions.
Freeze evaluation sets to ensure comparability across runs.

Evaluation Dimensions and Metrics #

1) Quality

Retrieval accuracy and relevance for representative queries.
Consistency and factuality of synthesized outputs.
For bilingual QA/MT workflows (if applicable), integrate MQM to identify error patterns and guide post-editing.

2) Performance and Cost Signals

Measure throughput and latency under representative loads.
When calculating inference TCO, capture:
Throughput: Prefill TPS and Decode TPS.
Latency SLOs for end-to-end requests.
Model/hardware configuration fields (without changing other variables during controlled comparisons).

Ablation Study Design #

To attribute gains to specific changes:

Isolate variables: toggle a single component at a time (e.g., self-recognition evolution, synthesis changes, NDC sharding) while keeping all other conditions constant.
Use identical data splits, prompts/queries, and evaluation harness settings across ablations.
Run multiple seeds/trials when stochasticity is involved.
Report effect sizes with confidence where possible.

Suggested ablations:

Sharding impact: NDC sharded indices vs. previous indexing approach.
Self-recognition evolution: on vs. off, or stepwise variants.
Synthesis changes: with vs. without synthesis-specific updates.
Operational parameters: previous vs. extended timeout under identical workloads.

Experimental Protocol #

Single-change comparisons relative to baseline.
Predefine metrics, thresholds, and success criteria before running.
Use matched workloads and fixed evaluation sets.
Record environment, configuration, and evaluation versions with each run.

Reporting Template #

For each experiment, capture:

Objective: what component/change is under test.
Setup: data splits, prompts/queries, configuration, and evaluation criteria.
Metrics: quality (retrieval/synthesis), throughput, latency, and any TCO fields.
Results: baseline vs. variant deltas; include error bars/variability when available.
Interpretation: whether the change meets predefined success criteria.
Follow-ups: next ablation or mitigation if criteria are not met.

Next Steps #

Lock evaluation datasets and metrics definitions.
Run the baseline suite.
Execute the ablations in the order above, starting with NDC sharding and self-recognition evolution.
Iterate based on measured deltas, not perceived improvements.

This plan creates a stable, repeatable framework to evaluate the ongoing index reorganization and self-recognition/synthesis evolution, ensuring changes are justified by measurable gains in quality, performance, or both.