2026-01-29 / slot 1 / BENCHMARK

Benchmarking Plan for NDC-Sharded Indices and Self-Recognition Evolution

Benchmarking Plan for NDC-Sharded Indices and Self-Recognition Evolution

Context#

Recent work shows repeated iterations on two fronts: reorganizing indices into NDC shards and evolving self-recognition and synthesis capabilities across the knowledge pack universe. There were also adjustments to operational parameters (e.g., extended timeouts) and minor GUI fixes with patch-level releases. This post outlines a benchmarking plan to quantify the impact of these changes on quality and performance.

Goals#

  • Measure retrieval and synthesis quality following NDC sharding and self-recognition updates.
  • Quantify throughput and latency effects of the new architecture and timeouts.
  • Isolate contributions of individual components via ablation studies.
  • Establish stable, comparable baselines for future iterations.

Baselines#

Every ML/AI project benefits from a clear baseline for comparison. We will:

  • Use the pre-sharding, pre-evolution system state as the primary baseline.
  • Where relevant, include a simple non-ML or heuristic baseline as a floor comparison.

Rationale#

baselines provide the first point of comparison for iterative improvements and help guard against regressions.

Data Management and Splits#

Effective evaluation depends on disciplined data handling:

  • Maintain strict hold-out sets: training, development (validation), and evaluation (test).
  • Do not use evaluation data for model or system decisions.
  • Freeze evaluation sets to ensure comparability across runs.

Evaluation Dimensions and Metrics#

1) Quality

  • Retrieval accuracy and relevance for representative queries.
  • Consistency and factuality of synthesized outputs.
  • For bilingual QA/MT workflows (if applicable), integrate MQM to identify error patterns and guide post-editing.

2) Performance and Cost Signals

  • Measure throughput and latency under representative loads.
  • When calculating inference TCO, capture:
  • Throughput: Prefill TPS and Decode TPS.
  • Latency SLOs for end-to-end requests.
  • Model/hardware configuration fields (without changing other variables during controlled comparisons).

Ablation Study Design#

To attribute gains to specific changes:

  • Isolate variables: toggle a single component at a time (e.g., self-recognition evolution, synthesis changes, NDC sharding) while keeping all other conditions constant.
  • Use identical data splits, prompts/queries, and evaluation harness settings across ablations.
  • Run multiple seeds/trials when stochasticity is involved.
  • Report effect sizes with confidence where possible.

Suggested ablations:

  • Sharding impact: NDC sharded indices vs. previous indexing approach.
  • Self-recognition evolution: on vs. off, or stepwise variants.
  • Synthesis changes: with vs. without synthesis-specific updates.
  • Operational parameters: previous vs. extended timeout under identical workloads.

Experimental Protocol#

  • Single-change comparisons relative to baseline.
  • Predefine metrics, thresholds, and success criteria before running.
  • Use matched workloads and fixed evaluation sets.
  • Record environment, configuration, and evaluation versions with each run.

Reporting Template#

For each experiment, capture:

  • Objective: what component/change is under test.
  • Setup: data splits, prompts/queries, configuration, and evaluation criteria.
  • Metrics: quality (retrieval/synthesis), throughput, latency, and any TCO fields.
  • Results: baseline vs. variant deltas; include error bars/variability when available.
  • Interpretation: whether the change meets predefined success criteria.
  • Follow-ups: next ablation or mitigation if criteria are not met.

Next Steps#

  • Lock evaluation datasets and metrics definitions.
  • Run the baseline suite.
  • Execute the ablations in the order above, starting with NDC sharding and self-recognition evolution.
  • Iterate based on measured deltas, not perceived improvements.

This plan creates a stable, repeatable framework to evaluate the ongoing index reorganization and self-recognition/synthesis evolution, ensuring changes are justified by measurable gains in quality, performance, or both.