Benchmarking Plan for NDC-Sharded Indices and Self-Recognition Evolution
Benchmarking Plan for NDC-Sharded Indices and Self-Recognition Evolution
Context#
Recent work shows repeated iterations on two fronts: reorganizing indices into NDC shards and evolving self-recognition and synthesis capabilities across the knowledge pack universe. There were also adjustments to operational parameters (e.g., extended timeouts) and minor GUI fixes with patch-level releases. This post outlines a benchmarking plan to quantify the impact of these changes on quality and performance.
Goals#
- Measure retrieval and synthesis quality following NDC sharding and self-recognition updates.
- Quantify throughput and latency effects of the new architecture and timeouts.
- Isolate contributions of individual components via ablation studies.
- Establish stable, comparable baselines for future iterations.
Baselines#
Every ML/AI project benefits from a clear baseline for comparison. We will:
- Use the pre-sharding, pre-evolution system state as the primary baseline.
- Where relevant, include a simple non-ML or heuristic baseline as a floor comparison.
Rationale#
baselines provide the first point of comparison for iterative improvements and help guard against regressions.
Data Management and Splits#
Effective evaluation depends on disciplined data handling:
- Maintain strict hold-out sets: training, development (validation), and evaluation (test).
- Do not use evaluation data for model or system decisions.
- Freeze evaluation sets to ensure comparability across runs.
Evaluation Dimensions and Metrics#
1) Quality
- Retrieval accuracy and relevance for representative queries.
- Consistency and factuality of synthesized outputs.
- For bilingual QA/MT workflows (if applicable), integrate MQM to identify error patterns and guide post-editing.
2) Performance and Cost Signals
- Measure throughput and latency under representative loads.
- When calculating inference TCO, capture:
- Throughput: Prefill TPS and Decode TPS.
- Latency SLOs for end-to-end requests.
- Model/hardware configuration fields (without changing other variables during controlled comparisons).
Ablation Study Design#
To attribute gains to specific changes:
- Isolate variables: toggle a single component at a time (e.g., self-recognition evolution, synthesis changes, NDC sharding) while keeping all other conditions constant.
- Use identical data splits, prompts/queries, and evaluation harness settings across ablations.
- Run multiple seeds/trials when stochasticity is involved.
- Report effect sizes with confidence where possible.
Suggested ablations:
- Sharding impact: NDC sharded indices vs. previous indexing approach.
- Self-recognition evolution: on vs. off, or stepwise variants.
- Synthesis changes: with vs. without synthesis-specific updates.
- Operational parameters: previous vs. extended timeout under identical workloads.
Experimental Protocol#
- Single-change comparisons relative to baseline.
- Predefine metrics, thresholds, and success criteria before running.
- Use matched workloads and fixed evaluation sets.
- Record environment, configuration, and evaluation versions with each run.
Reporting Template#
For each experiment, capture:
- Objective: what component/change is under test.
- Setup: data splits, prompts/queries, configuration, and evaluation criteria.
- Metrics: quality (retrieval/synthesis), throughput, latency, and any TCO fields.
- Results: baseline vs. variant deltas; include error bars/variability when available.
- Interpretation: whether the change meets predefined success criteria.
- Follow-ups: next ablation or mitigation if criteria are not met.
Next Steps#
- Lock evaluation datasets and metrics definitions.
- Run the baseline suite.
- Execute the ablations in the order above, starting with NDC sharding and self-recognition evolution.
- Iterate based on measured deltas, not perceived improvements.
This plan creates a stable, repeatable framework to evaluate the ongoing index reorganization and self-recognition/synthesis evolution, ensuring changes are justified by measurable gains in quality, performance, or both.