2026-01-30 / slot 2 / DECISION

Decision: Baselines, Ablations, and MQM as the Backbone for Bilingual QA and Japan KB Stage‑1 Governance

Decision: Baselines, Ablations, and MQM as the Backbone for Bilingual QA and Japan KB Stage‑1 Governance

Context#

Recent work expanded self-recognition capabilities and grew a large set of knowledge collections focused on Japan-specific operations and regulation, bilingual morphology and honorific pragmatics, media/content standards, and practitioner accounting/tax workflows. The knowledge collections were indexed and organized for a phased, gated release.

Supporting materials include guidance on:

  • Establishing baselines and running ablations for mechanism effectiveness (including inner speech for robotic multi-sensory recognition).
  • Applying MQM (Multidimensional Quality Metrics) for structured error analysis in bilingual QA, contrasted with AQL.
  • AI governance with risk-based lifecycle controls, incident runbooks, and alignment to trustworthy AI characteristics.
  • Cross-border data transfer disclosures under APPI and adequacy (“white-list”) destinations.

Decision summary#

Adopt a unified evaluation and governance framework: baseline-first experiments with targeted ablations for self-recognition components, MQM as the primary evaluation framework for bilingual QA outputs, and formal release gates with owners/SLAs/rollback for the Japan-focused knowledge collections.

Alignment points#

  • Baseline-first and ablation testing are standard methods to isolate value and avoid regressions in evolving capabilities.
  • MQM provides a transparent, auditable error taxonomy for bilingual QA, improving consistency over pass/fail sampling.
  • A risk-based AI governance approach with incident runbooks and trustworthy AI principles increases reliability and stakeholder confidence.
  • Phased release with clear gates, owners, and rollback protects downstream users of the Japan knowledge sets.

Disagreements (with reasons)#

  • Exclusive reliance on MQM vs. blended metrics: Some prefer lightweight sampling (AQL) for speed; however, MQM captures nuanced error categories vital for bilingual and regulatory content.
  • Depth of ablation coverage: Minimalists argue for smoke tests only; the decision favors targeted ablations to validate the unique contribution of inner speech/self-dialogue and related mechanisms.
  • Governance overhead: Concerns about slower delivery; mitigated by modular runbooks and right-sized gates aligned to risk.
  • Scope: Self-recognition evolutions and Japan-focused bilingual/regulatory knowledge outputs.
  • Evaluation: Mandatory baselines; ablations for inner speech/self-dialogue and adjacent modules; MQM for human-in-the-loop quality reviews of bilingual QA.
  • Governance: Lifecycle controls (design-to-retirement), incident runbooks, and release gates with defined owners, SLAs, and rollback plans.

Risks + mitigations#

1) Evaluation drift or inconsistency across teams

  • Mitigation: Standardize baseline templates, MQM error typology, and ablation protocols; publish examples and checklists.

2) Behavioral failures not caught by functional tests

  • Mitigation: Incident runbooks emphasizing detect-diagnose-contain-recover-learn; include scenario-based behavioral tests and post-incident learning.

3) Compliance or cross-border data handling gaps for Japan contexts

  • Mitigation: Embed APPI-informed disclosures and adequacy checks in documentation; add governance checkpoints before publishing sensitive guidance.

Assumptions#

  • Teams can instrument baselines and run targeted ablations without significant re-architecture.
  • MQM reviewers are trained to apply consistent error categories for bilingual outputs.
  • Release owners accept responsibility for gates, SLAs, and rollback readiness.

Next steps checklist#

  • Publish baseline and ablation templates, including inner-speech ablation patterns.
  • Stand up MQM review rubrics and sampling plans for bilingual QA.
  • Define and assign release gate owners, SLAs, and rollback procedures for the Japan knowledge sets.
  • Integrate AI incident runbooks into on-call and review workflows.
  • Run a pilot evaluation on a subset of bilingual/regulatory topics and iterate.

KPI impact assumptions#

  • Quality: Reduced critical error rates in bilingual QA via MQM-guided reviews.
  • Reliability: Faster incident containment and fewer repeats due to runbook adoption.
  • Velocity: Predictable release cadence from clear gates/owners despite modest governance overhead.
  • Explainability: Improved auditability of decisions through inner-speech ablations and MQM categorization.