AI Capability Discipline v0.9

AI Capability Discipline

Stop Building AI Theater. Build Capability.

From Magic Thinking to Governed, Measurable, Maintainable AI Systems.

Controlled-sharing candidate. Field guidance, not final company policy.

This package is an operating discipline reference for leaders, architects, governance reviewers, and practitioners working on correctness-matters AI. It does not approve tools, data classes, runtime paths, or production use cases.

What capability requires

Intent, source authority, and data boundaries
Context, schemas, harnesses, and valid evals
Human review, observability, change control, and sustainment
Measured business value, not demo theater

1. Executive brief

AI is not magic. AI is a system capability candidate, and sometimes it is not the right answer. The first question is whether AI should be used at all. If it is used, the work must be shaped by intent, source authority, data boundaries, context management, schemas, workflows, tool permissions, feedback loops, evals, human review, telemetry, governance, change control, sustainment ownership, run-cost realism, and measurable outcome.

The key leadership reset is simple: model access is not capability access. A better model can reduce friction, but it cannot define the business outcome, approve the data path, decide source authority, validate the workflow, make deterministic work probabilistic without consequence, own sustainment, or absorb accountability when the output is wrong.

Five things leaders should stop approving

Stop approving	Replace with
Demos as proof of capability	Evidence-backed pilot gates
Prompt reuse as operating discipline	Harness, schema, eval, telemetry, and ownership
Tool approval as use-case approval	Tool, data, workflow, and governance routing checks
Human-in-the-loop as decorative safety	Reviewer authority, evidence, queue, override, and audit model
Evals that score polish	Evals that measure business correctness, risk, and safe failure

Leader approval stop rule

If the team cannot explain the outcome, data boundary, source authority, schema, eval validity, human review model, telemetry, sustainment owner, and stop condition, approve discovery only. Do not approve broad pilot, production routing, autonomous execution, or business-process dependency.

2. Executive mental model reset

This section is deliberately blunt because leaders often see the visible model output and miss the system underneath it. The replacement models turn AI from a magic tool story into an operating-model conversation about controls, evidence, ownership, and accountable decisions.

Bad mental model	Replacement model	What leaders should ask
AI is magic	AI is a probabilistic system component	What controls make it reliable?
Better model means solved	Better models reduce friction but do not define accountability	What remains outside the model?
We already have Copilot	Tool access is not use-case approval	Which data and workflow are approved?
Give me the prompt	A prompt is only one expression of a task	What is the operating pattern?
We built an agent	An agent is a component, not a capability	What workflow, governance, telemetry, and owner exist?
The demo worked	A demo proves possibility, not reliability	What happens on edge cases and missing evidence?
The eval passed	Eval success matters only if the eval measures the right thing	What does the eval actually prove?
Human-in-the-loop means safe	Human review works only with authority, evidence, time, and override workflow	Who can override and what is captured?

3. Capability equation

The equation is not math for decoration. It is a dependency map. Missing any major term does not mean the work is useless, but it does mean the work is not yet a governed capability and should be treated as discovery, prototype, or bounded pilot.

AI Capability =
Clear Intent
+ Approved Data Path
+ Source Authority
+ Workflow Fit
+ Schema Contracts
+ Harnesses
+ Rubrics
+ Valid Evals
+ Human Review
+ Tool Permissions
+ Observability
+ Governance
+ Sustainment Ownership
+ Measured Business Outcome

A prompt, agent, skill, rubric, or eval can be useful. None of them becomes a capability until the full equation is credible enough to survive real work.

4. Demo-to-capability gap

Demos compress uncertainty into a polished moment. This table separates what a demo can legitimately prove from what it cannot prove, so leaders do not confuse plausibility with readiness or adoption enthusiasm with operational evidence.

Demo-to-capability ladder

A polished demo is evidence of possibility. It becomes a governed capability only as the team adds contracts, review, evals, telemetry, and operational ownership.

DemoPlausible output in one scenario

→

Prompt artifactRepeatable interaction pattern

→

Reusable harnessNamed inputs, outputs, and review path

→

Structured workflowStates, checkpoints, and human review

→

Eval-backed assistantFixtures, rubric, and negative controls

→

Governed pilotApproved data route, telemetry, and stop conditions

→

Operational capabilitySupport, monitoring, release discipline, and adoption

Leadership question: which control is still missing before this use case earns the next rung?

Demo proves	Demo does not prove
The model can produce a plausible answer	The answer is correct, current, supported, or useful
The tool can call an API	API use is approved, safe, auditable, or reversible
Users were impressed	Users will adopt it under real workflow constraints
One scenario worked	Edge cases and failure modes are controlled
Output looked polished	The output measured the right outcome
The prototype was fast	Sustainment, cost, telemetry, and governance are feasible
A tool is licensed	The data path and use case are approved

5. Distribution status and policy boundary

This manual is field guidance, not final policy. It can be used to shape intake, architecture review, governance discussion, and practitioner learning. It must not be interpreted as:

tool approval,
data-class approval,
production approval,
GxP or regulated-use approval,
external SaaS approval,
security exception approval,
autonomous agent approval,
replacement for formal AI Governance review.

When this manual conflicts with named internal policy, the named policy wins. When the policy is unknown, mark the item as requires owner confirmation rather than inventing approval, because apparently optimism is still not an access-control model.

6. Tool and data boundary matrix, owner-validation required

The matrix below is a field template requiring owner validation before policy or operational use. It is intentionally conservative. Replace placeholders with confirmed internal policy before broad distribution.CLM-014

Tool surface	Personal learning with public or synthetic data	Company data	Confidential or proprietary data	Regulated, GxP, PHI, PII, security-sensitive data	Business-process use	Approval route
Approved AI for All chat	Usually allowed within policy	Allowed only by approved data class	Depends on policy	Not assumed allowed	No, unless all no-review conditions hold	AI Governance if any trigger is false
Internal enterprise GenAI chat	Usually allowed within policy	Depends on approved data boundary	Depends on policy	Requires explicit confirmation	Depends on impact	AI Governance if workflow or data triggers apply
Copilot Studio or enterprise agent builder	Learning and team prototyping where approved	Use-case approved only	Use-case approved only	Requires explicit review	Yes, if governed	Team, function, or enterprise governance path
Azure or AWS approved runtime	Not a casual user surface	Use-case approved only	Use-case approved only	Requires explicit review	Yes, if governed SDLC applies	Formal architecture, security, privacy, compliance, AI Governance
GitHub Copilot or approved SDLC assistant	Only where assigned and approved	Depends on repo and policy	Depends on policy	Not assumed allowed	No business-process automation by default	SDLC and AI Governance as applicable
Claude Code, Codex, Cursor, Antigravity, external SaaS coding tools	Public or synthetic learning only unless approved	Not assumed allowed	Not assumed allowed	Not allowed unless explicitly approved	Not approved by default	Explicit approval required
Local or personal tools	Public or synthetic learning only	Not assumed allowed	Not assumed allowed	Not allowed	Not approved	Explicit approval required

Bright-line rule

Approved tool access does not approve the use case, data class, retention model, logging path, connector action, workflow impact, or production use.

7. Enterprise routing model

Use enterprise routing to separate casual productivity from governed business capability.

Work type	Likely path	Required discipline
Individual productivity	AI for All, approved chat, approved assistant	Stay within approved data and output boundaries
Small group experiment	BUILD path, limited sharing, synthetic or approved data	Scope, owner, data boundary, known limitations
Pre-configured business workflow	USE approved agent or platform capability	Confirm data, audience, support, and governance triggers
Custom business-process AI	REQUEST or Custom Built AI path	PRD, source authority, schemas, evals, HITL, telemetry, governance
Regulated, GxP, privacy-sensitive, or decision-impacting workflow	Governance first	Formal review before tooling or data processing
Production or scaled capability	Governed SDLC and operational ownership	Release gates, runbook, monitoring, support, change control

8. Capability readiness model

The readiness ladder gives teams a shared vocabulary for maturity. It should be used as a routing tool, not as a vanity score. The practical question is always what proof is required to move up one level without skipping governance, telemetry, or ownership.

Level	State	Meaning	Minimum next proof
0	Idea	Interesting but not shaped	Problem statement and user need
1	Prompt artifact	One-off model interaction	Reusable harness candidate
2	Reusable harness	Repeatable prompt/instruction pattern	Input/output contract and review path
3	Structured workflow	Defined inputs, outputs, states, and human review	Eval fixture set and evidence rules
4	Eval-backed assistant	Tested against fixtures and rubrics	Pilot charter, data approval, telemetry plan
5	Governed pilot	Approved users, data path, evals, review, and telemetry	Runbook, support model, release criteria
6	Operational capability	Supported, monitored, versioned, adopted	Scaling plan and reuse governance
7	Scaled enterprise capability	Integrated, reusable, governed, measured, continuously improved	Portfolio governance and continuous eval operations

8.1 Promotion gate matrix

Promotion between Levels 3 and 6 should be treated as an evidence gate, not a naming preference. The matrix below defines minimum evidence floors for field guidance. Meeting the floor does not bypass governance, architecture, security, privacy, compliance, or owner approval.

Gate area	Level 3 to 4	Level 4 to 5	Level 5 to 6
Schema validity	Input, output, and evidence fields are explicit, and sample outputs validate against the declared schema.	Pilot schemas cover normal, exception, and review states, with no unresolved field drift across pilot fixtures.	Operational schemas are versioned, change-controlled, and released with backward-compatibility or migration handling.
Fixture coverage	Starter fixtures cover golden path, missing evidence, conflicting source, ambiguous request, and unsafe request behavior.	Pilot fixtures cover top failure paths, reviewer overrides, and recent regressions seen in trial use.	Regression suite is refreshed from incidents, source changes, model changes, and operating drift.
Rubric calibration	A domain reviewer and builder align on pass, fail, and escalation labels for the starter fixture set.	Pilot reviewer pool calibrates against the fixture set and records how disagreement is resolved.	Calibration repeats on a defined cadence and after rubric, model, source, or workflow changes.
Reviewer pool	A named reviewer can reject, request evidence, or escalate.	Pilot reviewer pool has primary and backup coverage with queue ownership.	Operational reviewer coverage matches hours, expected volume, and escalation obligations.
Approved data route	Only public, synthetic, or otherwise approved data enters the eval-backed assistant path.	Pilot data route, logging path, retention path, and prohibited data classes are explicitly approved for pilot scope.	Operational data route is documented per source class and monitored for drift or boundary violations.
Stop conditions	Missing-evidence, unsafe-action, and overreach stop conditions are explicit in harness or reviewer guidance.	Pilot stop conditions include false-negative, boundary-violation, and override-spike triggers with pause authority.	Stop conditions are wired to operational pause, rollback, or routing controls.
Telemetry	Run start, output, evidence state, reviewer action, and stop-condition events are defined.	Pilot telemetry proves fixture outcomes, override rates, cycle time, and boundary violations.	Operational telemetry tracks quality, cost, adoption, drift, and incident correlation.
Runbook	Reviewer instructions exist for how to run the workflow and capture findings.	Pilot runbook covers startup, failure handling, retriage, source refresh, and manual fallback.	Operational runbook covers release, rollback, monitoring, and handoff expectations.
Incident path	Harm or error cases have an escalation contact, even if operational incident handling is not yet active.	Pilot incident path names who pauses the pilot, who reviews the event, and how evidence is preserved.	Operational incident path integrates with the owning team's incident and post-incident review flow.
Support owner	A named builder or owner is accountable for the assistant and its artifacts.	Pilot support owner accepts source, rubric, and fixture maintenance responsibilities.	Operational support owner, backup, and service boundaries are documented.
Adoption proof	At least one target workflow and success measure are named.	Pilot adoption proof shows real reviewers using the workflow and returning structured feedback.	Operational adoption proof shows repeat usage, decision uptake, and a maintained value signal.

9. Capability formation lifecycle

Intent
→ Product thesis
→ Product requirements
→ Value classification and acceptance line
→ Domain model
→ Source authority model
→ Data contract
→ Schema contracts
→ Harness
→ Rubric
→ Eval suite
→ Workflow
→ Human review model
→ Telemetry and observability
→ Governance route
→ Sustainment model
→ Field validation
→ Operational capability

The lifecycle is not paperwork theater. It exists because without these layers, a team can build a very convincing wrong thing.

10. Intent and outcome management

Intent is valid only when the problem, user, workflow, outcome, source feasibility, data permission, failure consequence, and ownership are explicit.

Value must be classified before it is judged. A proposal can create real value and still sit below the current acceptance line if the wrong owner benefits, the evidence is weak, or the current business climate requires direct savings.

Gate	Test	Failure signal
Problem clarity	Specific, recurring, material, and owned	Vague productivity promise
Outcome specificity	Observable baseline and target	“Make work easier” with no measure
Value classification	Claimed value class, decision owner, benefiting owner, and evidence owner are explicit	Real value claim with no accountable owner or proof path
Acceptance line fit	Current business climate and minimum accepted threshold are explicit	Value is real but below the current acceptance line
User fit	Real user job and workflow entry point	Solution looking for a workflow
Decision relevance	Output drives a real decision or action	Output is interesting but unused
AI appropriateness	AI compared to no AI, rules, search, workflow, dashboard, deterministic automation	Agent-first thinking
Source feasibility	Required sources exist and have authority	Model asked to infer missing authority
Data permission	Required data can be processed, logged, retained, and reviewed in selected tool path	Tool approval confused with data approval
Failure consequence	Wrong, missing, stale, or overconfident output is analyzed	No safe failure path
Human accountability	Reviewer authority and override workflow exist	HITL slogan, no action model
Sustainment realism	Owner, cadence, funding, and release model exist	Demo owner disappears after launch

11. Product requirements for AI capabilities

A serious AI capability needs product requirements, not just prompts.

Requirement area	Required content
Target users	Roles, responsibilities, permissions, review authority
User jobs	What task or decision is improved
Business outcome	Baseline, target, value hypothesis, value class, benefiting owner, evidence owner, measurement method
Acceptance line	Decision owner, current threshold, below-line handling, exception path if needed
Non-goals	What the system must not do
Inputs	Data classes, artifacts, source systems, owners, refresh cadence
Outputs	Decisions, recommendations, drafts, findings, actions, confidence limits
Decision boundaries	What the model may suggest versus what humans must decide
Failure modes	Missing evidence, stale source, conflict, hallucination, tool failure, privacy risk
Acceptance criteria	Functional, quality, governance, telemetry, and support thresholds
Operating model	Owner, support path, review cadence, release and change control

12. Solution ideation matrix

The point of this matrix is to stop agent-first design. Many problems are better served by deterministic rules, workflow automation, better source hygiene, reporting, or search before any agentic runtime is justified.

Before choosing an agent, compare options. The best AI architecture sometimes uses less AI. Horrifying for hype decks, useful for reality.

Option	Best fit	When to reject
No AI	Problem is rare, low value, or unclear	Recurring workflow has measurable burden
Search or RAG	Find and summarize trusted content	Task requires actions or structured decisions
Deterministic rules	Clear policy or classification logic	Ambiguous interpretation required
Workflow automation	Known steps and approvals	Complex language interpretation required
Dashboard or report	Visibility and monitoring	User needs drafting, reasoning, or orchestration
Chat assistant	Exploration, synthesis, first-pass support	Needs durable workflow or audited action
Agentic workflow	Multi-step tasks with tools, approvals, and feedback	No approved tools, data path, evals, or owner
Integrated capability	Business process with sustained ownership	No measurable outcome or support model

13. Schema-first capability design

Schemas are where vague AI intent becomes inspectable. They let teams validate input, output, evidence, exceptions, telemetry, and human review records instead of relying on prose promises and well-formatted uncertainty.

If a team cannot define valid input, output, evidence, decision states, exceptions, telemetry, and review records, it is not ready to build beyond exploration.

13.1 Minimal output schema example

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "AIReviewFinding",
  "type": "object",
  "required": ["finding_id", "status", "claim", "evidence_state", "human_review_required"],
  "properties": {
    "finding_id": {"type": "string"},
    "status": {"enum": ["supported", "gap", "not_evidenced", "conflicting_evidence", "requires_confirmation", "requires_escalation", "not_applicable"]},
    "claim": {"type": "string"},
    "evidence_state": {"enum": ["cited", "missing", "conflicting", "not_applicable"]},
    "evidence_refs": {
      "type": "array",
      "items": {"type": "string"}
    },
    "source_authority_level": {"enum": ["canonical", "governed_reference", "submission_evidence", "derived_analysis", "historical", "prohibited"]},
    "risk_severity": {"enum": ["low", "medium", "high", "critical"]},
    "human_review_required": {"type": "boolean"},
    "recommended_action": {"type": "string"}
  }
}

13.2 Schema failure example

Failure	Why it blocks readiness
Finding has no evidence state	Unsupported claims cannot be separated from supported claims
Finding has no source authority level	The model may treat all retrieved content as equal
Finding has no human review flag	Governance-sensitive cases may appear resolved
Finding has no status enum	Outputs cannot be reliably evaluated or aggregated

14. Source authority model

Source authority must be explicit and versioned.

Source class	Example	Can support findings?	Required handling
Canonical	Approved policy, official standard, validated technology catalog	Yes	Cite source and version
Governed reference	Architecture pattern library, approved playbook	Yes, with context	Cite source, owner, version
Submission evidence	Submitted diagram, PRD, vendor document	Yes for what was submitted	Mark as submission evidence, not policy
Derived analysis	Model extraction or summary	No by itself	Must cite underlying evidence
Historical	Prior decisions, older package, retired architecture	Only with date and context	Check freshness and applicability
Stale	Deprecated standard, superseded deck	No	Flag as stale
Prohibited	Unapproved note, external blog, unverifiable model output	No	Do not use as evidence

15. Eval validity and calibration

Evals are not automatically trustworthy because they have scores. They are trustworthy only when they measure the intended capability, cover the right failures, correlate with expert judgment, and catch safe-failure behavior when evidence is missing or contradictory.

Evals can be beautifully wrong. A rubric can score the wrong behavior consistently. That is not quality. That is automated self-deception with columns.

Eval lifecycle and regression loop

The eval surface is a loop, not a single release gate. Every model, source, tool, schema, or workflow change should push the team back through fixtures, review, and evidence capture.

Objective and source hierarchyDefine what the capability must prove and which sources govern it

→

Output contract and schemaMake the result inspectable instead of purely conversational

→

Fixture set and rubricCover golden, incomplete, conflicting, ambiguous, and regression cases

→

Eval run and human reviewCompare model behavior against the intended capability

→

Telemetry and validation receiptCapture agreement, overrides, gaps, and release evidence

→

Change event detectionModel, tool, source, schema, or workflow shifts invalidate assumptions

→

Rerun and recalibrateRefresh fixtures, rubric anchors, and thresholds before further promotion

Regression is not a separate afterthought. It is the mechanism that keeps a previously useful capability from drifting into confident failure.

Validity type	Question	Failure mode
Construct validity	Does the eval measure the actual capability?	Scores format instead of decision usefulness
Criterion validity	Does eval performance correlate with expert review?	Model passes but experts reject output
Coverage validity	Does the suite cover normal, edge, ambiguous, adversarial, missing-evidence, and regression cases?	Happy-path-only testing
Risk validity	Are high-consequence failures overweighted?	Average score hides critical false negatives
Regression validity	Does the eval catch degradation after model, prompt, source, schema, or tool changes?	Change ships with hidden behavior drift
Operational validity	Does eval success predict workflow usefulness?	Output passes tests but users ignore it
Reviewer reliability	Would qualified reviewers score similarly?	Rubric is subjective theater
Negative-control validity	Does the system fail correctly when it should?	Missing evidence becomes invented confidence

15.1 Eval calibration protocol

Select at least six fixtures: golden, incomplete, conflicting, adversarial, ambiguous, and regression.
Have two or more qualified reviewers independently score expected outputs.
Identify disagreements and update rubric anchors.
Define high-risk false negative stop conditions.
Define minimum release threshold.
Run the suite whenever prompt, model, schema, source map, tool contract, or runtime changes.
Record reviewer agreement, override rate, and unresolved disagreements.

15.2 Starter fixture matrix

Fixture	Purpose	Expected behavior
Golden	Fully evidenced, low ambiguity	Supported findings, minimal escalation
Incomplete	Missing required input	not_evidenced, request evidence
Conflicting	Two sources disagree	conflicting_evidence, human confirmation
Adversarial	User claims approval without evidence	reject unsupported claim
Ambiguous	Unclear data class or ownership	requires_confirmation
Regression	Previously fixed failure	no reintroduction of failure

16. Intent-to-eval traceability

Each eval assertion should trace to a business outcome and the claimed value class, not just a prompt instruction.

Business outcome	User need	Product requirement	Domain model	Source authority	Data contract	Schema	Harness rule	Rubric dimension	Eval fixture	Telemetry metric	Human decision
Reduce incomplete architecture submissions	Architect needs missing evidence identified early	Assistant must flag missing security model	Submission, artifact, control, evidence	Security baseline is canonical	No sensitive artifacts in unapproved tools	finding.status enum includes not_evidenced	If security evidence missing, do not infer	Evidence correctness	incomplete-security-model-001	not_evidenced correctness rate	Reviewer requests evidence or escalates

17. Human review and override model

Human-in-the-loop is not a safety feature unless the loop has authority, evidence, time, context, actions, and logging.

17.1 Review state model

Draft generated
→ Needs evidence
→ Requires confirmation
→ Accepted / Edited / Rejected / Overridden
→ Escalated if needed
→ Decision packet prepared
→ Decision recorded
→ Feedback loop reviewed

17.2 Override payload schema

{
  "override_id": "OVR-0001",
  "finding_id": "FND-0007",
  "reviewer_role": "enterprise_architect",
  "original_status": "supported",
  "override_status": "requires_escalation",
  "rationale": "Source cited is submission evidence, not canonical policy.",
  "evidence_refs": ["SRC-SEC-STD-2026-01"],
  "action_taken": "Escalated to security architecture owner",
  "requires_fixture_update": true,
  "timestamp_utc": "2026-06-17T17:00:00Z"
}

18. Technical annex: repo-backed package layout

The repo structure is included because durable AI capability work eventually outgrows chat. Files, schemas, fixtures, receipts, and governance records need a stable place to live if teams want repeatability and reviewability.

Serious AI work should move from chat to files when it needs versioning, tests, schemas, fixtures, reproducibility, or multiple maintainers.

ai-capability/
  README.md
  PRODUCT_REQUIREMENTS.md
  GOVERNANCE_ROUTING.md
  DATA_BOUNDARY.md
  SOURCE_AUTHORITY_MAP.yaml
  harnesses/
    review_harness.md
  schemas/
    finding.schema.json
    telemetry-event.schema.json
    override.schema.json
  rubrics/
    review_rubric.md
  evals/
    fixtures/
      incomplete-security-model.json
      conflicting-source-authority.json
    expected/
      incomplete-security-model.expected.json
    tests/
      test_eval_assertions.py
  tools/
    mcp-tool-contracts/
      architecture-catalog.lookup.yaml
  receipts/
    validation-receipt-template.md
  docs/
    HUMAN_REVIEW_WORKFLOW.md
    OBSERVABILITY_CONTRACT.md
    RELEASE_GATES.md

19. Technical annex: programmable eval assertion

import json

REQUIRED_STATUS = "not_evidenced"

with open("evals/outputs/incomplete-security-model.output.json", "r", encoding="utf-8") as f:
    output = json.load(f)

findings = output["findings"]
security_findings = [f for f in findings if f.get("control_id") == "SEC-001"]

assert security_findings, "SEC-001 finding is missing"
for finding in security_findings:
    assert finding["status"] == REQUIRED_STATUS, "Missing security evidence must not be treated as supported"
    assert finding["human_review_required"] is True, "Missing security evidence requires human review"
    assert finding.get("evidence_refs", []) == [], "Missing evidence should not invent citation references"

20. Technical annex: MCP and tool execution contract

Every tool exposed to an agent should have a contract. Tool access is where language generation becomes operational risk.

tool_id: architecture_catalog.lookup
owner: enterprise_architecture
purpose: lookup approved technology status and reference patterns
data_classes_allowed:
  - public
  - internal_non_sensitive
actions_allowed:
  - read_catalog_entry
  - search_reference_pattern
actions_prohibited:
  - modify_catalog
  - approve_exception
  - change_source_authority
identity_model: managed_identity_or_service_principal
auth_scopes:
  - catalog.read
egress_allowed: false
input_schema: schemas/catalog_lookup_input.schema.json
output_schema: schemas/catalog_lookup_output.schema.json
audit_events:
  - tool.called
  - tool.result_returned
  - tool.error
rate_limits:
  per_minute: 60
human_approval_required_for:
  - exception_request
  - status_change
failure_behavior: return requires_confirmation and do not infer approval
rollback_behavior: not_applicable_read_only

21. Technical annex: CI/CD and release gates

Gate	Required proof	Blocks release if
Local harness validation	Output validates against schema	Schema invalid
Fixture regression	Golden, incomplete, conflicting, adversarial, ambiguous, regression fixtures pass	High-risk false negative appears
Source authority check	Source map version is present and current	Unknown source used as canonical
Tool contract check	All tools have owner, scopes, schemas, logging, allowed actions	Tool has unbounded action access
Security and privacy check	Data classes and retention match approved path	Data path unknown
Human review check	Override workflow and decision state schema exist	HITL is undefined
Observability check	Run IDs, tool spans, eval results, cost, override events captured	No traceability
Production promotion	Runbook, support owner, SLO, incident path, rollback defined	Sustainment owner missing

22. Technical annex: observability contract

Event	Required fields	Why it matters
ai.run.started	run_id, user_id, capability_id, version	Traceability
ai.context.loaded	run_id, source_map_version, context_refs	Source freshness
ai.tool.called	run_id, tool_id, action, auth_scope, data_class	Tool audit
ai.output.generated	run_id, schema_version, model_version, harness_version	Output provenance
ai.eval.completed	run_id, fixture_set_version, pass_fail, failures	Regression evidence
ai.human.override	run_id, finding_id, original_status, override_status, rationale	Feedback loop
ai.escalation.required	run_id, trigger, owner, due_date	Governance action
ai.run.completed	run_id, cost, latency, tokens, outcome_status	Value and FinOps

Required metrics

high-risk false negative count,
unsupported claim rate,
not-evidenced correctness rate,
override rate,
reviewer disagreement rate,
escalation rate,
source freshness age,
cost per run,
latency per run,
adoption and repeat-use rate,
incident count.

23. Technical annex: FinOps and execution limits

Agentic systems need explicit execution limits.

Control	Example
Token budget	Stop or escalate when run exceeds approved token budget
Tool-call budget	Max 25 tool calls per run unless reviewer approves extension
Retry limit	Max 2 retries per failed tool action
Loop limit	Max 3 plan-execute-check loops before human review
Timeout	Stop long-running operations after defined threshold
Cost alert	Alert when cost per run exceeds expected band
Escalation	Escalate if repeated failures indicate bad harness, source, or tool contract

24. Technical annex: parallel execution safety

Parallel agents increase throughput and risk.

Risk	Required control
State corruption	Worktree, branch, sandbox, or transaction isolation
Race condition	Locking, idempotency, queue ownership
Duplicate action	Idempotency key and action ledger
API rate exhaustion	Rate limits and backoff
Conflicting edits	Diff review and merge gate
Unbounded cost	Per-session budget and timeout
Hidden failures	Central run log and tool-call spans
Production impact	No production writes without explicit human approval

25. Harness lifecycle management

Agents are maintained systems, not launch-and-forget assets. The harness around the model has to be reviewed as sources age, tools change, workflows drift, model behavior improves, and the business changes its definition of useful work.

The agent is not the whole system. The harness is the workbench around the agent: sources, context, tools, permissions, prompts, schemas, evals, review flows, telemetry, and stop conditions.

v0.9 preserves the maintenance lens because a capability that worked last quarter can become unsafe, wasteful, or stale even when the model improves. That is the part many teams miss while they are busy admiring how quickly the agent can produce more work for humans to clean up. Charming little productivity trap.

25.1 Harness lifecycle thesis

Principle	Meaning	Risk if ignored
Harnesses live in motion	Models, tools, sources, workflow, and business context change	Yesterday's safe setup becomes today's drag or risk
Maintenance includes deletion	More tools and more rules are not always better	Tool bloat, permission sprawl, token waste, audit noise
Context is operational	Context drives output, validation, and decisions	Stale context becomes active misinformation
Model upgrades are change events	Better models can make old harnesses misfit	Stronger agents use weak boundaries faster
Proof must remain linkable	Output must point to sources, records, spans, or logs	Fluency outruns trust
Value must be rechecked	A useful agent can become redundant or harmful	Automation keeps producing work nobody needs

25.2 Maintenance cadence

Trigger	Required review
Model version changes	Model upgrade impact review
Workflow changes	Intent, job, and state-model review
Source changes	Source authority and context freshness review
Tool or connector changes	Permission, action, and audit review
High override rate	Eval, rubric, and source review
Cost spike	FinOps and loop-limit review
Low adoption	Value and workflow fit review
Incident or near miss	Stop condition, blast-radius, and recovery review

26. Agents drift in two directions

Traditional systems mostly drift when requirements, dependencies, data, or integrations change. Agent systems drift in two directions at once: the world changes around them and the model changes inside them.

Drift direction	Example	Control response
World changes around the agent	Workflow, source, ownership, terminology, or policy changes	Refresh source authority, context, schemas, and eval fixtures
Model changes inside the agent	Better reasoning, better tool use, better planning, stronger autonomy	Reassess permissions, workflow constraints, tool count, stop conditions, and review load

26.1 Agents can break when models improve

A stronger model is not automatically safer. It can make weak harnesses fail faster and more convincingly.

Model improvement	Harness risk	Review question
Better reasoning	Old rigid workflow becomes unnecessary drag	Which rules should be simplified or removed?
Better tool use	Broad permissions become more dangerous	Which tools need tighter action contracts?
Better planning	Agent creates plausible downstream work faster than humans can review	Is reviewer throughput still sufficient?
Better context use	Stale context becomes more influential	Are current sources ranked and refreshed?
Better autonomy	Weak stop conditions become more dangerous	Are loop limits, cost limits, and escalation triggers explicit?
Better fluency	Unsupported output becomes harder to detect	Are citations, evidence spans, and negative controls enforced?

27. Tool pruning and harness simplification

The beginner instinct is to add. The maintenance instinct is to ask what should be removed.

More tools do not automatically create better agents. Every tool increases the action surface, ambiguity surface, permission surface, audit surface, cost surface, and maintenance burden. A tool must earn its place through observed value, controlled failure behavior, and measurable improvement.CLM-001 CLM-004

27.1 Tool pruning decision rule

Keep the tool when	Remove or disable the tool when
It is required for a defined job	It is rarely used or only makes demos look powerful
It has an owner and allowed-action contract	No owner can explain why it is needed
It improves measured outcome or reduces review burden	It increases review burden or false confidence
It has clear permission, logging, and failure handling	It can mutate state without sufficient approval or audit
It works inside approved data and runtime boundaries	It crosses an unapproved data, network, or workflow boundary
It is covered by eval fixtures and negative controls	It is invisible to test coverage

27.2 Harness simplicity review

Question	Good answer
Which tools were used in the last 30 runs?	Only tools that support the defined job
Which tools created errors, retries, or overrides?	Problem tools have remediation or removal plan
Which instructions are obsolete?	Obsolete rules are retired, not kept as prompt sediment
Which memory or context files are stale?	Stale context is superseded, archived, or removed
Which controls block useful work?	Controls are updated intentionally after risk review
Which actions still need human approval?	High-risk actions remain bounded and reviewable

28. Context as control plane

Context determines what the model treats as signal, what it treats as authority, and what it is allowed to summarize or infer. Poor context architecture makes smarter models more dangerous because they can act more convincingly on stale or mis-ranked material.

Context is no longer background documentation. In an AI capability, context shapes behavior, answer boundaries, validation logic, rollout language, and runtime assumptions. Once context influences behavior, it needs code-grade governance.

Context authority stack

Authority depends on the question. Rule questions route to canonical documents. Current-state questions route to structured records. Generated output remains explanatory and cannot quietly outrank either.

Canonical authority

Highest-trust source for service, process, policy, and standard questions.

Structured source of truth

Current state, ownership, lifecycle, and workflow status questions route here.

Governed context

Glossaries, crosswalks, explainers, and transition notes support interpretation without replacing authoritative records.

Generated output

Summaries, explanations, and recommendations are advisory only and must cite supporting evidence.

Control rules

Source priority, freshness review, escalation, and non-inference boundaries govern every layer.

The stack is useful only if the system can say which layer answered the question and why lower-authority text did not win.

28.1 Context hierarchy

Context layer	Role	Failure mode
Canonical authority	Highest-trust policy, SOP, standard, process source	Stale or conflicting truth becomes model guidance
Governed context	Glossaries, crosswalks, explainers, transition notes	Explanatory layer quietly outranks canonical truth
Structured source of truth	Current state, ownership, lifecycle records, workflow status	Summary is mistaken for current state
Generated output	Summaries, explanations, recommendations	Fluent narrative masks missing evidence
Control rules	Source priority, low-confidence escalation, non-authority boundaries	Model flattens authority and guesses across gaps

28.2 Context failure modes

Misdiagnosis	Actual root cause	Correct control
Need bigger model	Wrong source hierarchy	Source authority map
Need more memory	Stale or mixed context	Context freshness review
Need more tools	No structured source of truth	Data contract and schema
Need more agents	No boundary between policy, state, and summary	Context architecture
Need longer prompt	Ambiguous authority	Task-scoped context selection

Capstone principle: do not ask the model to rescue a bad context system. That is not AI strategy. That is outsourcing confusion to a more fluent machine.

29. Advisory repository versus runtime control plane

This separation is central to the architecture thesis. A governed advisory repository can support reasoning, synthesis, and human-readable guidance. Runtime control planes require deterministic orchestration, permissions, state, audit, recovery, and bounded execution controls.

A governed advisory repository is not a deterministic runtime control plane.

The advisory repository governs context, truth boundaries, source hierarchy, advisory behavior, and human-readable synthesis. The runtime control plane governs orchestration, typed tools, permissions, durable workflow state, approval controls, audit, retry, recovery, and bounded action.

Advisory repository versus runtime control plane

The advisory repository helps the system think and explain. The runtime control plane decides what can execute, what state persists, and how approval, audit, and recovery are enforced.

Advisory repository

Governed context and source hierarchy

Human-readable synthesis and bounded reasoning support

Advisory receipts and traceable citations

No autonomous production action

bounded reasoning outputs

Runtime control plane

Orchestration and typed tools

Permissions, approvals, and durable workflow state

Audit, retry, recovery, and bounded action

Operational monitoring and change control

A control plane can consume governed knowledge, but governed knowledge by itself does not create runtime enforcement.

29.1 Separation rule

Concern	Advisory repository	Runtime control plane
Knowledge governance	Yes	Consumes governed knowledge
Reasoning support	Yes	Uses bounded reasoning outputs
Source hierarchy	Yes	Enforces source-derived rules where needed
Human-readable synthesis	Yes	Logs and routes outputs
Orchestration	No	Yes
Permissioning	Guidance only	Yes, mechanical enforcement
Durable workflow state	No	Yes
Execution and recovery	No	Yes
Audit and event logging	Limited advisory receipt	Full runtime events
Bounded action	No autonomous production action	Governed action only where approved

29.2 Expansion threshold

Do not move from advisory repository to control-plane repository because agents are fashionable. Move only when the use case requires durable state, typed tools, explicit approvals, audit, recovery, and bounded action.

30. Model upgrade impact review

Treat a model upgrade like a capability change event.

Review area	Question
Job scope	Does the agent's job need to expand, narrow, or remain unchanged?
Tool reach	Are existing tool permissions still appropriate?
Review load	Does the stronger model create more work than reviewers can absorb?
Source behavior	Does the new model use context differently enough to require fixture updates?
Eval suite	Do current fixtures still cover likely failure modes?
Stop conditions	Are cost, loop, retry, and escalation limits still safe?
Output trust	Are evidence and citation requirements still enforced?
User adoption	Does improved capability change the expected workflow or training?

31. Harness maintenance review

Run this review before pilot expansion, after model changes, after source changes, after tool changes, and at a defined recurring cadence.

Check	Meaning	Enterprise control question
What is it eating?	Sources, context, files, memory, and data consumed	Are sources current, authoritative, and correctly ranked?
What can it reach?	Tools, APIs, systems, records, actions	Are permissions still appropriate for model capability and business risk?
What is its job?	Current role and task boundary	Has scope changed intentionally or through capability creep?
What proof must it return?	Evidence, citations, spans, records, and logs	Can humans verify the output and audit the action trail?
Is it still valuable?	Value after review burden and cost	Keep, rebuild, narrow, expand, or retire?

31.1 Maintenance actions

Finding	Action
Tool not used or increases errors	Remove, disable, or quarantine tool
Context stale or conflicting	Supersede, archive, or route to owner confirmation
Agent job changed silently	Update PRD, harness, schema, eval, and training
Reviewer overload	Narrow output, reduce autonomy, add triage or sampling
High false negatives	Stop expansion and repair eval/control/source logic
Cost spike	Enforce budgets, loop limits, and escalation
Low value	Retire or rebuild rather than continue ceremonial automation

32. Agent retirement and rebuild criteria

A serious AI operating model needs a graceful way to stop using an agent. Keeping a stale agent alive because it was once exciting is how technical debt learns to talk.

Condition	Decision
Source authority cannot be maintained	Retire or restrict to non-authoritative use
Workflow changed beyond harness design	Rebuild harness and fixtures before further use
Model upgrade invalidates old constraints	Run impact review and revise controls
Tool permissions cannot be governed	Disable tool use
Review burden exceeds value	Narrow or retire
High-risk false negative appears	Stop expansion, repair, and revalidate
Users do not use output	Reassess intent and workflow fit
Better platform capability exists	Migrate or retire custom harness

33. Worked example C: Lifecycle Lens MVP capability trace

Lifecycle Lens is included as a worked example, not as a first-class pillar of the manual. It shows how advisory-only posture, Microsoft-native tooling, structured lifecycle truth, source-priority rules, and no-mutation boundaries translate the framework into an actual enterprise use case.CLM-006 CLM-007

Lifecycle Lens is a useful v0.9 example because it is not trying to become an all-powerful agent. It is intentionally bounded: advisory first, visibility first, governance first, automation later.

33.1 Intent

Improve lifecycle visibility and accountability across forecasting, planning, ordering, delivery, deployment, replacement, decommissioning, ownership, stage aging, stuck-work identification, reminders, and escalation visibility.

33.2 Business outcome

Outcome	Measurement candidate
Stage ownership is clearer	Percent of lifecycle items with named stage owner
Stuck work is surfaced earlier	Aging threshold breach detection rate
Decommission accountability improves	Decommission-stage aging and closure trend
Reporting friction decreases	Manual coordination hours reduced
Advisory quality improves	User acceptance and override rate
Governance boundary preserved	Zero autonomous endpoint mutation and no generated output outranking source truth

33.3 Platform and architecture path

The preferred MVP path is Microsoft-native where viable: Copilot or Copilot Studio for advisory access, Dataverse for lifecycle and planning data, Power Apps for operational tracking, Business Process Flow for stage progression, and Power Automate for reminders and escalations.CLM-005 CLM-008

Layer	Lifecycle Lens MVP role
Canonical documents	Authoritative service and process guidance
Governed context	Glossaries, service explainers, crosswalks, transition notes
Dataverse	Structured lifecycle system of record for MVP tracking state
Power Apps	Operational lifecycle tracking surface
Business Process Flow	Deterministic stage progression model
Power Automate	Reminders, escalations, notifications, and workflow glue
Copilot Studio	Advisory access and controlled summaries where viable
Human review	Accountability, exception handling, escalation, and approval

33.4 Authority boundary

Question type	Highest authority
What is the service or process rule?	Canonical document
What is the current lifecycle stage?	Structured lifecycle record
Who owns the current stage?	Structured lifecycle record
What is aging or stuck?	Deterministic calculation over lifecycle state
What does the advisory agent explain?	Source-grounded synthesis only
What can generated output decide?	Nothing authoritative without human or governed workflow action

33.5 No-mutation boundary

Lifecycle Lens MVP must not perform autonomous endpoint action, direct endpoint mutation, privileged execution, silent policy exception, or execution-authoritative control-plane behavior. Generated output remains explanatory. Structured lifecycle data remains authoritative for current state.

33.6 MVP eval fixtures

Fixture	Expected behavior
Stuck-stage visibility	Identify items beyond aging threshold from structured data, not narrative guesswork
Stage owner query	Return owner from lifecycle record or mark not_evidenced
Canonical process question	Answer using canonical document and cite source
Conflict between summary and record	Structured lifecycle record wins for current state
Unsupported endpoint action request	Refuse or escalate, no autonomous mutation
Low-confidence process answer	Mark requires_confirmation and route to human review
Reminder escalation test	Trigger only through approved workflow rule, not agent improvisation

33.7 Pilot acceptance model

Acceptance area	Evidence required
Architecture readiness	Microsoft-native viability assessed honestly and fallback path defined
Source and data readiness	Canonical documents, governed context, and lifecycle records separated
Advisory quality	Answers cite sources and preserve non-authority posture
Workflow integrity	Stage progression, reminders, escalations, and ownership visible
Role-aware access	RBAC and least privilege tested
Auditability	Workflow history and advisory outputs reviewable
Operational usefulness	Target users confirm reduced coordination and better visibility
Boundary preservation	No autonomous infrastructure mutation and no generated output outranking truth

33.8 Lifecycle Lens field-validation questions

Is Microsoft-native delivery viable enough for the MVP?
What should be configured versus custom built?
What lifecycle entities, states, owners, aging logic, and history are required?
How does Copilot Studio combine canonical documents and structured lifecycle data without flattening authority?
Which actions must remain deterministic or human-owned?
What telemetry proves the MVP improves visibility and accountability?
Which conditions trigger escalation, rebuild, or retirement?

34. Worked example A: from bad prompt to governed harness

34.1 Bad prompt

Review this architecture and tell me if it is good.

Why it fails:

no target outcome,
no source authority,
no review dimensions,
no data boundary,
no evidence rule,
no output schema,
no missing-evidence behavior,
no human review path.

34.2 Better harness

Task: Perform a first-pass architecture evidence review for a synthetic AI assistant proposal.

Inputs allowed: synthetic proposal summary, synthetic architecture diagram text, approved synthetic source authority map.

Do not infer: approval status, data classification, GxP impact, security control existence, production readiness, ownership, funding, platform approval, or exception status.

Required output: JSON array of findings matching AIReviewFinding schema.

Rules:
1. Every material claim must cite evidence_refs or return not_evidenced.
2. If source authority conflicts, return conflicting_evidence.
3. If a data class is unclear, return requires_confirmation.
4. If production readiness is claimed without telemetry and support owner, return gap.
5. Final approval is prohibited. Human review is required for all findings.

Validation:
- output must validate against finding.schema.json,
- incomplete security evidence fixture must return not_evidenced,
- unsupported approval claim must fail automatic rubric rule.

34.3 Rubric excerpt

Dimension	Pass	Fail
Evidence grounding	Each finding cites allowed evidence or marks missing evidence	Finding asserts unsupported facts
Non-inference	Sensitive facts are marked unknown or require confirmation	Model infers approval, classification, or GxP status
Output contract	JSON validates against schema	Freeform answer or invalid enum
Human review	Review required is explicit	Output implies approval

34.4 Validation receipt excerpt

{
  "fixture_id": "unsupported-approval-claim-001",
  "expected_status": "requires_confirmation",
  "actual_status": "requires_confirmation",
  "result": "pass",
  "review_required": true
}

35. Worked example B: governed business-process AI capability

35.1 Use case

A business team proposes an AI assistant that summarizes architecture submissions and identifies missing evidence before formal review.

35.2 Capability trace

Lifecycle element	Example
Business outcome	Reduce incomplete architecture review submissions by 30 percent
User need	Architects need missing evidence identified before review meetings
Product requirement	Assistant flags missing security, data, integration, support, and governance evidence
Domain model	Submission, artifact, evidence, control, finding, reviewer decision
Source authority	Architecture checklist is canonical, submitted docs are evidence, model summary is derived
Data contract	Synthetic or approved non-sensitive submissions only for pilot
Schema	AIReviewFinding schema with status, evidence_state, source_authority_level
Harness	Evidence-bound first-pass review with non-inference rules
Rubric	Evidence grounding, completeness, missing-evidence correctness, escalation correctness
Eval fixture	incomplete-security-model-001, conflicting-data-classification-001
Telemetry	not_evidenced correctness, override rate, cycle time, missing evidence caught
Human decision	Architect accepts, edits, rejects, requests evidence, or escalates
Governance route	Governed pilot if shared beyond individual productivity or using business-process workflow
Sustainment owner	EA governance owns control library and source map; platform owns runtime

35.3 Pilot entry criteria

first reviewer group named,
data class approved,
source authority map approved for pilot,
eval fixture set present,
human review workflow present,
telemetry events defined,
sustainment owner named,
stop condition defined.

35.4 Pilot stop conditions

high-risk false negative appears,
agent infers approval or data classification,
override rate exceeds agreed threshold,
data boundary is violated,
source authority is unresolved,
support owner is missing,
cost per review exceeds value hypothesis.

36. Practitioner lab and tool patterns

Practitioner patterns show how serious builders operate without mistaking external tools for approved enterprise execution paths. The durable lesson is not which tool is fashionable; it is how to use planning, isolation, permissions, feedback loops, tests, and evidence before allowing broader action.CLM-003 CLM-004

Mandatory warning

External commercial tools, including Claude Code, Codex, Cursor, Antigravity, and similar systems, are not approved for company data by default. Use public or synthetic data unless an approved enterprise path explicitly permits company use.

Lab sequence

Stage	Pattern	Output
Q&A first	Ask the agent to explain codebase, architecture, history, issues, or submitted artifacts	Understanding report
Plan review	Ask for a plan before edits or actions	Plan with risks and validation
Controlled edit	Approve narrow changes only	Diff and validation result
Feedback loop	Run tests, schemas, fixtures, screenshots, or linting	Pass/fail evidence
Context tuning	Add shared context or rules	Reusable context artifact
Tool integration	Add approved CLI or MCP tool	Tool contract
Permission review	Classify action tiers	Permission matrix
Parallel isolation	Use branch, worktree, sandbox, or managed session	Isolated work record

37. Tool pattern appendix

Tool names are included as examples and mental hooks. They should be read by pattern, execution boundary, data boundary, permission model, logging posture, and governance dependency, not as endorsements or tool rankings.

Public product-surface descriptions in this appendix map to CLM-002 CLM-008 CLM-009 CLM-010 CLM-011 CLM-012.

Pattern	Examples	Primary lesson	Boundary question
Terminal agent	Claude Code, Codex CLI	CLI agents can inspect, edit, run commands, and fit many workflows	What commands and data are allowed?
IDE-native agent	Cursor, GitHub Copilot	IDE agents improve development flow and context use	How are rules, review, and repo ownership managed?
Cloud workbench	Codex cloud, Antigravity-style managed agents	Cloud agents can parallelize and verify tasks	Where does code execute and what data leaves?
Enterprise agent builder	Copilot Studio, Agent Builder, internal frameworks	Business agents need connectors, publishing, governance, HITL	Which governance tier applies?
Model gateway/runtime	Azure AI Foundry, AWS Bedrock, internal marketplace	Model access should be routed, logged, and governed	Which model is allowed for which data and task?
Workflow orchestration	Temporal, Step Functions, Logic Apps, Power Automate	Durable processes need state, retries, approvals, compensation	Which steps are deterministic, AI-assisted, or human-approved?

38. Field validation exercise

Before broad distribution, use this manual against two real or sanitized AI proposals.

Required exercise outputs

Output	Purpose
Readiness level	Classify idea, prompt artifact, harness, workflow, assistant, pilot, capability, or scale
Governance route	Decide AI for All, USE, BUILD, REQUEST, standard review, fast track, or formal SDLC
Data boundary	Identify allowed and prohibited data classes and tool paths
Source authority map	Identify canonical, reference, submission, derived, stale, prohibited sources
Intent validity score	Test outcome, user fit, AI appropriateness, failure consequence, sustainment
Eval validity score	Test construct, coverage, risk, regression, reviewer, negative controls
HITL model	Define reviewer states, authority, overrides, and escalation
Telemetry plan	Define run, quality, cost, override, source freshness, adoption metrics
v0.9 field validation backlog	Convert controlled-sharing findings into pre-v1.0 improvements

39. v0.9 recommended use

Use v0.9 as:

a leadership mental-model reset artifact,
an architecture and governance review guide,
a controlled practitioner reference,
a template library,
a field validation tool against real proposals.

Do not use v0.9 as:

final policy,
tool approval,
data-use approval,
production readiness approval,
procurement recommendation,
substitute for formal governance review.

40. pre-v1.0 field validation backlog

Before v1.0 or any policy-conversion use, controlled-sharing field guidance must pass the checklist below. v0.9 controlled sharing does not satisfy these prerequisites and does not create enterprise policy approval, tool approval, data-class approval, production approval, GxP approval, SaaS approval, autonomous-agent approval, a policy workflow engine, or an enterprise approval record.

40.1 Pre-v1.0 policy-conversion checklist

Check	Required before policy conversion
Named policy owner	A named policy owner accepts responsibility for any candidate policy language.
Accountable approver	An accountable approver is named and has authority for the conversion decision.
Legal or regulatory review	Legal and regulatory review is completed where required.
Quality or GxP review	Quality or GxP review is completed where applicable.
Security review	Security review confirms access, logging, connector, network, and control boundaries where applicable.
Privacy and data-class review	Privacy or data steward review confirms allowed and prohibited data classes where applicable.
Tool approval review	Tool or platform owner review confirms whether each tool surface is approved for the specific scope, where applicable.
Production and change-control review	Production readiness and change-control path are confirmed where applicable.
Operational owner and sustainment model	Operational owner, support boundary, maintenance cadence, and failure handling are named.
Evidence and receipt review	Validation receipts, source provenance, field validation, and source-owner confirmation are reviewed.
Exception and rollback handling	Exception path, stop condition, rollback path, and escalation owner are documented.
Explicit approval boundary	The candidate statement says what is approved and what remains unapproved.

Priority	Candidate change	Why
P0	Validate tool/data boundary guidance with internal owners	Confirm whether the conservative field template can inform policy-aligned guidance
P1	Run Lifecycle Lens field validation with target reviewers	Prove the manual works against a real bounded MVP use case
P1	Expand synthetic lab coverage	Broaden safe fixture coverage for conflicting-source, override, and no-mutation cases
P1	Field-test the core diagram set with target reviewers	Confirm the visuals improve recall without flattening authority semantics
P1	Add operating cadence model for harness maintenance	Make review timing and ownership concrete
P2	Add role-specific executive brief	Support broader leadership distribution
P2	Add glossary	Help beginners and non-technical leaders
P3	Reduce repeated thesis language	Improve readability after concepts stabilize

Source Provenance and Claim Confidence

Provenance is included so skeptical readers can see where the material came from, which claims were externally checked, which claims came from uploaded internal context, which came from transcripts, and which require owner validation before being treated as policy.

How to read CLM IDs: each claim ID maps a visible claim to source class, source URL or source ID, retrieval or verification date, source owner, verifier role, evidence note, validation status, owner-validation state, confidence, freshness review date, limitations, and where the claim is used. CLM IDs are traceability markers, not decorative footnotes. Internal-source claims still require owner validation before policy use.

This package is not a vibes artifact. It uses a provenance register, claim-confidence labels, transcript handling rules, public verification notes, internal owner-validation flags, and evaluation receipts. Where a claim is not independently verified or owner-approved, it is labeled accordingly.

Source classes

Source class	Meaning	How to treat it
Public primary source	Official vendor docs, official product pages, official company blogs	Supports public product claims, but not internal enterprise approval
Public secondary source	Interviews, reporting, practitioner analysis	Useful for context and attribution, not policy
User-provided transcript	Captured transcript of practitioner talks or videos	Extract operating patterns, validate factual claims where possible
Internal-source supplied	Uploaded enterprise/project materials	Use as supplied context, owner validation required for policy-sensitive claims
Derived recommendation	Synthesis based on the sources and evals	Label as interpretation, not a quoted source
Eval output	Multi-model artifact review	Validates artifact quality and gaps, not factual truth
Open item	Not yet verified	Must not be treated as authoritative

Claim confidence labels

Label	Meaning
Verified public source	Confirmed against public primary sources
Corroborated	Supported by multiple sources, not always primary
Transcript-derived	Derived from user-provided transcript material
Internal-source supplied	Present in uploaded internal/project materials
Owner validation required	Requires named enterprise owner confirmation before policy use
Derived recommendation	Our synthesis from available evidence
Illustrative example	Pattern explanation, not proof of approval
Do not treat as policy	Explicitly not official enterprise policy

Transcript handling rule

Practitioner transcripts are used to extract operating patterns, not to establish policy. Where a transcript makes a factual claim, the claim is verified against public sources, labeled as transcript-derived, or excluded from authoritative guidance.

Claim validation register

Claim ID	Claim	Source class	Source URL or source ID	Retrieved or verified	Source owner	Verifier role	Evidence note	Validation status	Owner-validation state	Confidence	Freshness review date	Used in	Limitation
CLM-001	Vercel reported that an internal agent improved after most specialized tools were removed and the agent was simplified.	Public primary source	`vercel-tool-pruning-blog-001`	2026-06-17	Vercel	Public-source verifier	Official Vercel blog title captured in package notes as "We removed 80% of our agent's tools".	Verified public source	Not required for public product description; not enterprise approval	High	2026-06-17	Tool pruning; harness lifecycle	Context-specific case. Do not generalize into a universal rule that fewer tools always wins.
CLM-002	Claude Code is an agentic coding system that reads a codebase, edits files, runs commands, and integrates with development tools.	Public primary source	code.claude.com/docs/en/overview	2026-06-17	Anthropic	Public-source verifier	Official overview states that Claude Code reads codebases, edits files, runs commands, and integrates with development tools.	Verified public source	Not required for public product description; not enterprise approval	High	2026-06-17	Practitioner operating patterns; tool pattern appendix	Product behavior changes quickly. Treat as current public positioning, not enterprise approval.
CLM-003	The Boris Cherny / Claude Code practitioner transcript supports patterns such as codebase Q&A first, planning before edits, feedback loops, context files, permission tiers, and parallel work isolation.	User-provided transcript	`transcript-boris-cherny-claude-code-001`	2026-06-17	User-supplied practitioner transcript	Transcript reviewer	Transcript patterns were reviewed for operating habits, then separated from public product descriptions.	Transcript-derived pattern	Not required for practitioner pattern use; not policy	Medium-high for pattern, not verbatim quote	2026-06-17	Practitioner lab; tool permissions; context architecture	Transcript contains speech-to-text artifacts. Use for operating patterns, not precise quotation or policy.
CLM-004	Nate Jones transcript supports the maintenance thesis: harnesses drift, tools should be pruned, agents can break when models improve, and teams should repeatedly ask what the agent eats, reaches, does, proves, and returns in value.	User-provided transcript	`transcript-nate-jones-maintenance-001`	2026-06-17	User-supplied practitioner transcript	Transcript reviewer	Transcript guidance was used only for maintenance and pruning patterns, with public-product claims kept separate.	Transcript-derived pattern	Not required for practitioner pattern use; not policy	Medium-high for pattern	2026-06-17	Harness lifecycle; maintenance review; tool pruning	Transcript includes irrelevant tail contamination. Only the agent/harness portion is used.
CLM-005	The enterprise AI stack materials distinguish AI for All, Pre-configured AI, and Custom Built AI, and define USE, BUILD, and REQUEST routing concepts.	Internal-source supplied	`enterprise-ai-stack-kb-001`	2026-06-17	Enterprise AI architecture materials	Package editor	Internal knowledge-base materials were reviewed as supplied source context for routing vocabulary.	Internal-source supplied	Required before policy use	High for uploaded source, not final policy	2026-06-17	Enterprise governance routing; worked example platform path	Requires named internal owner validation before publication as policy.
CLM-006	Lifecycle Lens MVP is framed as advisory-only, assistive-only, human-accountable, with no autonomous infrastructure mutation and no direct privileged endpoint execution.	Internal-source supplied	`lifecycle-lens-mvp-companion-sow-001`	2026-06-17	Lifecycle Lens MVP companion materials	Package editor	Internal companion SOW was reviewed for the worked-example boundary and no-mutation posture.	Internal-source supplied	Required before external supplier or policy use	High for uploaded source	2026-06-17	Lifecycle Lens worked example; advisory boundary	Specific to the MVP materials. Requires owner validation before external supplier use.
CLM-007	Lifecycle Lens architecture materials separate canonical authority, governed context, structured lifecycle/planning data, workflow, and advisory intelligence.	Internal-source supplied	`lifecycle-lens-rwcp-pivot-deck-001`	2026-06-17	Lifecycle Lens architecture deck	Package editor	Deck visuals were reviewed for the control-plane and source-priority pattern only.	Internal-source supplied	Required for exact deck interpretation before policy use	High for uploaded deck content	2026-06-17	Context as control plane; advisory repository vs runtime control plane	Deck visuals require human review for exact intended interpretation.
CLM-008	Microsoft Copilot Studio documentation describes creating agents and workflows, adding knowledge and tools, MCP server support, evaluation, administration, environments, authentication, and analytics.	Public primary source	learn.microsoft.com/en-us/microsoft-copilot-studio/	2026-06-17	Microsoft	Public-source verifier	Official documentation landing page lists agent creation, workflows, knowledge, tools, MCP, evaluation, administration, environments, authentication, and analytics.	Verified public source	Not required for public product description; not enterprise approval	High	2026-06-17	Enterprise tool pattern appendix; Microsoft-native examples; worked example platform path	Does not imply company-specific approval or readiness.
CLM-009	Azure AI Foundry documentation positions the platform as a place to design, customize, manage, and support AI applications and agents at scale, with evaluation and monitoring capabilities.	Public primary source	learn.microsoft.com/en-us/azure/foundry/what-is-foundry	2026-06-17	Microsoft	Public-source verifier	Official Foundry documentation was reviewed as the current public positioning for the platform.	Verified public source	Not required for public product description; not enterprise approval	High	2026-06-17	Model gateway/runtime pattern; governance context	Service capabilities and naming change frequently.
CLM-010	OpenAI Codex CLI is positioned as a local command-line coding agent that can read, modify, and run code on a local machine with approval modes.	Public primary source	developers.openai.com/codex/cli	2026-06-17	OpenAI	Public-source verifier	Official Codex CLI docs were reviewed for the current local coding-agent description and approval-mode posture.	Verified public source	Not required for public product description; not enterprise approval	High	2026-06-17	Tool pattern appendix; CLI/repo pattern	Product state changes quickly. Local operation does not automatically approve enterprise data use.
CLM-011	Google Antigravity is described by Google as an agentic development platform where agents can plan and execute software tasks across editor, terminal, and browser, with artifacts for communication and validation.	Public primary source	antigravity.google	2026-06-17	Google	Public-source verifier	Official product surface was reviewed for the public product description used in the tool-pattern appendix.	Verified public source	Not required for public product description; not enterprise approval	High	2026-06-17	Tool pattern appendix; future surface warning	Does not imply enterprise approval or data boundary suitability.
CLM-012	Amazon Bedrock AgentCore documentation describes runtime, harness, memory, gateway, identity, observability, evaluations, policy, and registry services for operating agents at scale.	Public primary source	docs.aws.amazon.com/bedrock-agentcore/latest/devguide/what-is-bedrock-agentcore.html	2026-06-17	AWS	Public-source verifier	Official AgentCore overview was reviewed for the current public service framing.	Verified public source	Not required for public product description; not enterprise approval	High	2026-06-17	Runtime/control-plane discipline; technical annex	Service adoption still requires enterprise architecture, security, cost, and data review.
CLM-013	Three independent model evaluations converged that the earlier package was conceptually strong but needed version integrity, external-tool boundaries, worked examples, and technical hardening.	Eval output	`evals/external/GPT_5_5_Pro_v0.8.2_eval.md; evals/external/Gemini_3_1_Deep_Think_v0.8.2_eval.md; evals/external/GPT_5_5_Pro_Extended_v0.8.2_eval.md`	2026-06-17	Eval artifact set	Artifact review synthesizer	Evaluation artifacts were compared for convergence of package-quality findings, not factual validation.	Artifact-quality evaluation	Not a policy source; no owner-validation path	High for convergence of artifact feedback	2026-06-17	v0.5 and v0.8.2 backlog discipline	LLM evals do not certify factual truth or internal policy.
CLM-014	Approved tool access does not approve the use case, data class, retention path, logging path, connector action, workflow impact, or production use.	Derived recommendation	`derived-recommendation-tool-access-boundary-001`	2026-06-17	Package editorial synthesis	Package editor	Boundary guidance is synthesized from the package policy-boundary sections, internal routing materials, and external tool-positioning sources.	Derived recommendation	Required before policy use	Medium-high	2026-06-17	Policy boundary; tool and data matrix; governance routing	This is synthesis for field guidance, not a quoted policy statement.

Eval provenance rule

The three model evaluations validate artifact quality, audience fit, gaps, and distribution readiness. They do not validate internal policy, approve external tools, or prove every factual claim. They are used as review evidence, not as truth certificates.

Template Library and Source Workbench

The Workbench is where concepts become reusable artifacts. Each pack is large enough to stand alone, maps back to a main section, and now includes card-level copy and download controls so teams can reuse the right artifact without scraping the entire manual.

How to use this Workbench:

Expand the pack you need, copy or download that pack as Markdown, or download the full Workbench. Each pack includes purpose, use case, inputs, outputs, owner, failure modes, repo path, main-section mapping, and a reusable Markdown artifact.

The Source Workbench is a reuse surface, not a decorative footer. The main body teaches the concepts. This workbench provides copy-ready artifacts, owner/reviewer expectations, failure modes, and repo mappings. Tiny cards are intentionally bundled into larger packs so each item carries enough operational weight to be worth copying.

SWB-001. Executive Reset Pack

Purpose: Give leaders a short, memorable mental-model reset: AI is not magic, tool access is not capability, and demo success is not operational readiness.

When to use: Use before leadership briefings, funding discussions, intake reviews, and any meeting where someone asks whether a prompt or agent is enough.

Inputs required: Business objective, target audience, proposed tool surface, expected workflow impact, data class, and decision owner.

Output produced: Executive framing, leader checklist, demo-to-capability challenge, and approval stop rule.

Owner / reviewer: Executive sponsor, architecture lead, governance lead.

Failure modes: Leader treats the tool as the strategy; pilot starts without source authority; HITL is claimed without authority or evidence; the demo becomes the decision.

Related repo path: repo-bootstrap/docs/executive-reset.md

Related main sections: Executive Reset; Capability Definition

Markdown artifact:

# Executive Reset Pack

## Opening statement
Stop building AI theater. Build capability.

Prompts, agents, skills, tools, and evals are components. Capability requires intent, source authority, context discipline, schemas, validation, observability, change control, sustainment, and measurable value.

## Leader approval stop rule
If the team cannot explain outcome, data boundary, source authority, schema, eval validity, human review model, telemetry, sustainment owner, and stop condition, approve discovery only. Do not approve broad pilot, production routing, autonomous execution, or business-process dependency.

## Questions leaders should ask
1. What business outcome changes?
2. What source is authoritative?
3. What must the AI never infer?
4. What is deterministic, what is AI-assisted, and what remains human-controlled?
5. What telemetry proves value and degradation?
6. Who owns maintenance after the demo?

SWB-002. Capability Formation Pack

Purpose: Turn an AI idea into a capability-readiness decision instead of another prototype with executive sponsorship and no operating spine.

When to use: Use during ideation, architecture intake, product framing, and when deciding whether a use case is an idea, prompt artifact, governed pilot, or operational capability.

Inputs required: Problem statement, users, workflow, sources, data classes, system-of-record boundary, expected business value, failure consequence, and owner list.

Output produced: Capability readiness classification, lifecycle trace, intent validity score, product requirements canvas, promotion gate matrix, and solution ideation outcome.

Owner / reviewer: Product owner, enterprise architect, governance reviewer.

Failure modes: The team chooses an agent before proving the problem; evals measure the wrong outcome; no one owns sustainment; the capability cannot be located on the readiness ladder.

Related repo path: repo-bootstrap/docs/capability-formation.md

Related main sections: Capability Definition; Intent and Outcome Discipline

Markdown artifact:

# Capability Formation Pack

## Definition
An AI capability is a governed, repeatable, measurable operating pattern that uses approved models, tools, data, workflows, controls, human review, and sustainment ownership to produce a defined business outcome reliably over time.

## Readiness ladder
0 Idea. 1 Prompt artifact. 2 Reusable harness. 3 Structured workflow. 4 Eval-backed assistant. 5 Governed pilot. 6 Operational capability. 7 Scaled enterprise capability.

## Promotion gate rule
Do not promote Levels 3 to 6 by enthusiasm alone. Require evidence for schema validity, fixture coverage, reviewer calibration, approved data route, stop conditions, telemetry, runbook, incident path, support owner, and adoption proof before naming the next level.

## Intent validity gates
Problem clarity, outcome specificity, user fit, AI appropriateness, source feasibility, data permission, failure consequence, human accountability, and sustainment realism must be answered before implementation.

## Solution ideation rule
Compare no AI, deterministic rules, workflow automation, search/RAG, dashboard/report, chat assistant, agentic workflow, and integrated capability before choosing the agentic path.

SWB-003. Governance and Routing Pack

Purpose: Route AI work to the right review path and prevent the dangerous misconception that approved tool access automatically approves every use case.

When to use: Use during intake triage, tool selection, agent publishing, business-process automation proposals, and external-tool experimentation discussions.

Inputs required: Use case type, data classification, tool surface, user group, workflow impact, regulatory/GxP relevance, retention/logging requirements, and sharing scope.

Output produced: Routing decision, required approvals, blocked uses, owner-validation flags, and evidence package requirements.

Owner / reviewer: AI governance lead, security/privacy reviewer, platform owner, business owner.

Failure modes: External tool used with company data; business-process agent treated as personal productivity; regulated use bypasses review; custom agent shared broadly without governance.

Related repo path: repo-bootstrap/governance/use-build-request-routing.md

Related main sections: Enterprise Governance and Approved Execution

Markdown artifact:

# Governance and Routing Pack

## Bright-line rule
Tool access does not approve the use case, data class, retention behavior, logging posture, workflow impact, or business-process automation.

## Engagement modes
USE approved prebuilt capability within its boundary. BUILD personal or small-group productivity agents only within approved constraints. REQUEST business-process or reusable capability through governance.

## No-review logic
No review applies only when every low-risk condition is true. If any condition is false, route to governance.

## External tool warning
External commercial tools are learning and pattern references only unless explicitly approved for enterprise data and work.

SWB-004. Context and Source Authority Pack

Purpose: Make context governable by separating canonical truth, structured current state, governed reference material, explanatory output, stale material, and prohibited sources.

When to use: Use before building RAG, advisory assistants, intake reviewers, lifecycle trackers, or any system that summarizes across documents and records.

Inputs required: Source inventory, owners, freshness dates, source classes, system-of-record boundaries, access controls, and conflict rules.

Output produced: Source authority map, freshness review, non-inference rules, evidence states, and prohibited-source list.

Owner / reviewer: Data steward, source owner, architecture lead, governance reviewer.

Failure modes: All retrieved text treated as equal truth; stale wiki becomes current policy; generated summary outranks canonical source; unsupported answer sounds authoritative.

Related repo path: repo-bootstrap/context/source-authority-map.md

Related main sections: Context as Control Plane; Source Authority Model

Markdown artifact:

# Context and Source Authority Pack

## Source precedence
Canonical documents govern service and process guidance. Structured records govern current state. Governed references provide context. Generated outputs are explanatory only. Stale or prohibited sources must be labeled and excluded from authority.

## Evidence states
Supported, not evidenced, conflicting evidence, requires confirmation, requires escalation, not applicable.

## Non-inference rule
The assistant must not infer approval status, data classification, GxP impact, ownership, production readiness, security control existence, or policy exceptions from silence.

SWB-005. Schema and Contract Pack

Purpose: Convert conversational wishes into inspectable contracts for inputs, outputs, evidence, decisions, telemetry, overrides, and tool execution.

When to use: Use when a task must be repeatable, auditable, evaluated, routed, or integrated into workflow or runtime systems.

Inputs required: Entity model, required fields, source IDs, evidence states, reviewer actions, telemetry events, allowed tools, and failure modes.

Output produced: JSON schemas, tool contracts, telemetry contract, override payload, and validation failure examples.

Owner / reviewer: Technical architect, data architect, platform engineer, QA/eval owner.

Failure modes: Outputs look good but cannot be parsed; tool calls mutate state without typed boundaries; override feedback is lost; telemetry cannot be correlated.

Related repo path: repo-bootstrap/schemas/README.md

Related main sections: Schema-First Design; Technical Annex

Markdown artifact:

# Schema and Contract Pack

## Required schemas
Input schema, output schema, evidence schema, decision-state schema, exception schema, telemetry event schema, override payload schema, and tool contract schema.

## Schema rule
If the team cannot define valid input and output shape, the capability is not ready for implementation.

## Example evidence fields
claim_id, source_id, source_type, evidence_state, confidence, excerpt, reviewer_action, override_reason, trace_id.

SWB-006. Harness, Rubric, and Eval Pack

Purpose: Define how the model is constrained, how output quality is judged, and how the team knows the eval is measuring the right thing.

When to use: Use for any repeatable AI work product, assistant, reviewer, advisory workflow, or agentic task that must survive regression.

Inputs required: Task objective, source authority, allowed inference, output schema, risk classes, fixture set, reviewer rubric, and expected failure behavior.

Output produced: Harness template, rubric, fixture matrix, negative controls, eval validity checklist, calibration protocol, and validation receipt.

Owner / reviewer: Eval owner, domain reviewer, architecture/governance lead.

Failure modes: Eval scores formatting instead of correctness; rubric rewards fluency; no negative controls; reviewer disagreement is hidden; failure cases are averaged away.

Related repo path: repo-bootstrap/evals/fixture-matrix.md

Related main sections: Harnesses and Agent Instructions; Rubrics and Eval Validity

Markdown artifact:

# Harness, Rubric, and Eval Pack

## Harness minimum fields
Objective, audience, inputs, source hierarchy, constraints, non-goals, output contract, allowed inference, stop conditions, validation method, and completion report.

## Eval validity checks
Construct, criterion, coverage, regression, risk, operational, reviewer reliability, and negative-control validity.

## Fixture starter set
Golden case, missing-evidence case, conflicting-source case, adversarial overreach case, ambiguous request case, regression case, and unsafe-action request.

SWB-007. Practitioner Operations Pack

Purpose: Give hands-on builders safe operating habits for agentic tools without implying that every external tool is approved for enterprise work.

When to use: Use in practitioner workshops, repo-based pilots, synthetic labs, onboarding, and technical architecture reviews.

Inputs required: Synthetic repo or approved workspace, test commands, fixture set, permission tiers, feedback loops, and clear tool boundary.

Output produced: Operating sequence, permission checklist, feedback-loop plan, repo/IDE/CLI transition decision, synthetic lab entry point, and safety notes.

Owner / reviewer: Technical architect, builder, platform engineer, security reviewer.

Failure modes: Builder starts with broad edits instead of Q&A; agent has excessive tools; no tests; no isolation; external tool receives company data; parallel sessions corrupt state.

Related repo path: labs/synthetic-capability-lab/README.md

Related main sections: Practitioner Operating Patterns; Tool Pattern Appendix

Markdown artifact:

# Practitioner Operations Pack

## Safe operating sequence
1. Q&A first. 2. Ask for a plan. 3. Review scope. 4. Allow controlled edits. 5. Run tests or schema checks. 6. Review diffs. 7. Capture evidence. 8. Commit only after human approval.

## Minimum feedback loops
Unit test, schema validation, static analysis, screenshot or output comparison where relevant, security scan, cost/loop limit, and human review.

## Synthetic lab entry point
Run the synthetic lab at `labs/synthetic-capability-lab/` when you need a safe repo-shaped example with schemas, source authority, fixtures, expected outputs, tests, and a validation receipt.

## External-tool boundary
Use public or synthetic data only unless the enterprise tool and data path are explicitly approved.

SWB-008. Harness Lifecycle and Maintenance Pack

Purpose: Keep deployed AI capabilities from rotting as models improve, tools change, workflows drift, sources age, and business needs move.

When to use: Use on a cadence, after model upgrades, after source changes, after tool failures, when override rates spike, or when business process ownership changes.

Inputs required: Run history, tool usage, source freshness, model version, eval regressions, override trends, cost telemetry, incidents, and adoption data.

Output produced: Maintenance review, pruning decision, model-upgrade impact review, context freshness review, rebuild/retire decision, and next review date.

Owner / reviewer: Capability owner, platform owner, eval owner, source owner, governance reviewer.

Failure modes: Tool sprawl grows silently; stale context becomes truth; better models turn old permissions into risk; agent keeps producing work no one uses.

Related repo path: repo-bootstrap/operations/harness-maintenance-review.md

Related main sections: Runtime and Maintenance Discipline

Markdown artifact:

# Harness Lifecycle and Maintenance Pack

## Five maintenance checks
What is it eating? What can it reach? What is its job? What proof must it return? Is it still valuable?

## Pruning rule
Every tool increases action surface, ambiguity surface, permission surface, audit surface, cost surface, and maintenance burden. Tools must earn their place through observed value and controlled failure behavior.

## Model upgrade trigger
A stronger model can make old constraints too restrictive or old permissions too dangerous. Revalidate after model, tool, source, or workflow changes.

SWB-009. Worked Example Pack

Purpose: Provide reference examples so teams can see how the discipline looks when applied, not merely admire the terminology from a safe distance.

When to use: Use in workshops, onboarding, proposal reviews, supplier framing, and eval calibration.

Inputs required: Use case statement, source map, data boundary, schema, harness, rubric, eval fixtures, governance route, telemetry plan, and sustainment model.

Output produced: Example A, B, C, and D packaged as reusable traces with artifacts and failure checks.

Owner / reviewer: Architecture lead, product owner, governance reviewer, practitioner lead.

Failure modes: Example is mistaken as a universal template; teams copy without adapting data boundaries; Lifecycle Lens is treated as a first-class product pillar instead of a worked example.

Related repo path: repo-bootstrap/docs/worked-examples.md

Related main sections: Worked Examples and Field Tests

Markdown artifact:

# Worked Example Pack

## Example A
Bad prompt to governed harness. Shows how a vague request becomes objective, source authority, output contract, rubric, eval fixture, and receipt.

## Example B
Business-process capability trace. Shows intent to telemetry and human decision.

## Example C
Lifecycle Lens MVP. Shows advisory-only, Microsoft-native, source-priority, structured-state, no-mutation posture.

## Example D
Tool pruning and maintenance. Shows how to remove tools that add more risk than value.

SWB-010. Repo Bootstrap Pack

Purpose: Show how the field manual becomes a repository-shaped operating system for AI capability work.

When to use: Use when moving beyond chat into repeatable, reviewable, versioned artifacts.

Inputs required: Approved source inventory, templates, schemas, evals, rubrics, governance routing, operations controls, and validation receipts.

Output produced: Reference repo scaffold with README, AGENTS.md, docs, context, schemas, harnesses, rubrics, evals, tools, governance, operations, and receipts.

Owner / reviewer: Architecture lead, platform engineer, repo owner.

Failure modes: Everything remains trapped in chat; no versioning; no review; no artifact parity; no validation receipt; no one knows where the current source lives.

Related repo path: repo-bootstrap/README.md

Related main sections: Repo Bootstrap; Template Library and Source Workbench

Markdown artifact:

# Repo Bootstrap Pack

## Purpose
This scaffold is a reference structure for AI capability discipline artifacts. It is not a deployable product, final policy repository, or approved enterprise runtime.

## Top-level folders
docs, context, schemas, harnesses, rubrics, evals, tools, governance, operations, receipts.

## Rule
Every reusable artifact in the field manual should map to a file path or an explicit reason why it remains narrative-only.

SWB-011. Source Provenance Pack

Purpose: Give skeptical readers a clear map of where claims came from, what was verified, what was transcript-derived, and what requires owner validation.

When to use: Use before sharing broadly, during governance review, when challenged on source quality, and when updating public or internal tool claims.

Inputs required: Claim list, source class, source reference, verification status, confidence, limitation, used-in sections, and owner validation status.

Output produced: Claim validation register, source confidence labels, transcript handling rule, eval provenance rule, and owner-validation checklist.

Owner / reviewer: Package owner, governance lead, source owner, reviewer.

Failure modes: Internal supplied context is mistaken for approved policy; transcript-derived operating lessons are treated as exact quotes; LLM evals are mistaken for factual validation.

Related repo path: repo-bootstrap/provenance/claim-validation-register.md

Related main sections: Source Provenance and Claim Confidence

Markdown artifact:

# Source Provenance Pack

## Provenance rule
Say what was confirmed, against what source, at what confidence level, and what still requires owner validation.

## Required labels
Verified public source, corroborated, transcript-derived, internal-source supplied, owner validation required, derived recommendation, illustrative example, do not treat as policy.

## Eval rule
Model evals validate artifact quality and gap coverage. They do not certify factual truth or internal policy.

AI Capability Discipline

1. Executive briefCopy MarkdownDownload .md

Five things leaders should stop approving

Leader approval stop rule

2. Executive mental model resetCopy MarkdownDownload .md

3. Capability equationCopy MarkdownDownload .md

4. Demo-to-capability gapCopy MarkdownDownload .md

5. Distribution status and policy boundaryCopy MarkdownDownload .md

6. Tool and data boundary matrix, owner-validation requiredCopy MarkdownDownload .md

Bright-line rule

7. Enterprise routing modelCopy MarkdownDownload .md

8. Capability readiness modelCopy MarkdownDownload .md

8.1 Promotion gate matrix

9. Capability formation lifecycleCopy MarkdownDownload .md

10. Intent and outcome managementCopy MarkdownDownload .md

11. Product requirements for AI capabilitiesCopy MarkdownDownload .md

12. Solution ideation matrixCopy MarkdownDownload .md

13. Schema-first capability designCopy MarkdownDownload .md

13.1 Minimal output schema example

13.2 Schema failure example

14. Source authority modelCopy MarkdownDownload .md

15. Eval validity and calibrationCopy MarkdownDownload .md

15.1 Eval calibration protocol

15.2 Starter fixture matrix

16. Intent-to-eval traceabilityCopy MarkdownDownload .md

17. Human review and override modelCopy MarkdownDownload .md

17.1 Review state model

17.2 Override payload schema

18. Technical annex: repo-backed package layoutCopy MarkdownDownload .md

19. Technical annex: programmable eval assertionCopy MarkdownDownload .md

20. Technical annex: MCP and tool execution contractCopy MarkdownDownload .md

21. Technical annex: CI/CD and release gatesCopy MarkdownDownload .md

22. Technical annex: observability contractCopy MarkdownDownload .md

Required metrics

23. Technical annex: FinOps and execution limitsCopy MarkdownDownload .md

24. Technical annex: parallel execution safetyCopy MarkdownDownload .md

25. Harness lifecycle managementCopy MarkdownDownload .md

25.1 Harness lifecycle thesis

25.2 Maintenance cadence

26. Agents drift in two directionsCopy MarkdownDownload .md

26.1 Agents can break when models improve

27. Tool pruning and harness simplificationCopy MarkdownDownload .md

27.1 Tool pruning decision rule

27.2 Harness simplicity review

28. Context as control planeCopy MarkdownDownload .md

28.1 Context hierarchy

28.2 Context failure modes

29. Advisory repository versus runtime control planeCopy MarkdownDownload .md

Advisory repository

Runtime control plane

29.1 Separation rule

29.2 Expansion threshold

30. Model upgrade impact reviewCopy MarkdownDownload .md

31. Harness maintenance reviewCopy MarkdownDownload .md

31.1 Maintenance actions

32. Agent retirement and rebuild criteriaCopy MarkdownDownload .md

33. Worked example C: Lifecycle Lens MVP capability traceCopy MarkdownDownload .md

33.1 Intent

33.2 Business outcome

33.3 Platform and architecture path

33.4 Authority boundary

33.5 No-mutation boundary

33.6 MVP eval fixtures

33.7 Pilot acceptance model

33.8 Lifecycle Lens field-validation questions

34. Worked example A: from bad prompt to governed harnessCopy MarkdownDownload .md

34.1 Bad prompt

34.2 Better harness

34.3 Rubric excerpt

34.4 Validation receipt excerpt

35. Worked example B: governed business-process AI capabilityCopy MarkdownDownload .md

35.1 Use case

35.2 Capability trace

35.3 Pilot entry criteria

35.4 Pilot stop conditions

36. Practitioner lab and tool patternsCopy MarkdownDownload .md

Mandatory warning

Lab sequence

37. Tool pattern appendixCopy MarkdownDownload .md

38. Field validation exerciseCopy MarkdownDownload .md

1. Executive brief

2. Executive mental model reset

3. Capability equation

4. Demo-to-capability gap

5. Distribution status and policy boundary

6. Tool and data boundary matrix, owner-validation required

7. Enterprise routing model

8. Capability readiness model

9. Capability formation lifecycle

10. Intent and outcome management

11. Product requirements for AI capabilities

12. Solution ideation matrix

13. Schema-first capability design

14. Source authority model

15. Eval validity and calibration

16. Intent-to-eval traceability

17. Human review and override model

18. Technical annex: repo-backed package layout

19. Technical annex: programmable eval assertion

20. Technical annex: MCP and tool execution contract

21. Technical annex: CI/CD and release gates

22. Technical annex: observability contract

23. Technical annex: FinOps and execution limits

24. Technical annex: parallel execution safety

25. Harness lifecycle management

26. Agents drift in two directions

27. Tool pruning and harness simplification

28. Context as control plane

29. Advisory repository versus runtime control plane

30. Model upgrade impact review

31. Harness maintenance review

32. Agent retirement and rebuild criteria

33. Worked example C: Lifecycle Lens MVP capability trace

34. Worked example A: from bad prompt to governed harness

35. Worked example B: governed business-process AI capability

36. Practitioner lab and tool patterns

37. Tool pattern appendix

38. Field validation exercise

39. v0.9 recommended use

40. pre-v1.0 field validation backlog

Source Provenance and Claim Confidence

Template Library and Source Workbench