Markdown

Units: Testing Strategy

Unit Decomposition

This intent is decomposed into five implementation units covering six testing layers.

**See**: `memory-bank/research/test_strategy/testing-strategy.md` for detailed pyramid, workflows, and CI/CD integration.

---

6-Layer Testing Pyramid

graph TB
    subgraph pyramid [" "]
        direction TB
        L6["πŸ” Release Validation<br/><i>Human Review + Grok 4.1 Free</i>"]
        L5["πŸ€– Agent Behavior<br/><i>Promptfoo + OpenRouter Free</i>"]
        L4["πŸ”„ E2E Workflow<br/><i>Custom Test Harness</i>"]
        L3["βš™οΈ Integration<br/><i>BATS + Node.js</i>"]
        L2["πŸ“‹ Schema/Contract<br/><i>JSON Schema + markdownlint</i>"]
        L1["πŸ§ͺ Unit Tests<br/><i>Vitest</i>"]
    end
    L6 --> L5 --> L4 --> L3 --> L2 --> L1

| Layer | Speed | Cost | Determinism | Technology | |-------|-------|------|-------------|------------| | Unit Tests | ~ms | Free | 100% | Vitest | | Schema/Contract | ~ms | Free | 100% | JSON Schema, markdownlint | | Integration | ~sec | Free | 100% | BATS | | E2E Workflow | ~min | Free | ~95% | Custom harness | | Agent Behavior | ~min | Free | ~70-85% | Promptfoo + OpenRouter | | Release Validation | ~hr | Free | ~70-85% + human | Grok 4.1 Fast |

**Total Cost**: $0 using OpenRouter free tier models.

---

Unit 1: Specification Contract Testing

Description

Schema validation and contract enforcement for all markdown-based specifications.

Responsibilities

  • Validate `memory-bank.yaml` against JSON Schema
  • Enforce required sections in intent artifacts
  • Validate unit brief template compliance
  • Enforce bolt template structure
  • Run markdown linting for structure consistency

Components

| Component | File | Description | |-----------|------|-------------| | Schema Validator | `__tests__/unit/schema-validation/` | JSON Schema validation | | Intent Validator | `__tests__/unit/schema-validation/intent-schema.test.ts` | Intent structure checks | | Unit Validator | `__tests__/unit/schema-validation/unit-schema.test.ts` | Unit brief validation | | Markdown Linter | `.markdownlint.yaml` | Markdown structure rules |

Dependencies

  • JSON Schema library (Ajv)
  • markdownlint
  • remark-lint

---

Unit 2: CLI & Installer Testing

Description

Traditional testing for deterministic CLI commands and installer operations.

Responsibilities

  • Unit test installer components
  • Integration test slash commands
  • Snapshot test generated files
  • Property-based test Memory Bank operations

Components

| Component | File | Description | |-----------|------|-------------| | Installer Tests | `__tests__/unit/installers/` | Unit tests for installers | | CLI Integration | `__tests__/integration/cli/` | BATS command tests | | Snapshot Tests | `__tests__/unit/template-generation/` | Generated file snapshots | | Memory Bank Tests | `__tests__/integration/memory-bank/` | CRUD operation tests |

Dependencies

  • Jest/Vitest
  • BATS (Bash Automated Testing System)
  • Mock filesystem library

---

Unit 3: Agent Behavior Evaluation

Description

Non-deterministic testing for LLM-driven agent outputs using multiple evaluation strategies.

**See**: `memory-bank/research/test_strategy/promptfoo-tutorial.md` for Promptfoo configuration and assertions.

Responsibilities

  • Mock LLM responses for deterministic CI runs
  • Evaluate outputs via LLM-as-judge
  • Compute semantic similarity against golden examples
  • Assert behavioral properties (not exact matches)
  • Handle test flakiness from non-determinism

Components

| Component | File | Description | |-----------|------|-------------| | **Promptfoo Config** | `promptfoo.yaml` | Declarative test configuration | | Mock Responses | `__tests__/fixtures/mock-responses/` | Recorded LLM outputs | | LLM Judge Rubrics | `__tests__/evaluation/rubrics/` | Evaluation criteria | | Semantic Similarity | `__tests__/evaluation/` | Embedding comparison | | Behavioral Assertions | `__tests__/e2e/` | Property-based checks |

Dependencies

  • **Promptfoo** (primary evaluation framework)
  • **OpenRouter** (LLM-as-judge provider, free tier)
  • `x-ai/grok-4.1-fast:free` - Coding agents (2M context)
  • `qwen/qwen3-coder:free` - Code-specialized
  • vcrpy/responses for mocking

Evaluation Rubrics

| Rubric | File | Criteria | |--------|------|----------| | Intent Quality | `rubrics/intent-quality.yaml` | Completeness, clarity, testability | | Unit Completeness | `rubrics/unit-completeness.yaml` | Required sections, dependencies | | Bolt Validity | `rubrics/bolt-validity.yaml` | Stage definitions, acceptance criteria |

---

Unit 4: Golden Dataset Management

Description

Curated test cases with known-good outputs for regression testing.

**See**: `memory-bank/research/test_strategy/promptfoo-specsmd-tutorial.md` for detailed golden dataset and fixture guide.

Responsibilities

  • Maintain input/output examples per agent
  • Track baseline quality scores
  • Detect regressions from baseline
  • Version golden datasets with code
  • Manage pre-configured memory bank fixtures

Components

| Component | File | Description | |-----------|------|-------------| | Inception Dataset | `__tests__/evaluation/golden-datasets/inception/` | Inception agent examples | | Construction Dataset | `__tests__/evaluation/golden-datasets/construction/` | Construction agent examples | | Operations Dataset | `__tests__/evaluation/golden-datasets/operations/` | Operations agent examples | | Master Dataset | `__tests__/evaluation/golden-datasets/master/` | Routing examples | | Baselines | `__tests__/evaluation/regression/baselines/` | Historical scores | | **Test Fixtures** | `__tests__/fixtures/memory-bank-states/` | Pre-configured memory bank states |

Dataset Structure

golden-datasets/
β”œβ”€β”€ inception/
β”‚   β”œβ”€β”€ inputs/
β”‚   β”‚   β”œβ”€β”€ 001-simple-feature.txt
β”‚   β”‚   β”œβ”€β”€ 002-complex-system.txt
β”‚   β”‚   └── ...
β”‚   └── outputs/
β”‚       β”œβ”€β”€ 001-simple-feature-intent.md
β”‚       β”œβ”€β”€ 002-complex-system-intent.md
β”‚       └── ...
β”œβ”€β”€ construction/
β”‚   β”œβ”€β”€ inputs/
β”‚   └── outputs/
└── ...

Test Fixtures (Memory Bank States)

Pre-configured memory bank states for reproducible testing:

| Fixture | Description | Tests | |---------|-------------|-------| | `01-empty-project/` | Fresh project, no specsmd init | Project initialization | | `02-initialized-project/` | After `specsmd install` | Agent routing, context loading | | `03-inception-started/` | Intent created, no units | Unit decomposition, requirements | | `04-inception-complete/` | Full intent with units and stories | Transition to construction | | `05-construction-in-progress/` | Bolts planned, stage 2 active | Bolt execution, stage advancement | | `06-construction-complete/` | All bolts completed | Transition to operations |

**Structure**:

__tests__/fixtures/memory-bank-states/
β”œβ”€β”€ 01-empty-project/
β”‚   └── (empty)
β”œβ”€β”€ 02-initialized-project/
β”‚   └── .specsmd/
β”‚       β”œβ”€β”€ memory-bank.yaml
β”‚       └── context-config.yaml
β”œβ”€β”€ 03-inception-started/
β”‚   └── .specsmd/
β”‚       β”œβ”€β”€ memory-bank.yaml
β”‚       └── memory-bank/
β”‚           └── intents/
β”‚               └── user-auth/
β”‚                   └── requirements.md
...

Minimum Coverage

| Agent | Minimum Examples | Target Coverage | |-------|-----------------|-----------------| | Master Agent | 20 | 100+ | | Inception Agent | 10 | 50+ | | Construction Agent | 10 | 50+ | | Operations Agent | 5 | 25+ |

---

Unit 5: CI/CD Integration

Description

Pipeline automation for test execution and quality gate enforcement.

Responsibilities

  • Trigger tests on PR and merge events
  • Enforce quality thresholds
  • Report results to PRs
  • Block releases on failures
  • Support human approval gates

Components

| Component | File | Description | |-----------|------|-------------| | PR Workflow | `.github/workflows/pr-tests.yml` | Fast deterministic tests | | Main Workflow | `.github/workflows/main-tests.yml` | Agent evaluation on merge | | Nightly Workflow | `.github/workflows/nightly-eval.yml` | Full golden dataset | | Quality Gates | `.github/workflows/quality-gates.yml` | Threshold enforcement |

Pipeline Tiers

| Tier | Trigger | Tests | Duration | |------|---------|-------|----------| | Tier 1 | Every PR | Schema, CLI, Snapshots | ~2 min | | Tier 2 | Merge to main | + Quick agent eval | ~5 min | | Tier 3 | Nightly | Full golden dataset | ~30 min | | Tier 4 | Release | Full suite + human review | ~1 hour |

Quality Thresholds

| Metric | Target | Alert | |--------|--------|-------| | Schema Validation | 100% | < 100% | | CLI Test Pass Rate | 100% | < 100% | | Agent Quality Score | > 0.85 | < 0.80 | | Semantic Similarity | > 0.90 | < 0.85 | | Regression from Baseline | 0% | > 5% |

---

Unit Dependency Graph

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    CI/CD Integration                        β”‚
β”‚                       (Unit 5)                              β”‚
β”‚                    Orchestrates all                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚ triggers
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚                   β”‚                   β”‚
        β–Ό                   β–Ό                   β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Specification β”‚  β”‚ CLI/Installer β”‚  β”‚ Agent Behavior        β”‚
β”‚ Contract      β”‚  β”‚ Testing       β”‚  β”‚ Evaluation            β”‚
β”‚ Testing       β”‚  β”‚ (Unit 2)      β”‚  β”‚ (Unit 3)              β”‚
β”‚ (Unit 1)      β”‚  β”‚               β”‚  β”‚                       β”‚
β”‚               β”‚  β”‚ - Unit tests  β”‚  β”‚ - Mock responses      β”‚
β”‚ - Schema      β”‚  β”‚ - Integration β”‚  β”‚ - LLM-as-judge        β”‚
β”‚ - Linting     β”‚  β”‚ - Snapshots   β”‚  β”‚ - Semantic similarity β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                  β”‚ uses
                                                  β–Ό
                                      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                      β”‚ Golden Dataset        β”‚
                                      β”‚ Management            β”‚
                                      β”‚ (Unit 4)              β”‚
                                      β”‚                       β”‚
                                      β”‚ - Test cases          β”‚
                                      β”‚ - Baselines           β”‚
                                      β”‚ - Regression tracking β”‚
                                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

---

Implementation Order & Status

| # | Unit | Status | Notes | |---|------|--------|-------| | 1 | Specification Contract Testing | βœ… Complete | 7 tests, YAML + MD validators | | 2 | CLI/Installer Testing | ⏳ Pending | Next priority | | 3 | Golden Dataset Management | ⏳ Pending | Required for Unit 4 | | 4 | Agent Behavior Evaluation | ⏳ Pending | Depends on golden datasets | | 5 | CI/CD Integration | 🚧 In Progress | GitHub Action created |

Completed Work (Unit 1)

  • βœ… `src/lib/yaml-validator.ts` - YAML config validation
  • βœ… `src/lib/markdown-validator.ts` - Markdown template validation
  • βœ… `src/__tests__/schemas/` - 5 YAML schema definitions
  • βœ… `src/__tests__/unit/schema-validation/` - 7 passing tests
  • βœ… `.markdownlint.yaml` - Markdown linting rules
  • βœ… `.github/workflows/test.yml` - CI workflow (pending commit)