Units: Testing Strategy
This intent is decomposed into five implementation units covering six testing layers.
This intent is decomposed into five implementation units covering six testing layers.
This intent is decomposed into five implementation units covering six testing layers.
**See**: `memory-bank/research/test_strategy/testing-strategy.md` for detailed pyramid, workflows, and CI/CD integration.
---
graph TB
subgraph pyramid [" "]
direction TB
L6["π Release Validation<br/><i>Human Review + Grok 4.1 Free</i>"]
L5["π€ Agent Behavior<br/><i>Promptfoo + OpenRouter Free</i>"]
L4["π E2E Workflow<br/><i>Custom Test Harness</i>"]
L3["βοΈ Integration<br/><i>BATS + Node.js</i>"]
L2["π Schema/Contract<br/><i>JSON Schema + markdownlint</i>"]
L1["π§ͺ Unit Tests<br/><i>Vitest</i>"]
end
L6 --> L5 --> L4 --> L3 --> L2 --> L1| Layer | Speed | Cost | Determinism | Technology | |-------|-------|------|-------------|------------| | Unit Tests | ~ms | Free | 100% | Vitest | | Schema/Contract | ~ms | Free | 100% | JSON Schema, markdownlint | | Integration | ~sec | Free | 100% | BATS | | E2E Workflow | ~min | Free | ~95% | Custom harness | | Agent Behavior | ~min | Free | ~70-85% | Promptfoo + OpenRouter | | Release Validation | ~hr | Free | ~70-85% + human | Grok 4.1 Fast |
**Total Cost**: $0 using OpenRouter free tier models.
---
Schema validation and contract enforcement for all markdown-based specifications.
| Component | File | Description | |-----------|------|-------------| | Schema Validator | `__tests__/unit/schema-validation/` | JSON Schema validation | | Intent Validator | `__tests__/unit/schema-validation/intent-schema.test.ts` | Intent structure checks | | Unit Validator | `__tests__/unit/schema-validation/unit-schema.test.ts` | Unit brief validation | | Markdown Linter | `.markdownlint.yaml` | Markdown structure rules |
---
Traditional testing for deterministic CLI commands and installer operations.
| Component | File | Description | |-----------|------|-------------| | Installer Tests | `__tests__/unit/installers/` | Unit tests for installers | | CLI Integration | `__tests__/integration/cli/` | BATS command tests | | Snapshot Tests | `__tests__/unit/template-generation/` | Generated file snapshots | | Memory Bank Tests | `__tests__/integration/memory-bank/` | CRUD operation tests |
---
Non-deterministic testing for LLM-driven agent outputs using multiple evaluation strategies.
**See**: `memory-bank/research/test_strategy/promptfoo-tutorial.md` for Promptfoo configuration and assertions.
| Component | File | Description | |-----------|------|-------------| | **Promptfoo Config** | `promptfoo.yaml` | Declarative test configuration | | Mock Responses | `__tests__/fixtures/mock-responses/` | Recorded LLM outputs | | LLM Judge Rubrics | `__tests__/evaluation/rubrics/` | Evaluation criteria | | Semantic Similarity | `__tests__/evaluation/` | Embedding comparison | | Behavioral Assertions | `__tests__/e2e/` | Property-based checks |
| Rubric | File | Criteria | |--------|------|----------| | Intent Quality | `rubrics/intent-quality.yaml` | Completeness, clarity, testability | | Unit Completeness | `rubrics/unit-completeness.yaml` | Required sections, dependencies | | Bolt Validity | `rubrics/bolt-validity.yaml` | Stage definitions, acceptance criteria |
---
Curated test cases with known-good outputs for regression testing.
**See**: `memory-bank/research/test_strategy/promptfoo-specsmd-tutorial.md` for detailed golden dataset and fixture guide.
| Component | File | Description | |-----------|------|-------------| | Inception Dataset | `__tests__/evaluation/golden-datasets/inception/` | Inception agent examples | | Construction Dataset | `__tests__/evaluation/golden-datasets/construction/` | Construction agent examples | | Operations Dataset | `__tests__/evaluation/golden-datasets/operations/` | Operations agent examples | | Master Dataset | `__tests__/evaluation/golden-datasets/master/` | Routing examples | | Baselines | `__tests__/evaluation/regression/baselines/` | Historical scores | | **Test Fixtures** | `__tests__/fixtures/memory-bank-states/` | Pre-configured memory bank states |
golden-datasets/
βββ inception/
β βββ inputs/
β β βββ 001-simple-feature.txt
β β βββ 002-complex-system.txt
β β βββ ...
β βββ outputs/
β βββ 001-simple-feature-intent.md
β βββ 002-complex-system-intent.md
β βββ ...
βββ construction/
β βββ inputs/
β βββ outputs/
βββ ...Pre-configured memory bank states for reproducible testing:
| Fixture | Description | Tests | |---------|-------------|-------| | `01-empty-project/` | Fresh project, no specsmd init | Project initialization | | `02-initialized-project/` | After `specsmd install` | Agent routing, context loading | | `03-inception-started/` | Intent created, no units | Unit decomposition, requirements | | `04-inception-complete/` | Full intent with units and stories | Transition to construction | | `05-construction-in-progress/` | Bolts planned, stage 2 active | Bolt execution, stage advancement | | `06-construction-complete/` | All bolts completed | Transition to operations |
**Structure**:
__tests__/fixtures/memory-bank-states/
βββ 01-empty-project/
β βββ (empty)
βββ 02-initialized-project/
β βββ .specsmd/
β βββ memory-bank.yaml
β βββ context-config.yaml
βββ 03-inception-started/
β βββ .specsmd/
β βββ memory-bank.yaml
β βββ memory-bank/
β βββ intents/
β βββ user-auth/
β βββ requirements.md
...| Agent | Minimum Examples | Target Coverage | |-------|-----------------|-----------------| | Master Agent | 20 | 100+ | | Inception Agent | 10 | 50+ | | Construction Agent | 10 | 50+ | | Operations Agent | 5 | 25+ |
---
Pipeline automation for test execution and quality gate enforcement.
| Component | File | Description | |-----------|------|-------------| | PR Workflow | `.github/workflows/pr-tests.yml` | Fast deterministic tests | | Main Workflow | `.github/workflows/main-tests.yml` | Agent evaluation on merge | | Nightly Workflow | `.github/workflows/nightly-eval.yml` | Full golden dataset | | Quality Gates | `.github/workflows/quality-gates.yml` | Threshold enforcement |
| Tier | Trigger | Tests | Duration | |------|---------|-------|----------| | Tier 1 | Every PR | Schema, CLI, Snapshots | ~2 min | | Tier 2 | Merge to main | + Quick agent eval | ~5 min | | Tier 3 | Nightly | Full golden dataset | ~30 min | | Tier 4 | Release | Full suite + human review | ~1 hour |
| Metric | Target | Alert | |--------|--------|-------| | Schema Validation | 100% | < 100% | | CLI Test Pass Rate | 100% | < 100% | | Agent Quality Score | > 0.85 | < 0.80 | | Semantic Similarity | > 0.90 | < 0.85 | | Regression from Baseline | 0% | > 5% |
---
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CI/CD Integration β
β (Unit 5) β
β Orchestrates all β
βββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββ
β triggers
βββββββββββββββββββββΌββββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββ βββββββββββββββββ βββββββββββββββββββββββββ
β Specification β β CLI/Installer β β Agent Behavior β
β Contract β β Testing β β Evaluation β
β Testing β β (Unit 2) β β (Unit 3) β
β (Unit 1) β β β β β
β β β - Unit tests β β - Mock responses β
β - Schema β β - Integration β β - LLM-as-judge β
β - Linting β β - Snapshots β β - Semantic similarity β
βββββββββββββββββ βββββββββββββββββ βββββββββββββ¬ββββββββββββ
β uses
βΌ
βββββββββββββββββββββββββ
β Golden Dataset β
β Management β
β (Unit 4) β
β β
β - Test cases β
β - Baselines β
β - Regression tracking β
βββββββββββββββββββββββββ---
| # | Unit | Status | Notes | |---|------|--------|-------| | 1 | Specification Contract Testing | β Complete | 7 tests, YAML + MD validators | | 2 | CLI/Installer Testing | β³ Pending | Next priority | | 3 | Golden Dataset Management | β³ Pending | Required for Unit 4 | | 4 | Agent Behavior Evaluation | β³ Pending | Depends on golden datasets | | 5 | CI/CD Integration | π§ In Progress | GitHub Action created |