Skip to content

Step 5: Test

Part of: Business-First AI Framework

You’ve just finished Build (Step 4). You should have:

  • Platform artifacts — prompts, skills, agents, and configs generated for your platform
  • Context artifacts — style guides, reference materials, and examples
  • AI Building Block Spec ([name]-building-block-spec.md) — which includes the evaluation criteria and test scenarios defined during Design

Your first run is a test, not a deployment. The goal is to verify that the workflow produces good output before you share it with your team or use it on real work.

Start with a single test — pick one realistic scenario and run the workflow end to end. This is your smoke test. You are checking the basics:

  • Does it run at all? — Can you execute every step without errors?
  • Does it produce output? — Is there a result, or does it stall?
  • Is the output in the right format? — Does it look like what you expected?

If any of these fail, go back to Build and fix the obvious issue before continuing. Common first-run problems:

  • Missing context files the model references but cannot find
  • MCP connections that are not configured or are not responding
  • Skills that are installed but not correctly linked to the workflow

Once the smoke test passes, move to a full evaluation using the criteria defined during Design. Your AI Building Block Spec includes evaluation dimensions (the qualities you care about — accuracy, tone, completeness, specificity) and test scenarios (realistic inputs that exercise different parts of the workflow).

For each test scenario:

  1. Run the workflow with the test input

  2. Score the output on each evaluation dimension using a 1-5 scale:

    ScoreMeaning
    5Excellent — ready to use as-is
    4Good — minor edits only
    3Acceptable — needs some rework but the structure is right
    2Weak — significant gaps, wrong direction on one or more dimensions
    1Failure — output is unusable or fundamentally off-target
  3. Note specific issues — What exactly was wrong? Which dimension scored low and why?

Record your scores. These become your baseline for measuring improvement (during this test cycle) and regression (during Improve).

If the overall workflow scores poorly, test individual building blocks separately to isolate the problem:

  • Test a skill by running it with sample inputs outside the full workflow. Does it produce the expected output on its own?
  • Test context by asking the model a question that should be answerable from your reference materials. Does it find and use the right information?
  • Test an agent by giving it a single task from the workflow. Does it use its tools correctly? Does it make reasonable decisions?

Isolating building blocks helps you find the weak link without guessing.

After running the full eval suite, calculate an average score across all scenarios and dimensions. This is your baseline — the starting quality level of your workflow.

Record the baseline alongside your individual scores. You will use this number in two ways:

  1. During this test cycle — to measure whether your fixes are improving things
  2. During Improve (Step 7) — to detect quality regression over time

When something is off, the fix depends on what went wrong. Use this table to map problems to building blocks:

ProblemWhat to fix
Output is generic or off-brandAdd more context — examples, style guides, reference materials
Steps are skipped or misunderstoodRefine the prompt — make the instructions more explicit
A step needs domain expertise the AI does not haveBuild a skill for that step — codify the expertise into a reusable routine
The AI needs to make unpredictable decisionsConvert from prompt to agent — let the AI plan its approach
Output format is wrongCheck the prompt’s output format instructions — add explicit formatting examples
The model ignores your reference materialsCheck that context files are correctly linked and formatted — the model may not be finding them
Tool connections fail during executionVerify MCP connections — test each tool integration independently

After each fix, re-run the affected test scenarios and compare scores to your previous run. You are looking for improvement on the dimensions that scored low.

If you chose the code-first architecture approach during Design, you may encounter additional issues:

ProblemWhat to check
API calls return errorsVerify API keys, rate limits, and request format match the provider’s current spec
Agent does not use toolsCheck that tools are correctly registered in the agent configuration and that permissions are granted
Multi-agent handoffs failVerify the output format of each agent matches the expected input format of the next agent in the pipeline
Scheduled runs produce different resultsCheck for time-dependent context (dates, market data) that may have changed between runs
SDK version mismatchEnsure your SDK version matches the documentation the model used during Build — update if needed

After testing and iterating, you reach one of two outcomes:

Ready to deploy. You can run the workflow on a new scenario and trust the output without heavy editing. Your eval scores are at or above the threshold you set. Move to Step 6: Run.

Not ready. One or more dimensions consistently score below your threshold. Go back to Step 4: Build, fix the identified building blocks, and return to Test. Re-run the full eval suite — do not skip scenarios that passed previously, because a fix in one area can affect others.

This step is facilitated by the test Business-First AI Framework skill. See Get the Skills for installation instructions across all supported platforms.

Start with this prompt:

Test my workflow against the evaluation criteria in the Building Block Spec.

The skill guides you through the smoke test, eval suite, building block evals, baseline establishment, and diagnosis process.

  • Design Your AI Workflow — where evaluation criteria and test scenarios are defined
  • Build — where you fix issues identified during testing
  • Run — the next step once your workflow passes testing
  • Improve — where you re-run evals to detect regression on deployed workflows