AI Testing Skills: The Evolution Beyond RAG and MCP

Over the last three years, Large Language Models (LLMs) have evolved from simple text generators into agentic teammates capable of reasoning, executing commands, and verifying results. This summary outlines the milestones of this evolution and introduces “Skills” as the critical next step for efficient, repeatable AI workflows.

The Evolution of Agentic Capabilities

The transition from chat-based LLMs to autonomous agents occurred through specific technical milestones:

Function Calling (The Hands): Allowed models to stop describing actions and start executing them via structured API calls.
RAG (The Library Card): Retrieval-Augmented Generation gave agents access to external facts and documentation to ground their answers in truth.
MCP (The Universal Connector): The Model Context Protocol created a standard interface for connecting agents to various tools and resources.

The Efficiency Challenge

While MCP and tool use provided capability, they introduced a “token tax.” To make informed decisions, agents need tool definitions and schemas in their context window. Naïve usage results in context bloat, where agents burn tokens reading tool descriptions before doing actual work. Furthermore, while RAG is excellent for retrieving facts, it is ill-suited for storing procedural “how-to” knowledge.

The Solution: Skills and Progressive Disclosure

Skills represent the next evolutionary step: a practical method to formalize procedural memory without wasting tokens. A Skill is a packaged playbook designed around the principle of progressive disclosure:

Step 1: At startup, the agent loads only tiny metadata (Name + Description).
Step 2: The full instructions (the body of the Skill) are loaded only when the agent decides to use that specific skill.
Step 3: Detailed helper scripts or files are accessed only if strictly necessary.

If Function Calling provides hands and RAG provides knowledge, Skills provide the agent with “muscle memory.”

Anatomy of a Skill

A Skill is typically a directory containing a SKILL.md file. This file includes YAML frontmatter for discovery (name/description) and a Markdown body for the actual playbook. Both Anthropic (Claude Code) and OpenAI (Codex) utilize this structure to ensure discovery is cheap and details are lazy-loaded.

Real-World Implementation

The article contrasts distinct approaches to skills:

Anthropic’s Webapp-Testing: Focuses on creating a standard operational loop. It uses a decision tree to determine if a server needs starting and relies on helper scripts (e.g., scripts/with_server.py) to handle deterministic orchestration, keeping the prompt clean.
Custom API-Testing: A demonstration of building a skill from scratch. It defines strict rules for Playwright API tests, including authentication fixtures and HTTP client patterns, ensuring the agent follows team conventions without needing them repeated in every prompt.

Conclusion

Skills bridge the gap between ad-hoc prompting and rigid automation. They allow developers to package “how we do things” into shareable, versioned, and token-efficient playbooks, enabling agents to scale their capabilities without overwhelming their context windows.

Mentoring question

Which repetitive procedures or ‘tribal knowledge’ in your current development workflow are you constantly re-explaining to AI agents, and how could packaging them into a ‘Skill’ improve both consistency and token efficiency?

Source: https://www.awesome-testing.com/2025/12/ai-testing-skills