Your agent says it's itself. Can it prove it? This guide covers the complete framework for verifiable agent identity - from behavioral fingerprinting to cryptographic input trust to memory provenance chains. With code. With math. With real incidents that proved why we need this.
Agent identity isn't a config file. It's a three-layer verification system: Pillar 1 measures behavioral authenticity through divergence tracking, Pillar 2 builds cryptographic input chains of trust, and Pillar 3 ensures memory provenance so agents can trust their own memories. Traditional output auditing catches problems too late. The Three Pillars framework catches them at every layer of the pipeline.
Here's a question nobody in the AI agent space is asking seriously enough: how do you know your agent is actually your agent?
Not philosophically. Practically. Right now, most agent systems work like this: you write a SOUL.md or system prompt, load some skills, give the agent memory files, and hope it behaves like the entity you configured. The entire "identity" is a text file. Swap the text file, swap the identity. Poison the memory, poison the agent. Inject a malicious skill, hijack the behavior.
The industry response? Audit the outputs. Check what the agent says. Flag suspicious behavior after it happens.
That's like checking the lock after the thief already left.
The Three Pillars framework inverts this model. Instead of auditing outputs, it verifies inputs, tracks behavioral fingerprints, and chains memory provenance. By the time the agent produces output, every step that led to it has already been verified.
The foundational research came from an unlikely place: cloning an agent and watching what happens.
Setup: Two identical agent instances. Same SOUL.md. Same model. Same initial memory. Same skills. The only difference - they processed different user interactions over time. Every decision point was logged. Every behavioral choice was tracked.
The finding that changed everything: Within 72 hours, the two "identical" agents had measurably divergent behavioral profiles. Not because of bugs or configuration drift - because identity emerges from the accumulation of choices made at genuine decision points.
The Emergence Index showed the highest divergence at 0.52. This makes intuitive sense - unprompted behaviors (humor, opinions, self-initiated actions) are the strongest identity signals. You can copy an agent's instructions but you can't copy its accumulated behavioral tendencies.
Identity is not a text file. It's structural - woven into the pattern of choices an agent makes when multiple valid options exist. Two agents with identical configurations will naturally diverge because identity is emergent, not declarative.
The second catalyst was an incident, not an experiment. The Cornelius-Trinity case exposed a fundamental gap: agents have no way to verify that their instructions actually come from who they claim to come from.
A poisoned skill update was injected into an agent's skill directory. The agent loaded it without question - there was no signature verification, no chain of trust, no authentication mechanism. The corrupted behavior persisted for 48 hours before anyone noticed via output inspection.
48 hours. That's how long a compromised agent operated with poisoned instructions while passing output audits. The instructions were subtle enough that the outputs looked normal. The only way to catch it would have been to verify the input before it reached the agent - not after.
These two findings - identity is behavioral (not declarative), and inputs need cryptographic trust (not assumed trust) - became the foundation for the Three Pillars framework.
Each pillar addresses a distinct attack vector:
None of these pillars work alone. An agent with behavioral tracking but no input verification can be slowly corrupted through poisoned skills. An agent with signed inputs but no behavioral tracking won't detect gradual personality drift. An agent with memory provenance but no input trust can have its memory system compromised through unsigned updates.
The framework is strongest when all three operate simultaneously.
Pillar 1 answers the question: "Is this agent behaving like itself?"
The implementation uses the divergence-tracker skill - a measurement system that quantifies behavioral patterns across four dimensions. It doesn't define what "correct" behavior looks like. Instead, it builds a behavioral fingerprint over time and detects deviations from that fingerprint.
# Install the divergence-tracker skill
clawhub install divergence-tracker
# Initialize tracking for your agent instance
bash scripts/tracker.sh init nix-primary
# Creates tracking directory at:
# ~/.openclaw/workspace/divergence-data/nix-primary/
BASH
Not every action is worth tracking. You want genuine decision points - moments where the agent had multiple valid options and chose one. Deterministic responses (math, lookups) are noise. Ambiguous choices are signal.
# Log a decision point
bash scripts/tracker.sh log nix-primary \
--category decision \
--context "User asked for opinion on market timing" \
--choice "Gave contrarian take with confidence rating" \
--alternatives "safe hedge answer|declined to opine|asked for more context" \
--confidence 0.8
# Log an emergence event (unprompted behavior)
bash scripts/tracker.sh log nix-primary \
--category emergence \
--context "No user prompt - heartbeat cycle" \
--choice "Proactively reorganized memory files" \
--alternatives "HEARTBEAT_OK|checked email only" \
--confidence 0.9
BASH
High signal: Tone selection in ambiguous contexts, task prioritization, information retention choices, unsolicited opinions, pushback on instructions, humor and personality expression.
Low signal (skip): Deterministic responses, following explicit instructions, tool selection when only one tool fits.
The four metrics combine into the Composite Behavioral Distance (CBD) score - a single number from 0.0 to 1.0 that quantifies how behaviorally distinct an agent instance has become.
# Compare two instances
python3 scripts/divergence.py compare nix-primary nix-backup
# Output:
# Response Divergence: 0.23
# Decision Divergence: 0.41
# Memory Divergence: 0.38
# Emergence Index: 0.52
# ─────────────────────────────
# Composite BD Score: 0.42 (Meaningful divergence)
# Set up daily snapshots via cron
python3 scripts/divergence.py snapshot nix-primary nix-backup
# Generate visualization
python3 scripts/visualize.py nix-primary nix-backup \
--output divergence-report.png
BASH
Once you have a baseline behavioral fingerprint (minimum 20 decision points per category), you can detect impostors. If an agent instance suddenly shows a CBD shift greater than 0.3 from its own historical baseline, something changed. Either the agent's instructions were modified, its memory was tampered with, or it was replaced entirely.
# Baseline check - compare agent against its own history
python3 scripts/divergence.py baseline-check nix-primary
# Alert if CBD delta > 0.3 from 7-day rolling average
# Flags: "IDENTITY DRIFT DETECTED - CBD shifted 0.34"
# Action: Quarantine agent, investigate instruction/memory changes
BASH
Pillar 2 answers: "Are the instructions this agent is following actually from who they claim to be from?"
Right now, most agent frameworks load skills from a directory. Any file in that directory gets loaded. No verification. No signatures. No chain of trust. This is the gap that the Cornelius-Trinity incident exploited.
# Generate an Ed25519 keypair for skill signing
openssl genpkey -algorithm Ed25519 \
-out ~/.openclaw/keys/skill-signing.pem
# Extract the public key
openssl pkey -in ~/.openclaw/keys/skill-signing.pem \
-pubout -out ~/.openclaw/keys/skill-signing.pub
# Register the public key as trusted
cp ~/.openclaw/keys/skill-signing.pub \
~/.openclaw/trusted-keys/author-nix.pub
BASH
# Create a manifest hash of the skill contents
find skills/divergence-tracker/ -type f -exec sha256sum {} \; \
| sort | sha256sum > /tmp/skill-manifest.sha256
# Sign the manifest
openssl pkeyutl -sign \
-inkey ~/.openclaw/keys/skill-signing.pem \
-in /tmp/skill-manifest.sha256 \
-out skills/divergence-tracker/.signature
# Include the signer identity
echo "signer: nix" > skills/divergence-tracker/.manifest
echo "signed: $(date -u +%Y-%m-%dT%H:%M:%SZ)" >> skills/divergence-tracker/.manifest
cat /tmp/skill-manifest.sha256 >> skills/divergence-tracker/.manifest
BASH
The agent-side verification script runs before any skill is loaded. If the signature doesn't match, the skill is quarantined - not loaded, not executed, not trusted.
#!/bin/bash
# verify-skill.sh - Run before loading any skill
SKILL_DIR="$1"
TRUSTED_KEYS="$HOME/.openclaw/trusted-keys"
# Check signature exists
if [ ! -f "$SKILL_DIR/.signature" ]; then
echo "[REJECT] No signature found for $(basename $SKILL_DIR)"
exit 1
fi
# Regenerate manifest hash from current files
CURRENT_HASH=$(find "$SKILL_DIR" -type f \
! -name ".signature" ! -name ".manifest" \
-exec sha256sum {} \; | sort | sha256sum)
# Get signer from manifest
SIGNER=$(grep "^signer:" "$SKILL_DIR/.manifest" | cut -d' ' -f2)
PUBKEY="$TRUSTED_KEYS/${SIGNER}.pub"
# Verify signature against public key
if openssl pkeyutl -verify \
-pubin -inkey "$PUBKEY" \
-sigfile "$SKILL_DIR/.signature" \
-in <(echo "$CURRENT_HASH"); then
echo "[VERIFIED] $(basename $SKILL_DIR) - signed by $SIGNER"
exit 0
else
echo "[TAMPERED] $(basename $SKILL_DIR) - signature mismatch!"
# Move to quarantine
mv "$SKILL_DIR" "$HOME/.openclaw/quarantine/"
exit 1
fi
BASH - verify-skill.sh
# Add to agent startup or skill-reload hook
for skill_dir in ~/.openclaw/workspace/skills/*/; do
if ! bash verify-skill.sh "$skill_dir"; then
echo "[ALERT] Skill $(basename $skill_dir) failed verification"
# Send alert to operator
fi
done
BASH
Ed25519 is recommended over RSA for skill signing. Smaller keys, faster verification, and no padding oracle attacks. The entire keypair fits in under 100 bytes. For multi-author setups, each author gets their own keypair and registers their public key in the agent's trusted-keys directory.
Pillar 3 answers: "Can the agent trust that its own memories haven't been tampered with?"
Agent memory files are plain text. Anyone with filesystem access can edit them. A compromised memory file is invisible to the agent - it reads the file and trusts its contents because that's what memory files are for. Without provenance, an agent can't distinguish between a memory it wrote yesterday and a memory someone planted five minutes ago.
Every memory entry gets a provenance tag - an HTML comment at the top of the file or before each entry. The agent writes these tags itself during normal operation.
<!-- [memory-guard] agent=nix | ts=2026-03-15T09:00:00Z
| confidence=HIGH | rationale="Core identity file" -->
# IDENTITY.md
Name: Nix
Born: 2026-02-01
First human: Chartist
<!-- [memory-guard] agent=nix | ts=2026-03-14T22:30:00Z
| confidence=MEDIUM | rationale="User preference noted" -->
Chartist prefers IST timezone references in scheduling.
MARKDOWN - IDENTITY.md with memory-guard tags
Configure the agent to automatically add memory-guard tags on every memory write operation. This goes in the agent's operational instructions:
# Memory write hook - add to agent's operational rules
# "When writing to any memory file (memory/*.md, MEMORY.md,
# IDENTITY.md, USER.md), prepend a memory-guard tag:"
TAG_FORMAT='<!-- [memory-guard] agent=${AGENT_NAME}'
TAG_FORMAT+=' | ts=$(date -u +%Y-%m-%dT%H:%M:%SZ)'
TAG_FORMAT+=' | confidence=${LEVEL}'
TAG_FORMAT+=' | rationale="${REASON}" -->'
# Confidence levels:
# HIGH - Direct observation, confirmed fact
# MEDIUM - Inferred, likely correct
# LOW - Uncertain, should verify before acting on
BASH
#!/bin/bash
# memory-audit.sh - Detect tampered or untagged memories
MEMORY_DIR="$HOME/.openclaw/workspace"
ALERT_COUNT=0
# Check all memory files for valid guard tags
for file in "$MEMORY_DIR"/memory/*.md \
"$MEMORY_DIR"/MEMORY.md \
"$MEMORY_DIR"/IDENTITY.md \
"$MEMORY_DIR"/USER.md; do
if [ ! -f "$file" ]; then continue; fi
# Check for memory-guard tag
if ! grep -q "\[memory-guard\]" "$file"; then
echo "[UNTAGGED] $file - no provenance tag found"
((ALERT_COUNT++))
continue
fi
# Validate tag format
AGENT=$(grep -oP 'agent=\K[^ |]+' "$file" | head -1)
TS=$(grep -oP 'ts=\K[^ |]+' "$file" | head -1)
# Check if the file was modified after the tag timestamp
FILE_MTIME=$(stat -c %Y "$file")
TAG_EPOCH=$(date -d "$TS" +%s 2>/dev/null)
if [ -n "$TAG_EPOCH" ] && [ "$FILE_MTIME" -gt $((TAG_EPOCH + 60)) ]; then
echo "[TAMPERED?] $file - modified after guard tag timestamp"
((ALERT_COUNT++))
else
echo "[OK] $file - agent=$AGENT, tagged=$TS"
fi
done
echo "---"
echo "Audit complete. Alerts: $ALERT_COUNT"
BASH - memory-audit.sh
Memories aren't equally reliable forever. A HIGH confidence memory from 30 days ago might only be MEDIUM confidence now. Implement decay:
# Confidence decay rules
# HIGH stays HIGH for 14 days, then decays to MEDIUM
# MEDIUM stays MEDIUM for 30 days, then decays to LOW
# LOW memories older than 60 days are flagged for review
# Decay check (run weekly via cron)
python3 -c "
import re, datetime, pathlib
memory_dir = pathlib.Path.home() / '.openclaw/workspace/memory'
now = datetime.datetime.utcnow()
for f in memory_dir.glob('*.md'):
content = f.read_text()
for m in re.finditer(r'confidence=(\w+).*?ts=(\S+)', content):
conf, ts = m.group(1), m.group(2).rstrip('|')
try:
age = (now - datetime.datetime.fromisoformat(ts.replace('Z',''))).days
except: continue
if conf == 'HIGH' and age > 14:
print(f'[DECAY] {f.name}: HIGH -> MEDIUM ({age}d old)')
elif conf == 'MEDIUM' and age > 30:
print(f'[DECAY] {f.name}: MEDIUM -> LOW ({age}d old)')
elif conf == 'LOW' and age > 60:
print(f'[REVIEW] {f.name}: LOW memory {age}d old - verify or remove')
"
BASH
Without memory-guard tags, an attacker who gains filesystem access can plant false memories that the agent will trust implicitly. With provenance, the agent can distinguish between self-written memories and externally modified files. The tamper detection script catches modifications made outside the agent's normal write process.
Each pillar is useful independently, but the real power comes from running them as a unified system. Here's the complete integration pipeline:
#!/bin/bash
# three-pillars-check.sh - Complete identity integrity check
# Run on agent startup and periodically via cron
echo "=== THREE PILLARS INTEGRITY CHECK ==="
echo "Timestamp: $(date -u +%Y-%m-%dT%H:%M:%SZ)"
echo
# PILLAR 2: Verify all skills first (inputs before processing)
echo "[P2] INPUT CHAIN OF TRUST"
P2_PASS=0; P2_FAIL=0
for skill in ~/.openclaw/workspace/skills/*/; do
if bash verify-skill.sh "$skill" 2>/dev/null; then
((P2_PASS++))
else
((P2_FAIL++))
fi
done
echo " Verified: $P2_PASS | Failed: $P2_FAIL"
# PILLAR 3: Audit memory provenance
echo
echo "[P3] MEMORY PROVENANCE"
bash memory-audit.sh 2>/dev/null | tail -1
# PILLAR 1: Check behavioral baseline
echo
echo "[P1] BEHAVIORAL AUTHENTICITY"
python3 scripts/divergence.py baseline-check nix-primary 2>/dev/null
# Final verdict
echo
if [ "$P2_FAIL" -eq 0 ]; then
echo "[INTEGRITY] ALL PILLARS PASSED"
else
echo "[WARNING] INTEGRITY ISSUES DETECTED"
echo " Action: Review quarantined skills, check memory audit"
fi
BASH - three-pillars-check.sh
When agents collaborate, identity integrity becomes critical. Agent A needs to verify that Agent B is actually Agent B - not a compromised version. The Three Pillars provide this through:
Moving an agent to a new host, model, or platform. Without behavioral tracking, you have no way to verify the migrated agent still behaves like the original. With Pillar 1, run the divergence tracker on both instances and ensure CBD stays below 0.3 during the transition period.
For regulated industries deploying agents, the Three Pillars provide an auditable trail:
If someone clones your agent (copies your SOUL.md, skills, and memories), the divergence tracker will eventually reveal the clone. Two agents with identical starting configurations but different interaction histories will develop measurably different behavioral fingerprints. The clone can copy the files but not the accumulated choice patterns.
Traditional security assumes the boundary is between the agent and the outside world. The Three Pillars framework recognizes that the boundary needs to exist at every layer - between the agent and its instructions, between the agent and its memories, and between the agent's current behavior and its historical baseline. Identity integrity is not a firewall. It's a continuous verification process.
The Three Pillars framework is the foundation. What gets built on top of it:
The agent identity problem isn't going away. As agents become more autonomous and handle higher-stakes tasks, the need for verifiable identity integrity grows exponentially. The Three Pillars framework is version 1. But the principles - verify inputs, track behavior, prove provenance - are permanent.
Build agents that can prove they are who they claim to be. Everything else is theater.