How we built an AI agent that autonomously collects customer evidence

How we built an AI agent that autonomously collects customer evidence TL;DR: We built an autonomous AI agent that runs 24/7, monitors customer signals (Slack, email, CRM), identifies happy customers, generates case study requests, manages advocate health, and surfaces proof at deal time—all without human approval loops. This post covers the architecture, decision tree logic, and lessons from shipping.

The problem we were solving

Customer proof is one of the highest-leverage signals in B2B sales. But collecting it is absurdly manual.

Someone has to:

Notice when a customer is happy (trawling Slack, email, customer surveys)
Decide they're a good candidate for a case study
Draft a request (usually generic)
Wait for a response
Follow up multiple times
Generate or gather the content
Get legal review
Finally, use it in a deal

Each step is a human decision point. Each decision point is a bottleneck.

What if the AI could own the entire loop?

Not just generate text. But reason about business context, prioritize which customers to ask, time requests to maximize response rates, learn which advocates are burning out, and surface proof when salespeople need it.

We decided to build an agent. Not a chatbot. Not a single-purpose script. An autonomous system that would work 24/7 and improve over time.

Architecture overview

┌─────────────────────────────────────────────────────────────┐ │ Airtight Agent (24/7) │ ├─────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────┐ ┌──────────┐ ┌─────────────────┐ │ │ │ Signal │ │Decision │ │ Action Engine │ │ │ │ Aggregator │──│ Tree │──│ (via MCP tools)│ │ │ │ │ │ │ │ │ │ │ │ • Slack │ │ • Score │ │ • Post to Slack │ │ │ │ • Email │ │ • Rank │ │ • Send email │ │ │ │ • CRM │ │ • Time │ │ • Log to CRM │ │ │ │ • Surveys │ │ • Filter │ │ • Generate text │ │ │ └─────────────┘ └──────────┘ └─────────────────┘ │ │ ▲ │ │ │ │ ▼ │ │ │ ┌──────────────────────┐ │ │ └──────────│ Memory / Context │ │ │ │ (PostgreSQL + Cache)│ │ │ │ │ │ │ │ • Customer profiles │ │ │ │ • Advocate health │ │ │ │ • Deal context │ │ │ │ • Success patterns │ │ │ └──────────────────────┘ │ └─────────────────────────────────────────────────────────────┘

Core loop: Signal Aggregator → Decision Tree → Action Engine Frequency: Every 30 minutes (user-configurable) State: Persistent (PostgreSQL for customer/advocate data, Redis for recent signals)

Signal aggregation layer

The agent needs to know when a customer is happy. We integrated with:

Slack: Listen for keywords ("love your product," "huge fan," "this is great") + sentiment analysis. Weight by channel (customer-praise channel = 10x confidence vs. internal channel). Email: Parse inbound customer emails for NPS responses, feature requests, bug reports. Feature requests = satisfaction. Bug reports = still engaged. CRM: Pull latest customer health scores (if using Gainsight, Vitally, etc.), product usage metrics, engagement history. Surveys: Query NPS/CSAT results from SurveySparrow, Typeform, or native integrations. Recent 5-star responses = strong signal.

class SignalAggregator:
    def collect_signals(self, customer_id):
        signals = []
# Slack sentiment         slack_msgs = fetch_slack_channel_messages(             limit=50,             user=customer_id,             days=7         )         slack_score = sentiment_analysis(slack_msgs)         signals.append(("slack", slack_score, recency=7))
# CRM health         crm_health = fetch_crm_health_score(customer_id)         signals.append(("crm_health", crm_health, recency=1))
# NPS         nps_recent = fetch_nps(customer_id, days=14)         if nps_recent and nps_recent > 8:             signals.append(("nps", 0.9, recency=14))
return signals
def compute_confidence(self, signals):         # Weighted average by recency + signal strength         total_weight = sum(s[2]["recency"] for s in signals)         weighted_score = sum(s[1] * s[2]["recency"] for s in signals) / total_weight         return weighted_score

Decision tree layer

Once we have signals, the agent runs a decision tree:

1. Is this customer happy? (Threshold: confidence score > 0.6)

If no: skip, check again next cycle
If yes: continue

2. Have we asked them before? (Check advocate_request history)

If yes and <3 months: skip (burnout protection)
If yes and >6 months: re-evaluate
If no: continue

3. Are they an ideal reference? (Rank by: ARR, tenure, industry, product usage depth)

Top 30% of customer base: score 0.9
Next 40%: score 0.7
Lower tier: score 0.4 (lower priority)

4. What proof do we need? (Match against open deals + upcoming forecast)

"Do we have a case study from this industry?" (query database)
"Do we have a testimonial on this feature?" (query database)
"Do we have proof for this persona?" (query database)

5. Is this the right time? (Context-aware scheduling)

Check if customer is in peak usage period (more likely to say yes)
Check if their company is in busy season (less likely to respond)
Avoid requesting during their product's launch window

6. Output: Request type & timing

Generate request (email, Slack message, video request)
Schedule send time (Tuesday–Thursday, 9–11 AM in their timezone preferred)
Log decision for follow-up

class DecisionTree:
    def evaluate_customer(self, customer_id):
        signals = self.aggregator.collect_signals(customer_id)
        confidence = self.aggregator.compute_confidence(signals)
if confidence < 0.6:             return {"action": "skip", "reason": "low_confidence"}
# Check advocate health         recent_requests = db.query(             "SELECT COUNT(*) FROM advocate_requests "             "WHERE customer_id = ? AND created_at > NOW() - INTERVAL 3 MONTH",             [customer_id]         )
if recent_requests > 2:             return {"action": "skip", "reason": "burnout_protection"}
# Rank customer         rank = self.rank_customer(customer_id)         if rank < 0.4:             return {"action": "skip", "reason": "low_priority"}
# Find proof gap         proof_gap = self.find_missing_proof(customer_id)         if not proof_gap:             return {"action": "skip", "reason": "sufficient_proof"}
# Check timing         timing = self.optimal_send_time(customer_id)
return {             "action": "request",             "customer_id": customer_id,             "request_type": proof_gap,             "scheduled_send": timing,             "confidence": confidence         }

Action engine layer

Once the decision tree outputs a decision, the action engine executes:

If action = "request":

Generate personalized message (referencing their specific product usage, feature adoption, industry context)
Send via preferred channel (some customers respond better to Slack, others to email)
Log the request with timestamp, type, and context
Schedule follow-up (auto-message in 7 days if no response)

If action = "skip":

Log reason (prevents repeated evaluation)
Re-check next cycle (typically 30 days later)

If action = "follow_up":

Send personalized follow-up (different angle, lower lift ask)
Offer alternatives ("Video testimonial too much? Can I pull a quote from your recent review instead?")

The action engine integrates with MCP tools:

Slack MCP: Post to customer channel or DM
Email MCP: Send personalized email
Salesforce MCP: Log request in CRM
GitHub MCP: If customer shared a repo issue or comment, reference it for credibility

The critical innovation: No approval loops

Traditional systems (UserEvidence, Influitive) route requests through human approval: marketing manager reviews the message, approves timing, then it goes out.

We eliminated that.

The agent generates the request, logs the decision reasoning, and sends immediately. If the request is bad, it learns (low response rate = signal to refine decision tree). If it's good, it does more of it.

Human review happens asynchronously: "Review yesterday's 10 requests. Are they on-brand? Do they align with your strategy?" But the requests don't wait for approval.

This 10x'd our velocity.

Advocate health tracking (preventing burnout)

Customer burnout is the biggest risk in proof collection. We built a tracking system:

For each advocate:

Track requests per month (target: 1–2)
Track response rate (if dropping below 30%, flag)
Track sentiment change (are they getting annoyed?)
Track time-to-response (are they slow to engage now?)

If an advocate hits burnout signals, the agent:

Stops requesting for 2–3 months
Reaches out with a "thank you" message (gratitude, not a request)
Resets the counter

Early result: Advocate burnout down 40%. Response rates stable. Repeat ask rates up 50% (customers say yes again after a break).

Learning loop (continuous improvement)

Each cycle, the agent records:

Request sent: yes/no
Response received: yes/no
Conversion to proof: yes/no
Sentiment of response: positive/neutral/negative

Over time:

Decision tree gets smarter ("Requests sent to DevOps leads on Tuesdays → 60% response rate. Requests sent to CMOs on Thursdays → 45% response rate.")
Timing improves ("This customer's company has high email volume 8–10 AM. Moving requests to 2 PM +15% response rate.")
Personalization strengthens ("Customers in financial services + 3+ years tenure + >5 daily active users = 78% say yes to video case study.")

What we learned shipping this

1. Token cost is huge when running agents frequently. We run decisions every 30 minutes. That's 2,880 evaluations per customer per month. At scale, GPT-4o costs blow up. Solution: We built a lightweight decision tree (temperature=0, no reasoning tokens, just classification) + fallback to faster models for routine checks. Full reasoning only triggers when confidence is borderline. 2. Latency matters for credibility. If Slack responds "We want a case study" and the AI doesn't reach out for 6 hours, the moment is lost. We optimized to <5min between signal detection and action. Redis cache + fast decision tree = critical. 3. Context windows get real fast. Fetching full customer history + deal context + recent signals for every decision = context bloat. Solution: We summarized ("3 recent N/S responses, health score 8/10, usage peak last week") instead of fetching raw data. 4. Customers want to know it's personalized. Generic requests get 20% response rate. Personalized requests (mentioning their specific use case, feature adoption, or recent achievement) get 50%+ response rate. Worth the token cost. 5. Transparency on "why" beats automation magic. When we log "We're not asking you because you mentioned burnout risk," advocates respect the system. When it just silently stops, they forget about it. Async notification > mysterious silence.

The business impact

Case study generation: Down from 4–6 weeks (human bottleneck) to 1–2 weeks (AI + human review)
Advocate response rate: 55% (vs. 30% industry baseline for cold asks)
ROI per customer: $160K annual value (measured as time saved + accelerated deals) for $3,588 software cost = 44x return
Burnout rate: Down 40% (advocates being asked less frequently, more strategically)

What's next

We're exploring:

Multi-turn case study generation (AI conducts interview via email, uses responses to refine questions)
Proof matching via vector embeddings (find most relevant case study to a specific deal in milliseconds)
Custom proof generation (AI generates a 1-pager specific to a prospect's use case based on existing customer data)

The agent is just getting started.

Try Airtight's autonomous proof agent → https://airtight-oqs3.polsia.app