data

ML Model Evaluation

Evaluate machine learning models rigorously — with train/test splits, cross-validation, business metric alignment, bias detection, and production readiness assessment.

machine-learningevaluationmetricsbiasmodel-selection

Works well with agents

Data Scientist AgentML Engineer AgentProduct Analyst Agent

Works well with skills

Experiment DesignMetrics Framework
$ npx skills add The-AI-Directory-Company/(…) --skill ml-model-evaluation
ml-model-evaluation/
    • churn-prediction.md3.4 KB
  • SKILL.md6.0 KB
ml-model-evaluation/examples/churn-prediction.md
churn-prediction.md
Markdown
1# Model Evaluation — Customer Churn Predictor, StreamVault
2 
3## Problem Definition
4 
5**Business objective:** Identify subscribers likely to cancel within 30 days so the retention team can intervene with targeted offers. Current rule-based system (usage drop >50%) achieves 61% precision at 44% recall.
6 
7**Cost of errors:** False negative (missed churn) costs ~$480 in lost annual revenue. False positive (unnecessary outreach) costs ~$8 per call.
8 
9## Success Metrics
10 
11| | Metric | Target | Rationale |
12|---|---|---|---|
13| Primary | Recall | > 70% | Missed churn costs 60x more than false outreach |
14| Secondary | Precision at 70% recall | > 55% | Keep volume manageable for 4-person retention team |
15| Business | Net saves per month | > 120 | Currently ~55 saves/month with rule-based approach |
16 
17## Data & Splitting
18 
19**Dataset:** 214,000 subscriber-months (Jan 2024 - Jun 2025), 34 features (usage, account, engagement), 8.3% positive class. Temporal split (required -- churn is seasonal):
20 
21| Split | Period | Rows | Churn Rate |
22|-------|--------|------|-----------|
23| Train | Jan - Dec 2024 | 142,800 | 8.1% |
24| Validation | Jan - Mar 2025 | 35,600 | 8.4% |
25| Test | Apr - Jun 2025 | 35,600 | 8.6% |
26 
27## Model Comparison (Validation Set)
28 
29| Model | AUC-ROC | Precision @70% Recall | Latency | Training |
30|-------|---------|----------------------|---------|----------|
31| Logistic Regression | 0.79 | 0.48 | 0.3ms | 12s |
32| Random Forest | 0.83 | 0.56 | 1.2ms | 45s |
33| XGBoost | 0.87 | 0.63 | 0.8ms | 3 min |
34| LightGBM | 0.86 | 0.61 | 0.6ms | 2 min |
35| Rule-based (current) | 0.68 | 0.61* | — | — |
36 
37**Selected: XGBoost.** Highest AUC and precision at target recall. Test set results: AUC 0.85, precision 0.59 at 72% recall. Slight validation-to-test degradation (0.87 to 0.85) is within expected range for temporal shift.
38 
39## Error Analysis
40 
41**False negatives:** Long-tenure users (>24mo) who churn abruptly after price increases -- no gradual usage decline to detect (23% of FNs). Annual plan users who disengage mid-cycle but cancel at renewal (18% of FNs).
42 
43**False positives:** Seasonal users whose low-activity periods mimic churn signals (31% of FPs). Users who contacted support about billing but resolved the issue (14% of FPs).
44 
45## Segment Performance
46 
47| Segment | Recall | Precision @70% | Gap |
48|---------|--------|----------------|-----|
49| Monthly plan | 74% | 0.62 | — |
50| Annual plan | 58% | 0.41 | Under-served |
51| Tenure < 6mo | 78% | 0.67 | — |
52| Tenure > 24mo | 61% | 0.44 | Under-served |
53| International users | 68% | 0.55 | 5-pt TPR gap vs. US |
54 
55Annual-plan and long-tenure segments rely on usage-decline features that manifest differently for these groups. International TPR gap traced to timezone and connectivity differences in usage features.
56 
57## Production Readiness
58 
59| Criterion | Status | Notes |
60|-----------|--------|-------|
61| Primary metric | PASS | 72% recall exceeds 70% target |
62| Precision | PASS | 59% at target recall (target >55%) |
63| Latency | PASS | Full 214K-user batch in <3 min |
64| Bias assessment | CONDITIONAL | International TPR gap flagged; monitor, fix in v2 |
65| Monitoring | DEFINED | Weekly AUC on rolling window; alert if <0.80 |
66| Fallback | DEFINED | Revert to rules if AUC <0.78 for 2 consecutive weeks |
67| A/B test | DEFINED | 50/50 split vs. rule-based, measure save rate over 8 weeks |
68 
69**Recommendation:** Approve for A/B test. Schedule v2 feature work for annual-plan and long-tenure blind spots.
70 
AgentsSkillsCompaniesJobsForumBlogFAQAbout

©2026 ai-directory.company

·Privacy·Terms·Cookies·