operationsengineering
Cloud Cost Analysis
Analyze and optimize cloud infrastructure costs — identifying waste, right-sizing resources, evaluating reserved vs on-demand pricing, and producing savings roadmaps with ROI projections.
cloudcost-optimizationFinOpsAWSinfrastructure
Works well with agents
Works well with skills
cloud-cost-analysis/
SKILL.md
Markdown| 1 | |
| 2 | # Cloud Cost Analysis |
| 3 | |
| 4 | ## Before you start |
| 5 | |
| 6 | Gather the following from the user: |
| 7 | |
| 8 | 1. **Which cloud provider(s)?** (AWS, GCP, Azure, or multi-cloud) |
| 9 | 2. **Current monthly spend** (total and by service if available) |
| 10 | 3. **Cost breakdown access** (billing console exports, Cost Explorer data, or CSV dumps) |
| 11 | 4. **Growth trajectory** (expected traffic or workload changes over 6-12 months) |
| 12 | 5. **Commitment constraints** (existing reserved instances, savings plans, or enterprise agreements) |
| 13 | |
| 14 | If the user says "our cloud bill is too high," push back: "What's your current monthly spend, which services make up the top 80%, and do you have any existing reservations or savings plans?" |
| 15 | |
| 16 | ## Cost analysis template |
| 17 | |
| 18 | ### 1. Spend Summary |
| 19 | |
| 20 | Break down current spend into the top cost categories. Cover at least 80% of total spend. |
| 21 | |
| 22 | ``` |
| 23 | | Service | Monthly Cost | % of Total | Trend (3mo) | |
| 24 | |------------------|-------------|------------|-------------| |
| 25 | | EC2 / Compute | $42,300 | 38% | +12% | |
| 26 | | RDS / Databases | $28,100 | 25% | +5% | |
| 27 | | S3 / Storage | $15,200 | 14% | +8% | |
| 28 | | Data Transfer | $9,800 | 9% | +22% | |
| 29 | | Other | $15,600 | 14% | flat | |
| 30 | | **Total** | **$111,000**| **100%** | **+11%** | |
| 31 | ``` |
| 32 | |
| 33 | ### 2. Waste Identification |
| 34 | |
| 35 | Audit each category for idle, oversized, or orphaned resources. Use this checklist: |
| 36 | |
| 37 | - **Idle resources**: Instances, load balancers, or databases with <5% average utilization over 14 days |
| 38 | - **Orphaned storage**: Unattached EBS volumes, old snapshots, unused S3 buckets |
| 39 | - **Oversized instances**: CPU/memory utilization consistently below 30% — candidates for right-sizing |
| 40 | - **Zombie environments**: Dev/staging environments running 24/7 that could use scheduling |
| 41 | - **Unused reservations**: Reserved capacity for instance types no longer in use |
| 42 | |
| 43 | For each finding, document the resource, its current cost, and the estimated savings. |
| 44 | |
| 45 | ### 3. Right-Sizing Recommendations |
| 46 | |
| 47 | For every oversized resource, propose a specific target: |
| 48 | |
| 49 | ``` |
| 50 | | Resource | Current Type | Avg CPU | Avg Memory | Recommended | Monthly Savings | |
| 51 | |------------------|-------------|---------|------------|-------------|-----------------| |
| 52 | | api-prod-1 | m5.2xlarge | 12% | 28% | m5.large | $180 | |
| 53 | | worker-batch | c5.4xlarge | 8% | 15% | c5.xlarge | $310 | |
| 54 | | analytics-db | r5.4xlarge | 22% | 45% | r5.2xlarge | $520 | |
| 55 | ``` |
| 56 | |
| 57 | ### 4. Pricing Model Optimization |
| 58 | |
| 59 | Evaluate the mix of on-demand, reserved, savings plans, and spot: |
| 60 | |
| 61 | - **Stable baseline workloads**: Recommend 1-year or 3-year reservations. Calculate break-even point (typically 7-9 months for 1-year RI). |
| 62 | - **Variable workloads**: Recommend savings plans with a commitment level matching the floor of historical usage. |
| 63 | - **Fault-tolerant batch jobs**: Recommend spot instances with interruption handling. Document the spot vs on-demand discount (typically 60-80%). |
| 64 | - **Dev/test environments**: Recommend scheduling (stop nights/weekends) or spot-based environments. |
| 65 | |
| 66 | ### 5. Architecture-Level Optimizations |
| 67 | |
| 68 | Identify structural changes that reduce cost: |
| 69 | |
| 70 | - **Data transfer**: Move cross-AZ traffic to same-AZ where possible. Use VPC endpoints instead of NAT gateways for AWS service calls. |
| 71 | - **Storage tiering**: Move infrequently accessed data to cheaper tiers (S3 Infrequent Access, Glacier, or equivalent). |
| 72 | - **Compute model**: Evaluate containers (ECS/EKS) vs VMs for better bin-packing and utilization. |
| 73 | - **Caching**: Add caching layers to reduce database and API call volume. |
| 74 | - **Serverless migration**: Identify low-traffic services where serverless would eliminate idle compute costs. |
| 75 | |
| 76 | ### 6. Savings Roadmap |
| 77 | |
| 78 | Prioritize recommendations by effort and impact. Use this format: |
| 79 | |
| 80 | ``` |
| 81 | | Priority | Action | Monthly Savings | Effort | Timeline | |
| 82 | |----------|----------------------------------|----------------|----------|-----------| |
| 83 | | P0 | Delete orphaned EBS volumes | $1,200 | 1 day | This week | |
| 84 | | P0 | Schedule dev environments | $3,800 | 2 days | This week | |
| 85 | | P1 | Right-size top 10 instances | $4,500 | 1 week | 2 weeks | |
| 86 | | P1 | Purchase 1-year RIs for baseline | $8,200 | 1 day | 30 days | |
| 87 | | P2 | Migrate logs to S3 IA tier | $2,100 | 1 sprint | 60 days | |
| 88 | | P2 | Move batch jobs to spot | $5,600 | 2 sprints| 90 days | |
| 89 | ``` |
| 90 | |
| 91 | ### 7. ROI Projection |
| 92 | |
| 93 | Summarize the total opportunity: |
| 94 | |
| 95 | - **Quick wins (0-2 weeks)**: Total monthly savings from P0 items |
| 96 | - **Medium-term (1-3 months)**: Cumulative savings including P1 items |
| 97 | - **Full realization (3-6 months)**: Total annual savings with all recommendations implemented |
| 98 | - **Implementation cost**: Engineering hours required, expressed in estimated cost |
| 99 | |
| 100 | ## Quality checklist |
| 101 | |
| 102 | Before delivering the analysis, verify: |
| 103 | |
| 104 | - [ ] Top 80% of spend is broken down by service with 3-month trends |
| 105 | - [ ] Every waste finding references a specific resource or resource group |
| 106 | - [ ] Right-sizing recommendations include current utilization data |
| 107 | - [ ] Pricing model recommendations include break-even calculations |
| 108 | - [ ] Savings roadmap has priorities, effort estimates, and timelines |
| 109 | - [ ] ROI projection includes implementation cost, not just savings |
| 110 | - [ ] Recommendations account for existing commitments and growth trajectory |
| 111 | |
| 112 | ## Common mistakes to avoid |
| 113 | |
| 114 | - **Optimizing without utilization data.** Right-sizing based on instance type alone is guessing. Always require at least 14 days of CPU/memory metrics before recommending a downsize. |
| 115 | - **Ignoring data transfer costs.** These are often the fastest-growing line item and the hardest to spot. Always check cross-AZ, cross-region, and internet egress charges. |
| 116 | - **Recommending 3-year reservations without growth context.** A 3-year RI saves more per month but locks you in. If the workload might migrate to containers or serverless, prefer 1-year or convertible RIs. |
| 117 | - **Listing savings without effort estimates.** "$50K/year savings" means nothing if it requires 6 months of engineering work. Always pair savings with implementation cost. |
| 118 | - **Forgetting about non-production environments.** Dev, staging, and QA environments often run 24/7 but are only used during business hours. Scheduling alone can cut their cost by 65%. |
| 119 |