operationsengineering

Cloud Cost Analysis

Analyze and optimize cloud infrastructure costs — identifying waste, right-sizing resources, evaluating reserved vs on-demand pricing, and producing savings roadmaps with ROI projections.

cloudcost-optimizationFinOpsAWSinfrastructure

Works well with agents

Cloud Architect Agent DevOps Engineer Agent

Works well with skills

System Design Document

cloud-cost-analysis/

SKILL.md

Markdown

1
2	# Cloud Cost Analysis
3
4	## Before you start
5
6	Gather the following from the user:
7
8	1. Which cloud provider(s)? (AWS, GCP, Azure, or multi-cloud)
9	2. Current monthly spend (total and by service if available)
10	3. Cost breakdown access (billing console exports, Cost Explorer data, or CSV dumps)
11	4. Growth trajectory (expected traffic or workload changes over 6-12 months)
12	5. Commitment constraints (existing reserved instances, savings plans, or enterprise agreements)
13
14	If the user says "our cloud bill is too high," push back: "What's your current monthly spend, which services make up the top 80%, and do you have any existing reservations or savings plans?"
15
16	## Cost analysis template
17
18	### 1. Spend Summary
19
20	Break down current spend into the top cost categories. Cover at least 80% of total spend.
21
22	```
23	\| Service \| Monthly Cost \| % of Total \| Trend (3mo) \|
24	\|------------------\|-------------\|------------\|-------------\|
25	\| EC2 / Compute \| $42,300 \| 38% \| +12% \|
26	\| RDS / Databases \| $28,100 \| 25% \| +5% \|
27	\| S3 / Storage \| $15,200 \| 14% \| +8% \|
28	\| Data Transfer \| $9,800 \| 9% \| +22% \|
29	\| Other \| $15,600 \| 14% \| flat \|
30	\| Total \| $111,000\| 100% \| +11% \|
31	```
32
33	### 2. Waste Identification
34
35	Audit each category for idle, oversized, or orphaned resources. Use this checklist:
36
37	- Idle resources: Instances, load balancers, or databases with <5% average utilization over 14 days
38	- Orphaned storage: Unattached EBS volumes, old snapshots, unused S3 buckets
39	- Oversized instances: CPU/memory utilization consistently below 30% — candidates for right-sizing
40	- Zombie environments: Dev/staging environments running 24/7 that could use scheduling
41	- Unused reservations: Reserved capacity for instance types no longer in use
42
43	For each finding, document the resource, its current cost, and the estimated savings.
44
45	### 3. Right-Sizing Recommendations
46
47	For every oversized resource, propose a specific target:
48
49	```
50	\| Resource \| Current Type \| Avg CPU \| Avg Memory \| Recommended \| Monthly Savings \|
51	\|------------------\|-------------\|---------\|------------\|-------------\|-----------------\|
52	\| api-prod-1 \| m5.2xlarge \| 12% \| 28% \| m5.large \| $180 \|
53	\| worker-batch \| c5.4xlarge \| 8% \| 15% \| c5.xlarge \| $310 \|
54	\| analytics-db \| r5.4xlarge \| 22% \| 45% \| r5.2xlarge \| $520 \|
55	```
56
57	### 4. Pricing Model Optimization
58
59	Evaluate the mix of on-demand, reserved, savings plans, and spot:
60
61	- Stable baseline workloads: Recommend 1-year or 3-year reservations. Calculate break-even point (typically 7-9 months for 1-year RI).
62	- Variable workloads: Recommend savings plans with a commitment level matching the floor of historical usage.
63	- Fault-tolerant batch jobs: Recommend spot instances with interruption handling. Document the spot vs on-demand discount (typically 60-80%).
64	- Dev/test environments: Recommend scheduling (stop nights/weekends) or spot-based environments.
65
66	### 5. Architecture-Level Optimizations
67
68	Identify structural changes that reduce cost:
69
70	- Data transfer: Move cross-AZ traffic to same-AZ where possible. Use VPC endpoints instead of NAT gateways for AWS service calls.
71	- Storage tiering: Move infrequently accessed data to cheaper tiers (S3 Infrequent Access, Glacier, or equivalent).
72	- Compute model: Evaluate containers (ECS/EKS) vs VMs for better bin-packing and utilization.
73	- Caching: Add caching layers to reduce database and API call volume.
74	- Serverless migration: Identify low-traffic services where serverless would eliminate idle compute costs.
75
76	### 6. Savings Roadmap
77
78	Prioritize recommendations by effort and impact. Use this format:
79
80	```
81	\| Priority \| Action \| Monthly Savings \| Effort \| Timeline \|
82	\|----------\|----------------------------------\|----------------\|----------\|-----------\|
83	\| P0 \| Delete orphaned EBS volumes \| $1,200 \| 1 day \| This week \|
84	\| P0 \| Schedule dev environments \| $3,800 \| 2 days \| This week \|
85	\| P1 \| Right-size top 10 instances \| $4,500 \| 1 week \| 2 weeks \|
86	\| P1 \| Purchase 1-year RIs for baseline \| $8,200 \| 1 day \| 30 days \|
87	\| P2 \| Migrate logs to S3 IA tier \| $2,100 \| 1 sprint \| 60 days \|
88	\| P2 \| Move batch jobs to spot \| $5,600 \| 2 sprints\| 90 days \|
89	```
90
91	### 7. ROI Projection
92
93	Summarize the total opportunity:
94
95	- Quick wins (0-2 weeks): Total monthly savings from P0 items
96	- Medium-term (1-3 months): Cumulative savings including P1 items
97	- Full realization (3-6 months): Total annual savings with all recommendations implemented
98	- Implementation cost: Engineering hours required, expressed in estimated cost
99
100	## Quality checklist
101
102	Before delivering the analysis, verify:
103
104	- [ ] Top 80% of spend is broken down by service with 3-month trends
105	- [ ] Every waste finding references a specific resource or resource group
106	- [ ] Right-sizing recommendations include current utilization data
107	- [ ] Pricing model recommendations include break-even calculations
108	- [ ] Savings roadmap has priorities, effort estimates, and timelines
109	- [ ] ROI projection includes implementation cost, not just savings
110	- [ ] Recommendations account for existing commitments and growth trajectory
111
112	## Common mistakes to avoid
113
114	- Optimizing without utilization data. Right-sizing based on instance type alone is guessing. Always require at least 14 days of CPU/memory metrics before recommending a downsize.
115	- Ignoring data transfer costs. These are often the fastest-growing line item and the hardest to spot. Always check cross-AZ, cross-region, and internet egress charges.
116	- Recommending 3-year reservations without growth context. A 3-year RI saves more per month but locks you in. If the workload might migrate to containers or serverless, prefer 1-year or convertible RIs.
117	- Listing savings without effort estimates. "$50K/year savings" means nothing if it requires 6 months of engineering work. Always pair savings with implementation cost.
118	- Forgetting about non-production environments. Dev, staging, and QA environments often run 24/7 but are only used during business hours. Scheduling alone can cut their cost by 65%.
119