engineering

Architecture Decision Record

Write Architecture Decision Records (ADRs) that capture context, options considered, decision rationale, and consequences — creating a searchable decision log for future engineers.

ADRarchitecturedecisionsdocumentationtechnical-decisions

Works well with agents

CTO Advisor AgentEnterprise Architect Agent

Works well with skills

System Design DocumentTechnical Spec Writing
$ npx skills add The-AI-Directory-Company/(…) --skill architecture-decision-record
architecture-decision-record/
    • message-broker-selection.md5.4 KB
  • SKILL.md5.4 KB
architecture-decision-record/examples/message-broker-selection.md
message-broker-selection.md
Markdown
1# ADR-0023: Use RabbitMQ for Order Processing Instead of Kafka
2 
3- **Status**: Accepted
4- **Date**: 2025-11-08
5- **Decision makers**: @priya (tech lead), @marcus (architect), @jen (platform eng)
6- **Consulted**: Order fulfillment team, Data engineering team, SRE team
7 
8## 1. Context
9 
10The order processing service currently uses a PostgreSQL-backed job queue (pg-boss) to coordinate order state transitions: payment captured, inventory reserved, shipment created, confirmation sent. At current volume (~800 orders/hour), the queue works. However, projected holiday traffic will reach 5,000 orders/hour, and pg-boss is already the primary source of database connection pressure during peak loads.
11 
12We need a dedicated message broker to decouple order processing steps, handle burst traffic, and allow independent scaling of consumers. The solution must support at-least-once delivery, dead-letter handling, and per-consumer retry policies. The team has no prior experience operating a dedicated message broker in production.
13 
14Constraints: 90-day deadline before holiday traffic ramp, 3-person platform team, existing infrastructure runs on AWS ECS with Terraform, and total budget for new infrastructure is $2,000/month.
15 
16## 2. Options Considered
17 
18### Option A: Do Nothing (keep pg-boss)
19 
20- **Pros**: No migration effort; team already knows it; no new infrastructure cost
21- **Cons**: Database connection exhaustion at 3,000+ orders/hour based on load tests; no independent consumer scaling; couples queue health to database health
22- **Estimated effort**: 0 weeks
23- **Estimated cost**: $0 additional, but risk of outage during peak
24 
25### Option B: Apache Kafka (self-managed on ECS)
26 
27- **Pros**: High throughput (100K+ msg/sec); durable log allows replay; strong ecosystem for event sourcing
28- **Cons**: Operational complexity (ZooKeeper/KRaft, partition management, offset tracking); 6-8 week ramp-up for team with no Kafka experience; minimum 3-broker cluster costs ~$1,800/month; over-engineered for 5,000 orders/hour
29- **Estimated effort**: 8-10 weeks including learning curve
30- **Estimated cost**: ~$1,800/month infrastructure
31 
32### Option C: Amazon SQS
33 
34- **Pros**: Zero operational overhead; pay-per-message pricing (~$40/month at our volume); native AWS integration
35- **Cons**: No native routing/exchange patterns — requires one queue per consumer type; no built-in priority queues; 256KB message size limit; vendor lock-in
36- **Estimated effort**: 3-4 weeks
37- **Estimated cost**: ~$40/month
38 
39### Option D: RabbitMQ (Amazon MQ)
40 
41- **Pros**: Flexible routing via exchanges and bindings; built-in dead-letter queues and per-queue TTL; team can learn core concepts in days; Amazon MQ handles patching and failover; supports priority queues natively; 5,000 msg/sec is well within a single-node capacity
42- **Cons**: Not designed for event log replay (messages are consumed and gone); Amazon MQ costs more than self-hosted (~$350/month for mq.m5.large); lower ceiling than Kafka at extreme scale
43- **Estimated effort**: 4-5 weeks
44- **Estimated cost**: ~$350/month (Amazon MQ)
45 
46## 3. Decision
47 
48**We will use RabbitMQ via Amazon MQ for order processing message brokering.**
49 
50RabbitMQ's exchange/binding model maps directly to our order processing topology: a single topic exchange routes order events to dedicated queues per processing step (payment, inventory, shipping, notification). Dead-letter exchanges handle failures without custom retry logic. Amazon MQ eliminates the operational burden that disqualified self-managed Kafka given our 3-person team and 90-day deadline.
51 
52We rejected Kafka because the operational complexity and learning curve exceed what the team can absorb in 90 days, and our throughput requirements (5,000 msg/sec peak) do not justify it. We rejected SQS because the lack of routing primitives would force us to build exchange-like logic in application code. We rejected "do nothing" because load testing confirmed pg-boss failures above 3,000 orders/hour.
53 
54## 4. Consequences
55 
56**Positive consequences**
57- Order processing steps are decoupled — each consumer scales independently
58- Dead-letter queues provide automatic failure isolation with visibility into poisoned messages
59- Database connection pressure drops by ~40% by removing pg-boss polling
60 
61**Negative consequences**
62- New infrastructure dependency; Amazon MQ uptime becomes critical path for order processing
63- No event replay capability — if we need event sourcing later, we will need a separate system
64- Team must learn AMQP concepts (exchanges, bindings, acknowledgments, prefetch)
65 
66**Risks**
67- Amazon MQ single-node failover takes 1-2 minutes; during that window, order events queue in producers. **Mitigation**: implement local retry buffer in the publisher with 5-minute capacity.
68- If throughput grows beyond 50,000 msg/sec, RabbitMQ will need replacement. **Mitigation**: abstract broker behind an interface; revisit at 20,000 msg/sec sustained.
69 
70## 5. Follow-Up Actions
71 
72| Action | Owner | Deadline |
73|--------|-------|----------|
74| Provision Amazon MQ instance in staging via Terraform | @jen | 2025-11-15 |
75| Implement order event publisher with local retry buffer | @priya | 2025-11-29 |
76| Build payment, inventory, and shipping consumers | @marcus | 2025-12-13 |
77| Load test at 7,500 msg/sec (1.5x projected peak) | @jen | 2025-12-20 |
78| Migrate production traffic with pg-boss fallback | @priya | 2026-01-03 |
79| Decommission pg-boss after 2-week parallel run | @marcus | 2026-01-17 |
80 
AgentsSkillsCompaniesJobsForumBlogFAQAbout

©2026 ai-directory.company

·Privacy·Terms·Cookies·