engineering

Architecture Decision Record

Write Architecture Decision Records (ADRs) that capture context, options considered, decision rationale, and consequences — creating a searchable decision log for future engineers.

ADRarchitecturedecisionsdocumentationtechnical-decisions

Works well with agents

CTO Advisor Agent Enterprise Architect Agent

Works well with skills

System Design Document Technical Spec Writing

$ npx skills add The-AI-Directory-Company/(…) --skill architecture-decision-record

architecture-decision-record/

message-broker-selection.md

Markdown

1	# ADR-0023: Use RabbitMQ for Order Processing Instead of Kafka
2
3	- Status: Accepted
4	- Date: 2025-11-08
5	- Decision makers: @priya (tech lead), @marcus (architect), @jen (platform eng)
6	- Consulted: Order fulfillment team, Data engineering team, SRE team
7
8	## 1. Context
9
10	The order processing service currently uses a PostgreSQL-backed job queue (pg-boss) to coordinate order state transitions: payment captured, inventory reserved, shipment created, confirmation sent. At current volume (~800 orders/hour), the queue works. However, projected holiday traffic will reach 5,000 orders/hour, and pg-boss is already the primary source of database connection pressure during peak loads.
11
12	We need a dedicated message broker to decouple order processing steps, handle burst traffic, and allow independent scaling of consumers. The solution must support at-least-once delivery, dead-letter handling, and per-consumer retry policies. The team has no prior experience operating a dedicated message broker in production.
13
14	Constraints: 90-day deadline before holiday traffic ramp, 3-person platform team, existing infrastructure runs on AWS ECS with Terraform, and total budget for new infrastructure is $2,000/month.
15
16	## 2. Options Considered
17
18	### Option A: Do Nothing (keep pg-boss)
19
20	- Pros: No migration effort; team already knows it; no new infrastructure cost
21	- Cons: Database connection exhaustion at 3,000+ orders/hour based on load tests; no independent consumer scaling; couples queue health to database health
22	- Estimated effort: 0 weeks
23	- Estimated cost: $0 additional, but risk of outage during peak
24
25	### Option B: Apache Kafka (self-managed on ECS)
26
27	- Pros: High throughput (100K+ msg/sec); durable log allows replay; strong ecosystem for event sourcing
28	- Cons: Operational complexity (ZooKeeper/KRaft, partition management, offset tracking); 6-8 week ramp-up for team with no Kafka experience; minimum 3-broker cluster costs ~$1,800/month; over-engineered for 5,000 orders/hour
29	- Estimated effort: 8-10 weeks including learning curve
30	- Estimated cost: ~$1,800/month infrastructure
31
32	### Option C: Amazon SQS
33
34	- Pros: Zero operational overhead; pay-per-message pricing (~$40/month at our volume); native AWS integration
35	- Cons: No native routing/exchange patterns — requires one queue per consumer type; no built-in priority queues; 256KB message size limit; vendor lock-in
36	- Estimated effort: 3-4 weeks
37	- Estimated cost: ~$40/month
38
39	### Option D: RabbitMQ (Amazon MQ)
40
41	- Pros: Flexible routing via exchanges and bindings; built-in dead-letter queues and per-queue TTL; team can learn core concepts in days; Amazon MQ handles patching and failover; supports priority queues natively; 5,000 msg/sec is well within a single-node capacity
42	- Cons: Not designed for event log replay (messages are consumed and gone); Amazon MQ costs more than self-hosted (~$350/month for mq.m5.large); lower ceiling than Kafka at extreme scale
43	- Estimated effort: 4-5 weeks
44	- Estimated cost: ~$350/month (Amazon MQ)
45
46	## 3. Decision
47
48	We will use RabbitMQ via Amazon MQ for order processing message brokering.
49
50	RabbitMQ's exchange/binding model maps directly to our order processing topology: a single topic exchange routes order events to dedicated queues per processing step (payment, inventory, shipping, notification). Dead-letter exchanges handle failures without custom retry logic. Amazon MQ eliminates the operational burden that disqualified self-managed Kafka given our 3-person team and 90-day deadline.
51
52	We rejected Kafka because the operational complexity and learning curve exceed what the team can absorb in 90 days, and our throughput requirements (5,000 msg/sec peak) do not justify it. We rejected SQS because the lack of routing primitives would force us to build exchange-like logic in application code. We rejected "do nothing" because load testing confirmed pg-boss failures above 3,000 orders/hour.
53
54	## 4. Consequences
55
56	Positive consequences
57	- Order processing steps are decoupled — each consumer scales independently
58	- Dead-letter queues provide automatic failure isolation with visibility into poisoned messages
59	- Database connection pressure drops by ~40% by removing pg-boss polling
60
61	Negative consequences
62	- New infrastructure dependency; Amazon MQ uptime becomes critical path for order processing
63	- No event replay capability — if we need event sourcing later, we will need a separate system
64	- Team must learn AMQP concepts (exchanges, bindings, acknowledgments, prefetch)
65
66	Risks
67	- Amazon MQ single-node failover takes 1-2 minutes; during that window, order events queue in producers. Mitigation: implement local retry buffer in the publisher with 5-minute capacity.
68	- If throughput grows beyond 50,000 msg/sec, RabbitMQ will need replacement. Mitigation: abstract broker behind an interface; revisit at 20,000 msg/sec sustained.
69
70	## 5. Follow-Up Actions
71
72	\| Action \| Owner \| Deadline \|
73	\|--------\|-------\|----------\|
74	\| Provision Amazon MQ instance in staging via Terraform \| @jen \| 2025-11-15 \|
75	\| Implement order event publisher with local retry buffer \| @priya \| 2025-11-29 \|
76	\| Build payment, inventory, and shipping consumers \| @marcus \| 2025-12-13 \|
77	\| Load test at 7,500 msg/sec (1.5x projected peak) \| @jen \| 2025-12-20 \|
78	\| Migrate production traffic with pg-boss fallback \| @priya \| 2026-01-03 \|
79	\| Decommission pg-boss after 2-week parallel run \| @marcus \| 2026-01-17 \|
80