engineering
Architecture Decision Record
Write Architecture Decision Records (ADRs) that capture context, options considered, decision rationale, and consequences — creating a searchable decision log for future engineers.
ADRarchitecturedecisionsdocumentationtechnical-decisions
Works well with agents
Works well with skills
$ npx skills add The-AI-Directory-Company/(…) --skill architecture-decision-recordarchitecture-decision-record/
message-broker-selection.md
Markdown
| 1 | # ADR-0023: Use RabbitMQ for Order Processing Instead of Kafka |
| 2 | |
| 3 | - **Status**: Accepted |
| 4 | - **Date**: 2025-11-08 |
| 5 | - **Decision makers**: @priya (tech lead), @marcus (architect), @jen (platform eng) |
| 6 | - **Consulted**: Order fulfillment team, Data engineering team, SRE team |
| 7 | |
| 8 | ## 1. Context |
| 9 | |
| 10 | The order processing service currently uses a PostgreSQL-backed job queue (pg-boss) to coordinate order state transitions: payment captured, inventory reserved, shipment created, confirmation sent. At current volume (~800 orders/hour), the queue works. However, projected holiday traffic will reach 5,000 orders/hour, and pg-boss is already the primary source of database connection pressure during peak loads. |
| 11 | |
| 12 | We need a dedicated message broker to decouple order processing steps, handle burst traffic, and allow independent scaling of consumers. The solution must support at-least-once delivery, dead-letter handling, and per-consumer retry policies. The team has no prior experience operating a dedicated message broker in production. |
| 13 | |
| 14 | Constraints: 90-day deadline before holiday traffic ramp, 3-person platform team, existing infrastructure runs on AWS ECS with Terraform, and total budget for new infrastructure is $2,000/month. |
| 15 | |
| 16 | ## 2. Options Considered |
| 17 | |
| 18 | ### Option A: Do Nothing (keep pg-boss) |
| 19 | |
| 20 | - **Pros**: No migration effort; team already knows it; no new infrastructure cost |
| 21 | - **Cons**: Database connection exhaustion at 3,000+ orders/hour based on load tests; no independent consumer scaling; couples queue health to database health |
| 22 | - **Estimated effort**: 0 weeks |
| 23 | - **Estimated cost**: $0 additional, but risk of outage during peak |
| 24 | |
| 25 | ### Option B: Apache Kafka (self-managed on ECS) |
| 26 | |
| 27 | - **Pros**: High throughput (100K+ msg/sec); durable log allows replay; strong ecosystem for event sourcing |
| 28 | - **Cons**: Operational complexity (ZooKeeper/KRaft, partition management, offset tracking); 6-8 week ramp-up for team with no Kafka experience; minimum 3-broker cluster costs ~$1,800/month; over-engineered for 5,000 orders/hour |
| 29 | - **Estimated effort**: 8-10 weeks including learning curve |
| 30 | - **Estimated cost**: ~$1,800/month infrastructure |
| 31 | |
| 32 | ### Option C: Amazon SQS |
| 33 | |
| 34 | - **Pros**: Zero operational overhead; pay-per-message pricing (~$40/month at our volume); native AWS integration |
| 35 | - **Cons**: No native routing/exchange patterns — requires one queue per consumer type; no built-in priority queues; 256KB message size limit; vendor lock-in |
| 36 | - **Estimated effort**: 3-4 weeks |
| 37 | - **Estimated cost**: ~$40/month |
| 38 | |
| 39 | ### Option D: RabbitMQ (Amazon MQ) |
| 40 | |
| 41 | - **Pros**: Flexible routing via exchanges and bindings; built-in dead-letter queues and per-queue TTL; team can learn core concepts in days; Amazon MQ handles patching and failover; supports priority queues natively; 5,000 msg/sec is well within a single-node capacity |
| 42 | - **Cons**: Not designed for event log replay (messages are consumed and gone); Amazon MQ costs more than self-hosted (~$350/month for mq.m5.large); lower ceiling than Kafka at extreme scale |
| 43 | - **Estimated effort**: 4-5 weeks |
| 44 | - **Estimated cost**: ~$350/month (Amazon MQ) |
| 45 | |
| 46 | ## 3. Decision |
| 47 | |
| 48 | **We will use RabbitMQ via Amazon MQ for order processing message brokering.** |
| 49 | |
| 50 | RabbitMQ's exchange/binding model maps directly to our order processing topology: a single topic exchange routes order events to dedicated queues per processing step (payment, inventory, shipping, notification). Dead-letter exchanges handle failures without custom retry logic. Amazon MQ eliminates the operational burden that disqualified self-managed Kafka given our 3-person team and 90-day deadline. |
| 51 | |
| 52 | We rejected Kafka because the operational complexity and learning curve exceed what the team can absorb in 90 days, and our throughput requirements (5,000 msg/sec peak) do not justify it. We rejected SQS because the lack of routing primitives would force us to build exchange-like logic in application code. We rejected "do nothing" because load testing confirmed pg-boss failures above 3,000 orders/hour. |
| 53 | |
| 54 | ## 4. Consequences |
| 55 | |
| 56 | **Positive consequences** |
| 57 | - Order processing steps are decoupled — each consumer scales independently |
| 58 | - Dead-letter queues provide automatic failure isolation with visibility into poisoned messages |
| 59 | - Database connection pressure drops by ~40% by removing pg-boss polling |
| 60 | |
| 61 | **Negative consequences** |
| 62 | - New infrastructure dependency; Amazon MQ uptime becomes critical path for order processing |
| 63 | - No event replay capability — if we need event sourcing later, we will need a separate system |
| 64 | - Team must learn AMQP concepts (exchanges, bindings, acknowledgments, prefetch) |
| 65 | |
| 66 | **Risks** |
| 67 | - Amazon MQ single-node failover takes 1-2 minutes; during that window, order events queue in producers. **Mitigation**: implement local retry buffer in the publisher with 5-minute capacity. |
| 68 | - If throughput grows beyond 50,000 msg/sec, RabbitMQ will need replacement. **Mitigation**: abstract broker behind an interface; revisit at 20,000 msg/sec sustained. |
| 69 | |
| 70 | ## 5. Follow-Up Actions |
| 71 | |
| 72 | | Action | Owner | Deadline | |
| 73 | |--------|-------|----------| |
| 74 | | Provision Amazon MQ instance in staging via Terraform | @jen | 2025-11-15 | |
| 75 | | Implement order event publisher with local retry buffer | @priya | 2025-11-29 | |
| 76 | | Build payment, inventory, and shipping consumers | @marcus | 2025-12-13 | |
| 77 | | Load test at 7,500 msg/sec (1.5x projected peak) | @jen | 2025-12-20 | |
| 78 | | Migrate production traffic with pg-boss fallback | @priya | 2026-01-03 | |
| 79 | | Decommission pg-boss after 2-week parallel run | @marcus | 2026-01-17 | |
| 80 |