Lessons From Processing a Million Transactions a Day

When our platform crossed the million-transactions-per-day threshold, it wasn't a celebration. It was a relief. We'd spent weeks preparing — profiling, load testing, optimizing — and the system held.

But getting there taught me lessons that no amount of theoretical scaling knowledge could have provided. Here's what I learned from actually doing it.

Lesson 1: Profile Before You Optimize

Our first instinct when performance started degrading was to optimize the database queries. They were the obvious bottleneck — the dashboard showed increasing query times, and the engineers were confident they knew which queries were slow.

They were wrong. When we actually profiled the system end-to-end, the primary bottleneck was JSON serialization in our API layer. We were converting large objects to JSON on every response, and at scale, those milliseconds per request added up to massive cumulative overhead.

The fix was simple — response caching and more efficient serialization — but we would have spent weeks optimizing queries that weren't the real problem without profiling first.

Rule: always profile, never assume.

Lesson 2: Batching Changes Everything

Individual database operations that are fast at low volume become untenable at high volume. Writing one record takes 5ms. Writing a million records at 5ms each takes over an hour.

Batching was the single biggest performance improvement we made. Instead of processing transactions individually, we accumulated them in memory and wrote them in batches of 1,000. Database operations that took hours completed in minutes.

The trade-off is complexity. Batching introduces questions about error handling (what if one record in the batch fails?), ordering (does processing order matter?), and latency (how long do you buffer before flushing?).

We solved these with dead letter queues for failed records, sequence numbers for ordering, and configurable batch windows with maximum size limits.

Lesson 3: Connection Pools Are Not Infinite

At low volume, every request gets its own database connection and life is good. At high volume, you run out of connections. And when you run out, requests start queuing, timeouts start cascading, and the whole system degrades.

We learned to treat connection pools as a precious shared resource:

Right-size the pool based on actual concurrency, not theoretical maximums
Set connection timeouts so requests fail fast rather than waiting indefinitely
Monitor pool utilization and alert when it exceeds 70%
Use read replicas to distribute query load away from the primary

Lesson 4: Asynchronous Processing Is Your Friend

Not everything needs to happen in real-time. The user who submits a transaction doesn't need to wait for the audit log entry, the notification email, and the analytics event. They need confirmation that their transaction was accepted.

We moved everything non-critical to asynchronous processing. The request path became: validate, persist, respond. Everything else — notifications, audit logging, analytics, reconciliation — happened asynchronously via message queues.

This reduced our p99 response time by 60% and decoupled our user-facing performance from the performance of downstream systems.

Lesson 5: Monitoring Must Scale Too

Our monitoring worked great at 10,000 transactions per day. At a million, we were drowning in metrics data, our dashboards were slow to load, and our alert rules were firing constantly because our thresholds were calibrated for a smaller scale.

Scaling the monitoring meant:

Aggregating metrics instead of tracking individual transactions
Percentile-based alerting (p99 latency) instead of average-based
Sampling for distributed traces — you don't need to trace every request
Tiered dashboards — high-level system health for daily monitoring, detailed drill-downs for investigation

Lesson 6: Data Growth Is the Silent Killer

A million transactions per day means roughly 30 million per month. After a year, you have 365 million records. Queries that scanned the full table in development became unusable in production.

We implemented a data management strategy early:

Table partitioning by date, allowing old partitions to be archived or dropped
Archival pipelines that moved data older than 90 days to cold storage
Index maintenance that prevented bloat from degrading query performance
Read replicas for analytical queries that would otherwise impact transactional performance

The Meta-Lesson

The overarching lesson from scaling to a million transactions per day: the problems you'll face at scale are different from the problems you imagine. Your intuitions about bottlenecks, failure modes, and performance characteristics are probably wrong until you validate them with real data at real volume.

Profile first. Measure everything. And respect the fact that scale changes the rules of the game.