HyperAIHyperAI

Command Palette

Search for a command to run...

How a Shape-First Approach Cut MongoDB Costs by 79% and Boosted Performance

A software-as-a-service (SaaS) company recently experienced a sudden 20% increase in its cloud bill due to an unexpected auto-scaling event in its MongoDB setup, upgrading from an M20 to an M60 instance. This triggered a 48-hour emergency to drastically reduce costs while maintaining performance and avoiding downtime. The efforts led by Hayk Ghukasyan, Chief of Engineering at Hexact, achieved a remarkable reduction in monthly expenses by 79%, from $15,284 to $3,210, along with a significant decrease in p95 latency from 1.9 seconds to 140 milliseconds. Step 1: The Day the Invoice Went Supernova At 2:17 a.m., the SaaS company's on-call team was alerted to a steep rise in the cloud bill due to an unexpected MongoDB cluster upgrade. The COO demanded a 70% cost reduction within 48 hours. Initial diagnostics using the MongoDB profiler revealed three primary issues: N + 1 Query Tsunami: The API was making multiple queries for each order and its line items. Infinite Retention Window: Events in the database were being retained indefinitely, leading to excessive storage costs. Jumbo Document Money Pit: Some documents exceeded 256 KB, straining cache lines and SSD resources. Step 2: Three Shape Crimes & How to Fix Them 2.1 N + 1 Query Tsunami Symptom: For each order, an additional query was made to fetch line items. Solution: By using the $lookup aggregation stage, the number of reads was reduced to a single pass, eliminating the need for multiple round-trips and associated hidden fees. javascript db.orders.aggregate([ { $match: { userId } }, { $lookup: { from: 'orderLines', localField: '_id', foreignField: 'orderId', as: 'lines' }}, { $project: { lines: 1, total: 1, ts: 1 } } ]); This change resulted in 90% fewer reads. 2.2 Infinite Retention Window Symptom: Events were stored without a retention policy, leading to bloated databases. Solution: By capping the retention window and projecting only necessary fields, the query became more efficient and less resource-intensive. ```javascript const events = db.events.find( { userId, ts: { $gte: new Date(Date.now() - 30243600*1000) } }, { _id: 0, ts: 1, page: 1, ref: 1 } ).sort({ ts: -1 }).limit(1_000); db.events.createIndex({ ts: 1 }, { expireAfterSeconds: 90243600 }); ``` This approach helped a fintech client reduce storage costs by 72% overnight. 2.3 Jumbo Document Money Pit Symptom: Large documents, including multi-MB invoices with PDFs and extensive histories, were causing performance and cost issues. Solution: Documents were split by access patterns, storing frequently accessed metadata in MongoDB and infrequently accessed blobs in S3 or GridFS. Step 3: Four Shape Sins Hiding in Plain Sight Low-Cardinality Leading Index Key: An inefficient index key with low cardinality led to high B-tree fan-out and poor cache performance. Reordering the index keys improved this. Blind $regex Scan: Non-indexed $regex queries forced full scans, increasing CPU usage and latency. Using anchored patterns and indexed slug fields optimized this. findOneAndUpdate as a Message Queue: Polling with findOneAndUpdate caused document-level locks and reduced throughput. Implementing a purpose-built queue or using change streams improved performance. Offset Pagination Trap: Using skip and limit caused linear time complexity. Switching to range cursors with compound indexes enhanced efficiency. Step 4: Cost Anatomy 101 By optimizing these areas, the SaaS company saw substantial reductions in various cost metrics: - Reads: Decreased from 7.8 billion to 2.3 billion, saving $1,260. - Writes: Remained consistent, but the overall workload was optimized. - Data Transfer: Reduced from 1.5 TB to 300 GB, saving $300. - Storage: Shrunk from 2 TB to 800 GB, saving $96. Step 5: 48-Hour Rescue Timeline 0-2 Hours: Enable the profiler to identify the top 10 slow operations. 2-6 Hours: Replace N + 1 queries with efficient $lookup. 6-10 Hours: Add projections and limits to queries to optimize RAM usage. 10-16 Hours: Split jumbo documents to fit working sets in RAM. 16-22 Hours: Drop and reorder weak indexes to reduce disk usage and improve cache hits. 22-30 Hours: Create TTL indexes and online archives to manage data retention. 30-36 Hours: Set up Grafana panels for live monitoring and early warnings. 36-48 Hours: Load-test the system using k6 to ensure performance stability. Step 6: Self-Audit Checklist Monitor cache miss ratio. Track the number of documents scanned versus returned. Regularly review and optimize document shapes and queries. Step 7: Why Shape > Indexes (Most Days) Fixing document shape issues before adding indexes yields better long-term benefits. A reshaped document reduces future fetch sizes, cache lines, and replication packets, leading to sustained performance and cost savings. Step 8: Live Metrics to Alert On Cache Miss Ratio: Trigger alerts if the ratio exceeds 10% for 5 minutes. Documents Scanned vs Returned: Raise alerts if the scanned-to-returned ratio surpasses 100. Step 9: Thin-Slice Migration Script To migrate a 1-TB events collection without downtime, a script was used for double-writes and backfilling: ``javascript const cs = db.events.watch([], { fullDocument: 'updateLookup' }); cs.on('change', ev => { db[${ev.fullDocument.type}s`].insertOne(ev.fullDocument); }); let lastId = ObjectId("000000000000000000000000"); while (true) { const batch = db.events .find({ _id: { $gt: lastId } }) .sort({ _id: 1 }) .limit(10_000) .toArray(); if (!batch.length) break; db[batch[0].type + 's'].insertMany(batch); lastId = batch[batch.length - 1]._id; } ``` Step 11: When Sharding Is Actually Required Sharding is a complex solution that should be considered only when specific performance issues persist despite optimization: - Working Set Above 80% of RAM: Indicates a need for distributed storage to maintain cache hit ratios. - Primary Write Throughput Exceeding 15,000 ops/s: Suggests high write velocity, manageable through batching or bulk upserts. - Multi-Region Read Latency: Zone sharding can help achieve sub-70 ms p95 read latency across regions. Conclusion The unexpected MongoDB cluster upgrade highlighted the importance of managing data shape and query efficiency. The SaaS company's proactive reshaping efforts, guided by Ghukasyan, not only significantly reduced costs but also improved system performance. This experience underscores that shaping documents and queries correctly from the start is crucial for both technical and financial sustainability. Technical debt on shape is a hidden but significant cost driver, and addressing it early can prevent future billing surprises and maintain optimal system performance. Industry experts commend the approach taken by the SaaS company, emphasizing that the "shape-first" mindset is essential for scalable and cost-effective database management. Hexact, founded by Ghukasyan, specializes in building automation platforms and has extensive experience in large-scale systems architecture and real-time databases, positioning them well to assist organizations facing similar challenges.

Related Links