
Microsoft Dev: Azure Cosmos DB Conf 2026 Recap: Lessons from Production
Key Takeaways
Azure Cosmos DB performance is a design problem, not a capacity issue. Scaling RUs won’t fix hot partitions caused by poor partition key selection. To succeed, architects must master hierarchical partitioning, respect the 20 GB logical partition limit, and use strategic denormalization to ensure even load distribution and low-latency access patterns.
- Partition key selection is the most critical design decision; poor alignment with access patterns creates hot partitions and throttling regardless of total provisioned Request Units (RUs).
- Leverage hierarchical partition keys to scale beyond the 20 GB logical partition limit and provide granular control over data distribution for high-traffic or massive tenants.
- Prioritize denormalization and the Change Feed to create specialized containers for specific query patterns, drastically reducing the latency and cost of cross-partition queries.
- Acknowledge the non-negotiable hard limits of 20 GB per logical partition and 10,000 RU/s; failure to design around these constraints will eventually lead to unfixable performance bottlenecks.
You provisioned Azure Cosmos DB with ample Request Units (RUs), your application’s P99 latency is creeping up, and throttling errors are becoming more frequent. Sound familiar? This isn’t a capacity problem; it’s a design problem. The Azure Cosmos DB Conference 2026 made one thing brutally clear: the platform exposes your data modeling and partition key choices like a harsh spotlight.
The Unseen Bottleneck: Partition Keys and Skewed Distribution
The single most impactful decision you make for Cosmos DB is the partition key. Forget throwing more RUs at the problem; if your partition key leads to skewed distribution, you’re battling hot partitions. This results in 100% RU utilization on some physical partitions while others languish, leading to relentless throttling and unacceptable latency spikes, even if your aggregate RU usage appears low.
Diagnosing the Dreaded Hot Partition
Azure Monitor is your primary battlefield. Look for:
Normalized RU Consumption (%) By PartitionKeyRangeID: This metric is gold. High values on specificPartitionKeyRangeIDs scream “hot partition.”PhysicalPartitionThroughput: This shows the actual RU/s being consumed by individual physical partitions.
The fundamental truth from 2026 discussions is that your partition key MUST align with your most frequent access patterns and distribute load evenly. Common pitfalls include using a userId when one user generates a disproportionate amount of traffic, or not considering the 20 GB logical partition limit per physical partition.
Strategic Partitioning: Beyond the Single Key
When a single partition key can’t guarantee even distribution, or when dealing with massive tenants, Hierarchical Partition Keys are your savior. These allow up to three levels of subpartitioning, effectively scaling beyond the 20 GB limit for a single logical partition and offering finer-grained control over data distribution.
Data Modeling: Embracing Denormalization and Avoiding Pitfalls
Cosmos DB thrives on flexibility, but that doesn’t mean a free-for-all.
- Denormalize and Embed: For one-to-few or contained relationships, embed related data directly within the parent document. This minimizes costly cross-partition queries.
- Avoid Unbounded Arrays: Large, unbounded arrays can lead to large documents and potential RU spikes during updates.
- Multiple Containers for Specific Access Patterns: Don’t force one container to serve all purposes. Use the Change Feed to propagate data to specialized containers optimized for different query needs, drastically reducing cross-partition query overhead.
Throughput Management Beyond Provisioning
Beyond basic RU provisioning, proactive management is key:
- Monitor
utilizationOf20GBLogicalPartitionand set alerts. When a logical partition approaches its 20 GB limit, it’s a precursor to issues. - While not ideal, skewed workloads can sometimes be mitigated by manually redistributing throughput across physical partitions using PowerShell or the Azure CLI. This is a workaround, not a fix for bad design.
The Ecosystem and The Hard Truths
Discussions on platforms like Hacker News (even years later) reveal a mixed sentiment. Early criticisms often cited misleading “multi-model” marketing (Cosmos DB excels primarily as a document DB), sparse documentation, and cost concerns. Alternatives like Amazon DynamoDB, MongoDB Atlas, and Google Cloud Firestore are frequently mentioned.
The critical takeaway from production deployments is that the 20 GB logical partition and 10,000 RU/s per logical partition limits are non-negotiable hard limits. They necessitate careful design from day one.
When to Reconsider Cosmos DB
If your workload is inherently relational with complex, multi-table joins as a primary requirement, or if you’re seeking a bare-bones, cost-optimized NoSQL solution for extremely low-throughput scenarios, Cosmos DB might be overkill or a poor fit. It demands a deep understanding of your data and access patterns.
The honest verdict? Azure Cosmos DB is a powerhouse for globally distributed, high-scale, real-time, and AI-driven applications with flexible data. But its success is entirely dependent on disciplined data modeling and partition key design that perfectly matches your workload. Increasing RUs will only delay the inevitable confrontation with a flawed design. Cosmos DB doesn’t hide your design problems; it amplifies them. Choose your partition key wisely.
Frequently Asked Questions
- What is the most common mistake made when setting up Azure Cosmos DB in production?
- The most common mistake is an inadequate or poorly chosen partition key. A bad partition key leads to skewed data distribution and hot partitions, which will eventually throttle your throughput regardless of how many Request Units (RUs) you provision. Always design your partition key strategy around your most frequent query patterns.
- How can I prevent throttling errors in Azure Cosmos DB production?
- Prevent throttling by carefully selecting a partition key that distributes requests evenly across all physical partitions. Monitor your RU consumption closely, especially at the partition level, and optimize your queries to be more efficient. Consider autoscale throughput if your workload is variable to dynamically adjust capacity.
- When should I use a composite partition key in Azure Cosmos DB?
- A composite partition key can be beneficial when you frequently query based on multiple properties. By concatenating them into a single partition key value, you can improve query performance for those specific scenarios. However, ensure that the combined key still offers good distribution and avoids creating hot partitions.
- What are the alternatives to using a highly cardinal partition key in Azure Cosmos DB?
- If a single high-cardinality property leads to skew, consider a synthetic partition key (a generated value) that is designed for even distribution. Alternatively, you can create a logical grouping of related data into different containers with more appropriate partition keys based on their access patterns. This often involves rethinking your data model.
- What are the best practices for managing Azure Cosmos DB cost in production?
- Optimize costs by choosing the right provisioned throughput model (manual vs. autoscale) based on your workload predictability. Right-size your RUs to avoid over-provisioning. Regularly review query performance and ensure efficient data modeling to minimize RU consumption. Consider data lifecycle management to archive or delete older data.




