
Why Your SQL SELECT * Is Costing You More Than You Think
Key Takeaways
SELECT * is not free; it adds query planning time, CPU overhead, disk/network I/O, and can hinder index utilization, leading to higher costs and slower performance at scale.
SELECT *forces the query planner to fetch all column metadata, increasing initial planning time.- Databases often perform column reordering or data type conversions on the fly for
SELECT *, adding CPU overhead. - Increased data volume for
SELECT *impacts disk I/O and network traffic between nodes in distributed systems. - Specific query patterns involving
SELECT *can lead to suboptimal index usage. - The perceived ‘simplicity’ of
SELECT *masks significant hidden costs at scale.
The Implicit Tax: Why SELECT * Is a Performance and Cost Minefield
During the initial sprint, the team opted for SELECT * FROM users; to fetch user data. It was less typing, and the table only had five columns. Fast forward eighteen months, and that same query, now running against a petabyte-scale data warehouse and invoked millions of times daily, is silently inflating our cloud bill, saturating network links, and pushing query latency into the unacceptable quartile. The temptation of SELECT * is a siren song for developers, promising brevity but ultimately leading to significant, often unquantified, operational costs. This isn’t just about bandwidth; it’s about internal database mechanics, cache efficiency, and the very architecture of how data is processed at scale.
The Hidden I/O and CPU Tax
When a database executes SELECT *, it’s instructed to retrieve every single column defined for a given table. This sounds innocuous for a table with a handful of small datatypes. However, consider a table with twenty columns, including large TEXT fields, JSONB blobs, and BYTEA arrays. Even if the consuming application only needs the user_id and email from that row, the database engine must still:
- Locate and read all column data from disk or memory. For variable-length types (like
TEXTorJSONBin PostgreSQL) stored out-of-line, this can involve additional disk seeks to fetch those separate data pages. - Serialize the entire row’s data into a buffer for network transmission. This involves packing each field, converting types as necessary, and assembling the final packet.
- Transmit this inflated payload across the network to the client.
This triple whammy of I/O, CPU, and network overhead directly impacts the database’s ability to serve other concurrent requests. EXPLAIN ANALYZE can sometimes hint at this internal cost. For instance, in PostgreSQL, if a query plan shows significant time spent in “fetching” or scanning large numbers of “blocks,” and the number of rows returned is proportional to the table size, it’s a strong indicator that more data than necessary is being touched. While EXPLAIN ANALYZE doesn’t directly measure network latency, the database-side work required to prepare that data is often considerably higher for SELECT * than for a targeted projection.
A stark example comes from the common practice in ORMs. A simple session.query(User).all() in SQLAlchemy or new HashSet<User>(dbContext.Users); in Entity Framework, without explicit field projection, will often default to generating SELECT * FROM users;. This pattern, replicated across hundreds of microservices fetching basic user profiles, creates a distributed denial-of-service attack against your own network and database infrastructure, all under the guise of developer convenience.
Cache Contention and Reduced Effectiveness
Database systems rely heavily on caching mechanisms to minimize disk I/O. PostgreSQL, for example, uses shared_buffers to cache frequently accessed data pages. When SELECT * is the query pattern, the pages being loaded into and retained by this cache are likely to contain many columns that the application will immediately discard.
Imagine a 1MB data page containing 100 rows. If each row has 10 columns, and your query only needs 2 of those columns, you’ve still loaded the data for all 200 columns into cache. If another query actually needs those other 8 columns from a different row on the same page, it might find that page has been evicted to make room for more “recently” used (though not necessarily useful) data. This leads to a cascade of disk reads for data that was already in memory but deemed less important because it was part of an unneeded row set. The effective cache hit ratio for relevant data plummets, forcing more frequent and expensive disk accesses. This phenomenon is particularly pronounced in OLAP systems where tables can be orders of magnitude larger than in OLTP, and SELECT * is antithetical to their column-oriented design.
Under-the-Hood: The Index-Only Scan Conspiracy
Modern relational databases offer powerful indexing capabilities. One of the most significant performance optimizations is the index-only scan. If a query can be satisfied entirely by data present in an index (i.e., all columns referenced in the SELECT list and WHERE clause are included in the index), the database can skip the expensive operation of fetching the actual row data from the table’s main storage (the heap). This is a massive performance win.
However, SELECT * fundamentally undermines this. When you request all columns, the database must go to the heap to retrieve columns that are not part of the index. Even if you have a perfect index covering your WHERE clause, if the SELECT list is *, the index-only scan is impossible. The query optimizer is forced to perform a heap fetch for every qualifying row, negating the benefits of an otherwise excellent index. While an index can theoretically include all columns of a table, creating such an index is often impractical, bloats index size dramatically, and incurs significant write overhead. Therefore, SELECT * effectively prevents the database from using one of its most efficient query execution strategies.
Bonus Perspective: The Schema Evolution Double-Edged Sword
A common justification for SELECT * is its perceived resilience to schema evolution. The argument goes: “If the table gains a new column, my SELECT * query automatically picks it up without code changes.” This is a dangerous oversimplification and a false economy.
Firstly, while it might avoid a runtime error today, it can mask fundamental issues. If an application suddenly starts receiving and processing columns it wasn’t designed for, this can lead to subtle bugs, increased memory consumption, or unexpected behavior down the line. Data contracts between services are critical; SELECT * actively erodes these contracts by making them implicit and undocumented.
Secondly, removing a column from a table where SELECT * is prevalent is far more likely to cause immediate, catastrophic failures. The application, no longer receiving the data it implicitly expected, will break. Proactive, explicit column selection forces developers to confront schema changes head-on. Adding a column should prompt a review: “Does my application need this new data?” Removing a column should prompt: “Which services are using this, and how do we gracefully deprecate it?” Relying on SELECT * to paper over these essential architectural discussions is a recipe for instability and technical debt.
Opinionated Verdict
The convenience of SELECT * is a short-term illusion that imposes a long-term, quantifiable tax on your infrastructure. Its cost manifests not just in network traffic but in degraded database performance, reduced cache efficiency, and the inability to leverage fundamental optimizations like index-only scans. For any production system that scales beyond a handful of users or tables, abandoning SELECT * in favor of explicit column projection is not merely a best practice; it is an operational imperative. The minuscule effort saved in typing id, name, email is dwarfed by the operational expenditure and performance degradation incurred by fetching every column, every time. Treat every instance of SELECT * in production code as a ticking time bomb for performance and budget.




