Central Theme: The Critical 3%
Written by legendary engineers Jeff Dean and Sanjay Ghemawat, this article argues that the common advice to avoid “premature optimization” is often misinterpreted to mean ignoring performance entirely during development. They contend that while small efficiencies can be ignored, software engineers must seize the “critical 3%” of opportunities where optimization yields significant benefits. Ignoring performance leads to systems with flat profiles (where no single bottleneck exists, making optimization difficult), frustrated users, and expensive hardware over-provisioning.
Estimation and Measurement
Before coding, engineers should perform back-of-the-envelope calculations to estimate resource usage (latency, memory, bandwidth). This requires knowing the rough costs of operations (e.g., L1 cache reference vs. SSD read vs. Datacenter round trip). Once code exists, profiling is essential. When profiles are flat (no obvious hotspots), developers should look for structural changes, such as reducing allocations, removing overly general code, or restructuring loops.
Key Technical Strategies
- API Design: Prefer bulk APIs to reduce call overhead. Use view types (e.g.,
std::string_view) to avoid copying, and design types to be thread-compatible rather than internally thread-safe unless necessary to avoid locking overhead for single-threaded users. - Algorithmic Improvements: This is the highest impact area. Examples include replacing O(N²) algorithms with O(N) or O(N log N), upgrading hash functions, and replacing sorted-list intersections with hash table lookups.
- Memory Representation: Use compact data structures to improve cache locality. Prefer indices over pointers (pointers are large on 64-bit systems), use arenas for allocation, and prefer arrays/vectors over map structures for small datasets.
- Reducing Allocations: Allocations are expensive due to allocator lock contention and cache misses. Strategies include reusing temporary objects, pre-sizing vectors (
reserve), and hoisting variable definitions out of loops. - Concurrency: Amortize lock acquisition (batch operations), shard data structures to reduce contention, and keep critical sections short.
Avoiding Unnecessary Work
The most effective optimization is skipping work entirely. This includes:
- Fast Paths: Create special handling for common, simple cases (e.g., ASCII strings vs. UTF-8).
- Deferral: Don’t compute expensive statistics or values until they are actually requested.
- Specialization: Write specialized code instead of using generic libraries for hot paths (e.g., custom string formatting).
Protocol Buffers and C++ Specifics
While useful, Protocol Buffers can be performance bottlenecks. The authors advise against deep message hierarchies, unnecessary copying (use bytes or arenas), and excessive usage. For C++, they recommend modern Abseil containers like absl::flat_hash_map and absl::InlinedVector over standard library equivalents for better cache performance and reduced allocation overhead.
Mentoring question
Looking at a service or module you own, can you perform a back-of-the-envelope calculation to estimate its theoretical maximum performance, and if there is a discrepancy with reality, would a ‘fast path’ or data structure change (like switching from a map to an array) be the most effective way to close that gap?