The fastest API is not the one with the most exotic infrastructure behind it. It is the one that knows exactly what work belongs in the request path, and has the discipline to keep everything else out of the way.
A lot of API performance work starts in the wrong place. The endpoint is slow, so the first instinct is to reach for more workers, more dynos, a larger database instance, a more aggressive cache, or a queue bolted onto the side. Those may all have their place, but they do not fix the underlying shape of the system. They only give a poor shape more room to breathe.
The better starting point is the hot path: the exact sequence of work that happens every time a request comes in. That path is where latency compounds, where database pressure collects, where object allocation quietly grows, and where a harmless-looking branch turns into an expensive multiplier once traffic arrives.
In physics, the principle of least action describes systems following a path that minimises a quantity called action. In software, the metaphor is useful even if the mathematics is different. A request path should perform the smallest amount of synchronous work required to produce a correct and useful response. Not the smallest amount of work overall. The smallest amount of work that must happen right now.
The API hot path is not the place to discover business meaning from scratch. It is the place to serve meaning that has already been prepared.
Most overloaded endpoints become overloaded gradually. They begin as a clean route: validate the request, check permission, fetch data, return a response. Then a reporting requirement appears. Then a score needs to be calculated. Then the response needs enrichment from another table. Then a third-party lookup is added. Then extra analytics are written. Each addition is reasonable in isolation. Together, they convert the endpoint from a serving layer into a live reconstruction engine.
That distinction matters. Serving is cheap when the representation is already close to the shape the caller needs. Reconstruction is expensive because the system has to rediscover relationships, aggregate rows, interpret rules, calculate derived values and format the result while the user waits.
A useful review of an API usually starts with five blunt questions:
- What does the caller actually need before the response can be returned?
- Which calculations can be performed before the request arrives?
- Which database reads repeat across requests and should be cached, materialised or indexed differently?
- Which side effects can be moved into a queue, log stream or batch process?
- Which objects, joins or serialisation steps exist only because the internal model is leaking into the public response?
The answer is often not a single optimisation but a separation of responsibilities. Ingestion should collect raw truth. Computation should turn that truth into derived structures. Storage should preserve the structures needed by the product. Serving should read from those structures with as little live interpretation as possible.
This is why denormalised read models, materialised views, compact lookup tables, spatial indexes, search indexes and precomputed scores are not premature optimisation when the workload is known. They are ways of respecting the difference between writing truth and serving answers.
The same principle applies below the database layer. Single-thread performance still matters because every concurrent system is built out of serial sections. If the inner loop is wasteful, concurrency multiplies the waste. If every request allocates too many objects, traverses data in a cache-hostile layout, repeats parsing, or walks through abstraction layers that add no value to the hot path, more workers merely make the machine burn harder.
Before parallelising, it is worth finding the shape of the serial work. Profile the endpoint. Count queries. Inspect serialisation. Measure allocation. Look at the loops that run for every request. Check whether data is being copied, converted or re-sorted because two parts of the system disagree about representation. The boring measurements often reveal the real fault line.
Concurrency becomes useful once the core path is clear. It is not just about starting more workers. It is about splitting work without losing determinism, correctness or observability. A parallel process that produces different answers depending on timing is not an optimisation; it is a liability with better CPU utilisation.
For batch-heavy systems, deterministic concurrency usually means a parent process owns ordering, identity allocation and final mutation, while workers handle isolated units of work. Results can then be applied in a stable sequence. That design is less glamorous than a free-for-all worker pool, but it is much easier to test, replay and trust.
Load testing should then be used to verify the shape of the system, not to produce a heroic number for a slide deck. A useful load test answers practical questions: which endpoint bends first, whether the bend is caused by the application, database, network, queue or infrastructure, and whether the failure mode is graceful enough for production.
This also means separating background noise from real failure. Production systems receive bad requests, bot traffic, malformed payloads, repeated retries and upstream nonsense. Those events matter, but they should not be confused with the behaviour of valid business traffic. A clean analysis separates request classes, payload types, response codes, latency bands and time windows before making an infrastructure decision.
The practical pattern is simple: make the hot path lean, make the data shape honest, make concurrency deterministic, and make load tests evidence-led. Do that before adding complexity. Most businesses do not need a more theatrical backend. They need fewer surprises in the path that runs every single time.