Original: https://a16zcrypto.com/why-blockchain-performance-is-hard-to-measure/
By Joseph Bonneau
Performance and scalability are much-discussed challenges in the crypto space, relevant to both layer 1 projects (standalone blockchains) and layer 2 solutions such as rollups and off-chain channels. However, we have no standardized metrics or benchmarks. Numbers are often reported inconsistently and incompletely, making accurate comparisons of projects difficult and often obscuring what is most important in practice.
We need a more granular and thorough approach to measuring and comparing performance—one that breaks down performance into components and compares them with trade-offs along multiple axes. In this post, I define basic terms, outline challenges, and provide guidelines and key principles to keep in mind when evaluating blockchain performance.
Scalability and Performance
First, let's define two terms, scalability and performance, which have standard computer science meanings and are often misused in the blockchain context. Performance measures what the system is currently capable of achieving. As we'll discuss below, performance metrics might include transactions per second or median transaction confirmation times. Scalability, on the other hand, measures a system's ability to increase performance by adding resources.
This distinction is important: if well defined, many ways to improve performance will not improve scalability at all. A simple example is using a more efficient digital signature scheme, such as BLS signatures, which are about half the size of Schnorr or ECDSA signatures. If Bitcoin switched from ECDSA to BLS, the number of transactions per block could increase by 20-30%, improving performance overnight. But we can only do this once - there is no more space-efficient signature scheme to switch (BLS signatures can also be aggregated to save more space, but this is a one-time trick).
Many other one-shot tricks (such as Segregated Witness) are possible in blockchains, but you need a scalable architecture to achieve continuous performance improvement, where adding more resources improves performance over time. This is also conventional wisdom in many other computer systems, such as building web servers. With a few common tricks, you can build a very fast server; but ultimately, you need a multi-server architecture that keeps adding additional servers to meet growing demand.
Understanding this distinction also helps avoid common class errors found in statements like "Blockchain X is highly scalable, it can process Y transactions per second!" The second statement may be impressive, but it's a performance metric, not a scalability metric. It does not account for the ability to increase performance by adding resources.
Scalability inherently requires exploiting parallelism. In the blockchain space, layer 1 scaling seems to require sharding or what looks like sharding. The basic concept of sharding — breaking state into chunks so that different validators can process it independently — fits nicely with the definition of scalability. Layer 2 has more options that allow for the addition of parallel processing — including off-chain channels, rollup servers, and sidechains.
Latency vs. Throughput
Traditionally, blockchain system performance is evaluated along two dimensions: latency and throughput: latency measures how quickly individual transactions can be confirmed, while throughput measures the aggregate rate of transactions over time. These axes apply to Tier 1 and Tier 2 systems, as well as many other types of computer systems such as database query engines and Web servers.
Unfortunately, both latency and throughput are difficult to measure and compare. Also, individual users don't really care about throughput (it's a system-wide measure). What they really care about is latency and transaction fees — more specifically, their transactions getting confirmed as quickly as possible and as cheaply as possible. While many other computer systems are also evaluated on a cost/performance basis, transaction fees represent a new performance axis for blockchain systems that does not exist in traditional computer systems.
Challenges of Measuring Latency
The delay seems simple at first: how long does it take for a transaction to be confirmed? But there are always several different ways to answer this question.
First, we can measure the delay between different time points and get different results. For example, do we start measuring latency when the user hits the local "submit" button, or when the transaction hits the mempool? Do we stop the clock when a transaction is in a proposed block, or when a block is confirmed by one or six subsequent blocks?
The most common way to measure this is from a validator's perspective, from the time a client first broadcasts a transaction to when it is reasonably "confirmed" (in the sense that a real-world merchant would consider receiving payment and sending out an item). Of course , different merchants may have different acceptance criteria, and even a single merchant may have different criteria based on the transaction amount.
The validator-centric approach ignores a few things that are important in practice. First, it ignores latency on the peer-to-peer network (how long does it take for a client to broadcast a transaction until a majority of nodes hear it?) and client-side latency (how long does it take for the transaction to be prepared on the client's local machine?). For simple transactions like signing an Ethereum payment, client-side latency can be very small and predictable, but for more complex cases like proving that shielded Zcash transactions are correct, it can be significant.
Even if we normalize the time window we're trying to measure with latency, the answer almost always depends on it. There has never been a cryptocurrency system that offered fixed transaction latency. The basic rules of thumb to remember are:
Latency is a distribution, not a number.
The network research community has long understood this. Special emphasis is placed on the "long tail" of the distribution, since high latency in even 0.1% of transactions (or web server queries) can severely impact end users.
In a blockchain, confirmation delays can vary for a number of reasons:
Batching: Most systems batch transactions in some fashion, such as blocks on most layer 1 systems. This causes variable latency as some transactions have to wait until the batch fills up. Others may get lucky and join the batch last. These transactions are confirmed instantly without any additional delay.
Variable congestion: Most systems suffer from congestion, meaning that there are (at least some of the time) more transactions to process than the system can handle at once. Congestion levels can increase when transactions are broadcast at unpredictable times (often abstracted as a Poisson process), or when the rate of new transactions changes over the course of a day or week, or in response to external events such as popular NFT issuances. different.
Consensus layer differences: Confirming transactions at layer 1 typically requires a distributed set of nodes to reach consensus on blocks, which can add variable latency without being affected by congestion. A proof-of-work system discovers blocks at unpredictable times (also abstracted as a Poisson process). Proof-of-stake systems can also add various delays (for example, if there are not enough nodes online to form a committee in a round, or if views need to be changed in response to a leader crashing).
For these reasons, a good guideline is:
Statements about latency should present a distribution (or histogram) of confirmation times, not a single number like mean or median.
While summary statistics such as means, medians, or percentiles provide part of the picture, an accurate assessment of a system requires consideration of the entire distribution. In some applications, average latency can provide good insight if the latency distribution is relatively simple (eg, Gaussian). But in cryptocurrencies, this is almost never the case: usually, the confirmation time will be very long.
Payment channel networks such as the Lightning Network are a good example. As classic L2 scaling solutions, these networks provide very fast payment confirmations in most cases, but sometimes they require channel resets, which can add orders of magnitude to latency.
Even if we have good statistics on the exact latency distribution, they may vary over time as systems and system requirements change. It's also not always clear how to compare latency distributions between competing systems. For example, consider a system that confirms transactions with a uniformly distributed latency between 1 and 2 minutes (mean and median 90 seconds). If a competing system accurately confirms 95% of transactions within 1 minute and the other 5% within 11 minutes (average 90 seconds, median 60 seconds), which system is better? The answer may be that some apps prefer the former and some prefer the latter.
Finally, it is important to note that in most systems not all transactions have the same priority. Users can pay more to get higher inclusion priority, so in addition to all of the above, the latency depends on the transaction fee paid. In short:
Latency is complicated. The more data reported, the better. Ideally, the complete delay profile should be measured under different congestion conditions. Breaking down latency into different components (local, network, batch, consensus latency) is also helpful.
The Challenge of Measuring Throughput
Throughput also seems simple at first glance: how many transactions per second can a system handle? Two main difficulties arise: what exactly is a "transaction" and are we measuring what a system does today, or what it might be able to do?
While "transactions per second" (tps) is the de facto measure of blockchain performance, transactions as a unit of measure are problematic. For a system that offers general programmability ("smart contracts") or even limited functionality like Bitcoin's multi-transaction or multi-signature verification options, the fundamental questions are:
Not all deals are created equal.
This is clearly true in Ethereum, where transactions can include arbitrary code and modify state arbitrarily. The concept of gas in Ethereum is used to quantify (and charge a fee for) the total work of a transaction, but this is highly relevant to the EVM execution environment. There is no easy way to compare the total amount of work done by a set of EVM transactions to a set of Solana transactions using a BPF environment. Comparing any of these to a set of bitcoin transactions is equally worrisome.
A blockchain that separates the transaction layer into a consensus layer and an execution layer can make this clearer. In a (pure) consensus layer, throughput can be measured in bytes added to the chain per unit of time. Execution layers are always more complex.
A simpler execution layer, such as a rollup server that only supports payment transactions, avoids the difficulty of quantitative calculations. However, even in this case, the amount of input and output paid will be different. The number of "hops" required for a payment channel transaction can vary, which affects throughput. The throughput of a rollup server may depend on how far a batch of transactions can be "broken down" into a smaller set of aggregated changes.
Another challenge with throughput is going beyond empirically measuring today's performance to assess theoretical capacity. This introduces various modeling problems to assess potential capacity. First, we must determine the actual transactional workload at the execution layer. Second, real systems almost never reach theoretical capacity, especially blockchain systems. For robustness reasons, we want node implementations to be heterogeneous and diverse in practice (as opposed to all clients running a single software implementation). This makes accurate simulations of blockchain throughput more difficult.
Overall:
Throughput claims require careful explanation of transaction workload and number of validators (their number, implementation, and network connections). In the absence of any clear standard, historical workloads from popular networks like Ethereum will suffice.
Latency-throughput tradeoff
Latency and throughput are often a tradeoff. As noted by Lefteris Kokoris-Kogias, this trade-off is often not smooth, with latency increasing dramatically as the system load approaches its maximum throughput.
Zero-knowledge rollup systems provide a natural example of the throughput/latency tradeoff. Large batches of transactions increase proof time and thus latency. However, in terms of proof size and verification cost, the on-chain footprint will be amortized across more transactions with larger batches, increasing throughput.
transaction fee
End users are understandably more concerned with the tradeoff between latency and cost than latency and throughput. Users have no direct reason to care about throughput at all, just that they can confirm transactions quickly with the lowest possible fees (some users care more about fees, others care about latency). Overall, costs are influenced by a number of factors:
- How much market demand is there to trade?
- What is the total throughput achieved by the system?
- How much total revenue does the system provide validators or miners?
- How much of this revenue is based on transaction fees versus inflation rewards?
The first two factors are roughly the supply and demand curve leading to a market clearing price (although there are claims that miners act as a cartel to drive fees above this point). All else being equal, more throughput should result in lower fees, but there's a lot more to it.
Especially the 3rd and 4th points above are the basic problems of blockchain system design, but we lack good principles for them. We have some understanding of the pros and cons of giving miners income from inflation rewards vs. transaction fees. However, despite many economic analyzes of blockchain consensus protocols, we still do not have a widely accepted model for determining how much revenue needs to flow to validators. Most systems today are built on educated guesses about how much revenue is enough for validators to act honestly without getting in the way of actual use of the system. In a simplified model, it can be shown that the cost of launching a 51% attack is proportional to the reward to the validator.
Raising the cost of attack is a good thing, but we also don't know how much security is "enough". Imagine you are considering going to two amusement parks. One of them claims to spend 50% less on ride maintenance than the other. Is it a good idea to go to this park? It could be that they are more efficient and get the same security for less money. Maybe the other person is spending more than it takes to keep the rides safe for no benefit. But it could also be that the first park is dangerous. Blockchain systems are similar. Once throughput is factored in, blockchains with lower fees have lower fees because they reward (and thus incentivize) fewer validators. We don't have good tools today to assess whether this is possible, or whether it would leave the system vulnerable. Overall:
Comparing fees between different systems can be misleading. Although transaction fees are important to users, they are affected by many factors besides the system design itself. Throughput is a better metric for analyzing the overall system.
in conclusion
Assessing performance fairly and accurately is difficult. The same applies to measuring a car's performance. Just like blockchain, different people will care about different things. With a car, some users will care about top speed or acceleration, others will care about fuel consumption, and still others will care about towing capacity. All of these are not easy to evaluate. In the United States, for example, the Environmental Protection Agency provides detailed guidelines on how gas mileage is assessed and how it must be presented to users at dealerships.
The blockchain space is still a long way from this level of standardization. In some areas, we may in the future measure the throughput of the system through a normalized workload or a normalized graph for presenting the latency distribution. For now, the best course of action for evaluators and builders is to collect and publish as much data as possible, and describe the evaluation methodology in detail so that it can be replicated and compared with other systems.