Open Menu

Kairos: Scaling to Millions of Transactions Per Second

  • Our consensus mechanism, Kairos, is able to sequence transactions at record-breaking speeds of millions of transactions per second.

  • We are able to do this because:

    1. Kairos only focuses on sequencing transactions and not executing them.

    2. Participants are able to produce blocks independently of each other.

  • See how we compare:

    Platform Transactions/Second Finality
    Bitcoin 7 Probabilistic
    Ethereum 119 > 13 minutes
    Solana 65,000 > 12 seconds
    Espra > 1,000,000 < 2 seconds
  • Kairos is also byzantine fault-tolerant and can tolerate up to a third of the network being faulty or malicious.

Why Consensus?

Establishing the order of transactions is important for certain apps. For example, consider a ticketing app that only has 5 tickets left for a concert:

  • Alice wants to buy 4 tickets.
  • Zeno wants to buy 2 tickets.

There are only enough tickets for one of them. If the app could know which order took place first, it could sell them tickets, and reject the second one.

And this is exactly what a consensus algorithm does. It puts things into a sequence that’s agreed upon by the whole network.

Existing Consensus Models

There are lots of consensus algorithms. Many of them predate blockchains. Viewstamped Replication [OL88], for example, came out in 1988.

In fact, algorithms like Paxos [Lam98] and Raft [OO13] sit at the heart of a lot of today’s cloud infrastructure. They are a key part of enabling them to scale to billions of users.

Most of these algorithms are not byzantine fault-tolerant (BFT) [LRM82] — which is just an academic way of saying that they can’t handle faulty or malicious participants.

This doesn’t matter at places like Google where they control all of their own infrastructure, but is essential in permissionless, i.e. open, systems where anyone can participate.

There have been many BFT algorithms like PBFT [CL99], but these tend to be quite complex. In fact, it took Google a lot of effort [CGR07] to tame the much simpler Paxos.

So, when Bitcoin came out, it was a breath of fresh air. It introduced Proof-of-Work (PoW) [Nak08], the simplest BFT consensus algorithm ever:

  • Pending transactions are aggregated into units called “blocks”.

  • For a participant to successfully create a block, they have to do “work” by solving a computationally intense puzzle.

  • Every block always follows on from a previous block. This creates a chain of blocks and thus a sequence for all of the transactions that have taken place.

  • If multiple participants create new blocks on top of an existing one, this results in a conflict.

  • Proof-of-Work resolves this by selecting the “longest” chain as the official one, i.e. the one with the most number of blocks in it — as this would be backed by the most computational effort.

That last bit was brilliant thinking, and added a key missing piece to the vision of a secure, distributed computer system that David Chaum had articulated way back in 1982 [Cha82].

But Proof-of-Work has lots of downsides:

  1. Transactions are never final.

    Since it’s possible for multiple participants to create conflicting blocks, there’s always the risk of your transactions being voided by a longer chain.

    As time goes on, it gets more and more expensive for others to come out with a longer chain, as it’d need more computational work. But the risk is always there.

    This is why lots of exchanges will wait for a number of “confirmations” before processing your deposits. This lets them manage their risk in case your deposit gets voided.

    But it’s not reasonable to expect someone to wait around for confirmations for 20-60 minutes after they’ve paid for their coffee. It should feel instantaneous.

  2. It uses a lot of energy.

    Solving those computational puzzles takes a lot of work. Each transaction on Bitcoin uses as much energy as the average U.S. household does in an entire month [BECI].

  3. It doesn’t scale.

    Despite all of that effort, Bitcoin only manages to process 7 transactions a second. An iPhone from 10 years ago could process more transactions than that.

    Bitcoin fans will claim that “layer 2” solutions like the Lightning Network [PD16] will solve these issues. However, users still need to make transactions on Bitcoin for that.

    And it would take 3 years for just the U.S. population alone to open and close a single channel on something like the Lightning Network. This is clearly not scalable.

Other BFT consensus algorithms are not much better. While most don’t waste energy as Proof-of-Work does, none of them scale to billions of users.

That limits the kind of apps we can build. WhatsApp processes almost 2 million messages a second. The infrastructure of the future should be able to scale beyond that.

Sequence, Don’t Execute

Most consensus systems depend on participants executing transactions before coming to an agreement on the resulting state. This adds a lot of latency.

For example, in Solana’s Proof of History [Yak17]:

  • The current “leader” node has to first sequence the transactions, execute each of them, and then publish the transactions and the resulting state to “verifier” nodes.

  • The “verifier” nodes then have to execute all the same transactions again, before publishing a confirmation that they arrived at the same resulting state.

  • The consensus mechanism then uses the published confirmations as votes.

It takes time to spin up and tear down the execution machinery for each transaction. Not to mention the time it takes for the actual execution of those transactions. This all adds up.

In Kairos, we don’t do any of that. The consensus algorithm only focuses on sequencing the transactions, i.e. arranging them into an order that is agreed upon by all participants.

This is similar to Calvin’s [TDW+12] approach for traditional databases. Going back to our opening example, if the agreed upon sequence is that Zeno’s transaction is before Alice’s, then:

  • When the affected parties eventually execute those transactions, Zeno will find that he managed to buy some tickets, and Alice will find that her attempt failed.

Since the resulting state will be the exact same — barring any implementation bugs in the execution machinery — it doesn’t matter if the transactions are executed during consensus or later.

In other words, as long as you’ve agreed on the order of the inputs, and your execution is deterministic, i.e. repeatable, then the outputs will be the exact same — regardless of when execution takes place.

Concurrent Block Production

Early consensus mechanisms tried to get the network to agree on blocks one at a time. But it takes time for information to travel. The speed of light imposes a hard limit.

As a result, blocks could only be produced relatively slowly. To save on coordination, newer protocols try to predetermine who should produce a block at a particular time.

For example, Ethereum’s consensus protocol, Gasper [BHK+20], elects a single proposer to produce a block during each 12-second slot. This results in a lot less communication overhead.

In an ideal world, participants would produce blocks independently of each other, and a consensus algorithm would magically stitch them together to create a sequence that everyone agrees on.

These are called DAG-based consensus algorithms, where DAG is a mathematical term that’s short for “Directed Acyclic Graph”. Breaking that down:

  • graph means that it’s about a network of relationships between things.

  • directed means that the relationships only go in one direction, i.e. A can refer to B, but B can’t refer to A.

  • acyclic means that those relationships never form circular references, so if A refers to B, which refers to C, then C can’t refer to A as that would create an infinite loop.

There are many DAG-based algorithms where the graph defines what’s being sequenced. Some of the more impressive ones include Hedera’s Hashgraph [Bai16] and Chainspace’s Blockmania [DH18].

Kairos belongs to this family of algorithms where participants don’t need to coordinate with others to produce their own blocks. This gives us incredible scale.

In addition, we solve some aspects where DAG-based systems usually fail, e.g. as long as a supermajority of the network is online, we guarantee that all transactions will be final within a few seconds.

Safety vs. Liveness

A consensus algorithm would ideally provide all of the following:

  1. Safety/Finality

    All participants running the consensus algorithm must come to the same conclusion as everyone else. And, once agreed, they can’t change their minds about it.

  2. Liveness

    Participants should be able to arrive at consensus and continue doing so.

  3. Fault Tolerance

    The system should be able to handle failures, e.g. participants disappearing, messages being delayed, participants being intentionally malicious, etc.

We know from the FLP theorem [FLP85] that it’s impossible to have a consensus algorithm that guarantees all three aspects in a distributed system.

Since fault tolerance is often desired in distributed systems, most consensus algorithms end up sacrificing one of the other two aspects under certain conditions, e.g.

  • Bitcoin’s Proof-of-Work sacrifices safety [Nak08]. The chain may “fork” into different versions at any point, and the forks can be replaced by longer chains at any point.

  • Raft sacrifices liveness [OO13]. Since Raft’s leader election could go on forever due to split votes, it uses randomized timeouts to reduce the chances of being blocked forever.

  • Ethereum’s Gasper alternatingly sacrifices either safety or liveness in its two components [BHK+20]:

    • LMD GHOST sacrifices safety while adding blocks. It can always change its mind as to where the chain currently is.

    • Casper FFG sacrifices liveness while finalizing which set of blocks will be a permanent part of the chain.

Similarly, from the PACELC theorem [Aba12] that extends the CAP theorem [FB99], we know that in a distributed system:

  • You need to deal with the possibility of network partitions, i.e. network failures where parts of the network aren’t able to communicate with each other.

  • If there is a network partitioning (P), we have to choose between availability (A) and consistency (C).

  • Else (E), in the absence of network partitions, we have to choose between latency (L) and consistency (C).

Kairos chooses safety over liveness and is a PC/EC system, i.e. we choose consistency over availability in partitions, and choose consistency over latency when there aren’t any partitions.

But since networks are reasonably reliable nowadays, in practical terms, one could still build Kairos-based systems with 99.999% availability like Google did with Spanner [Bre17].

Kairos Assumptions

For Kairos to work, we depend on the following assumptions:

  • Since the FLP impossibility result is for “asynchronous” participants where messages might be delayed forever, we weaken this constraint by using clock time.

    This allows us to timeout non-responsive nodes and detect whether they’ve failed. For this, we assume that the clocks used by participants are reasonably in sync with each other.

  • Kairos can tolerate a certain level of “faulty” participants, i.e. those that stop working and never come back (crash failure) or those that violate the protocol (byzantine failure).

    If there are f number of faulty participants, then Kairos expects there to be at least 2f+1 “honest” participants, i.e. there must be at least 3f+1 participants in total.

    And this is because:

    • When making decisions, we need to see votes from at least 2f+1 participants to be sure that the honest ones are in the majority, i.e. f + 1 > f.

    • But since f participants might not respond at all, you can’t get 2f+1 votes from a total of 2f+1 participants. So, you need at least 3f+1 in total.

  • If honest participants are temporarily unable to communicate with each other, e.g. due to network partitions, we assume that they will always be able to do so eventually.

  • If there are 3f+1 participants in total, then Kairos needs at least 2f+2 participants to be active in order to continue making progress on consensus.

  • We assume that the cryptographic algorithms used for digital signatures and hashing are secure.

    • If the digital signature algorithm is broken, malicious parties could potentially forge messages from other participants.

    • If the hashing algorithm is broken, malicious parties could potentially make you think you’re agreeing to something, but later claim that you agreed to something else.

  • Any compromise of the cryptographic keys used by a participant could potentially result in them becoming faulty, e.g. if used by a malicious party to forge signatures.

    Similarly, any malware on a participant’s systems could also potentially result in them becoming faulty, e.g. by making them produce invalid data.

    We assume that any such increase to the f number of faulty participants will still manage to keep the total number of participants above 3f.

Kairos Consensus Algorithm

Kairos is the simplest byzantine fault-tolerant consensus algorithm with deterministic finality.

You initiate it with some starting parameters:

  • validators — the IDs, IP addresses, operator details, public keys, and transaction size limits of the initial set of validator nodes that participate in consensus.

  • tick_interval — the interval at which validator nodes produce new blocks, e.g. every 100 milliseconds.

  • genesis_time — the date and time of the 0th tick, e.g. midnight on November 5th 2024 TAI.

  • commit_window — the number of ticks within which a validator block can be committed, e.g. within 5 ticks.

  • average_block_size_limit — the average value of the block size limits for all validators, e.g. 20MB.

  • system_change_limits — the rate at which validator nodes can make system changes, e.g.

    // Change limits per validator node within a rolling 24
    // hour window.
    system_change_limits {      
      init_new_validator = 5
      propose_limit_change = 2
      update_public_key = 3
    }
    

    Having such limits will make it harder for malicious nodes to flood the system with a lot of noise, e.g. by constantly updating their public keys.

The cryptographic algorithms being used must also be specified:

  • hash_algorithm — the hash algorithm used to produce things like block hashes, e.g. BLAKE3 [OAN+19].

  • delay_function_params — the parameters used for the delay function that is used to aggregate validator blocks into a single “network block”, e.g.

    delay_function_params {
      algorithm = "argon2id"
      iterations = 20
      memory = 4.GB
      parallelism = 8
    }
    
  • default_signature_algorithm — the algorithm used by the validator nodes to produce digital signatures in day-to-day operations, e.g. Ed25519 [BDL+11].

  • high_security_signature_algorithm — the algorithm used by validator nodes when producing a second signature for high security contexts, e.g. when updating their public keys.

    Having two distinct algorithms gives us some future-proofing. For example, we know that Ed25519 should be breakable by quantum computers in the future.

    But it doesn’t make sense to use post-quantum algorithms like SPHINCS+ [BHK+19] all the time since they tend to be a lot slower, and quantum computers aren’t a threat yet.

    By using a post-quantum algorithm for just higher security contexts, we give ourselves slightly better protection from the future while still being pragmatic in the now.

All of these parameters, except genesis_time, can be updated via consensus using system transactions. This allows the system to evolve over time.

At each tick_interval, every validator node has the option to produce a block:

type ValidatorBlock {
    block_number
    node_id
    parent_hash
    references
    signature
    tick
    tick_signature
    transactions
    transactions_index
}

The block_number must start at 0 for the very first block produced by a validator node, and be incremented by exactly 1 for every subsequent block.

The transactions in the block include user transactions submitted by the validator's users, or system transactions submitted by the validator itself, e.g. to update its public key.

The transactions_index enables optimized lookups, e.g.

  • When using a verified streaming hash like BLAKE3 [OAN+19], this can let users verify that a transaction is present in a block without having to download the whole block.

The parent_hash specifies the hash of the block’s parent, i.e. the previous block produced by the validator. For the 0th block, which has no parent, this value is the hash of the starting parameters.

The tick_signature is a signature of the current tick value by the validator node. This gives us a source of “randomness” that other nodes can’t predict — similar to Ethereum’s RANDAO [Edg22].

The validator node’s signature attests to the block hash derived from hashing together all the other fields of the ValidatorBlock using the hash_algorithm.

A block’s references lists the blocks that it has “seen” from other validators:

type Reference {
    block_hash
    block_number
    node_id
    tick
}

For a block to be “seen”, the receiving validator must:

  • Have fully validated the contents of the block, e.g. that it isn’t malformed, has valid signatures, doesn’t conflict with a previous block from the same validator, etc.

  • Have “seen” all blocks referenced by the block, as well as its parent block.

A block can only be referenced if it’s from a tick in the past. Blocks from the future can only be referenced once the validator node’s tick moves beyond it.

This ends up creating a DAG of all the blocks from the perspective of each validator node:

Once produced, blocks are propagated across the network. For Espra, we will use a push-pull gossip protocol that takes advantage of the network topology for disseminating blocks.

In order to get sequenced, a validator block must be marked as “committed”. For a validator node to “commit” a block that has been produced, either by itself or others, at tick t:

  • It must first have “seen” blocks after tick t from at least 2f+1 validators besides itself.

  • The specific block must have been referenced by at least f+1 validators within the commit_window.

    For example, if the commit_window was 6 ticks, then for a block produced at tick 10, it must have been referenced by f+1 validators by tick 16 at the latest.

The committed blocks for a particular tick are then sequenced into a NetworkBlock for that tick:

type NetworkBlock {
    tick
    committed_blocks
    parent_hash
    spaces_index
}

Note that transactions from validator blocks which didn’t get committed are ignored. If desired, the uncommitted transactions could be included again in a future block by its validator.

Sequencing the Committed Blocks

Once we know which validator blocks have been committed for a particular tick, we can establish a total order of all transactions if we could order the committed blocks, e.g.

  • If validators A, B, and D produced blocks that got committed.

  • If the ordering of the committed blocks were to be: (D, A, B).

  • Then the finalized sequence of all transactions would be all of the transactions in D, followed by the ones in A, followed by the ones in B.

But how do we establish a fair ordering of the committed blocks? A naive approach would be to order the committed blocks by their block hashes, e.g.

sorted_blocks = sorted(
    committed_blocks,
    lambda block: block.hash()
)

But this could be exploited by malicious validators who use it to get their transactions ahead of others:

  • A validator could keep adding a “garbage” transaction to the end of their block until it produces a lower block hash than blocks they have seen from other validators.

  • Since there is time until the end of the commit_window to get their block seen by others, validators can hold back their block for a particular tick t while they find this lower hash.

To make things fairer, Kairos makes use of a delay_function together with some unguessable “randomness” from each block’s tick_signature:

# First, get a sorted list of tick signatures from all the
# committed blocks.
tick_signatures = sorted(
    block.tick_signature
    for block in commited_blocks
)

# Run the delay function to derive the delay_hash.
delay_hash = delay_function(
    tick.to_bytes(8, 'little') +
    b''.join(tick_signatures)
)

# Finally, sort the committed blocks by comparing the hash of
# the delay_hash combined with the block's tick signature.
sorted_blocks = sorted(
    committed_blocks,
    key=lambda block: hash(delay_hash + block.tick_signature)
)

This approach makes it harder for validator nodes to bias the order of the committed blocks:

  1. Because the derivation of the delay_hash is deterministic, the ways in which a validator node can impact it is limited to whether it produces a block for a particular tick or not.

  2. The use of a delay_function adds time delays — limiting the window of opportunity for a validator node to figure out a beneficial way in which to bias the order, e.g. by collusion.

  3. Even if a validator manages to bias the order, it would incentivize others to do so too. But that would make it difficult for them to keep doing so, as they would keep affecting each other.

We believe this provides an approach to ordering the committed blocks that is both fair and reasonably hard to bias — especially as the number of validator nodes grow.

We could also use an alternative approach to ordering that would be harder to bias, but would significantly delay execution:

  1. Instead of directly sharing its tick_signature for a block, each validator will first share the hash of it as a commitment.

  2. Each validator will reveal the actual tick_signature in a future block it produces once the commit_window has passed.

  3. All validators which don’t produce such a block within some finality_window, e.g. 2 * commit_window, will have their blocks removed from the committed_blocks for the original tick.

  4. The remaining committed_blocks are then ordered by using the hash of all the revealed tick_signature values, and using that as the delay_hash.

Once the committed blocks are ordered, the resulting NetworkBlock is finalized. This results in a total order/sequence of all transactions with strict serializability [AF19].

Control Protocol

Kairos is highly flexible. It can be paired with a purpose-specific “control protocol”. This makes Kairos suitable for building all kinds of distributed systems.

The control protocol defines:

  • How to handle membership changes to the set of validator nodes, e.g. who can join, how do they join, how do they leave, how is it decided, etc.

  • How to interpret validation “power”, e.g. do all validator nodes have the exact same weight, or is it using something like Proof-of-Stake, etc.

  • How to validate the transactions in blocks produced by other validator nodes, e.g. trust the transaction data blindly, validate just the signatures using some data source, etc.

  • How to handle invalid blocks from other validator nodes, e.g. does it result in evidence of a fault being shared with the rest of the network, does the validator get removed, etc.

  • How to penalize “misbehaving” validator nodes, e.g. are they permanently removed from the validator set, are they only removed temporarily, etc.

  • How to get the state of the system when a validator node first comes online, or reconnects after being offline for a while.

  • How to interpret system transactions, and whether those transactions impact system settings, validator membership, or some other internal state.

In Espra, we use our Diffusion Protocol as the control protocol. This keeps the network decentralized, while keeping it open and 100% permissionless, i.e. where anyone can join and participate.

With a different control protocol, you could also use Kairos to build distributed systems operated by single entities, e.g. a more scalable next-gen version of something like Google Spanner [Cor+13].

Implementation Considerations

There are lots of different factors to consider when creating a Kairos implementation:

  • To avoid clogging up the network with noise, validators should only produce blocks if they have transactions to add, or if it would help get a referenced block committed.

  • Kairos is highly parallelizable, e.g. blocks can be decoded and validated in parallel. So, where possible, the use of concurrent processing is strongly encouraged, e.g. threads.

  • Timekeeping is a complicated affair. To keeps things simple, implementations should use TAI instead of UTC. This will avoid the need to work with leap seconds [Pas11].

  • Multi-node setups would let single entities operate multiple validator nodes. As the storage and indexing of blocks could be shared, it would make it cheaper to run nodes.

    The shared storage system could also run on Kairos. This should use a shorter commit_window so that blocks are persisted faster than they’re produced at the higher layer.

  • When a validator node is restarted, it should first check the network to see if other nodes had seen newer blocks from itself than what it has locally stored.

    This would minimize the chances of the node getting penalized unfairly, e.g. when it had produced a block but failed to persist it, and ends up creating an alternative block.

  • Implementations should pay attention to queuing theory [Sin18]. This means trying to keep variance to a minimum, e.g. by adding support for backpressure [Phe19], load shedding [Yan19], etc.

    This will result in a system with more consistent performance, reduced tail latency, and a significantly better experience for downstream clients and users.

  • To get the most performance, implementations should take advantage of newer async APIs like io_uring on Linux [Axb19] and I/O rings on Windows [Sha21].

  • Ideally, implementations would embed Kairos within an abstraction like VirtualLog [BF19]. This would make it much easier to evolve the system and build support tooling.

  • When a new validator joins the network, it should wait to catch up to recent ticks before producing blocks. This will avoid wasting time referencing historic blocks.

  • Kairos is highly suited to optimization engineering, e.g. validator nodes can pre-allocate most of the memory needed for its data structures and use freelists otherwise.

  • Besides using modern compression like Zstandard [RFC8878], intelligent encoding of the block data will yield massive savings when storing, processing, and transferring blocks.

    For example, if a concrete implementation of a Reference looks something like:

    type Reference struct {
        block_hash    []byte  // 32 bytes
        block_number  uint64  // 8 bytes
        node_id       uint64  // 8 bytes
        tick          uint64  // 8 bytes
    }
    

    Each Reference would then take up 56 bytes. If there were 100k validator nodes, just the references alone would take up 5.6MB in each block for every tick.

    Since most tick values are likely to be for recent ticks, they could be encoded as deltas from the current tick. If encoded as a varint [LB12], this could shave off 7 bytes for most ticks.

    Similarly, since block_number must always increment by 1, we can omit it except for the very first time a reference is made to a block from a particular validator — saving 8 bytes.

    Most networks are not likely to have 264 number of nodes. By encoding the node_id as a varint, it should be possible to shave off at least 4 bytes.

    So, without much effort, we can bring down the space taken up by references from 56 to 37 bytes. On an active network, that could save hundreds of gigabytes a day.

    But we can do even better! Since it takes time for information to travel, a decent percentage of blocks will already have been referenced by previously referenced blocks.

    So by optionally encoding references as the “ith reference within the jth most recent reference”, you could encode most references in less than 6 bytes — shaving off 50 bytes.

    One could potentially go even further and leave out the block hashes of references from the block data, but still include them when computing the containing block’s hash.

    As this would make it harder to detect if a block is malformed, it should be supplemented with a secondary API that allows clients to fetch the corresponding block hashes for references.

  • In order to take advantage of such short reference encodings, the referenced blocks would first need to be referenced by other blocks, i.e. 2 ticks after the block was produced:

    • A block gets produced by its validator at tick t.

    • At tick t+1 it can be referenced by some of the other validators.

    • Then tick t+2 is the earliest point that short encodings can be used.

    To facilitate this, implementations should be able to defer references for use-cases like Espra where it’s important to keep the size of references as low as possible.

    If a validator can reference a block at tick t+1, it’ll defer referencing it until tick t+2 for a defer_percentage number of blocks, e.g. 90% of blocks.

    This will give time for the block to get referenced by some of the other validators — allowing the validator to make use of a short reference encoding for the block in the next tick.

  • Extreme care should be taken when storing block data. Contrary to general expectations, it is quite a challenge [Luu15] to safely write files without corruption or data loss [PCA+14].

    A fault in a single disk sector in a single replica, if propagated through the network via consensus, could result in data loss across entire clusters [AGL+18].

    Implementations should therefore be storage fault-aware and factor in protections against bitrot, disk failures, filesystem failures, torn writes, misdirected reads/writes, etc.

Blazing Fast Speeds

We know from our related work on Chainspace’s Blockmania [DH18] that Kairos, which is a much simpler algorithm, scales to beyond a million transactions per second.

We will validate this, along with Kairos’ safety properties, with our very own “Chaos Test”:

  • To run the test, we will spin up a large number of validator and client nodes that are distributed around various parts of the world.

  • The system will be seeded with a large number of accounts, e.g. 100 million, with random balances of a “test token” in each account.

  • The aggregate balance across all accounts will be calculated and saved as total_balance.

  • Each account will be assigned to a random client and validator node.

  • The client nodes will then start doing transfers of random amounts from a randomly selected account of theirs to another random account in the system.

  • Metrics will be continuously collected by a dedicated app, e.g. error rates, number of transactions per second, latency, how many blocks fail to get committed, etc.

  • A special chaos process on each validator node will then start injecting faults on that node based on instructions from a dedicated “control” app.

  • These faults will be similar to Jepsen [Kin13], e.g. adding network latency, pausing/crashing processes, interfering with the system clock, rebooting, introducing disk errors, etc.

  • We will also trigger some validator nodes into nemesis mode, e.g. producing conflicting blocks for the same tick, censoring blocks by others by not referencing them, etc.

  • At the end of the test we will tally up balances of all accounts, and if this equals the starting total_balance, we’ll have shown that Kairos has maintained consistency throughout.

This section will be updated with the results of the Chaos Test once we’ve done it. Our expectations are that:

  • Kairos will be resilient to a wide range of failures such as nodes disappearing, messages getting delayed, nodes being malicious, etc.

  • Kairos will scale relatively linearly until the I/O capabilities of a supermajority of the validator nodes are saturated.

  • When using Espra’s tick_interval and commit_window, Kairos will successfully include at least 99.9% of all submitted transactions under “normal” network conditions.

  • Even with degraded network conditions, Kairos will keep arriving at consensus until the network no longer has 2f+2 nodes online.

«chaos-test-results-once-done»

Kairos Parameter Selection

The parameters for Kairos have to balance:

  • Finalizing blocks within a reasonable enough time frame for its intended use case. The bounds for this are given by:

    # The minimum time that a block can take to get finalized
    # is 3 ticks. One tick for a block to get produced,
    # another for it to be seen by other validators, and a
    # third for those blocks to be seen by everyone else.
    time_to_finalize_lower_bound = 3 * tick_interval
    
    # If a block is to be finalized, it must happen before
    # the commit window expires.
    time_to_finalize_upper_bound = commit_window * tick_interval
    
  • Providing enough time for blocks to propagate through the network so that they can be “seen” in time by validator nodes and get “committed”.

    This is primarily constrained by the network connectivity, levels of congestion, and the speed of light.

  • Having a large enough average_block_size_limit so that there’s enough space for the references in each tick:

    min_space_needed_for_references = (
      number_of_validators * space_needed_per_reference
    )
    
  • Having a large enough average_block_size_limit so that the space taken up by references in a block is relatively low in comparison to the space taken up by transactions:

    effectiveness_ratio = (
      space_used_by_transactions / space_used_by_references
    )
    
  • Having a large enough average_block_size_limit so that the system can support “bursts” in activity, i.e. spikes in the amount of transactions data at busy validators.

  • Being able to support a large enough number of validators in decentralized systems. After all, this is what gives the network censorship resistance and resilience.

  • Having a long enough tick_interval and small enough average_block_size_limit for the total throughput to be handled by reasonably affordable machines:

    # Assuming that the average block size is in gigabits, and
    # the tick interval is in seconds, this calculation will
    # give us the average throughput in gigabits/second.
    
    ticks_per_second = 1 / tick_interval
    
    throughput = (
      number_of_validators * average_block_size * ticks_per_second
    )
    

    Whether a machine can handle that level of throughput will be limited by its minimum value within its available network, disk, and memory bandwidths:

    max_throughput = min(
      network_bandwidth,
      disk_bandwidth,
      memory_bandwidth
    )
    

The parameter choice depends on how Kairos is being used. For example, for a high-scale distributed database with just 20 validators equipped with 100 Gbps network cards and multiple SSDs:

  • You could choose something like average_block_size_limit of 30MB, tick_interval of 20 milliseconds, and a commit_window of 25.

  • Blocks would then get finalized between 60 milliseconds and 500 milliseconds.

  • And, assuming an average transaction size of 1kB, the system would have an average upper bound of around 12 million transactions per second!

For Espra:

  • We want to initially target at least 10,000 core validators to provide a reasonable level of decentralization, censorship resistance, and resilience.

  • Our choice of parameters should reflect realistic assumptions about the hardware that validators might use. Especially since we’re constrained by the available throughput.

    Given current tech, we can assume that network and disk bandwidth will often be a lot lower than memory bandwidth. For example, this author’s current laptop supports:

    • Memory bandwidth of 400GB/s. Of this, the CPU cores are able to access the memory at a limit of 243GB/s [Fru21].

    • Disk reads and writes at over 5100MB/s, i.e. 5.1GB/s.

    But on the networking front, while 400 Gbps network cards exist, looking at servers currently available at the major cloud providers, 25 Gbps tends to be more common.

    • Converting from bits to bytes, that’s just 3.125GB/s.

    For Espra, we assume that people will set up their own servers instead of paying the extortionate bandwidth costs at cloud providers [PR21].

    You can currently buy a 25 Gbps network card for less than $500, and a 100 Gbps card for less than $2k. Assuming a 5-year server life, the difference works out to less than $25/month.

    So, for Espra, we’ll require validators to have 100 Gbps network cards and multiple SSD drives. Converting from bits to bytes, 100 Gbps give us 12.5GB/s.

  • If we round that down and assume a max throughput of 12GB/s, that would mean that the average bandwidth limit for block production at each validator would be 1.2MB/s.

    bandwidth_limit = 12.GBps / 10_000
    
  • Assuming that each reference takes up 10 bytes on average, then the minimum space needed for references in each block would be 100kB.

    space_needed_for_references = 10_000 * 10.bytes
    

    Note that this assumes the use of deferred references, which will increase the lower bound of the time it takes to finalize by one tick_interval, i.e. 4 ticks.

  • If we want an “effectiveness” ratio of 2:1, i.e. the ratio of space taken up by transactions vs. references, then the average block size should be 300kB.

    average_block_size = (2 + 1) * 100.kB
    
  • This, in turn, would give us a tick_interval of 250ms.

    ticks_per_second = 1.second * (1.2.MBps / 300.kB)
    tick_interval = 1.second / ticks_per_second
    
  • And, finally, if we want to target an upper bound of 2 seconds for finality, so that transactions feel fast enough for most people, then our commit_window needs to be 8.

    commit_window = 2.second / 250.millisecond
    

So, in summary, we’re currently aiming to use the following initial parameters for Kairos in Espra:

  • tick_interval of 250 milliseconds

  • commit_window of 8 ticks

  • average_block_size_limit of 300 kilobytes

This will let blocks get finalized between 1-2 seconds. We can calculate the average upper bound for the rate of transactions the network can handle:

space_available_for_transactions = (
  average_block_size - space_needed_for_references
)

transactions_per_block = (
  space_available_for_transactions / average_transaction_size
)

blocks_per_second = 1.second / tick_interval

average_transactions_per_second = (
  number_of_validators * blocks_per_second * transactions_per_block
)

Plugging in our parameters, and assuming the average size of transactions at 1kB, we end up getting an average upper bound of 8 million transactions per second!

space_available_for_transactions = 300.kB - 100.kB
transactions_per_block = 200.kB / 1.kB
blocks_per_second = 1.second / 250.ms
average_transactions_per_second = 10_000 * 4 * 200  # 8,000,000

Of course, as computer hardware improves, we can update our parameters over time — enabling even greater throughput and network resilience.

TLA+ Specification

This section is a placeholder for a formal specification of Kairos that can be validated with TLA+ distributed model checkers [Lam99].

«formal-spec-once-done»

Kairos Drawbacks

Like all consensus algorithms, Kairos has a few drawbacks:

  1. Clocks need to be synced.

    We break with tradition in distributed systems design by depending on “real” time for our algorithm. Computer clocks tend to be highly unreliable and drift over time.

    So, in Kairos, if a validator’s clock is behind by more than the commit window, e.g. 2 seconds, then none of its blocks will ever get committed.

    While this is true, we don’t believe it is a big issue. Most devices nowadays, like smartphones and servers, tend to keep their clocks synced by fetching time over the internet [Mil03].

    In addition, we bake in time syncing into our communications layer with the Espra Time Protocol.

  2. Impact of bit flips on validator reputation.

    Computer hardware is not perfect. A single bit may flip at any time in a computer’s memory. What a validator thinks it signs, and what it actually signs can be two different things.

    Bit flips can happen for lots of reasons. Faulty hardware, attacks like RowHammer [KDK+14], or even radiation from cosmic rays and solar flares [WA08].

    Unless you’re like NASA and have 4 identical computers all computing the same thing so as to detect discrepancies [Jen01], it’s impossible to know when a bit has flipped.

    So, when faults are detected in a validator, it won’t be easy to know if it was because of malice or accidentally triggered by cosmic rays, faulty hardware, etc.

    Many consensus protocols will remove validators for such faults — even if it wasn’t intentional. We believe this to be unfair, as it takes effort for validators to build up their reputation.

    To protect themselves, validators should start by using more reliable hardware. For example, ECC memory modules will detect and fix single bit errors.

    For more robust protection, validators could use a set of “trusted peers” to sanity check their output before sharing it with the rest of the network.

    If the trusted peers all agree on there being a fault, the validator should document the issue and shut down. It can be brought back up after manual review.

  3. Impact of more than f validator nodes going offline.

    Like all BFT consensus algorithms with deterministic finality, Kairos needs more than 2f+1 validators to be online in order to keep making progress on consensus.

    A large number of validators could disappear for a number of reasons:

    • The “World War 3” scenario where regions with large number of validators get nuked. In such a scenario, those validators may never come back online.

    • A malicious party could hold the network to ransom by keeping parts of the network offline, or by conducting denial-of-service attacks.

    • A natural disaster could disconnect a place like New Zealand from the rest of the world, but it might take weeks for them to get reconnected to the rest of the world.

    The good news is that validators going offline can’t affect the safety of the system, e.g. by creating alternative versions of what’s agreed upon.

    Since a network that can’t make progress is no good, chains like Ethereum try to work around this by using an “inactivity leak” that reduces voting power from inactive validators [BG17].

    But this abandons safety as different parts of the network that can’t connect with each other would then end up creating different forks of the chain.

    So we don’t recommend such an approach for decentralized systems. Especially since it can be exploited by attackers to purposefully cause the network to fork and lose safety.

    Instead we suggest:

    • Adding support for high trust, “offline” interactions in the application layer.

      This would allow participants to carry on using the system in case their region is disconnected from the rest of the network, e.g. due to a natural disaster.

      The exchange of signed messages can take place directly between peers — allowing them to still keep using essential functionality, albeit in a slightly limited manner.

      Popular applications, especially emergency services, should support this mode of operation and conduct rigorous ongoing tests of their “offline” support.

      Preparation drills should be carried out in regions every few years so that people are familiar with how to use the system in case of disasters.

    • Predefining a social process that allows problematic validators to be removed via a hard fork.

      This process should have enough redudancies so that it can still take place even if certain people are not available or are being coerced.

      It should also be decentralized enough so that one person can’t hard fork the network via dictate, e.g. because they’ve been compromised in some way.

      And, finally, validators who are participating in the new fork must explicitly disavow other forks through a cryptographic proof so they can’t change their minds later.

  4. Impact of bugs in the execution layer.

    It’s hard to avoid bugs in the execution layer, i.e. the mechanism used to execute transactions. We’ve seen this in Bitcoin, Ethereum, and pretty much every chain with more than one implementation.

    If a bug exists in just certain versions or implementations, then it could result in participants seeing divergent views of what took place.

    The impact is exacerbated in Kairos-based systems, as we don’t execute transactions during consensus, and only sequence them. So execution bugs are only evident at a later point.

    The first line of defense against this is to make the “surface area” for bugs as small as possible, i.e. to keep the execution layer super simple.

    This should be paired with exhaustive conformance testing, i.e. making sure implementations adhere to a testable specification and pass all tests.

    For example, in Espra, our initial execution layer is derived from WASM, a relatively minimal format for executable programs, which comes with a relatively decent test suite.

    The next step is to catch execution bugs as they happen, and halt processing in that particular space until the issue is resolved. This can be done through a number of approaches:

    • Past versions of implementations can be preserved, enabling concurrent execution against multiple versions. This can help sanity check that all versions generate the same results.

    • Implementations can be registered with the system. Execution against the N most popular versions can then ensure that they all generate the same results.

    Over time this will result in most, if not all, bugs being eliminated from the execution layer.

  5. Vulnerability to “long-range” attacks.

    Like the consensus protocol in Ethereum and most Proof-of-Stake chains, Kairos is vulnerable to what is known as “long-range” attacks [Dei18].

    This can happen if the private keys of 2f+2 validators who used to be active at some point in the past are compromised, e.g. due to neglect of private keys on old machines, collusion, etc.

    Unlike the compromise of the private keys of currently active validators that the control protocol protects against, this will let attackers rewrite history from that earlier point in time.

    To protect against this, we encourage all decentralized deployments of Kairos to follow the protection mechanisms that we’ll be using in Espra:

    • All client software interacting with the network should hardcode a recent network block hash. Clients can then simply ignore any changes before that.

      In case someone gets marooned on a deserted island, and rejoins society after a long time, clients should also sanity check that they’re using a recent version.

    • In transactions, the data that gets signed should include the public keys of the operator of the node they’re using, as well as the network block hash for a recent tick.

      So, unlike in chains like Ethereum, the attackers won’t be able to cherry-pick specific transactions from other blocks and include them as part of their revised history.

    Both of these protections would make any attempt at “long-range” attacks pointless.

References

AF19 Daniel Abadi and Matt Freels. Serializability vs Strict Serializability: The Dirty Secret of Database Isolation Levels. Fauna Blog, 2019. External link
AGL+18 Ramnatthan Alagappan, Aishwarya Ganesan, Eric Lee, Aws Albarghouthi, Vijay Chidambaram, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. Protocol-Aware Recovery for Consensus-Based Storage. FAST '18: Proceedings of the 16th USENIX Conference on File and Storage Technologies, 2018. External link
Aba12 Daniel Abadi. Consistency Tradeoffs in Modern Distributed Database System Design: CAP is Only Part of the Story. Computer, Volume 45, Issue 2, pp. 37-42, 2012. External link
Axb19 Jens Axboe. Efficient IO with io_uring. 2019. External link
BDL+11 Daniel J. Bernstein, Niels Duif, Tanja Lange, Peter Schwabe, and Bo-Yin Yang. High-speed high-security signatures. Journal of Cryptographic Engineering, Volume 2, pp. 77–89, 2011. External link
BECI Bitcoin Energy Consumption Index. External link
BF19 Mahesh Balakrishnan and Jason Flinn. Delos: Storage for the Facebook Control Plane. Systems @Scale, 2019. External link
BG17 Vitalik Buterin and Virgil Griffith. Casper the Friendly Finality Gadget. 2017. External link
BHK+19 Daniel J. Bernstein, Andreas Hülsing, Stefan Kölbl, Ruben Niederhagen, Joost Rijneveld, and Peter Schwabe. The SPHINCS+ Signature Framework. CCS '19: Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pp. 2129-2146, 2019. External link
BHK+20 Vitalik Buterin, Diego Hernandez, Thor Kamphefner, Khiem Pham, Zhi Qiao, Danny Ryan, Juhyeok Sin, Ying Wang, and Yan X. Zhang. Combining GHOST and Casper. 2020. External link
Bai16 Leemon Baird. Hashgraph consensus: fair, fast, byzantine fault tolerance. Swirlds Tech Report, 2016. External link
Bre17 Eric A. Brewer. Spanner, TrueTime & The CAP Theorem. Google Research, 2017. External link
CGR07 Tushar D. Chandra, Robert Griesemer, and Joshua Redstone. Paxos made live: an engineering perspective. PODC '07: Proceedings of the 26th annual ACM Symposium on Principles of Distributed Computing, pp. 398–407, 2007. External link
CL99 Miguel Castro and Barbara Liskov. Practical byzantine fault tolerance. OSDI '99: Proceedings of the Third Symposium on Operating Systems Design and Implementation, pp. 173–186, 1999. External link
Cha82 David L. Chaum. Computer Systems Established, Maintained and Trusted by Mutually Suspicious Groups. Dissertation, Computer Science, UC Berkeley, 1982. External link
Cor+13 J. C. Corbett et al. Spanner: Google's globally distributed database. ACM Transactions on Computer Systems, Volume 31, Issue 3, Article 8, pp. 1–22, 2013. External link
DH18 George Danezis and David Hrycyszyn. Blockmania: from block DAGs to consensus. 2018. External link
Dei18 Evangelos Deirmentzoglou. Rewriting History: A Brief Introduction to Long Range Attacks. 2018. External link
Edg22 Ben Edgington. Upgrading Ethereum: Randomness. 2022. External link
FB99 Armando Fox and Eric A. Brewer. Harvest, yield, and scalable tolerant systems. Proceedings of the 7th Workshop on Hot Topics in Operating Systems, pp. 174-178, 1999. External link
FLP85 Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. Impossibility of distributed consensus with one faulty process. Journal of the ACM, Volume 32, Issue 2, pp. 374–382, 1985. External link
Fru21 Andrei Frumusanu. Apple's M1 Pro, M1 Max SoCs Investigated: New Performance and Efficiency Heights. AnandTech, 2021. External link
Jen01 Dennis Jenkins. Advanced Vehicle Automation and Computers Aboard the Shuttle. NASA, 2001. External link
KDK+14 Yoongu Kim, Ross Daly, Jeremie Kim, Chris Fallin, Ji Hye Lee, Donghyuk Lee, Chris Wilkerson, Konrad Lai, and Onur Mutlu Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors. ISCA '14: Proceedings of the 41st annual ACM/IEEE International Symposium on Computer Architecture, pp. 361-372, 2014. External link
Kin13 Kyle Kingsbury. Jepsen: A framework for distributed systems verification, with fault injection. 2013. External link
LB12 Daniel Lemire and Leonid Boytsov. Decoding billions of integers per second through vectorization. 2012. External link
LRM82 Leslie Lamport, Robert Shostak, and Marshall Pease. The Byzantine Generals Problem. ACM Transactions on Programming Languages and Systems, Volume 4, Issue 3, pp. 382–401, 1982. External link
Lam98 Leslie Lamport. The Part-Time Parliament. ACM Transactions on Computer Systems, Volume 16, Issue 2, pp. 133–169, 1998. External link
Lam99 Leslie Lamport. Specifying Concurrent Systems with TLA+ Calculational System Design, IOS Press, pp. 183–247, 1999. External link
Luu15 Dan Luu. Files are hard. 2015. External link
Mil03 David L. Mills. A Brief History of NTP Time: Confessions of an Internet Timekeeper. 2003. External link
Nak08 Satoshi Nakamoto. Bitcoin: A Peer-to-Peer Electronic Cash System. 2008. External link
OAN+19 Jack O’Connor, Jean-Philippe Aumasson, Samuel Neves, and Zooko Wilcox-O’Hearn. BLAKE3: one function, fast everywhere. 2019. External link
OL88 Brian M. Oki and Barbara H. Liskov. Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems. PODC '88: Proceedings of the 7th annual ACM Symposium on Principles of Distributed Computing, pp. 8–17, 1988. External link
OO13 Diego Ongaro and John Ousterhout. In Search of an Understandable Consensus Algorithm. ATC '14: Proceedings of USENIX Annual Technical Conference, 2014. External link
PCA+14 Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications. OSDI '14: Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation, 2014. External link
PD16 Joseph Poon and Thaddeus Dryja. The Bitcoin Lightning Network: Scalable Off-Chain Instant Payments. 2015. External link
PR21 Matthew Prince and Nitin Rao. AWS’s Egregious Egress. Cloudflare Blog, 2021. External link
Pas11 Christopher Pascoe. Time, technology and leaping seconds. Google Blog, 2011. External link
Phe19 Jay Phelps. Backpressure explained — the resisted flow of data through software. 2019. External link
RFC8878 Yann Collet and Murray Kucherawy. Zstandard Compression and the 'application/zstd' Media Type. RFC 8878, 2021. External link
Sha21 Yarden Shafir. I/O Rings – When One I/O Operation is Not Enough. 2021. External link
Sin18 Diptanshu Singh. Average Network Delay and Queuing Theory basics. 2018. External link
TDW+12 Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, and Daniel J. Abadi. Calvin: Fast Distributed Transactions for Partitioned Database Systems. SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 1–12, 2012. External link
WA08 Fan Wang and Vishwani D. Agrawal. Single Event Upset: An Embedded Tutorial. VLSID '08: Proceedings of the 21st International Conference on VLSI Design, pp. 429-434, 2008. External link
Yak17 Anatoly Yakovenko. Solana: A new architecture for a high performance blockchain. 2017. External link
Yan19 David Yanacek. Using load shedding to avoid overload. The Amazon Builders' Library. 2019. External link

To cite:

@misc{tav2025kairos,
  title={Kairos: Scaling to Millions of Transactions Per Second}, 
  author={Tav},
  year={2025},
  primaryClass={cs.CR}
}