# Authenticated Data Structures

**Merkle trees**

A Merkle tree is a binary tree of SHA-256 hashes built over an ordered list of data items. The data items can be arbitrarily large. The tree for *N* data items consists of *2N* nodes in total. Every node of the tree is assigned a label. The leaves of the Merkle tree are labeled with the *N* data items. The label of each internal parent node of the Merkle tree is the hash of the two labels on its two child nodes. The root of the tree is labeled with the Merkle root hash, called the Merkle root. Every label is 32 bytes.

*Membership witnesses: *Users who do not store any of the data items in the Merkle tree can still verify the inclusion of a data item. Verifying inclusion of a data item in the Merkle tree requires checking the correctness of a path of hashes from the data item to the root of the tree. The proof required for verification contains the labels on the sibling nodes of all nodes along a path from the data item leaf to the root of the tree. In total this is a 32 log *N* byte proof.

*Updates:* When a new item is appended to the end of the list, the new Merkle root can be updated by recomputing only a logarithmic number of hashes. The Merkle root can be updated in this way while streaming the new items of the list and only storing log *N* existing labels. These are the labels on the nodes that will be required for the inclusion proof of the new item. Similarly, changing the value of an existing data item at a leaf of the tree requires updating only the labels up the path from leaf to root and can be recomputed given the log *N* inclusion proof labels for the previous value of the data item.

*Deletes: *Deletion is a special case of an update. An item is deleted from the tree by updating the value of the data item at the existing position to NULL. The labels on NULL nodes no longer need to be stored. The same applies to labels of nodes that are a root of a subtree of only NULL nodes.

*Rebalancing: *Deletion of many data items will eventually result in many NULL nodes in the tree and thus an imbalanced tree whose depth is logarithmic in the total number of items ever inserted (i.e. including NULL nodes) rather than the current number of items. An optimization is therefore to use a self-balancing tree data-structure such as a 2-3 tree rather than a plain binary tree. Self-balancing trees retain logarithmic depth in the total number of current (non-NULL) items. A Merkle tree can be constructed over any tree in the same way as it is constructed over a standard binary tree. Deleting an item in a 2-3 tree requires reorganization of only a logarithmic number of sibling relationships in the tree and therefore requires updating only a logarithmic number of hash labels.

In the above diagram, “*Tx0*” represents a transaction. Its corresponding hash is “*Hash0*”. After “*Tx0*” and “*Tx1*” have been made, “*Hash0*” and “*Hash1*” are combined to produce “*Hash01*”. This process is repeated until the last hash, or the root hash, is produced.

**Sparse Merkle trees**

A sparse Merkle tree (SMT) may have a practically unbounded number of data item positions. However, the storage required is only proportional to the number of data items inserted into the tree. The label on the leaf node of the SMT is initially NULL and therefore does not need to be stored. Similarly, the labels on internal nodes of the SMT all have fixed predictable values that can be re-computed as hashes of NULL values. When an item is added to a leaf of the tree, then all the labels along the path from this leaf to root are updated and stored. For a tree of size 250, each insertion increases the current storage by 1.6 KB. The inclusion proof of an item is similarly 1.6 KB.

There are two ways to use an SMT as a commitment to a key/value store:

*Short keys*: Every node is assigned a log*N*bit key. This is the best solution for a key/value store where keys are relatively short. A SMT with 240 nodes can encode 40-bit keys. A (*key*,*value*) data item is inserted into the SMT by setting the label on the leaf node at index*key*to the 32 byte SHA-256 hash of*value*.*Hash table*: When the keys are longer than log*N*bits, the SMT can be used as a hash table of length*N*. A data item has a (key, value) pair and a standard hash table function is used to map the key to an index in {1,…,*N*} of the hash table, where again a 32-byte SHA-256 hash of*value*is stored. Collisions in the hash table are possible but happen infrequently. For example, if*N*= 240, then once a billion keys have been inserted, a new key causes a collision with probability 1/1000. Collisions do not break the integrity of the SMT commitment, but require the application to modify the key for a new insert into the SMT until it no longer causes a collision. If the application is not able to modify keys then the number of leaves in the SMT needs to be much higher (roughly*N*=*M*2 to support*M*total key inserts).

In the above diagram, the sparse Merkle tree has empty positions for where an item has not been inserted before.

**Bloom filter**

A Bloom filter uses *k* hash functions *H**1**,…,H**k*. The Bloom filter is stored as a bit vector of length *M* bits. All bits are initialized to 0. When an item is inserted into the Bloom filter, the item is hashed with each of the *k* hash functions to derive *k* hash values, and each hash value points to an index in {1,…,*M*} of the Bloom filter bit vector that is then set to 1. The Bloom filter indicates that an item is included if the value at each index determined by the *k* hash functions is set to 1. Otherwise, the item is definitely not included. Checking if an item is in the Bloom filter just requires recomputing each of these *k* hash values and then checking the bit at the index.

False positives are possible but occur with small probability. To maintain a false positive probability of *p* for *N* items set *k = -log p* and *M = *1.44 *k N*. For example, with *N* inserted items and *M* = 14*N* the false positive probability is approximately 1/1000.

Bloom filters are a secure commitment to the data items that are *not* inserted into the filter because they do not have any false negatives. They can be used to practically commit to a subset *S** of items from a relatively small superset *S* (e.g. the UTXOs as a subset of all the TXOs) by inserting all items from *S* that are *not* in *S** (e.g. the subset of spent TXOs). The size of this commitment is proportional to the set *S* and is therefore only practical when the superset of possible items is not too large.

Bloom filters are not a secure commitment to the data items that inserted into the filter unless the false positive rate *p* is negligibly small, e.g. *log p = –*128. In this case, the filter contains approximately 185 bits per inserted item, which is 28% smaller than storing the 32 byte SHA-256 hash of every element in the set.

Above is a diagram for a basic Bloom filter with *k* = 3, where A, B, and C are inserted with hash functions h1, h2, and h3.

**Counting Bloom filter**

A counting Bloom filter modifies the Bloom filter by replacing each bit of the Bloom filter bit vector with a 4 bit counter to enable deletions. Every time an item is inserted, the counters at the k indices determined by the hash function are incremented. An item is deleted from the filter by decrementing the counter at each of the k indices. The Bloom filter indicates that an item is included if the counters at the *k* indices for an item are all non-zero. The Bloom filter indicates that an item is not included if any of these counters are zero.

Counting Bloom filters do not offer any advantage over a list of SHA-256 hashes when used as secure commitments to sets. They require 4x as many bits as a plain Bloom filter and therefore to achieve* log p = –*128 they require more than 32 bytes per inserted item.

Above is a diagram for a counting Bloom filter with *k* = 3, where A, B, and C are inserted with hash functions h1, h2, and h3. As opposed to the basic Bloom filter, a 4 bit counter is used instead of a single bit at each position.

**Bloom filter with SMT**

A Bloom filter is combined with an SMT to form a compact commitment to a set. The Bloom filter is divided into blocks of 32 bytes. In order to support *N* item insertions with a negligible false positive probability of *log p = –*128, the total number of leaves in the SMT is approximately 0.72*N*.

**RSA accumulators**

The RSA accumulator must be set up by a trusted operator that randomly chooses two strong 1024 bit RSA primes *p, q* and publishes the integer *N* = *pq*. The values of *p *and *q* are kept secret. The operator discards the values *p *and *q *after publishing *N*.

*Secure MPC: *The integer *N* can be generated through a *secure multi-party computation *among several independent parties. Unless all the participating parties collude to corrupt the system, then nobody will learn the secret values *p *and *q*. Only a single participating party has to be trusted for the process to work. For Findora deployments, the integer *N* will be generated through a secure MPC among several parties including the initial validator set.

*Initialization:* Set *A* = 3. This value of *A* is the accumulator state head.

*Insertion: *Insertion of a new value *val* into the accumulator uses a function *HashPrime* that hashes arbitrary data to prime integers in the range [0, 2256). The accumulator state head *A* is updated to the new integer *A := A**HashPrime(val) **mod N*. The first step of *HashPrime* outputs *hashval* = *SHA256*(*val)*. Then, starting with a counter *nonce = *0, it computes *SHA256*(*nonce, hashval*) and checks if this is prime. If it is not prime then it increments *nonce *and repeats the process until the first value of *nonce* that outputs a prime. The parallel implementation of *HashPrime* tests a range of consecutive *nonce* values in parallel. If more than one output is prime then it selects the lowest, otherwise it shifts to the next consecutive range.

*Membership witness*: The membership witness for a value *val* inside an accumulator with current state head *A* is the value *A**1/x* mod* N* with *x = HashPrime(val)*. This is an integer value *w* such that *w**x** = A *mod* N. *The state head of the accumulator just before *val* is inserted is the initial membership witness for *val*. Membership witnesses are updated as the state head changes every time a new value is inserted into the accumulator. When a new value *val** is inserted into the accumulator the current membership witness *w *is updated to *w* := *w**y* with *y = HashPrime(val*)*.

*Non-membership witness: * If the value *val* is not in the accumulator with current state head *A = 3**y* *mod N*, then *x* = *HashPrime*(*val*) is co-prime with *z*. The Euclidean algorithm is used to find a large integer *a* (proportional to *y*) and a smaller integer *b* (proportional to *x*) such that *ax *+ *by* = 1. The non-membership witness for *val* is (3*a*, *b*). The non-membership witness is verified by checking that *3**ax**A**b** = 3*.

*Deletion: * Deletion of a value *val* with *x* = *HashPrime*(*val*) from the accumulator requires computing a value *A**1/x* mod* N *(the same as a membership witness for *val *in A) and updating the state head to *A* := *A**1/x* mod* N*. This can be computed in a brute force expensive method if all the values that have been inserted into the accumulator are known. If a membership witness *w* for *val *in A is already known, then deleting *val* is simple: set *A* := *w*.

Membership witnesses for all other values also need to be updated when a value is deleted. If the current state head is *A* and a value *val* is removed resulting in the new state head *A*1 = *A**1/x*, then the membership witness *w** = *A**1/y* for a value *val** with *y = HashPrime(val*)* is updated to *w** := *w***1/y*. Note that *w***1/y* = *A**1/*(xy).

The efficient algorithm to compute the new membership witness *A**1/*(xy) from the old membership witness *w** = *A**1/y* and new state head *A*1 = *A**1/x* first uses the Euclidean algorithm to find integers *a, b* satisfying an equation *ax *+* by *= *1*, and then outputs the integer (*A*1)*b*(*w**)*a* mod *N*. This output is a value *A**1/*(xy) because [(*A*1)*b*(*w**)*a*]xy *=* [(*A**b/x**A**a*/*y*]*xy** =* *A**by + ax **=* *A*.

Recent research provides new techniques for batch insertion, batch deletion, and constant size aggregate membership witness (or non-membership witness) for an arbitrarily large group of values. A prototype implementation has already been developed and an industry-grade implementation will be added to the Findora cryptographic library in Phase II of development.