Editor’s Note: This article was written largely with input from numerous Findora engineers and does not represent the effort of one person. This is a somewhat synthesized accounting of initial efforts to optimize the EVM layer on Findora, with more optimizations soon to follow.
The Findora blockchain is a combination of both a UTXO and EVM ledger, united through an interatomic bridge called Prism Transfer. Findora’s goal is to create a new blockchain-based financial internet with built-in privacy by scaling Ethereum privacy. Advanced zero-knowledge cryptography and SNAKRS let Findora protect transaction data on public blockchains.
Thus, while privacy is the primary concern of the protocol, scalability is another necessary component — without sufficient bandwidth, the network can’t provide privacy protections for multiple ecosystems.
However, while some projects seek transaction per second rates that are 10,000+, the reality is many projects don’t need a TPS anywhere near that high. The vast majority of TVL exists on Ethereum and Bitcoin, which have TPS rates of around 15 and 6, respectively.
A high TPS rate, then, is important but not essential, especially when starting out. For example, Avalanche has a theoretical TPS rate of 4,500 but rarely hits more than 9 TPS.
Over the last two months, the Findora development team has successfully raised the TPS on its EVM layer by almost 4X to around 150 TPS — more than enough to handle whatever burden it will have in the near future.
Most of the gains came from:
- Parallelizing the Tendermint ABCI (previously only single-threaded only)
- Using read-write lock to replace unique lock (mutex)
- Enhance transaction check logic
- Optimizing redundant serialization (place transaction list for block in RAM instead of database cache)
This document will go through the scientific process used to achieve these results and highlight some of the biggest areas of optimization. We want to describe the technical details of the optimization analysis, implementation, and testing that we conducted in the past 3 months so other teams working with EVM blockchains can leverage our work and replicate and extend our results.
Overview of Methodology
The best way to test and improve our performance, we believe, is with the scientific method: test the environment, analyze results, deploy a fix, and repeat. Consequently, we divided the optimization process into 5 steps:
- Test Result Collection
- Result Analysis
- Updating the code and deploying it again
By simplifying and taking out redundancies, we shortened the transaction time and made the blockchain more efficient. Privacy transactions require a lot more computing power than transparent transactions. Though the network must be able to handle a sufficient load to be usable in real-world applications and scalability is a key goal, the Findora Network believes it is secondary to privacy.
Most of the TPS gains came from small improvements that quickly added up. For instance, the team gained around 10 TPS by optimizing the serialization and deserialization process, reducing database reading, and combining functions.
Other improvements, like improving the check_tx function and removing redundant memory allocation, gave another boost in TPS.
Potential Areas of Optimization
There were 10 points total that we thought could be improved. Most gains, however, came from these 6 points:
- Improved deliver_tx and check_tx processing in the main interface
- Reduced the number of persistent operations in the data storage structure
- Optimize functions that require persistent data calling, reducing the number of calls.
- Optimize the performance of the vault for interfaces which are frequently called by deliver_tx and check_tx
- Optimized the logging process
- Optimized read-write lock functions
The four that haven’t yielded much in the way of results are listed below. The first three are a function of Tendermint and could not be directly affected by our team. The final point was not as fruitful as hoped:
- The process for checking new transactions
- Web3 PRC server’s processing capability
- ABCI check_tx’s processing capability
- The transmission speed for transferring transactions to the validators, and to the full nodes on the blockchain.
- ABCI’s capability for function calls.
- Optimizing the performance of dependency libraries (this remains a long-term goal)
- Used high-performance alternative libraries or interfaces
- Adjusted compiling options for libraries
Tools Used to Optimize Transaction Speeds
We used two main tools to measure the network’s performance before identifying places where performance could be improved. Most improvements came from getting rid of redundancies. The two tools were:
- A CLI Tool used to simulate transaction environments
- Pprof-rs to profile the CPU usage of the ABCI process and Findora nodes
We’ll detail how to deploy each and how they are used.
The CLI Tool
We wrote the CLI tool to simulate a client and facilitate testing. It allows us to make transactions, make wallets, write scripts and more, and essentially lets us send stress the network for testing without breaking it.
- Convert the native FRA to smart FRA by Prism, and send to root account
- Root account sends FRA to multiple addresses (source), and saves to file
- Parallelly perform the following actions:
- Randomly generate a batch of addresses (targets), and appoint an endpoint
- Obtain the nonce source
- Generate an EVM transfer transaction, and send to the endpoint by send_raw_transaction. Save the hash
- The client increases the nonce, generates, and submits transactions until all targets are sent one transaction.
- If the reason for the failure of a transaction is not caused by “mempool full”, it will obtain a new nonce and re-submit
- To reduce the effect of no response from the server, the logic of service resilience is added to the interface of nonce
- To reduce the server’s “mempool full” error and the server pressure, the parallelized waiting for synchronization is added.
Pprof-rs and CPU Profiling
By “CPU Profiling,” we mean the tools we used to measure and tune the performance of the Findora Network. We came up with an iterative process that allowed us to stress-test the network without breaking it to see which functions could be optimized.
The main tool used to further research and locate the bottlenecks of our platform was pprof-rs. Pprof-rs is a popular method for measuring the CPU usage of Rust programs.
How to Use Pprof-rs
We imported the pprof-rs in the abciapp crate, and enabled the flamegraph feature. Pprof-rs is injectively compiled into abcid. With this, the profiler can sample the operations’ context in the adcid by specified frequency.
In the abcid process, we mainly used the following interfaces for profiling.
- Use ProfilerGuardBuilder to enable a profiler, and set the frequency to 100.
2. Save the result of profiling into file as a flamegraph
3. Because the abcid is a persistent process, the profiler can be stopped by this method:
How Pprof-rs works
The profiler will pause the program by given frequency and sample the stack trace of the program. The sampling data is stored in a hashmap. In sampling, the profiler scans every stack frame and accumulates the count stored in the hashmap.
Then, the sampling data can be used to generate a flamegraph or other media to represent network performance.
When ProfileGaurdBuilder starts a profiler, it will register a signal handler to the SIGPROF, and a timer for pausing the main program. When the SIGPRO is triggered, the handler is called to sample the stack traces. This process is performed by the backtrace crate.
Using Pprof-rs to Profile a Findora Fullnode
It’s important to understand first the structure of our blockchain and the Tendermint consensus engine.
In the ETH full node, the Tendermint process communicates with the ABCI process by socket. The ABCI application registers these interfaces by default:
And it also provides the Web3 RPC server with the following interface with which the Web3 client can perform the testings:
The figure shows the basic workflow of an ETH full node. The full node is the node that collects new transactions and replays the new transactions and is the entry and checkpoint for all transactions. By profiling the ABCI in the full node, we can gain complete data on the performance of the chain.
Before profiling the platform’s abcid, we parallelized the Tendermint ABCI with two threads instead of a single thread. It allows the fullnode to simultaneously call the check_tx function for new transactions and replay the newly generated block (mainly the deliver_tx function). After this upgrade, the Fullnode’s CPU capability is equally distributed. Thus, the number of transactions is around 3000 in every block, and the block time is reduced.
And we also measured the time cost for the blocks with only one transaction. These functions are mainly measured among:
Except for the deliver_tx function, the other functions are only called once in one block. When the number of transactions is small in a block, the counts of calling these four functions are small and close.
For blocks containing many transactions, except for the deliver_tx, the time cost for the other three functions is small.
Based on the modification and measurement results, we come to a conclusion that the check_tx and deliver_tx functions occupy the most usage of fullnode’s CPU.
To arrive at this conclusion, we start the profiler at the beginning of every block (begin_block). Then we save the profiling flamegraph and stop the profiler before the next block in order not to affect the fullnode’s performance. For this, we use two global variables:
- An atomic boolean variable is used to determine whether to start or stop the profiler
- A variable is used to store the running profileGuard
The pprof-rs does not provide an interface for stopping the profiler. We stop the profiler and release the data by transferring the ownership of ProfileGuard.
Before stopping the profiler, the sampling data can be stored in the ledger directory as a flamegraph file.
For testing purposes, we added two RPCs. One is for enabling/disabling the profiler.
The other is for retrieving the generated profiling data store in flamegraph file.
Step 1: Starting the Test
Our tests were performed by the script which calls feth using the CLI tool (CLI). After selecting a fullnode for testing, we transferred FRA to 2000 test accounts by the subcommand fund:
feth fund — network http://dev-qa01-us-west-2-full-001-open.dev.findora.org:8545 — amount 2000 — redeposit — load — count 2000
Then, we began to send transactions to the fullnode. The parallelism, timeout, and transaction count from each account are configurable. For example:
feth — network http://dev-qa01-us-west-2-full-001-open.dev.findora.org:8545 — max-parallelism 300 — timeout 100 — count 10
Step 2: Collecting Test Results
There are four ways to collect the test data and analyze them.
- We can manually monitor test results and blocks with Blockscout. This process lets us visualize blocks to evaluate performance.
- We fetch performance numbers from Web3 RPC. First, we can use the interface eth_getBlockByNumber to get the target block. Then, we can get the actual number of transactions by the length of the transaction array. Only valid transactions can be retrieved from this interface. For the block time, we can calculate it by the differences between the timestamps of the adjacent blocks.
- Tendermint RPC (query block info from tendermint RPC): Similar to Web3 RPC, we used tools like curl, jq and others to retrieve the number of transactions and the block time. This RPC provides us the number of all transactions(valid/invalid) packed in the blocks. Note: both the Web3 RPC and Tendermint RPC methods sometimes have “no response” issues.
- Tendermint log (extract block timestamp, valid/invalid txns): In order to retrieve, keep, and analyze the test data more conveniently, we use subcommand etl in the feth. With this, the command can parse the log of fullnode, and save it to a redis database. As the below figure shows, the block time, total number of transactions, and the valid transactions can be parsed by the terdermint log, which is generated during the fullnode replay blocks.
Step 3: Analyzing Test Results
Once we’ve gotten test results, we profile or visualize the results so we can iterate through the code. Profiling is the step where we see what’s taking up the most CPU power and time, and we look for ways to optimize.
We added a subcommand in feth which enables the profiler. For example:
- To enable profiler in the next block
feth profiler — network http://dev-qa01-us-west-2-full-001-open.dev.findora.org:8669 –enable
- To generate the flamegraph and stop the profiler
feth profiler — network http://dev-qa01-us-west-2-full-001-open.dev.findora.org:8669
This flamegraph tells us the amount of time taken by a CPU function. The longer the time, the longer the function, and thus by looking at the long bars, we can tell what processes need to be optimized.
Step 4: Redeploying Code
After the code updates, we use Jenkins to deploy it to the test environment.
Changes we Made to Optimize Findora
Based on our testing, here are some changes we have implemented or will implement to increase Findora’s EVM TPS.
Setting the Mempool to 8k
Through testing, we found the optimal size of the mempool in order to improve the stability of fullnodes and make sure it doesn’t take too long to generate blocks is 8,000. We are looking to update the mempool on mainnet soon.
Parallelizing the Tendermint ABCI
Parallelizing the Tendermint ABCI so that check_yx and deliver_tx can be performed simultaneously was another improvement we discovered. This also helps keep the block time from being too long. But this does not improve the TPS significantly because the transactions are uniformly distributed.
We reduced the database readings and deserialization by combining some functions in the account packaged in our SDK.
With this optimization, the TPS is improved by around 10 txn/s.
Removed unnecessary checking
The functions check_tx and deliver_tx use the same logic for handling transactions. The only difference is in the context. But for the function check_tx, it does not need the logic of PendingTransactions, emit, events and etc. Thus, we separated these two functions by the context. And the TPS has improved to 79.2 txn/s.
In the previous implementation, there are tons of memory copy/allocation.deallocation between cur and base to perform one transaction.
By reducing the number of these operations, the TPS reached 149 txn/s.
Avoid printing redundant logs
We removed hundreds of unnecessary logs (what we call verbose logs). These logs can be reprinted for debugging but won’t automatically be printed on mainnet.
Future Optimization Points
Based on the flamegraph, there are two more things we can do in the future.
The secp256k1_ecdsa_recover function takes a significant portion of time in the recover_signer function. The core part for this secp256k1_ecdsa_recover function is libsecp256k1: recover, whose crate interface costs the most of the time.
We can consider optimizing this library, replacing it with another high-performance library, or reducing the callings.
Findora backend for EVM
It takes a large portion of the whole transaction process. But we still cannot find any better-optimized points for it. We may need more tests and profilings for this part in the future.
Improving TPS is an iterative process, and the team is always looking for new ways to increase the network capacity. While we obviously want to remain competitive, we think the more collaboratively Web3 behaves, the stronger the industry will be as a whole. Thus we are happy to share our processes both to help other teams and also to get constructive feedback. We hope this breakdown is helpful for other teams working in an EVM environment so they can implement some of these changes to improve their own TPS.