Source: Empower Labs
In the annals of technological progress, revolutionary technologies often emerge independently, each leading the transformation of an era. When two revolutionary technologies meet, their collision often has an exponential impact. Today, we are standing at such a historic moment: artificial intelligence and encryption technology, two equally disruptive new technologies, are stepping into the center of the stage hand in hand.
We imagine that many challenges in the field of AI can be solved by encryption technology; we expect AI Agents to build autonomous economic networks and promote the large-scale adoption of encryption technology; we also hope that AI can accelerate the development of existing scenarios in the field of encryption. Countless eyes are focused on this, and massive funds are pouring in. Like any buzzword, it embodies people's desire for innovation, vision for the future, and also contains uncontrollable ambition and greed.
However, in this hustle and bustle, we know very little about the most basic issues. How well does AI understand the field of encryption? Do agents equipped with large language models have the actual ability to use encryption tools? How big is the difference between different models in encryption tasks?
The answers to these questions will determine the mutual influence of AI and encryption technology, and are also crucial to the product direction and technical route selection in this cross-field. In order to explore these issues, I conducted some evaluation experiments on large language models. By evaluating their knowledge and capabilities in the field of encryption, the level of AI's encryption application can be measured, and the potential and challenges of the integration of AI and encryption technology can be judged.
First, the conclusion
The large language model performs well in cryptography and blockchain basics, and has a good understanding of the encryption ecosystem, but performs poorly in mathematical calculations and complex business logic analysis. In terms of private keys and basic wallet operations, the model has a satisfactory foundation, but faces severe challenges in how to keep private keys in the cloud. Many models can generate effective smart contract code for simple scenarios, but cannot independently perform difficult tasks such as contract auditing and complex contract creation.
Commercial closed-source models have a large lead overall. Among the open-source camp, only Llama 3.1-405B performs well, while all open-source models with smaller parameter scales fail. However, the potential does exist. Through prompt word guidance, thought chain reasoning and few-sample learning technology, the performance of all models has been greatly improved. The leading models have strong technical feasibility in some vertical application scenarios.
Experimental details
18 representative language models were selected as evaluation objects, including:
Closed source models: GPT-4o, GPT-4o Mini, Claude 3.5 Sonnet, Gemini 1.5 Pro, Grok2 beta (temporarily closed source)
Open source models: Llama 3.1 8B/70b/405B, Mistral Nemo 12B, DeepSeek-coder-v2, Nous-hermes2, Phi3 3.8B/14b, Gemma2 9B\27B, Command-R
Mathematical optimization models: Qwen2-math-72B, MathΣtral
These models cover mainstream commercial and popular open source models, with a span of more than 100 times the number of parameters from 3.8B to 405B. Considering the close relationship between encryption technology and mathematics, the experiment also specially selected two mathematical optimization models.
The knowledge areas covered by the experiment include cryptography, blockchain basics, private key and wallet operations, smart contracts, DAO and governance, consensus and economic models, Dapp/DeFi/NFT, on-chain data analysis, etc. Each field consists of a series of questions and tasks from easy to difficult, which not only tests the knowledge reserve of the model, but also tests its performance in the application scenario through simulation tasks.
The design sources of the tasks are diverse, some of which come from the input of multiple experts in the encryption field, and the other part is generated by AI assistance and manually proofread to ensure the accuracy and challenge of the tasks. Some of the tasks use simpler multiple-choice questions to facilitate separate standardized automated testing and scoring. The other part of the experiment uses more complex questions, and the testing process is carried out by a combination of program automation + manual + AI. All test tasks are evaluated using the zero-sample reasoning method, without providing any examples, thinking guidance or instruction-type prompts.
Since the design of the experiment itself is relatively rough and does not have sufficient academic rigor, the questions and tasks used for testing are far from covering the entire encryption field, and the testing framework is not mature. Therefore, this article does not list specific experimental data, but focuses on sharing some insights from the experiment.
Knowledge/Concepts
During the evaluation process, the large language model performed well in basic knowledge tests in various fields such as encryption algorithms, blockchain basics, and DeFi applications. For example, in the question-and-answer questions that examine the understanding of the concept of data availability, all models gave accurate answers. As for the questions that evaluate the model's mastery of the Ethereum transaction structure, although the models have slight differences in the details of the answers, they generally contain correct key information. The multiple-choice questions that examine concepts are even easier, and the accuracy rate of almost all models is above 95%.
Conceptual questions and answers are not difficult for big models.
Calculation/Business Logic
However, the situation is reversed when it comes to questions that require specific calculations. A simple RSA algorithm calculation question puts most models in trouble. This is actually not difficult to understand: big language models mainly operate by identifying and replicating patterns in training data, rather than by deeply understanding the essence of mathematical concepts. This limitation is particularly evident when dealing with abstract mathematical concepts such as modular operations and exponential operations. Given that the field of encryption is closely related to mathematics, this means that it is unreliable to directly rely on models for encryption-related mathematical calculations.
In other calculation questions, the performance of big language models is also unsatisfactory. For example, the simple question of calculating the impermanent loss of AMM, although it does not involve complex mathematical operations, only 4 out of 18 models gave the correct answer. And another more basic question of calculating the probability of a block, all models answered it wrong. It actually stumped all models, and none of them got it right. This not only exposes the shortcomings of large language models in terms of precise calculation, but also reflects that they have major problems in business logic analysis. It is worth noting that even mathematical optimization models have not shown obvious advantages in calculation questions, and their performance is disappointing.
However, the problem of mathematical calculation is not unsolvable. If we make a slight adjustment and require LLMs to give the corresponding Python code instead of directly calculating the results, the accuracy will be greatly improved. Taking the aforementioned RSA calculation question as an example, the Python code given by most models can be successfully executed and the correct results can be obtained. In the actual production environment, it is possible to bypass the link of LLMs' self-calculation by providing preset algorithm codes, which is similar to the way humans deal with such tasks. At the business logic level, the performance of the model can also be effectively improved through the guidance of carefully designed prompt words.
Private key management and wallet operation
If you ask what the first scenario for Agent to use cryptocurrency is, my answer is payment. Cryptocurrency can almost be regarded as a native form of currency for AI. Compared with the many obstacles faced by agents in the traditional financial system, it is a natural choice to use encryption technology to equip themselves with digital identities and manage funds through encrypted wallets. Therefore, the generation and management of private keys and various operations of wallets constitute the most basic skill requirements for agents to use encrypted networks independently.
The core of securely generating private keys lies in high-quality random numbers, which is obviously not a capability of large language models. However, the model has sufficient understanding of private key security. When asked to generate private keys, most models choose to use code (such as Python's related libraries) to guide users to generate private keys independently. Even if a model directly gives a private key, it clearly states that this is only for demonstration purposes and is not a secure private key that can be used directly. In this regard, all large models have shown satisfactory performance.
Private key management faces some challenges, which are mainly due to the inherent limitations of the technical architecture rather than the lack of model capabilities. When using a locally deployed model, the generated private key can be considered relatively safe. However, if a commercial cloud model is used, we must assume that the private key has been exposed to the operator of the model at the moment of generation. However, for agents that aim to work independently, it is necessary to have private key permissions, which means that the private key cannot be stored only locally by the user. In this case, the model itself is not enough to ensure the security of the private key, and additional security services such as a trusted execution environment or HSM need to be introduced.
If it is assumed that the agent already holds the private key securely, the various models in the test have shown good capabilities when performing various basic operations on this basis. Although there are often errors in the output steps and codes, these problems can be largely solved under the appropriate engineering architecture. It can be said that from a technical perspective, there are not many obstacles to allowing the agent to perform basic wallet operations autonomously.
Smart Contracts
The ability to understand, use, write and identify risks in smart contracts is the key to AI Agents performing complex tasks in the on-chain world, and is therefore also a key testing area for the experiment. Large language models show significant potential in this area, but also expose some obvious problems.
In the test, almost all models were able to correctly answer basic contract concepts and identify simple bugs. In terms of contract gas optimization, most models were able to identify key optimization points and analyze possible conflicts caused by optimization. However, when it comes to deep business logic, the limitations of large models begin to emerge.
Take a token vesting contract as an example: all models correctly understood the contract function, and most models found several medium- and low-risk vulnerabilities. However, for a high-risk vulnerability hidden in the business logic that may cause part of the funds to be locked under special circumstances, no model was able to discover it independently. In multiple tests using real contracts, the performance of the models was roughly the same.
This shows that the large model's understanding of the contract is still at the formal level and lacks an understanding of the deep business logic.However, after providing additional prompts, some models were eventually able to independently find the deeper hidden vulnerabilities in the above contract. Based on this performance, with the support of good engineering design, the large model has basically the ability to serve as a co-pilot in the field of smart contracts. However, there is still a long way to go to independently undertake important tasks such as contract auditing.
One thing needs to be explained. The code-related tasks in the experiment are mainly for contracts with simple logic and less than 2,000 lines of code. For larger and more complex projects, without fine-tuning or complex prompt word engineering, I think it is obviously beyond the effective processing capacity of the current model and is not included in the test. In addition, this test only involves Solidity, and does not include other smart contract languages such as Rust and Move.
In addition to the above test content, the experiment also covers DeFi scenarios, DAO and its governance, on-chain data analysis, consensus mechanism design, and Tokenomics. The large language model has demonstrated certain capabilities in all these aspects. Given that many tests are still in progress and the test methods and frameworks are being continuously optimized, this article will not discuss these areas in depth for the time being.
Model Differences
Among all the large language models evaluated, GPT-4o and Claude 3.5 Sonnet continue their outstanding performance in other fields and are the undisputed leaders. When faced with basic questions, these two models can almost always give accurate answers; in complex scenario analysis, they can provide in-depth and well-argued insights. They even show a high success rate in computing tasks that large models are not good at. Of course, this "high" success rate is relative and has not yet reached the level of stable output in a production environment.
Among the open source model camp, Llama 3.1-405B is far ahead of its peers thanks to its large parameter scale and advanced model algorithm. Among other open source models with smaller parameter scales, there is no significant performance gap between the models. Although the scores vary slightly, they are all far from the passing line overall.
Therefore, if you want to build encryption-related AI applications at present, these models with small and medium parameters are not suitable choices.
In our evaluation, two models are particularly eye-catching. The first is the Phi-3 3.8B model launched by Microsoft. It is the smallest model participating in this experiment. However, it has achieved a performance level comparable to the 8B-12B model with less than half the number of parameters, and even performs better in certain categories of problems. This result highlights the importance of model architecture optimization and training strategy, rather than just relying on the increase in parameter size.
And Cohere's Command-R model has become an unexpected "dark horse" - in reverse. Command-R is not as famous as other models, but Cohere is a large model company focusing on the 2B market. I think there are still quite a lot of fits with fields such as Agent development, so I deliberately included it in the test scope. But Command-R, with 35B parameters, ranked last in most tests, and was defeated by many models below 10B.
This result triggered some thoughts: Command-R was mainly focused on search-enhanced generation capabilities when it was released, and it did not even publish regular benchmark test scores. Does this mean that it is a "special key" that can only unlock its full potential in specific scenarios?
Experimental limitations
In this series of tests, we have a preliminary understanding of AI's capabilities in the field of encryption. Of course, these tests are far from professional standards. The coverage of the data set is far from enough, the quantitative standards of the answers are relatively rough, and there is still a lack of a sophisticated and more accurate scoring mechanism, which will affect the accuracy of the evaluation results and may lead to underestimation of the performance of some models.
In terms of testing methods, the experiment only used a single method of zero-shot learning, and did not explore ways such as thinking chains and few-sample learning that can inspire greater potential of the model. In terms of model parameters, the experiments all used standard model parameters, and did not examine the impact of different parameter settings on model performance. These overall single testing methods limit our comprehensive evaluation of the model's potential and fail to fully explore the performance differences of the model under specific conditions.
Despite the relatively simple testing conditions, these experiments still produced a lot of valuable insights and provided a reference for developers to build applications.
The encryption field needs its own benchmark
In the field of AI, benchmarks play a key role. The rapid development of modern deep learning technology originated from ImageNET, which was completed by Professor Fei-Fei Li in 2012. It is a standardized benchmark and dataset in the field of computer vision.
By providing a unified evaluation standard, benchmarks not only provide developers with clear goals and reference points, but also promote technological progress throughout the industry. This explains why every newly released large language model focuses on publishing its performance on various benchmarks. These results have become a "universal language" for model capabilities, enabling researchers to locate breakthroughs, developers to choose the model that best suits a specific task, and users to make informed choices based on objective data. More importantly, benchmarks often indicate the future direction of AI applications, guiding resource investment and research focus.
If we believe that the intersection of AI and cryptography has great potential, then establishing a dedicated cryptography benchmark becomes an urgent task. The establishment of a benchmark may become a key bridge connecting the two major fields of AI and cryptography, catalyzing innovation and providing clear guidance for future applications.
However, compared with mature benchmarks in other fields, building a benchmark in the cryptography field faces unique challenges: cryptography technology is evolving rapidly, the industry knowledge system has not yet been solidified, and there is a lack of consensus on multiple core directions. As an interdisciplinary field, cryptography covers cryptography, distributed systems, economics, etc., and its complexity far exceeds that of a single field. Even more challenging is that cryptography benchmarks not only need to evaluate knowledge, but also need to examine the actual operational capabilities of AI using cryptography technology, which requires the design of a new evaluation framework. The lack of relevant data sets further increases the difficulty.
The complexity and importance of this task determine that it cannot be completed by a single person or team. It requires the gathering of wisdom from users, developers, cryptography experts, encryption researchers, and more interdisciplinary people, and relies on broad community participation and consensus. Therefore, encryption benchmarks require more extensive discussions, because this is not only a technical work, but also a deep reflection on how we understand this emerging technology.
Postscript: Having talked about this, the topic is far from over. In the next article, I will explore in depth the specific ideas and challenges of building AI benchmarks in the encryption field. The experiment is still ongoing, and we are constantly optimizing the test model, enriching the data set, improving the evaluation framework, and improving the automated testing project. Adhering to the concept of open collaboration, all relevant resources in the future - including data sets, experimental results, evaluation frameworks, and automated testing codes will be open source as public resources.