Author: Geng Kai, DFG
The Importance of Data in Blockchain
Data is key to blockchain technology and is fundamental to developing decentralized applications (dApps). While much of the current discussion revolves around data availability (DA)—ensuring that every network participant has access to recent transaction data for verification—there is an equally important aspect that is often overlooked: data accessibility.
In the era of modular blockchains, DA solutions have become indispensable. These solutions ensure that transaction data is available to all participants, enabling real-time verification and maintaining the integrity of the network. However, the DA layer functions more like a billboard than a database. This means that data is not stored indefinitely; it is deleted over time, just as a poster on a billboard is eventually replaced by a new one.
Data accessibility, on the other hand, focuses on the ability to retrieve historical data, which is essential for developing dApps and conducting blockchain analytics. This aspect is critical for tasks that require access to past data to ensure accurate representation and execution. Although data accessibility is important, it is less discussed, but it is just as important as data availability. The two play different but complementary roles in the blockchain ecosystem, and a comprehensive data management approach must address both issues to support powerful and efficient blockchain applications.
How Blockchain Data Was Retrieved Before
Since its inception, blockchain has revolutionized infrastructure and enabled the creation of decentralized applications (dApps) in various fields such as gaming, finance, and social networking. However, building these dApps requires access to large amounts of blockchain data, which is difficult and expensive.
For dApp developers, one option is to host and run their own archive RPC nodes. These nodes store all historical blockchain data from the beginning, allowing full access to the data. However, the cost of maintaining archive nodes is high and the query capabilities are limited, making it impossible to query the data in the format that developers need. While running cheaper nodes is an option, these nodes have limited data retrieval capabilities, which may hinder the operation of dApps.
Another approach is to use a commercial RPC (Remote Procedure Call) node provider. These providers are responsible for the cost and management of the nodes and provide data through RPC endpoints. Public RPC endpoints are free but are rate-limited and may negatively impact the user experience of dApps. Private RPC endpoints provide better performance by reducing congestion, but even simple data retrieval requires a lot of back-and-forth communication. This makes them request-heavy and inefficient for complex data queries. In addition, private RPC endpoints are often difficult to scale and lack compatibility across different networks.
A Better Alternative: Blockchain Indexers
Blockchain indexers play a vital role in organizing on-chain data and sending it to a database for easy querying, which is why they are often referred to as the "Google of blockchain". They work by indexing blockchain data and making it readily available through a SQL-like query language (using APIs such as GraphQL). By providing a unified interface for querying data, indexers allow developers to quickly and accurately retrieve the information they need using a standardized query language, greatly simplifying the process.
Different types of indexers optimize data retrieval in various ways:
Full node indexers: These indexers run a full blockchain node and extract data directly from it, ensuring that the data is complete and accurate, but requiring a lot of storage and processing power.
Lightweight indexers: These indexers rely on full nodes to fetch specific data on demand, reducing storage requirements but potentially increasing query times.
Specialized indexers: These indexers specialize in certain types of data or specific blockchains, optimizing retrieval for specific use cases, such as NFT data or DeFi transactions.
Aggregated indexers: These indexers pull data from multiple blockchains and sources, including off-chain information, providing a unified query interface, which is particularly useful for multi-chain dApps.
Ethereum alone requires 3TB of storage space, and as blockchains continue to grow, the data storage capacity of Erigon archive nodes will continue to increase. The Indexer Protocol deploys multiple indexers, which can efficiently index and query large amounts of data at high speeds, which is not possible with RPC.
Indexers also allow for complex queries, easy filtering of data based on different criteria, and analysis of data after extraction. Some indexers also allow for aggregation of data from multiple sources, avoiding the need to deploy multiple APIs in multi-chain dApps. By being distributed across multiple nodes, indexers provide enhanced security and performance, while RPC providers may experience outages and downtime due to their centralized nature.
Overall, indexers improve the efficiency and reliability of data retrieval compared to RPC node providers, while also reducing the cost of deploying a single node. This makes the Blockchain Indexer Protocol a top choice for dApp developers.
Indexer Use Cases
As mentioned before, building dApps requires retrieving and reading blockchain data in order to run their services. This includes any type of dApp, including DeFi, NFT platforms, games, and even social networks, as these platforms need to read the data before they can perform other transactions.
DeFi
DeFi protocols require different information in order to quote specific prices, rates, fees, etc. to users. Automated Market Makers (AMMs) require price and liquidity information about certain pools to calculate swap rates, while lending protocols require utilization rates to determine borrowing rates and debt ratios for liquidations. It is essential to feed information into their dApps before calculating the interest rates executed by users.
Games
GameFi needs to quickly index and access data to ensure users can play games smoothly. Only with lightning-fast data retrieval and execution can Web3 games match Web2 games in performance and attract more users. These games require data such as land ownership, in-game token balances, in-game actions, etc. With indexers, they can better ensure a steady stream of data and stable uptime to ensure a perfect gaming experience.
NFTs
NFT markets and lending platforms need indexed data to access a variety of information, such as NFT metadata, ownership and transfer data, royalty information, etc. Quickly indexing such data avoids browsing each NFT one by one to find ownership or NFT property data.
Whether it’s a DeFi automated market maker (AMM) that needs price and liquidity information, or a SocialFi app that needs to update new user posts, being able to retrieve data quickly is critical for dApps to function properly. With indexers, they can retrieve data efficiently and correctly, providing a smooth user experience.
Analytics
Indexers provide a way to extract specific data from raw blockchain data, including smart contract events in each block. This opens up opportunities for more specific data analysis, providing comprehensive insights.
For example, a perpetual trading protocol can find out which tokens have high trading volume and which tokens incur fees, deciding whether to list these tokens as perpetual contracts on its platform. DEX developers can create dashboards for their products to gain insight into which pools have the highest returns or the most liquidity. Public dashboards can also be created, giving developers the freedom and flexibility to query any type of data to be displayed on a chart.
Since there are multiple blockchain indexers available, identifying the differences between indexing protocols is critical to ensuring that developers choose the indexer that best suits their needs.
Blockchain Indexer Overview
Indexer Overview
The Graph
The Graph is the first indexer protocol launched on Ethereum, which makes it easy to query transaction data that was previously not easily accessible. It uses subgraphs to define and filter subsets of data collected from the blockchain, such as all transactions related to the Uniswap v3 USDC/ETH pool.
Using Proof of Index, Indexers stake the native token GRT for indexing and query services, and delegators can choose to stake their tokens on it. Curators have access to high-quality subgraphs to help Indexers determine which subgraphs to compile data for to earn the best query fees. In its transition to greater decentralization, The Graph will eventually discontinue its hosting services and require subgraphs to upgrade to its network, while providing upgraded indexers.
Its infrastructure enables an average cost of $40 per million queries, which is significantly lower than the cost of self-hosting a node. Using file data sources, it also supports parallel indexing of on-chain and off-chain data at the same time for efficient data retrieval.
Looking at The Graph's indexer rewards, it has been growing steadily over the past few quarters. This is partly due to the increase in query volume, but also due to the growth in token prices as they plan to integrate AI-assisted queries in the future.
Subsquid
Subsquid is a peer-to-peer, horizontally scalable, decentralized data lake that efficiently aggregates large amounts of on-chain and off-chain data and is protected by zero-knowledge proofs. As a decentralized network of workers, each node is responsible for storing data from a specific subset of blocks, speeding up the data retrieval process by quickly identifying the nodes that hold the required data.
Subsquid also supports real-time indexing, allowing blocks to be indexed before they are finalized. It also supports storing data in a format of the developer's choice, facilitating easier analysis using tools such as BigQuery, Parquet, or CSV. In addition, subgraphs can be deployed on the Subsquid network without migrating to the Squid SDK, enabling codeless deployment.
Despite still being in the testnet phase, Subsquid has achieved impressive statistics with over 80,000 testnet users, over 60,000 Squid indexers deployed, and over 20,000 verified developers on the network. Recently, on June 3, Subsquid launched the mainnet of its data lake.
In addition to indexing, the Subsquid Network data lake can also replace RPC in use cases such as analytics, ZK/TEE coprocessors, AI agents, and Oracles.
SubQuery
SubQuery is a decentralized middleware infrastructure network that provides RPC and indexing data services. It initially supported Polkadot and Substrate networks and has now expanded to include more than 200 chains. It works similarly to The Graph using indexing proofs, where indexers index data and provide query requests, and delegators pledge shares to indexers. However, it introduces consumers to submit purchase orders to show that the indexer's income is guaranteed, rather than managers.
It will introduce shard-enabled SubQuery data nodes to prevent each node from constantly syncing new data, thereby optimizing query efficiency while moving toward greater decentralization. Users can choose to pay a computational fee of approximately 1 SQT token per 1,000 requests, or set a custom fee for indexers through the protocol.
Although SubQuery only launched its token earlier this year, the issuance rewards for nodes and delegators have also increased month-on-month in USD value, which also represents the increasing number of query services provided on its platform. Since the TGE, the total amount of staked SQT has increased from 6 million to 125 million, highlighting the growth of its network participation.
Covalent
Covalent is a decentralized indexer network where Block Sample Producer (BSP) network nodes create copies of blockchain data through batch exports and publish proofs on the Covalent L1 blockchain. This data is then refined by Block Result Producer (BRP) nodes according to set rules to filter out data that meets the requirements.
Through a unified API, developers can easily extract relevant blockchain data in a consistent request and response format, without having to write custom complex queries to access data. These pre-configured data sets can be extracted from network operators using CQT tokens settled on Moonbeam as a means of payment.
Covalent's rewards seem to have an overall upward trend from Q1'23 to Q1'24, partly due to the increase in the price of Covalent's token CQT.
Considerations for Choosing an Indexer
Customizability of Data
Some indexers, such as Covalent, are general-purpose indexers that only provide standard, pre-configured datasets through an API. While they may be fast, they do not provide flexibility for developers who need custom datasets. By using the indexer framework, it allows for more customized data processing to meet application-specific needs.
Security
The indexed data must be secure, otherwise the dApps built on these indexers are also vulnerable to attack. For example, if transactions and wallet balances can be manipulated, the dApp has the potential to lose liquidity, which in turn affects its users. While all Indexers employ some form of security through Indexer Staking Tokens, other Indexer solutions may use proofs to further increase security.
Subsquid offers the option to use optimistic and zero-knowledge proofs, while Covalent also publishes proofs that include block hashes. Graph provides dispute challenge periods for Indexer Queries in the form of optimistic challenge windows, while SubQuery generates a Merkle Mountain proof for each block to compute the hash of each block of all data stored in its database.
Speed and Scalability
As blockchains continue to grow, so do transaction volumes, which makes indexing large amounts of data more cumbersome as more processing power and storage space are required. As blockchain networks grow, it becomes more difficult to remain efficient, but the Indexer Protocol introduces solutions to meet these growing demands.
For example, Subsquid scales horizontally by adding more nodes to store data, which is able to scale as hardware improves. Graph provides parallel streaming data to sync data faster, while SubQuery introduces node sharding to speed up the sync process.
Supported Networks
While most blockchain activity still takes place within Ethereum, different blockchains are becoming more popular over time. For example, Layer 2s, Solana, Move blockchain, and Bitcoin ecosystem chains all have their own growing set of developers and activity, which also require indexing services.
Providing support for certain chains that are not supported by other indexer protocols can earn more market share fees. Indexing data-intensive networks such as Solana is not an easy task, and so far, only Subsquid has successfully provided indexing support for them.
Conclusion
Despite the widespread adoption of indexers in dApp development, the potential of indexers is still huge, especially with the integration of AI. As AI continues to gain popularity in Web2 and Web3, its ability to improve depends on access to relevant data to train models and develop AI agents. Ensuring data integrity is critical for AI applications because it prevents models from being fed biased or inaccurate information.
In the area of indexer solutions, Subsquid has made significant progress in performance and user metrics. Users have begun experimenting with building AI agents with Subsquid, demonstrating the platform’s versatility and potential in the evolving data indexing space. Additionally, tools like AutoAgora help indexers use AI to provide dynamic pricing for query services on The Graph, while SubQuery supports multiple AI networks like OriginTrail and Oraichain for transparent data indexing.
The integration of AI with indexers is expected to enhance data accessibility and usability in the blockchain ecosystem. By leveraging AI technology, indexers can provide more efficient and accurate data retrieval, enabling developers to build more sophisticated dApps and analytical tools. As AI and indexers continue to advance together, we remain optimistic about the future of data indexing and its role in shaping the decentralized digital landscape.