10.13.2022|Brendan Malone
Crypto policy work rarely involves real, granular data. It seems like there are (at least) three broad reasons for that:
This is a missed opportunity, given how many tools used by the modern economics and finance community should directly translate to crypto data analysis. Crypto offers granular data, available to anyone, by design. So why do a substantial number of policy papers still rely on pre-aggregated time-series data from external sources like CoinMarketCap instead of going directly to the source?
For example, with relatively little effort, it is possible to look at stablecoin issuance across the entire Ethereum ecosystem. This is roughly analogous to being able to query the balance sheets of every major bank in the U.S. and observe changes in consumer deposits on a second-by-second basis — but most policy papers that analyze stablecoins instead choose an analytical approach that talks about hypothesized events, like flights to safety, in the abstract.
In this blog post, I’ll demonstrate a few things that I hope will be helpful to policy researchers looking to work with on-chain data:
In subsequent posts, I’ll explore how to use the data collected here to draw inferences on crypto markets. I plan on eventually posting the data and code for free use, as well. By shedding light on how to “check the chain,” I hope to show how the crypto’s transparent nature enables a novel approach to data-informed policymaking.
If you work at a regulatory agency or research institution and are having difficulty accessing crypto data to work with, please don’t hesitate to reach out and share your thoughts about what Paradigm can do to help.
Short answer: it depends.
For the sake of bounding this particular project, I chose to focus my data gathering efforts on one blockchain (Ethereum) and a subset of specific projects: major USD-denominated, fiat-backed stablecoins. Specifically, USDC, Tether, Binance USD, Pax Dollar, and Gemini Dollar. The general approach I describe here should be broadly applicable to on-chain data, even if you’re looking to create a different set of data.
Block explorers like Etherscan are invaluable for viewing transaction snapshots and collecting information on specific smart contracts, but they aren’t particularly useful for generating large datasets in my experience. For collecting and working with raw data, you have essentially two options: (1) running a full node locally, or (2) querying a database that already has raw data that was written directly from the chain. Option (1) requires a considerable amount of technical skill and computational resources. You can get pretty far with Option (2) with very basic SQL and Python skills, so that’s the approach taken here.
Databases that already have on-chain data
Dune and Google Cloud Platform’s (GCP) BigQuery have up-to-date on-chain data structured in database tables that are queryable using SQL commands. Dune has a free option that is slower and somewhat limited, but it’s perfect for A/B testing queries and familiarizing yourself with the database schema, especially if you aren’t very experienced with using SQL to query relational databases. BigQuery is more flexible and faster, but Google charges for compute resources, so it can get expensive quickly. When I was first tinkering around the data, I would test queries in Dune before running in “production” in GCP to save money. For the most part, it worked well? (The one additional thing worth flagging is that Dune has at least 100x the number of crypto tables as GCP, including some pre-cleaned, user generated ones that are very valuable. By comparison, the data in GCP are mostly raw blocks/transactions. Dune also has some very handy built-in data viz tools that are probably worth the price of admission alone.)
To answer this question, you first need to know what you’re trying to accomplish with the data you’re seeking. For this test case, I decided I wanted to build a large data set of time-series data for major fiat-backed stablecoins looking at a few specific actions: mints (i.e., issuing stablecoins), burns (i.e., taking stablecoins out of circulation), and transfers. I chose to scope the exercise this way because it seems like policymakers and academics are most interested in fiat-backed stablecoins at the moment, so that data could be quite useful in the short term.
The major USD-denominated stablecoins are implemented using the ERC-20 Token Standard. As the name suggests, ERC-20 is a standardized way for creating tokens on Ethereum using a smart contract. If you conceptualize a blockchain as a giant decentralized Excel spreadsheet, smart contracts are similar to the Excel functions that let you manipulate the data in the spreadsheet’s cells. Feed the function an input, known as an argument, and it will produce a specific output using its built-in logic (e.g., MAX(number 1, [number 2], …) finds the largest number among the numbers included in the function’s input arguments).
The relevant contracts can be located using their Ethereum addresses, which are a unique identifier in the blockchain’s data structure:
Smart contracts are programs that can be used repeatedly, similar to an API. Each time a smart contract is interacted with (e.g., asked to do something programmatically), a record of that interaction is produced and logged in the blockchain by the Ethereum protocol. These logs form a powerful source of information describing a smart contract’s activities. (For a longer explanation of Ethereum event logs, look here.)
When a smart contract performs a specific function, such as burning ERC-20 stablecoin tokens to remove them from circulation, that function — and the arguments it is fed — are recorded on the blockchain as a transaction log receipt.
Here is a transaction where Circle, the issuer of the USDC stablecoin, burned (i.e., removed from circulation) $1,056.92 worth of USDC.
If you toggle over to the “Logs” tab, you can see the transaction event logs. The relevant fields are:
The Etherscan output also shows Topics and Data fields. These contain the bulk of the relevant information we need to parse to analyze transactions.1
Now that we understand the basics of how on-chain data is structured for our use case, we can start pulling the data from Dune and/or GCP.
As noted previously, for this example I’ve chosen to pull on-chain data from existing databases as opposed to accessing an active node on the Ethereum network. To keep things simple, I’m extracting (mostly) raw tables from GCP using SQL, and then cleaning the data in Python using the pandas library.
Extracting tables from GCP using BigQuery2
BigQuery has a lot of Ethereum tables, as seen in the left-hand pane of the image below. Clicking on a table brings up the database schema, which we can see here for the ethereum.logs
table. Address, data, and topics map to the log data I described above from Etherscan.
The following query will extract all of the records in the logs table that involve an interaction with the USDC, Tether USD, Binance USD, Pax Dollar, or Gemini Dollar contracts. Some additional information is useful beyond what’s available in ethereum.logs
, so I’ve also merged in data from the ethereum.blocks
table to include things like gas (a measure of a network fee paid to Ethereum miners for validating transactions).3
SELECT
eth_logs.\*,
eth_blocks.number, eth_blocks.miner, eth_blocks.size, eth_blocks.gas_limit, eth_blocks.gas_used, eth_blocks.base_fee_per_gas
FROM
`bigquery-public-data.crypto_`ethereum.logs`` AS eth_logs
LEFT JOIN
`bigquery-public-data.crypto_ethereum.blocks` AS eth_blocks
ON
eth_logs.block_number = eth_blocks.number
WHERE(
eth_logs.address='0xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb48'-- USDC
OR eth_logs.address ='0xdac17f958d2ee523a2206206994597c13d831ec7' -- USDT
OR eth_logs.address ='0x4fabb145d64652a948d72533023f6e7a623c7c53' -- Binance USD
OR eth_logs.address = '0x8e870d67f660d95d5be530380d0ec0bd388289e1' -- Pax Dollar
OR eth_logs.address = '0x056fd409e1d7a124bd7017459dfea2f387b6d5cd' -- Gemini Dollar
)
The resulting table can be read directly into Python as a pandas dataframe with the following fields:
Most of these fields are ready-to-use in raw form. However, the topics field discussed in Section III requires some additional cleaning with Python to separate the field into multiple columns.4
This post leveraged Ethereum logs data, but the same approaches can be used to access a variety of data available on the chain. Python and SQL, tools that most economists and policymakers are familiar with, can go a long way. Crypto’s transparency, relative to traditional finance, offers a unique opportunity for researchers to use real-time data to shed light on how the financial system functions, including where risks may arise unchecked.
In the next post I’ll prep the data set for a targeted analysis looking at minting and burning of fiat-backed stablecoins. In post number three, I’ll display a few charts and tables as an example of the kind of research questions that people can examine using granular on-chain data.
A few additional notes/disclaimers:
Copyright © 2025 Paradigm Operations LP All rights reserved. “Paradigm” is a trademark, and the triangular mobius symbol is a registered trademark of Paradigm Operations LP