solarchive.org

solarchive is a project to archive Solana's public transaction data, and make it freely accessible in ergonomic formats for developers, researchers, and the entire Solana community.

Today we are publishing datasets in Apache Parquet format for all user transaction history (no votes), and snapshots of tokens, and account states, all licensed under CC-BY-4.0.

Our top priority right now is to publish all historical data to date (2025/Dec), and on reducing the delay from on-chain activity to datasets you can use to under a week.

Historical Data Readiness

2020

2021

2022

2023

2024

2025

Genesis

No data

In Progress

Complete

But processing and hosting hundreds of terabytes of data isn't free. Storing 2025 alone will cost us nearly $10,000 in yearly storage fees! So we need your help to make this a reality. Your donation will help us cover infrastructure costs, data engineering costs, and more!

Explore the Data

Here's a sample of what's in the archive, queried live with DuckDB-WASM. In these 3 queries we are loading data straight out of the latest available date for each dataset right here in your browser.

Download Data

Datasets are archived as Parquet files, partitioned by day (transactions) or month (tokens, accounts snapshots). Alongside each partition and dataset, there is an index.json file that can help you discover what datasets are available, how many partitions are available within a dataset, and the list of all published files that belong to that partition alongside a checksum for verifying integrity of your downloads.

Each file contains vote-filtered transactions in Parquet format. You can process the raw data directly - import into DuckDB, pandas, Spark, or any other analytics tool.

🗺️ Roadmap for 2026

Frequently Asked Questions

Query	URL
Index of txs for Nov 1, 2025	`https://data.solarchive.org/txs/2025-11-01/index.json`
All txs for Nov 1, 2025	`https://data.solarchive.org/txs/2025-11-01/*.parquet`
Account snapshots for Feb, 2023	`https://data.solarchive.org/accounts/2023-02/*.parquet`
Token snapshots for Sep, 2024	`https://data.solarchive.org/tokens/2024-09/*.parquet`
Specific file	`https://data.solarchive.org/txs/2025-11-01/000000000014.parquet`

How does this compare to other Solana data providers?

solarchive.org is unique in providing free, direct access to raw Parquet files without requiring API keys, rate limits, or SaaS subscriptions. You download the data once and analyze it however you want, using your own tools. This is ideal for researchers, data scientists, and developers who need full control and offline access to historical data.

What data is available in the archive?

The archive contains three main datasets: (1) Transactions - all non-vote transactions with full details including signatures, fees, account changes, and token balances, (2) Accounts - periodic snapshots of account states including balances, owners, and program data, and (3) Tokens - metadata snapshots for fungible and non-fungible tokens including creators, URIs, and attributes. All data is vote-filtered to focus on user-facing activity.

How do I download the data?

First, fetch the partition's index.json file to see what files are available. For example, download https://data.solarchive.org/txs/2025-11-01/index.json - this lists all parquet files for that day with their URLs. Then download each file URL from the index. You cannot use wildcards like *.parquet directly - you must read the index.json first to get the actual file URLs.

What format is the data in?

All data is stored in Apache Parquet format, a columnar storage format that's highly compressed and efficient for analytics. Parquet files can be read by virtually any data tool including DuckDB, pandas, Spark, Snowflake, BigQuery, and more. No special software is required - just standard Parquet readers.

How often is the data updated?

We update datasets daily, but the data currently lags behind on-chain activity by about 2 weeks. We're actively working to reduce this lag to under 1 week. In addition to daily updates, we're also backfilling historical data from 2020-2025.

Is the data free to use?

Yes, all data is freely available for download and use. There are no usage limits, API keys, or registration requirements. However, hosting and processing costs are significant (nearly $10,000/year for 2025 alone), so donations are greatly appreciated to keep this service running and growing.

What's the total size of the dataset?

The complete raw archive is over 700TB and growing. November 2025 alone is over 4TB. Sizes vary significantly by date and dataset type. Always check the index.json files for exact sizes before downloading.

Can I query the data in my browser?

Yes! The "Explore the Data" section above demonstrates DuckDB-WASM running SQL queries directly in your browser. You can use this same approach in your own applications. However, for large-scale analysis, we recommend downloading the data and processing it locally with DuckDB, Spark, or your preferred analytics tool.

How do I analyze the data?

The data is in Parquet format, so you can use any tool that reads Parquet files - DuckDB, pandas, Spark, Polars, etc. Most tools can read directly from HTTPS URLs or from downloaded files. For large-scale analysis, we recommend downloading files locally first for better performance. See the "Explore the Data" section above for live examples using DuckDB-WASM in the browser.

What's the data schema?

Each dataset has a published JSON schema documenting all fields and types. See: https://data.solarchive.org/schemas/solana/transactions.json, https://data.solarchive.org/schemas/solana/accounts.json, and https://data.solarchive.org/schemas/solana/tokens.json. The schemas include field descriptions, data types, and examples.

How accurate is the data?

Currently, the data is sourced from the Solana Foundation's BigQuery exports, which are a fantastic authoritative resource. However, BigQuery can be prohibitively expensive, so we handle it carefully. We are also working on sourcing data directly from archival RPC nodes. All data is preserved as-is from the source - we filter out vote transactions to reduce noise, but everything else is verbatim from the blockchain.

How can I verify data integrity?

Each partition's index.json file includes file metadata like size and last modified timestamp. You can verify file integrity by comparing downloaded file sizes against the index. For critical applications, you can cross-reference specific transactions against Solana RPC nodes using the signature field.

Are vote transactions included?

No, vote transactions are filtered out to reduce dataset size and focus on user-facing activity. Vote transactions constitute the majority of Solana's transaction volume but are typically not relevant for application developers or researchers analyzing user behavior and token activity.

Can I redistribute or resell this data?

Yes! This data distribution is released under CC-BY-4.0. You can use it for any purpose including commercial, research, or educational use. You can redistribute, modify, and build services on top of it. The only requirement is attribution - you must credit solarchive.org. We also appreciate (but don't require) financial support if you benefit commercially.

What license is the data under?

This data distribution is licensed under CC-BY-4.0 (Creative Commons Attribution 4.0 International). Note that we're licensing this particular curated distribution (the processed Parquet files, schemas, and organization) - the underlying Solana blockchain data itself is public information. CC-BY-4.0 means you're free to use, share, and adapt the data for any purpose, including commercially, as long as you provide attribution to solarchive.org.

How can I support this project?

You can send SOL to solarchive.sol or use the Solana Pay buttons throughout the site. Any amount helps cover storage, bandwidth, and processing costs. For enterprises needing dedicated support, custom data processing, or higher bandwidth access, contact leandro@abstractmachines.dev to discuss premium support options.

Will there be a HuggingFace dataset?

Yes! Publishing a HuggingFace dataset is on the 2026 roadmap. This will allow you to use the standard datasets library to load Solana data directly: from datasets import load_dataset; ds = load_dataset('solarchive/solana-txs'). This will make it much easier to use the data in ML/AI workflows.

How do I know when new data is published?

Subscribe to our RSS feed to get notified of new dataset partitions. You can also check the index.json files periodically to see new partitions. The RSS feed lists the latest 50 partitions across all datasets (transactions, accounts, tokens) and is updated at build time.

☀️ solarchive beta

Historical Data Readiness

Explore the Data

Latest Transactions (2025-12-05)

Tokens (2025-12)

Accounts (2025-12)

Download Data

🗺️ Roadmap for 2026

☕ Support this Project

Frequently Asked Questions