☀️ solarchive beta

How to Load SolArchive from HuggingFace

● Beginner ⏱ 10 min 🤗 HuggingFace

This guide teaches you how to load Solana blockchain data directly from HuggingFace using the datasets library. You'll learn how to query with DuckDB, convert to pandas, and analyze blockchain data without managing downloads yourself.

Overview
Prerequisites
Loading Transactions
Loading Tokens and Accounts
Querying with DuckDB
Converting to Pandas
Loading Multiple Partitions
Finding Available Partitions
Streaming Large Partitions
Example: Analyzing Token Launches

Overview

SolArchive is now available on HuggingFace at solarchive/solarchive. This makes it incredibly easy to load Solana blockchain data directly into Python without managing downloads or storage yourself.

The dataset contains three data splits organized by partition:

Transactions: Daily partitions (e.g., txs/2025-11-01)
Tokens: Monthly partitions (e.g., tokens/2025-11)
Accounts: Monthly partitions (e.g., accounts/2025-11)

Prerequisites

You'll need Python 3.8+ and the following packages:

pip install datasets duckdb pandas

Loading Transactions

The simplest way to load transaction data is using the HuggingFace datasets library:

from datasets import load_dataset

# Load a specific transaction partition (one day)
ds = load_dataset(
    "solarchive/solarchive",
    data_dir="txs/2025-11-01",
    split="train"
)

print(f"Loaded {len(ds)} transactions")
print(f"Columns: {ds.column_names}")

# View first transaction
print(ds[0])

Note: The first load will download and cache the data locally. Subsequent loads will use the cached version, making re-running queries very fast.

Loading Tokens and Accounts

Loading token metadata or account snapshots works the same way:

# Load token metadata for a month
tokens = load_dataset(
    "solarchive/solarchive",
    data_dir="tokens/2021-04",
    split="train"
)

# Load account snapshots for a month
accounts = load_dataset(
    "solarchive/solarchive", 
    data_dir="accounts/2021-04",
    split="train"
)

print(f"Loaded {len(tokens)} token records")
print(f"Loaded {len(accounts)} account snapshots")

Querying with DuckDB

Once loaded, you can query the data efficiently with DuckDB:

import duckdb

# Load transactions
ds = load_dataset(
    "solarchive/solarchive",
    data_dir="txs/2025-11-01", 
    split="train"
)

# Convert to DuckDB relation for SQL queries
con = duckdb.connect()
rel = con.from_arrow(ds.data.table)

# Find all failed transactions
failed = rel.filter("status = 'Failed'").to_df()
print(f"Found {len(failed)} failed transactions")

# Calculate average fee
avg_fee = rel.aggregate("avg(fee / 1e9) as avg_fee_sol").fetchone()
print(f"Average fee: {avg_fee[0]:.6f} SOL")

# Find transactions with high compute usage
high_compute = rel.filter("compute_units_consumed > 1000000").to_df()
print(f"High compute transactions: {len(high_compute)}")

Converting to Pandas

You can easily convert to pandas for further analysis:

import pandas as pd

# Load and convert to pandas DataFrame
ds = load_dataset(
    "solarchive/solarchive",
    data_dir="txs/2025-11-01",
    split="train"
)

df = ds.to_pandas()

# Now use familiar pandas operations
print(df.head())
print(df.describe())

# Filter and analyze
successful = df[df['status'] == 'Success']
print(f"Success rate: {len(successful) / len(df) * 100:.2f}%")

Loading Multiple Partitions

To analyze data across multiple days or months, you can load and concatenate partitions:

from datasets import load_dataset, concatenate_datasets

# Load multiple days of transactions
partitions = ['2025-11-01', '2025-11-02', '2025-11-03']

datasets = [
    load_dataset(
        "solarchive/solarchive",
        data_dir=f"txs/{partition}",
        split="train"
    )
    for partition in partitions
]

# Combine into single dataset
combined = concatenate_datasets(datasets)
print(f"Total transactions: {len(combined)}")

Finding Available Partitions

You can discover available partitions programmatically:

from huggingface_hub import HfFileSystem

# List all available partitions
fs = HfFileSystem()
files = fs.ls("datasets/solarchive/solarchive", detail=False)

# Extract transaction partitions
tx_partitions = sorted(set(
    f.split('/')[2] 
    for f in files 
    if f.startswith('datasets/solarchive/solarchive/txs/')
))

print(f"Available transaction partitions: {len(tx_partitions)}")
print(f"First 5: {tx_partitions[:5]}")
print(f"Last 5: {tx_partitions[-5:]}")

Streaming Large Partitions

For very large partitions, use streaming to avoid loading everything into memory:

# Stream instead of downloading all at once
ds = load_dataset(
    "solarchive/solarchive",
    data_dir="txs/2025-11-01",
    split="train",
    streaming=True
)

# Process in batches
for i, batch in enumerate(ds.iter(batch_size=1000)):
    print(f"Processing batch {i}: {len(batch['signature'])} transactions")
    
    # Your processing logic here
    if i >= 10:  # Process only first 10 batches
        break

Example: Analyzing Token Launches

Here's a complete example analyzing new token launches in a specific month:

import duckdb
from datasets import load_dataset

# Load token data
tokens = load_dataset(
    "solarchive/solarchive",
    data_dir="tokens/2021-04",
    split="train"
)

# Query with DuckDB
con = duckdb.connect()
rel = con.from_arrow(tokens.data.table)

# Find unique tokens with names
result = rel.query("""
    SELECT DISTINCT ON (mint)
        mint,
        name,
        symbol,
        block_timestamp,
        is_nft
    FROM tokens_rel
    WHERE name IS NOT NULL AND name != ''
    ORDER BY mint, block_timestamp ASC
""", "tokens_rel").to_df()

print(f"Unique tokens in April 2021: {len(result)}")
print(f"NFTs: {result['is_nft'].sum()}")
print(f"Fungible tokens: {(~result['is_nft']).sum()}")

# Most common symbols
print("\nTop 10 symbols:")
print(result['symbol'].value_counts().head(10))

Performance Tips

Use streaming for large partitions to avoid memory issues
Cache is your friend: HuggingFace caches downloads, so re-running queries is fast
Query with DuckDB: Much faster than pandas for filtering large datasets
Load only what you need: Don't load multiple months if you only need one day
Use batch processing: Process data in chunks rather than all at once

Key Takeaways

✅ Load Solana data with just a few lines of Python
✅ No need to manage downloads or storage yourself
✅ Works seamlessly with DuckDB, pandas, and Arrow
✅ Streaming support for large datasets
✅ Cached downloads make re-running queries fast

Next Steps

Explore the SQL Playground to test queries before running them in Python
Check out How to Find Wallet Transactions for query patterns
Read Token Transfers Analysis for advanced analytics
Visit the HuggingFace dataset page for more details

Example Code: Find the complete working example in examples/load-from-huggingface