☀️ solarchive beta

How to Download SolArchive Data

● Beginner ⏱ 15 min 📦 Example: 270 GB

This guide teaches you how SolArchive organizes its blockchain data and how to build a download script. You'll learn about data partitioning, concurrent downloads, and best practices for working with large datasets.

Understanding Data Organization
Setting Up Your Environment
Building the Download Script
Running the Downloader

Understanding Data Organization

Before we write code, let's understand how SolArchive organizes its data. This knowledge will help you build an efficient downloader and query data effectively.

Three Datasets Available

SolArchive provides three main datasets:

Dataset	Contains	Partition Scheme	Typical Size
`txs`	Transaction data	Daily (YYYY-MM-DD)	~100-150 GB/day
`tokens`	Token metadata	Monthly (YYYY-MM)	~100-200 MB/month
`accounts`	Account snapshots	Monthly (YYYY-MM)	~50-100 GB/month

Important: Notice that transactions are partitioned daily, while tokens and accounts are partitioned monthly. Your download script needs to handle both partitioning schemes!

How Partitions Are Organized

Each partition contains:

Multiple parquet files - Data is split into numbered files like 000000000.parquet, 000000001.parquet, etc.
An index.json file - Lists all parquet files with their URLs, sizes, and metadata

Example URL structure:

https://data.solarchive.org/txs/2025-11-30/index.json
https://data.solarchive.org/txs/2025-11-30/000000000.parquet
https://data.solarchive.org/txs/2025-11-30/000000001.parquet
...

https://data.solarchive.org/tokens/2025-11/index.json
https://data.solarchive.org/tokens/2025-11/000000000.parquet
...

Download Strategy

To download data for a date range, we'll:

Generate a list of dates we want (e.g., Nov 30 - Dec 1)
For each date:
- Fetch the index.json file for that partition
- Parse it to get the list of parquet file URLs
- Download all parquet files concurrently (with a limit)
Handle caching - skip files that already exist with the correct size

Setting Up Your Environment

We'll build this downloader using modern Python with asyncio for efficient concurrent downloads and rich for beautiful progress bars.

Prerequisites

Python 3.10 or higher
uv package manager (replaces pip + venv)
Sufficient disk space for your data (100-150 GB per day of transactions)

Project Setup

Create a new project from scratch:

# Create a new project directory
mkdir solarchive-downloader
cd solarchive-downloader

# Initialize a new Python project with uv
uv init

# Add dependencies
uv add httpx rich

This adds two key dependencies:

httpx - Modern async HTTP client for concurrent downloads
rich - Beautiful progress bars to track download status

Note: We'll also use asyncio, dataclasses, and pathlib - but these are part of Python's standard library, so they don't need to be installed separately!

Building the Download Script

Let's build download_data.py step by step. We'll use modern Python features like dataclasses and async/await.

Step 1: Define Data Structures

First, create dataclasses to represent the API responses:

from dataclasses import dataclass
from pathlib import Path

@dataclass
class FileInfo:
    """Metadata for a single parquet file."""
    name: str
    url: str
    size_bytes: int
    last_modified: str

@dataclass
class DownloadResult:
    """Result of downloading a single file."""
    name: str
    size_mb: float
    was_cached: bool

Using dataclasses gives us type safety and named fields instead of tuples or dicts.

Step 2: Download Single File with Progress Bar

This function downloads a file and shows a live progress bar. We use rich to create a beautiful progress display that shows the filename, progress bar, download speed, and estimated time remaining:

import httpx
import asyncio
from rich.progress import Progress

async def download_file(
    client: httpx.AsyncClient, 
    file: FileInfo, 
    output_dir: Path, 
    semaphore: asyncio.Semaphore,
    progress: Progress
) -> DownloadResult:
    """Download a single parquet file with progress bar."""
    output_path = output_dir / file.name
    
    # Check if file exists with correct size
    if output_path.exists():
        if output_path.stat().st_size == file.size_bytes:
            return DownloadResult(
                name=file.name,
                size_mb=file.size_bytes / 1e6,
                was_cached=True
            )
    
    # Download with semaphore to limit concurrency
    async with semaphore:
        # Create individual progress bar for this file
        task_id = progress.add_task(f"[cyan]{file.name}", total=file.size_bytes)
        
        try:
            # Stream download to show real-time progress
            async with client.stream("GET", file.url) as response:
                response.raise_for_status()
                chunks = []
                async for chunk in response.aiter_bytes():
                    chunks.append(chunk)
                    progress.update(task_id, advance=len(chunk))
                
                content = b"".join(chunks)
            
            output_path.write_bytes(content)
            progress.update(task_id, description=f"[green]✓ {file.name}")
            
            return DownloadResult(
                name=file.name,
                size_mb=len(content) / 1e6,
                was_cached=False
            )
        finally:
            progress.remove_task(task_id)

Key features: streams the download to update progress in real-time, uses a semaphore to limit concurrent downloads, and automatically removes the progress bar when done.

Step 3: Download All Files for a Partition

Download all parquet files for a single partition. The progress display will show up to 10 concurrent downloads at once, each with its own progress bar:

import asyncio
from rich.progress import (
    Progress, TextColumn, BarColumn, 
    DownloadColumn, TransferSpeedColumn, TimeRemainingColumn
)

async def download_partition(
    partition_date: str, dataset: str, output_dir: Path, client: httpx.AsyncClient
):
    """Download all files for a given date and dataset (txs, tokens, or accounts)."""
    date_dir = output_dir / dataset / partition_date
    date_dir.mkdir(parents=True, exist_ok=True)
    
    # Fetch partition index
    index_url = f"https://data.solarchive.org/{dataset}/{partition_date}/index.json"
    response = await client.get(index_url)
    response.raise_for_status()
    index = response.json()
    
    print(f"  📅 {dataset}/{partition_date}: {len(index['files'])} files")
    
    # Create progress display showing multiple files at once
    with Progress(
        TextColumn("[progress.description]{task.description}"),
        BarColumn(),
        DownloadColumn(),
        TransferSpeedColumn(),
        TimeRemainingColumn(),
    ) as progress:
        # Download with limited concurrency (10 at a time)
        semaphore = asyncio.Semaphore(10)
        files = [FileInfo(**f) for f in index["files"]]
        tasks = [download_file(client, f, date_dir, semaphore, progress) for f in files]
        results = await asyncio.gather(*tasks)
    
    # Report results
    cached = sum(1 for r in results if r.was_cached)
    new = len(results) - cached
    print(f"  ✓ {new} downloaded, {cached} cached")
    
    return results

You'll see multiple progress bars stacked vertically, each showing a different file being downloaded. As files complete, their bars are removed and new ones appear for the next files in the queue.

Step 4: Download Date Range (Full Script)

Now we bring it all together - download data for a date range. Remember: transactions are daily partitions, but tokens and accounts are monthly:

from datetime import date, timedelta

async def download_date_range(start_date: str, end_date: str):
    """Download transactions and tokens for a date range.
    
    Note: txs are partitioned by day, tokens by month, accounts by month.
    """
    # Generate list of dates
    start = date.fromisoformat(start_date)
    end = date.fromisoformat(end_date)
    dates = []
    current = start
    while current <= end:
        dates.append(current.isoformat())
        current += timedelta(days=1)
    
    # Generate unique months for tokens and accounts (partitioned monthly)
    months = set()
    current = start
    while current <= end:
        months.add(current.strftime("%Y-%m"))
        current += timedelta(days=1)
    months = sorted(list(months))
    
    # Download transactions (daily), tokens (monthly), and accounts (monthly)
    # Set reasonable limits for concurrent downloads
    limits = httpx.Limits(max_keepalive_connections=20, max_connections=50)
    timeout = httpx.Timeout(60.0, connect=10.0, pool=10.0)
    
    async with httpx.AsyncClient(timeout=timeout, limits=limits) as client:
        print("TXS Dataset:")
        for partition_date in dates:
            await download_partition(partition_date, "txs", Path("data"), client)
        
        print("\nTOKENS Dataset:")
        for month in months:
            await download_partition(month, "tokens", Path("data"), client)
        
        # Uncomment to download accounts too:
        # print("\nACCOUNTS Dataset:")
        # for month in months:
        #     await download_partition(month, "accounts", Path("data"), client)

if __name__ == "__main__":
    START_DATE = "2025-11-30"
    END_DATE = "2025-12-01"
    asyncio.run(download_date_range(START_DATE, END_DATE))

This handles the different partitioning schemes automatically - downloading transactions for each day and tokens/accounts for each unique month in your date range.

Concurrency Control: The script uses asyncio.Semaphore(10) to limit downloads to 10 files at a time. This prevents overwhelming your network while still being fast. Each partition (typically 200-300 files) takes 5-15 minutes depending on your connection speed. You can adjust the semaphore value if you have a faster connection.

Running the Downloader

Execute the Script

uv run download_data.py

You'll see beautiful progress bars for each file being downloaded:

Downloading data from 2025-11-30 to 2025-12-01 (2 days)
Output directory: data/

TXS Dataset:
  🔍 Fetching index: https://data.solarchive.org/txs/2025-11-30/index.json
  📅 txs/2025-11-30: 284 files
  
000000000.parquet ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 450.2/450.2 MB 125.3 MB/s 0:00:00
000000001.parquet ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 320.5/512.1 MB  98.2 MB/s 0:00:02
000000002.parquet ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 112.3/487.9 MB 110.5 MB/s 0:00:03
... (showing up to 10 concurrent downloads)

  ✓ 284 downloaded, 0 cached

TOKENS Dataset:
  🔍 Fetching index: https://data.solarchive.org/tokens/2025-11/index.json
  📅 tokens/2025-11: 8 files
  
000000000.parquet ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.5/18.5 MB 45.2 MB/s 0:00:00
000000001.parquet ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.3/20.1 MB 38.7 MB/s 0:00:01
... (showing concurrent downloads)

  ✓ 8 downloaded, 0 cached

✅ Downloaded 605 files across 2 days (271.07 GB total)

Understanding the Output

Notice a few things:

Transactions have daily partitions: txs/2025-11-30, txs/2025-12-01
Tokens have monthly partitions: tokens/2025-11, tokens/2025-12
Tokens dataset is much smaller (megabytes vs gigabytes per partition)
Files are organized into data/txs/ and data/tokens/ subdirectories

Resumable Downloads: If you interrupt and re-run, the script will skip files already downloaded. This makes it safe to pause and resume large downloads!

What's Next?

Now that you have the data downloaded, check out our other guides:

Key Takeaways

Three datasets - Transactions (daily), Tokens (monthly), Accounts (monthly)
Parquet files - Data is stored in efficient columnar format, perfect for analytics
Concurrent downloads - Use async/await with semaphores to download efficiently
Caching - Check file size to skip re-downloading existing files

You now have a working downloader that handles SolArchive's data organization! This same script can be adapted for any date range or dataset combination.