This guide teaches you how SolArchive organizes its blockchain data and how to build a download script. You'll learn about data partitioning, concurrent downloads, and best practices for working with large datasets.
Before we write code, let's understand how SolArchive organizes its data. This knowledge will help you build an efficient downloader and query data effectively.
SolArchive provides three main datasets:
| Dataset | Contains | Partition Scheme | Typical Size |
|---|---|---|---|
txs | Transaction data | Daily (YYYY-MM-DD) | ~100-150 GB/day |
tokens | Token metadata | Monthly (YYYY-MM) | ~100-200 MB/month |
accounts | Account snapshots | Monthly (YYYY-MM) | ~50-100 GB/month |
Each partition contains:
000000000.parquet, 000000001.parquet, etc.Example URL structure:
https://data.solarchive.org/txs/2025-11-30/index.json
https://data.solarchive.org/txs/2025-11-30/000000000.parquet
https://data.solarchive.org/txs/2025-11-30/000000001.parquet
...
https://data.solarchive.org/tokens/2025-11/index.json
https://data.solarchive.org/tokens/2025-11/000000000.parquet
... To download data for a date range, we'll:
index.json file for that partition
We'll build this downloader using modern Python with asyncio for efficient
concurrent downloads and rich for beautiful progress bars.
Create a new project from scratch:
# Create a new project directory
mkdir solarchive-downloader
cd solarchive-downloader
# Initialize a new Python project with uv
uv init
# Add dependencies
uv add httpx rich This adds two key dependencies:
httpx - Modern async HTTP client for concurrent downloadsrich - Beautiful progress bars to track download statusasyncio, dataclasses,
and pathlib - but these are part of Python's standard library,
so they don't need to be installed separately!
Let's build download_data.py step by step. We'll use modern Python features
like dataclasses and async/await.
First, create dataclasses to represent the API responses:
from dataclasses import dataclass
from pathlib import Path
@dataclass
class FileInfo:
"""Metadata for a single parquet file."""
name: str
url: str
size_bytes: int
last_modified: str
@dataclass
class DownloadResult:
"""Result of downloading a single file."""
name: str
size_mb: float
was_cached: bool Using dataclasses gives us type safety and named fields instead of tuples or dicts.
This function downloads a file and shows a live progress bar. We use rich to create
a beautiful progress display that shows the filename, progress bar, download speed, and estimated
time remaining:
import httpx
import asyncio
from rich.progress import Progress
async def download_file(
client: httpx.AsyncClient,
file: FileInfo,
output_dir: Path,
semaphore: asyncio.Semaphore,
progress: Progress
) -> DownloadResult:
"""Download a single parquet file with progress bar."""
output_path = output_dir / file.name
# Check if file exists with correct size
if output_path.exists():
if output_path.stat().st_size == file.size_bytes:
return DownloadResult(
name=file.name,
size_mb=file.size_bytes / 1e6,
was_cached=True
)
# Download with semaphore to limit concurrency
async with semaphore:
# Create individual progress bar for this file
task_id = progress.add_task(f"[cyan]{file.name}", total=file.size_bytes)
try:
# Stream download to show real-time progress
async with client.stream("GET", file.url) as response:
response.raise_for_status()
chunks = []
async for chunk in response.aiter_bytes():
chunks.append(chunk)
progress.update(task_id, advance=len(chunk))
content = b"".join(chunks)
output_path.write_bytes(content)
progress.update(task_id, description=f"[green]β {file.name}")
return DownloadResult(
name=file.name,
size_mb=len(content) / 1e6,
was_cached=False
)
finally:
progress.remove_task(task_id) Key features: streams the download to update progress in real-time, uses a semaphore to limit concurrent downloads, and automatically removes the progress bar when done.
Download all parquet files for a single partition. The progress display will show up to 10 concurrent downloads at once, each with its own progress bar:
import asyncio
from rich.progress import (
Progress, TextColumn, BarColumn,
DownloadColumn, TransferSpeedColumn, TimeRemainingColumn
)
async def download_partition(
partition_date: str, dataset: str, output_dir: Path, client: httpx.AsyncClient
):
"""Download all files for a given date and dataset (txs, tokens, or accounts)."""
date_dir = output_dir / dataset / partition_date
date_dir.mkdir(parents=True, exist_ok=True)
# Fetch partition index
index_url = f"https://data.solarchive.org/{dataset}/{partition_date}/index.json"
response = await client.get(index_url)
response.raise_for_status()
index = response.json()
print(f" π
{dataset}/{partition_date}: {len(index['files'])} files")
# Create progress display showing multiple files at once
with Progress(
TextColumn("[progress.description]{task.description}"),
BarColumn(),
DownloadColumn(),
TransferSpeedColumn(),
TimeRemainingColumn(),
) as progress:
# Download with limited concurrency (10 at a time)
semaphore = asyncio.Semaphore(10)
files = [FileInfo(**f) for f in index["files"]]
tasks = [download_file(client, f, date_dir, semaphore, progress) for f in files]
results = await asyncio.gather(*tasks)
# Report results
cached = sum(1 for r in results if r.was_cached)
new = len(results) - cached
print(f" β {new} downloaded, {cached} cached")
return results You'll see multiple progress bars stacked vertically, each showing a different file being downloaded. As files complete, their bars are removed and new ones appear for the next files in the queue.
Now we bring it all together - download data for a date range. Remember: transactions are daily partitions, but tokens and accounts are monthly:
from datetime import date, timedelta
async def download_date_range(start_date: str, end_date: str):
"""Download transactions and tokens for a date range.
Note: txs are partitioned by day, tokens by month, accounts by month.
"""
# Generate list of dates
start = date.fromisoformat(start_date)
end = date.fromisoformat(end_date)
dates = []
current = start
while current <= end:
dates.append(current.isoformat())
current += timedelta(days=1)
# Generate unique months for tokens and accounts (partitioned monthly)
months = set()
current = start
while current <= end:
months.add(current.strftime("%Y-%m"))
current += timedelta(days=1)
months = sorted(list(months))
# Download transactions (daily), tokens (monthly), and accounts (monthly)
# Set reasonable limits for concurrent downloads
limits = httpx.Limits(max_keepalive_connections=20, max_connections=50)
timeout = httpx.Timeout(60.0, connect=10.0, pool=10.0)
async with httpx.AsyncClient(timeout=timeout, limits=limits) as client:
print("TXS Dataset:")
for partition_date in dates:
await download_partition(partition_date, "txs", Path("data"), client)
print("\nTOKENS Dataset:")
for month in months:
await download_partition(month, "tokens", Path("data"), client)
# Uncomment to download accounts too:
# print("\nACCOUNTS Dataset:")
# for month in months:
# await download_partition(month, "accounts", Path("data"), client)
if __name__ == "__main__":
START_DATE = "2025-11-30"
END_DATE = "2025-12-01"
asyncio.run(download_date_range(START_DATE, END_DATE)) This handles the different partitioning schemes automatically - downloading transactions for each day and tokens/accounts for each unique month in your date range.
asyncio.Semaphore(10) to limit
downloads to 10 files at a time. This prevents overwhelming your network while still being fast.
Each partition (typically 200-300 files) takes 5-15 minutes depending on your connection speed.
You can adjust the semaphore value if you have a faster connection.
uv run download_data.py You'll see beautiful progress bars for each file being downloaded:
Downloading data from 2025-11-30 to 2025-12-01 (2 days)
Output directory: data/
TXS Dataset:
π Fetching index: https://data.solarchive.org/txs/2025-11-30/index.json
π
txs/2025-11-30: 284 files
000000000.parquet ββββββββββββββββββββββββββββ 450.2/450.2 MB 125.3 MB/s 0:00:00
000000001.parquet ββββββββββββββββββββββββββββ 320.5/512.1 MB 98.2 MB/s 0:00:02
000000002.parquet ββββββββββββββββββββββββββββ 112.3/487.9 MB 110.5 MB/s 0:00:03
... (showing up to 10 concurrent downloads)
β 284 downloaded, 0 cached
TOKENS Dataset:
π Fetching index: https://data.solarchive.org/tokens/2025-11/index.json
π
tokens/2025-11: 8 files
000000000.parquet ββββββββββββββββββββββββββββ 18.5/18.5 MB 45.2 MB/s 0:00:00
000000001.parquet ββββββββββββββββββββββββββββ 12.3/20.1 MB 38.7 MB/s 0:00:01
... (showing concurrent downloads)
β 8 downloaded, 0 cached
β
Downloaded 605 files across 2 days (271.07 GB total) Notice a few things:
txs/2025-11-30, txs/2025-12-01tokens/2025-11, tokens/2025-12data/txs/ and data/tokens/ subdirectoriesNow that you have the data downloaded, check out our other guides:
You now have a working downloader that handles SolArchive's data organization! This same script can be adapted for any date range or dataset combination.