☀️ solarchive beta

How to Load SolArchive to ClickHouse

● Intermediate ⏱ 25 min

ClickHouse is a fast columnar database perfect for analytical queries on large datasets. This guide shows you how to load SolArchive parquet files into ClickHouse, giving you the power to run blazing-fast SQL queries on Solana blockchain data.

Prerequisites: This guide assumes you've already downloaded SolArchive data. If you haven't, start with How to Download SolArchive Data first.

Setting Up ClickHouse
Creating Tables
Loading Data
Querying Data

Setting Up ClickHouse

First, you need a running ClickHouse instance. You can install it locally or use Docker.

Option 1: Local Installation

# macOS
brew install clickhouse

# Start ClickHouse server
clickhouse server

# In another terminal, connect to ClickHouse
clickhouse client

Option 2: Docker

If you prefer Docker, this is the quickest way to get started:

# Pull and run ClickHouse with Docker
docker run -d \
  --name clickhouse-server \
  -p 8123:8123 -p 9000:9000 \
  --ulimit nofile=262144:262144 \
  clickhouse/clickhouse-server

# Connect to ClickHouse
docker exec -it clickhouse-server clickhouse-client

Note: ClickHouse will be available at http://localhost:8123 (HTTP interface) and localhost:9000 (native protocol). The client connects via the native protocol.

Creating Tables

Now let's create tables that match SolArchive's parquet schema. We'll use ClickHouse's powerful features like nested arrays and partitioning by month.

Transactions Table

This table stores transaction data with nested arrays for token balances and instructions:

CREATE TABLE IF NOT EXISTS transactions
(
    signature String,
    block_slot UInt64,
    block_timestamp DateTime64(3, 'UTC'),
    fee UInt64,
    status String,
    err Nullable(String),
    
    -- Account keys
    account_keys Array(String),
    
    -- Token balances (nested arrays)
    pre_token_balances Array(Tuple(
        account_index UInt8,
        mint String,
        owner String,
        amount String,
        decimals UInt8
    )),
    post_token_balances Array(Tuple(
        account_index UInt8,
        mint String,
        owner String,
        amount String,
        decimals UInt8
    )),
    
    -- Instructions
    instructions Array(Tuple(
        program_id_index UInt8,
        accounts Array(UInt8),
        data String
    )),
    
    -- Log messages
    log_messages Array(String)
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(block_timestamp)
ORDER BY (block_timestamp, signature)
SETTINGS index_granularity = 8192;

Key features:

MergeTree engine - ClickHouse's main table engine for large datasets
PARTITION BY toYYYYMM(block_timestamp) - Data partitioned by month for efficient queries
ORDER BY (block_timestamp, signature) - Primary key for fast lookups
Nested arrays - pre_token_balances and post_token_balances stored as tuples

Tokens Table

This table stores token metadata:

CREATE TABLE IF NOT EXISTS tokens
(
    mint String,
    symbol Nullable(String),
    name Nullable(String),
    decimals UInt8,
    supply Nullable(String),
    mint_authority Nullable(String),
    freeze_authority Nullable(String),
    updated_at DateTime64(3, 'UTC')
)
ENGINE = ReplacingMergeTree(updated_at)
ORDER BY mint
SETTINGS index_granularity = 8192;

Key features:

ReplacingMergeTree - Automatically deduplicates rows with the same mint
updated_at - Version column that keeps the latest record
ORDER BY mint - Primary key for fast token lookups

Schema Simplification: This schema includes the most commonly used fields. SolArchive's full schema has additional fields you can add as needed. Check the transaction schema and token schema for complete details.

Loading Data

Now let's write a Python script to load parquet files into ClickHouse. We'll use the ClickHouse Python client and PyArrow to read parquet files.

Install Dependencies

uv add clickhouse-connect pyarrow pandas

Load Script

Create load_clickhouse.py:

"""
Load SolArchive parquet files into ClickHouse.

This script reads parquet files and inserts them into ClickHouse tables
using the ClickHouse Python client.
"""

import asyncio
from pathlib import Path
from typing import List

import clickhouse_connect
import pyarrow.parquet as pq


def load_transactions_partition(client, partition_dir: Path):
    """Load all transaction parquet files from a partition into ClickHouse."""
    parquet_files = sorted(partition_dir.glob("*.parquet"))
    
    print(f"Loading {len(parquet_files)} files from {partition_dir.name}...")
    
    total_rows = 0
    for parquet_file in parquet_files:
        # Read parquet file
        table = pq.read_table(parquet_file)
        
        # Convert to pandas for easier insertion
        df = table.to_pandas()
        
        # Insert into ClickHouse
        client.insert_df('transactions', df)
        
        total_rows += len(df)
        print(f"  ✓ {parquet_file.name}: {len(df):,} rows")
    
    print(f"  Total: {total_rows:,} rows loaded\n")
    return total_rows


def load_tokens_partition(client, partition_dir: Path):
    """Load all token parquet files from a partition into ClickHouse."""
    parquet_files = sorted(partition_dir.glob("*.parquet"))
    
    print(f"Loading {len(parquet_files)} files from {partition_dir.name}...")
    
    total_rows = 0
    for parquet_file in parquet_files:
        # Read parquet file
        table = pq.read_table(parquet_file)
        
        # Convert to pandas for easier insertion
        df = table.to_pandas()
        
        # Insert into ClickHouse
        client.insert_df('tokens', df)
        
        total_rows += len(df)
        print(f"  ✓ {parquet_file.name}: {len(df):,} rows")
    
    print(f"  Total: {total_rows:,} rows loaded\n")
    return total_rows


def main():
    """Load SolArchive data into ClickHouse."""
    # Connect to ClickHouse
    print("Connecting to ClickHouse...")
    client = clickhouse_connect.get_client(
        host='localhost',
        port=8123,
        username='default',
        password=''
    )
    
    # Test connection
    result = client.query('SELECT version()')
    print(f"Connected to ClickHouse {result.result_rows[0][0]}\n")
    
    # Load transactions
    data_dir = Path("data")
    txs_dir = data_dir / "txs"
    
    if txs_dir.exists():
        print("Loading TRANSACTIONS dataset:")
        partitions = sorted([d for d in txs_dir.iterdir() if d.is_dir()])
        
        total_txs = 0
        for partition_dir in partitions:
            rows = load_transactions_partition(client, partition_dir)
            total_txs += rows
        
        print(f"✅ Loaded {total_txs:,} total transactions\n")
    
    # Load tokens
    tokens_dir = data_dir / "tokens"
    
    if tokens_dir.exists():
        print("Loading TOKENS dataset:")
        partitions = sorted([d for d in tokens_dir.iterdir() if d.is_dir()])
        
        total_tokens = 0
        for partition_dir in partitions:
            rows = load_tokens_partition(client, partition_dir)
            total_tokens += rows
        
        print(f"✅ Loaded {total_tokens:,} total token records\n")
    
    # Show table stats
    print("\nTable Statistics:")
    
    txs_count = client.query('SELECT count() FROM transactions')
    print(f"  Transactions: {txs_count.result_rows[0][0]:,} rows")
    
    tokens_count = client.query('SELECT count() FROM tokens')
    print(f"  Tokens: {tokens_count.result_rows[0][0]:,} rows")


if __name__ == "__main__":
    main()

Run the Load Script

uv run load_clickhouse.py

You'll see output like:

Connecting to ClickHouse...
Connected to ClickHouse 24.3.1

Loading TRANSACTIONS dataset:
Loading 284 files from 2025-11-30...
  ✓ 000000000.parquet: 125,430 rows
  ✓ 000000001.parquet: 98,765 rows
  ...
  Total: 35,642,190 rows loaded

✅ Loaded 35,642,190 total transactions

Loading TOKENS dataset:
Loading 8 files from 2025-11...
  ✓ 000000000.parquet: 2,534,890 rows
  ...
  Total: 20,279,120 rows loaded

✅ Loaded 20,279,120 total token records

Table Statistics:
  Transactions: 35,642,190 rows
  Tokens: 20,279,120 rows

Performance Tip: Loading large amounts of data can take time. For 270 GB of transactions, expect 30-60 minutes depending on your hardware. ClickHouse's columnar format compresses the data significantly, so it will use less disk space than the parquet files.

Querying Data

Now for the fun part - querying your data! ClickHouse excels at analytical queries over large datasets.

Example Queries

Here are some example queries to get you started:

-- Get top 10 most transferred tokens
SELECT 
    t.symbol,
    t.name,
    count() as transfer_count,
    count(DISTINCT tx.signature) as tx_count
FROM transactions tx
ARRAY JOIN tx.pre_token_balances AS pre, tx.post_token_balances AS post
LEFT JOIN tokens t ON pre.mint = t.mint
WHERE pre.mint = post.mint
  AND pre.account_index = post.account_index
  AND pre.amount != post.amount
  AND tx.status = 'Success'
  AND t.symbol IS NOT NULL
GROUP BY t.symbol, t.name
ORDER BY transfer_count DESC
LIMIT 10;

-- Average transaction fee per day
SELECT 
    toDate(block_timestamp) as day,
    avg(fee) / 1e9 as avg_fee_sol,
    count() as tx_count
FROM transactions
WHERE status = 'Success'
GROUP BY day
ORDER BY day DESC
LIMIT 30;

Why ClickHouse is Great for Blockchain Data

Columnar storage - Only reads columns you need, perfect for analytical queries
Fast aggregations - Built-in functions for counting, averaging, and grouping
Array operations - Native support for querying nested arrays like token balances
Compression - Stores blockchain data efficiently (often 10x compression)
Partitioning - Query specific time ranges without scanning the entire dataset

More Query Examples

Find the busiest hours of the day:

SELECT 
    toHour(block_timestamp) as hour,
    count() as tx_count
FROM transactions
WHERE toDate(block_timestamp) = today()
GROUP BY hour
ORDER BY hour;

Track a specific wallet's token transfers:

SELECT 
    block_timestamp,
    t.symbol,
    arrayElement(pre.amount, 1) as pre_amount,
    arrayElement(post.amount, 1) as post_amount
FROM transactions
ARRAY JOIN pre_token_balances AS pre, post_token_balances AS post
LEFT JOIN tokens t ON pre.mint = t.mint
WHERE pre.owner = 'YOUR_WALLET_ADDRESS_HERE'
  AND pre.mint = post.mint
  AND pre.account_index = post.account_index
ORDER BY block_timestamp DESC
LIMIT 100;

Conclusion

You now have SolArchive data loaded into ClickHouse! This gives you the power to run fast SQL queries on Solana blockchain data without relying on centralized APIs or rate limits.

From here, you can:

Build dashboards with tools like Grafana or Metabase
Create materialized views for common queries
Set up regular data ingestion jobs for new partitions
Join blockchain data with your own application data
Export query results for machine learning or analysis