ClickHouse is a fast columnar database perfect for analytical queries on large datasets. This guide shows you how to load SolArchive parquet files into ClickHouse, giving you the power to run blazing-fast SQL queries on Solana blockchain data.
First, you need a running ClickHouse instance. You can install it locally or use Docker.
# macOS
brew install clickhouse
# Start ClickHouse server
clickhouse server
# In another terminal, connect to ClickHouse
clickhouse client If you prefer Docker, this is the quickest way to get started:
# Pull and run ClickHouse with Docker
docker run -d \
--name clickhouse-server \
-p 8123:8123 -p 9000:9000 \
--ulimit nofile=262144:262144 \
clickhouse/clickhouse-server
# Connect to ClickHouse
docker exec -it clickhouse-server clickhouse-client http://localhost:8123 (HTTP interface)
and localhost:9000 (native protocol). The client connects via the native protocol.
Now let's create tables that match SolArchive's parquet schema. We'll use ClickHouse's powerful features like nested arrays and partitioning by month.
This table stores transaction data with nested arrays for token balances and instructions:
CREATE TABLE IF NOT EXISTS transactions
(
signature String,
block_slot UInt64,
block_timestamp DateTime64(3, 'UTC'),
fee UInt64,
status String,
err Nullable(String),
-- Account keys
account_keys Array(String),
-- Token balances (nested arrays)
pre_token_balances Array(Tuple(
account_index UInt8,
mint String,
owner String,
amount String,
decimals UInt8
)),
post_token_balances Array(Tuple(
account_index UInt8,
mint String,
owner String,
amount String,
decimals UInt8
)),
-- Instructions
instructions Array(Tuple(
program_id_index UInt8,
accounts Array(UInt8),
data String
)),
-- Log messages
log_messages Array(String)
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(block_timestamp)
ORDER BY (block_timestamp, signature)
SETTINGS index_granularity = 8192; Key features:
MergeTree engine - ClickHouse's main table engine for large datasetsPARTITION BY toYYYYMM(block_timestamp) - Data partitioned by month for efficient queriesORDER BY (block_timestamp, signature) - Primary key for fast lookupspre_token_balances and post_token_balances stored as tuplesThis table stores token metadata:
CREATE TABLE IF NOT EXISTS tokens
(
mint String,
symbol Nullable(String),
name Nullable(String),
decimals UInt8,
supply Nullable(String),
mint_authority Nullable(String),
freeze_authority Nullable(String),
updated_at DateTime64(3, 'UTC')
)
ENGINE = ReplacingMergeTree(updated_at)
ORDER BY mint
SETTINGS index_granularity = 8192; Key features:
ReplacingMergeTree - Automatically deduplicates rows with the same mintupdated_at - Version column that keeps the latest recordORDER BY mint - Primary key for fast token lookupsNow let's write a Python script to load parquet files into ClickHouse. We'll use the ClickHouse Python client and PyArrow to read parquet files.
uv add clickhouse-connect pyarrow pandas
Create load_clickhouse.py:
"""
Load SolArchive parquet files into ClickHouse.
This script reads parquet files and inserts them into ClickHouse tables
using the ClickHouse Python client.
"""
import asyncio
from pathlib import Path
from typing import List
import clickhouse_connect
import pyarrow.parquet as pq
def load_transactions_partition(client, partition_dir: Path):
"""Load all transaction parquet files from a partition into ClickHouse."""
parquet_files = sorted(partition_dir.glob("*.parquet"))
print(f"Loading {len(parquet_files)} files from {partition_dir.name}...")
total_rows = 0
for parquet_file in parquet_files:
# Read parquet file
table = pq.read_table(parquet_file)
# Convert to pandas for easier insertion
df = table.to_pandas()
# Insert into ClickHouse
client.insert_df('transactions', df)
total_rows += len(df)
print(f" ✓ {parquet_file.name}: {len(df):,} rows")
print(f" Total: {total_rows:,} rows loaded\n")
return total_rows
def load_tokens_partition(client, partition_dir: Path):
"""Load all token parquet files from a partition into ClickHouse."""
parquet_files = sorted(partition_dir.glob("*.parquet"))
print(f"Loading {len(parquet_files)} files from {partition_dir.name}...")
total_rows = 0
for parquet_file in parquet_files:
# Read parquet file
table = pq.read_table(parquet_file)
# Convert to pandas for easier insertion
df = table.to_pandas()
# Insert into ClickHouse
client.insert_df('tokens', df)
total_rows += len(df)
print(f" ✓ {parquet_file.name}: {len(df):,} rows")
print(f" Total: {total_rows:,} rows loaded\n")
return total_rows
def main():
"""Load SolArchive data into ClickHouse."""
# Connect to ClickHouse
print("Connecting to ClickHouse...")
client = clickhouse_connect.get_client(
host='localhost',
port=8123,
username='default',
password=''
)
# Test connection
result = client.query('SELECT version()')
print(f"Connected to ClickHouse {result.result_rows[0][0]}\n")
# Load transactions
data_dir = Path("data")
txs_dir = data_dir / "txs"
if txs_dir.exists():
print("Loading TRANSACTIONS dataset:")
partitions = sorted([d for d in txs_dir.iterdir() if d.is_dir()])
total_txs = 0
for partition_dir in partitions:
rows = load_transactions_partition(client, partition_dir)
total_txs += rows
print(f"✅ Loaded {total_txs:,} total transactions\n")
# Load tokens
tokens_dir = data_dir / "tokens"
if tokens_dir.exists():
print("Loading TOKENS dataset:")
partitions = sorted([d for d in tokens_dir.iterdir() if d.is_dir()])
total_tokens = 0
for partition_dir in partitions:
rows = load_tokens_partition(client, partition_dir)
total_tokens += rows
print(f"✅ Loaded {total_tokens:,} total token records\n")
# Show table stats
print("\nTable Statistics:")
txs_count = client.query('SELECT count() FROM transactions')
print(f" Transactions: {txs_count.result_rows[0][0]:,} rows")
tokens_count = client.query('SELECT count() FROM tokens')
print(f" Tokens: {tokens_count.result_rows[0][0]:,} rows")
if __name__ == "__main__":
main() uv run load_clickhouse.py You'll see output like:
Connecting to ClickHouse...
Connected to ClickHouse 24.3.1
Loading TRANSACTIONS dataset:
Loading 284 files from 2025-11-30...
✓ 000000000.parquet: 125,430 rows
✓ 000000001.parquet: 98,765 rows
...
Total: 35,642,190 rows loaded
✅ Loaded 35,642,190 total transactions
Loading TOKENS dataset:
Loading 8 files from 2025-11...
✓ 000000000.parquet: 2,534,890 rows
...
Total: 20,279,120 rows loaded
✅ Loaded 20,279,120 total token records
Table Statistics:
Transactions: 35,642,190 rows
Tokens: 20,279,120 rows Now for the fun part - querying your data! ClickHouse excels at analytical queries over large datasets.
Here are some example queries to get you started:
-- Get top 10 most transferred tokens
SELECT
t.symbol,
t.name,
count() as transfer_count,
count(DISTINCT tx.signature) as tx_count
FROM transactions tx
ARRAY JOIN tx.pre_token_balances AS pre, tx.post_token_balances AS post
LEFT JOIN tokens t ON pre.mint = t.mint
WHERE pre.mint = post.mint
AND pre.account_index = post.account_index
AND pre.amount != post.amount
AND tx.status = 'Success'
AND t.symbol IS NOT NULL
GROUP BY t.symbol, t.name
ORDER BY transfer_count DESC
LIMIT 10;
-- Average transaction fee per day
SELECT
toDate(block_timestamp) as day,
avg(fee) / 1e9 as avg_fee_sol,
count() as tx_count
FROM transactions
WHERE status = 'Success'
GROUP BY day
ORDER BY day DESC
LIMIT 30; Find the busiest hours of the day:
SELECT
toHour(block_timestamp) as hour,
count() as tx_count
FROM transactions
WHERE toDate(block_timestamp) = today()
GROUP BY hour
ORDER BY hour; Track a specific wallet's token transfers:
SELECT
block_timestamp,
t.symbol,
arrayElement(pre.amount, 1) as pre_amount,
arrayElement(post.amount, 1) as post_amount
FROM transactions
ARRAY JOIN pre_token_balances AS pre, post_token_balances AS post
LEFT JOIN tokens t ON pre.mint = t.mint
WHERE pre.owner = 'YOUR_WALLET_ADDRESS_HERE'
AND pre.mint = post.mint
AND pre.account_index = post.account_index
ORDER BY block_timestamp DESC
LIMIT 100; You now have SolArchive data loaded into ClickHouse! This gives you the power to run fast SQL queries on Solana blockchain data without relying on centralized APIs or rate limits.
From here, you can: