Want to code faster? Our Python Code Generator lets you create Python scripts with just a few clicks. Try it now!
If you work with data in Python, you've probably spent way too much time waiting for Pandas to grind through a CSV file. You know the drill — make coffee while read_csv() does its thing, then more coffee while groupby().agg() churns away.
Here's the thing: it doesn't have to be this way. In fact, by 2026, it really shouldn't be.
DuckDB is an embedded analytical database that you install with a single pip install and immediately start querying CSV, Parquet, and JSON files using plain SQL. No servers. No configuration files. No Docker containers. Just pure, fast analytical queries that consistently outperform Pandas by 4–10x on real workloads.
In this tutorial, I'll walk you through everything you need to start using DuckDB effectively — from basic queries to hybrid workflows that combine the best of DuckDB, Pandas, and Polars.
DuckDB is often called "SQLite for analytics." That comparison works because both are embedded databases — they live inside your application process rather than running as a separate server. But while SQLite is built for transactional workloads (think: your phone's contact list), DuckDB is purpose-built for analytical queries — the kind you'd run in a data warehouse.
Under the hood, DuckDB uses a columnar storage engine (data is stored column-by-column rather than row-by-row), which makes aggregations and scans dramatically faster. It's written in C++ with a Python client that's dead simple to use. And because it's vectorized and multi-threaded by default, it saturates your CPU cores without you lifting a finger.
The real headline feature, though, is this: you can point DuckDB at a CSV or Parquet file sitting on your disk and run SQL against it without importing anything first. No read_csv(), no schema inference headaches, no memory ballooning. Just write a query and get results.
Nothing fancy here — just the usual pip install:
pip install duckdb
If you want to follow along with the hybrid workflow examples later, grab Pandas and Polars while you're at it:
pip install duckdb pandas polars pyarrow
That's it. You're ready. No Docker, no Postgres connection string, no cloud account. Let's write some queries.
Fire up a Python shell or Jupyter notebook and create a connection. The default duckdb.connect() gives you an in-memory database — perfect for quick analysis:
import duckdb
conn = duckdb.connect() # in-memory database
Let's create a small table from scratch to get a feel for the API:
conn.execute("""
CREATE TABLE employees (
id INTEGER,
name VARCHAR,
department VARCHAR,
salary DECIMAL(10, 2)
)
""")
conn.execute("""
INSERT INTO employees VALUES
(1, 'Alice', 'Engineering', 95000),
(2, 'Bob', 'Engineering', 87000),
(3, 'Charlie', 'Marketing', 72000),
(4, 'Diana', 'Marketing', 78000),
(5, 'Eve', 'Engineering', 105000),
(6, 'Frank', 'Sales', 65000),
(7, 'Grace', 'Sales', 71000)
""")
The fetchdf() method returns results as a Pandas DataFrame — DuckDB plays nicely with the Python data stack by default:
result = conn.execute("""
SELECT * FROM employees ORDER BY salary DESC
""").fetchdf()
print(result)
Grouping and aggregation are just as straightforward:
conn.execute("""
SELECT
department,
ROUND(AVG(salary), 2) AS avg_salary,
COUNT(*) AS headcount
FROM employees
GROUP BY department
ORDER BY avg_salary DESC
""").fetchdf()
So far, this looks like any SQL database. But the real magic starts when you stop importing data and start querying files directly.
This is where DuckDB genuinely changes how you work. Let me show you what I mean.
I generated a 500,000-row sales dataset — the kind of CSV that makes Pandas sweat a little. It's about 35 MB on disk with columns for order ID, region, product, category, quantity, unit price, and order date.
Here's the DuckDB way to analyze it:
result = conn.execute("""
SELECT
region,
category,
COUNT(*) AS num_orders,
ROUND(SUM(total_amount), 2) AS revenue,
ROUND(AVG(total_amount), 2) AS avg_order_value
FROM read_csv('sales_data.csv', AUTO_DETECT=TRUE)
GROUP BY region, category
ORDER BY revenue DESC
LIMIT 10
""").fetchdf()
Notice what didn't happen here. I didn't call pd.read_csv(). I didn't check dtypes. I didn't worry about memory. I just wrote a SQL query that references a file path, and DuckDB handled schema detection, type inference, and query execution in one shot.
The result came back in 0.075 seconds:
region category num_orders revenue avg_order_value
0 North Office 50199 1.274445e+08 2538.79
1 South Electronics 50228 1.271582e+08 2531.62
2 East Electronics 49780 1.258078e+08 2527.28
3 West Office 49771 1.257954e+08 2527.48
4 East Office 49901 1.257002e+08 2518.99
This one capability — running SQL directly against files — eliminates the entire "load then query" workflow that dominates most Python data scripts. And it's not limited to CSV. DuckDB reads Parquet, JSON, and even queries files over HTTP/S3 natively.
I ran the exact same aggregation on the same 500K-row CSV file using both DuckDB and Pandas. Here's what happened:
# Pandas approach
df = pd.read_csv('sales_data.csv')
pandas_result = (df.groupby(["region", "category"])
.agg(num_orders=("order_id", "count"),
revenue=("total_amount", "sum"),
avg_order_value=("total_amount", "mean"))
.sort_values("revenue", ascending=False)
.head(10))
| Method | Time | Speedup |
|---|---|---|
| Pandas (read_csv + groupby) | 0.343s | — |
| DuckDB (direct query) | 0.075s | 4.6× faster |
That's a nearly 5x speedup on a modest dataset. On larger files — the kind where Pandas starts pushing against your RAM ceiling — the gap widens significantly. DuckDB's columnar engine only needs to scan the columns referenced in your query, while Pandas loads the entire DataFrame into memory before doing anything.
Now, to be fair: if you're working with tiny datasets (a few thousand rows), the difference won't matter much. Pandas is perfectly fine for that. But the moment you're dealing with anything north of 100K rows, DuckDB starts pulling away fast.
If you're still storing analytical data as CSV in 2026, let me gently suggest you switch to Parquet. It's a columnar file format that compresses better and reads faster — and DuckDB absolutely flies on it.
That 35 MB CSV I mentioned earlier? As Parquet, it shrinks to 11 MB. And queries get even faster because DuckDB can skip entire columns it doesn't need:
result = conn.execute("""
SELECT
product,
ROUND(SUM(total_amount), 2) AS total_revenue,
COUNT(*) AS units_sold,
ROUND(AVG(quantity), 1) AS avg_qty_per_order
FROM read_parquet('sales_data.parquet')
GROUP BY product
ORDER BY total_revenue DESC
""").fetchdf()
This query ran in 0.015 seconds — 5× faster than the already-fast CSV query. DuckDB can do this because Parquet files contain embedded statistics (min/max values per column chunk), allowing the engine to skip entire sections of the file that can't possibly match your query.
One thing SQL has always done better than Pandas is window functions. Here's how to find the top 3 products by revenue in each region — a query that would take several chained Pandas operations:
result = conn.execute("""
WITH ranked AS (
SELECT
region,
product,
ROUND(SUM(total_amount), 2) AS revenue,
ROW_NUMBER() OVER (
PARTITION BY region
ORDER BY SUM(total_amount) DESC
) AS rank
FROM read_parquet('sales_data.parquet')
GROUP BY region, product
)
SELECT * FROM ranked WHERE rank <= 3
ORDER BY region, rank
""").fetchdf()
The CTE (Common Table Expression) with ROW_NUMBER() OVER (PARTITION BY ...) is clean and readable. No multi-step index manipulation, no transform chains — just a single declarative query that reads like English.
Here's where things get interesting. DuckDB doesn't replace Pandas or Polars — it complements them. The sweet spot is using each tool for what it does best:
Here's a concrete example. First, DuckDB crunches through 500K rows and spits out a monthly summary:
# Step 1: DuckDB does the heavy aggregation (500K → 216 rows)
summary = conn.execute("""
SELECT
region,
category,
DATE_TRUNC('month', order_date) AS month,
COUNT(*) AS order_count,
ROUND(SUM(total_amount), 2) AS monthly_revenue
FROM read_parquet('sales_data.parquet')
GROUP BY region, category, DATE_TRUNC('month', order_date)
""").fetchdf()
# DuckDB aggregation: 0.041s → 216 rows
Then Pandas takes over for a pivot table — something it handles beautifully on small data:
# Step 2: Pandas pivots the summary
pivot = summary.pivot_table(
index="month",
columns="region",
values="monthly_revenue",
aggfunc="sum"
).round(2)
# Pandas pivot: 0.005s
And if you need to squeeze out every last millisecond, Polars can take the baton:
# Step 3: Polars for final formatting
import polars as pl
pl_df = pl.from_pandas(summary)
top_month = (pl_df
.group_by("region")
.agg(pl.col("monthly_revenue").max().alias("best_month_revenue"))
.sort("best_month_revenue", descending=True))
# Polars processing: 0.018s
The whole pipeline — 500K rows aggregated, pivoted, and polished — completes in under 0.07 seconds. Each tool handled exactly the part it excels at, and you didn't have to commit to any single framework for the entire workflow.
So far we've used in-memory connections. But DuckDB also supports file-based persistent databases — just pass a path to connect():
persistent_conn = duckdb.connect('analytics.duckdb')
# Create a table directly from a Parquet file
persistent_conn.execute("""
CREATE OR REPLACE TABLE sales AS
SELECT * FROM read_parquet('sales_data.parquet')
""")
# Verify it's there
row_count = persistent_conn.execute(
"SELECT COUNT(*) FROM sales"
).fetchone()[0]
print(f"Sales table: {row_count:,} rows")
# Sales table: 500,000 rows
persistent_conn.close()
The resulting analytics.duckdb file is only about 5 MB — smaller than both the CSV (35 MB) and the Parquet (11 MB) because DuckDB applies its own compression on top. You can share this file, back it up, or query it again later without re-ingesting the original data.
This is incredibly useful for iterative analysis. Load once, query many times — and close the connection when you're done without losing anything.
Once you've done your analysis, DuckDB makes it trivial to export results in whatever format you need:
# Export to CSV
conn.execute("""
COPY (
SELECT region, product, ROUND(SUM(total_amount), 2) AS revenue
FROM read_parquet('sales_data.parquet')
GROUP BY region, product
ORDER BY revenue DESC
) TO 'revenue_summary.csv' (HEADER, DELIMITER ',')
""")
# Export to Parquet
conn.execute("""
COPY (
SELECT region, product, ROUND(SUM(total_amount), 2) AS revenue
FROM read_parquet('sales_data.parquet')
GROUP BY region, product
ORDER BY revenue DESC
) TO 'revenue_summary.parquet' (FORMAT PARQUET)
""")
The COPY command handles the serialization efficiently — no loop writing rows one at a time.
I'm not going to tell you to throw away Pandas. That would be bad advice. Here's my honest take on when each tool makes sense:
| Scenario | Best Tool |
|---|---|
| Quick exploration of a small CSV (< 10K rows) | Pandas |
| Aggregating a large CSV/Parquet file | DuckDB |
| Joining multiple large files | DuckDB |
| Complex statistical modeling | Pandas + scikit-learn |
| ETL pipelines on structured data | DuckDB → Polars |
| BI-style reporting from files | DuckDB |
| Machine learning preprocessing | DuckDB → Pandas |
The pattern to internalize is this: use DuckDB to shrink your data, then hand off a manageable DataFrame to Pandas or Polars for the parts they excel at.
DuckDB solves a problem that's been nagging Python data folks for years: the awkward gap between "this file is too big for Pandas to handle comfortably" and "I don't want to set up a full database server."
It's not a replacement for your production Postgres instance, and it won't help you train neural networks. But for the vast middle ground of data analysis — where you're wrangling CSV files, joining datasets, and building reports — it's genuinely transformative.
The fact that you can pip install this kind of performance and just start writing SQL against files sitting on your laptop is kind of remarkable. In 2026, there's very little reason to keep suffering through slow Pandas aggregations when DuckDB is a single import away.
Further Reading:
Save time and energy with our Python Code Generator. Why start from scratch when you can generate? Give it a try!
View Full Code Create Code for Me
Got a coding query or need some guidance before you comment? Check out this Python Code Assistant for expert advice and handy tips. It's like having a coding tutor right in your fingertips!