DuckDB: The Rising Star in the Big Data Landscape

Mihai Bojin
5 min readJan 15, 2024

--

DuckDB — in-process SQL OLAP database management system

The Current State of Big Data

The big data landscape has undergone a remarkable transformation over the past decade. In this rapidly evolving realm, the sheer number of tools available today is staggering — roughly nine times more diverse than what we had ten years ago.

This diversity isn’t only limited to data warehouses, data lakes, and lakehouses. It has extended to processing methodologies, transitioning from ETL (Extract, Transform, Load) to ELT (Extract, Load, Transform), EL (Extract, Load), and beyond. The explosion of options is both a blessing and a curse, as it complicates the task of becoming an expert in any single technology — time is the same for all of us, after all!

SQL has established itself as the tool of choice for data transformations in this vast and varied ecosystem, evidenced by being the de-facto language of choice for many data processing platforms.

Among the plethora of tools, one newcomer is making significant waves — you’ve guessed it… DuckDB!

What the quack is DuckDB, and why should you care!?

DuckDB is an in-process SQL analytics engine that is rapidly gaining popularity, evidenced by its impressive stats:

  • 1.7 million downloads per month on PyPI
  • 13,800 stars on GitHub, reaching the same interest as Postgres in half the time
DuckDB vs. Postgres — GitHub Star Rating
  • In a comparable two-year timeframe, DuckDB experienced the same growth in popularity as Snowflake, as reported by DB-Engines trends; it seems well-positioned to become mainstream in the following few years and replace at least a few of the payloads currently being processed in traditional data warehouses.
DuckDB vs. Snowflake — DB Engines ranking

The open-source nature of DuckDB, guaranteed by its MIT license in perpetuity, adds to its appeal.

Key Benefits of DuckDB

  1. Ease of Installation: running DuckDB is as simple as brew install duckdb (which installs the DuckDB CLI)
  2. Low complexity: The absence of a server (DuckDB is just a binary) means there’s no need to deal with credentials, access control lists, firewall configurations, etc.
  3. Universal Compatibility: With very few dependencies, DuckDB epitomizes portability — it can even run in your browser!
  4. DataFrame Integration: DuckDB’s Python library has the ability to query Pandas DataFrames — this is a game-changer! It acts as a unifying layer or ‘glue’ between it and any systems it can’t directly query, facilitating the transformation step in data processing.
  5. Extensions: DuckDB has a flexible extension mechanism, which allows for greater flexibility, especially when it comes to reading data directly from JSON and Parquet or directly from S3. This capability greatly improves Developers’ experience.
  6. Stability and Efficiency: DuckDB is designed to handle workloads beyond memory limits (albeit with some limitations). This is particularly relevant in scenarios where the analyzed datasets are significantly larger than the available RAM but small enough to fit on disk — empowering analyses to be completed using ‘cheap’ and readily available hardware (like your laptop).

DuckDB in Real-world Data Pipelines

Compared to cloud-based systems, DuckDB stands out for its minimal requirements and cost-effectiveness. It eliminates the need for cloud accounts, quotas, or additional expenses. DuckDB’s consistency across environments, from a developer’s laptop to production setups, starkly contrasts cloud-based solutions that often suffer from drift as time progresses and data goes stale or invalid.

The simplicity of running DuckDB just about anywhere circumvents common challenges seen in distributed systems, such as moving data to compute nodes, VM/job orchestration, and failure handling. The capacity of modern machines, cloud-based or like those powered by Apple’s M1 SoCs, further enhances DuckDB’s utility by enabling single-machine processing scenarios for significantly large datasets. The reality is that very few customers will need to process TBs of data daily, genuinely needing more CPU power than readily available in all the public clouds.

SQL syntactic sugar

DuckDB’s relative novelty affords it the flexibility to introduce new SQL syntax enhancements, such as GROUP BY ALL, SELECT * EXCLUDE, ASOF JOINS, etc. These additions make SQL queries more intuitive and readable; consider the snippets below:

-- Group by several fields in ANSI SQL
SELECT country, city, region, postal_code, AVG(price) AS avg_price
FROM customers
-- Non-aggregated fields need to be repeated here
GROUP BY country, city, region, postal_code;

-- Group by all in DuckDB
SELECT country, city, region, postal_code, AVG(price) AS avg_price
-- Fields are only listed once; maintaining the code becomes easier
GROUP BY ALL;
-- Query all but the 'email' field in ANSI SQL
SELECT country, city, region, postal_code, address, phone_number
/*, email*/
FROM customers;

-- Query all but the 'email' field in DuckDB
SELECT * EXCLUDE (email) FROM customers;
-- Consider joining timestamps that are 'approximately' equal.
-- In ANSI SQL, you would generally have to bucket them
-- In DuckDB, you can use an ASOF JOIN to achieve the same results,
-- more simply and efficiently.
SELECT events.id, events.ts, events.val, metadata.details
FROM events
ASOF JOIN metadata USING(id, ts);

Integration with Pandas Dataframes

A significant advantage of DuckDB, especially in the Python ecosystem, is its seamless integration with Pandas Dataframes. This feature simplifies the process of merging diverse datasets from various sources, streamlining data analysis and transformation tasks.

For example, in a Jupyter Notebook, you can do the following (based on the Movie Recommendation System dataset):

# Install dependencies
%pip install --quiet duckdb
%pip install --quiet jupysql
%pip install --quiet duckdb-engine
%pip install --quiet pandas
%pip install --quiet matplotlib
%pip install --quiet psycopg2-binary
%pip install --quiet dash
%pip install --quiet plotly

import duckdb
import pandas as pd

# Load and configure jupysql
%load_ext sql
%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False
%config SqlMagic.named_parameters=True

# Connect to a local DuckDB instance
%sql duckdb:///

# Enable DuckDB to query remote files (i.e., S3)
%%sql
INSTALL httpfs;
LOAD httpfs;

# Configure S3 access keys
SET s3_region = '...';
SET s3_access_key_id = '...';
SET s3_secret_access_key = '...';

# Connect to a remote Postgres database
ATTACH 'dbname=DATABASE user=USER host=HOST password=PASSWORD connect_timeout=10' AS postgres (TYPE postgres, READ_ONLY);

# Execute a query and store it in a dataframe
%%sql
df << SELECT
t1.movieId,
t1.title,
t1.genres,
t2.userId,
t2.rating,
t3.tag
# query a table in Postgres
FROM postgres.public.movies AS t1
# join with a table from DuckDB
INNER JOIN ratings AS t2 USING (movieId)
# join with a JSON dataset from S3
INNER JOIN 's3://S3-BUCKET/tags.json' AS t3 USING (userId, movieId)

# And finally, reference the dataframe from another query
%%sql
by_genres << SELECT genres, COUNT(*) AS cnt
FROM df
GROUP BY ALL
ORDER BY 2 DESC
LIMIT 5;

# Or plot the transformed dataset
import plotly.express as px
fig = px.pie(by_genres,
values='cnt',
names='genres',
title='Top 5 movie genres')
fig.show()

Conclusion

This overview of DuckDB highlights its potential as a versatile, efficient, and user-friendly tool in the big data domain. As a relatively new entrant, it’s uniquely positioned to bridge gaps and enable solutions that align with the evolving needs of data engineers and software developers alike.

I am confident of its utility and versatility in dealing with the most common data processing use cases. I look forward to hearing about your experiences and insights with DuckDB!

Until next time — keep querying! 🦆

--

--

Mihai Bojin

Software Engineer at heart, Manager by day, Indie Hacker at night. Writing about DevOps, Software engineering, and Cloud computing. Opinions my own.