DuckDB Internals: Why Is DuckDB Fast? (Part 1)

376 points by marklit 3 days ago

smithclay 14 hours ago

If you're reading this and curious: consider writing a duckdb community extension* or contributing to an existing one*

duckdb is becoming a kind of data superglue between a lot of data ecosystems (GIS, observability, analytics, lakehouses, object storage, etc) that don't talk to each other typically, and it's worth checking out in 2026.

* https://github.com/duckdb/extension-template * https://duckdb.org/community_extensions/

aleda145 12 hours ago

I just started doing this last week!
I'm not very good at C++, but coupled with the extension template and codex I got a basic version of my extension working within an hour. Go for it!
pknerd 13 hours ago

Just curious whether one can earn money making these exts?
- faangguyindia 13 hours ago
  
  You can definately offer consultation or custom integration.
  - fg137 8 hours ago
    
    That's going to be a difficult business in this age unless you have some uniquely strong ideas and products.
    
    fartfeatures 7 hours ago
    
    I'm not convinced that is true. The JCB digger didn't put groundworkers out of business. A consultant that can get more done in a day is worth a lot more than one who can't. There is still skill required in wielding the tool of the day and that skill is marketable.
    
    fg137 3 hours ago
    
    Sorry, I don't think your non-software analogy that existed before 2025 helps at all in this specific topic about software consultancy.
  - pknerd 8 hours ago
    
    Thanks for your kind response. Could you guide further? Like businesses don't care about the tool/tech itself, how do I find and approach them, and for which niche.
    Thanks in Advance
    
    gortok 7 hours ago
    
    > Like businesses don't care about the tool/tech itself, how do I find and approach them, and for which niche.
    You probably don't realize this, but you're asking one of the hardest questions when starting a business, and one of the questions others are least likely to be able to answer for you.
    "finding" a niche, and connecting to the business folks inside that 'niche' is hard, and is inherently a personal journey.
    There's an old writing adage, "Write about what you know", and the same adage works in business: Do business with what you know.
    Your question goes into another issue that you have to resolve when building a business: going into a platform specialization necessarily means folks know about that platform or they know they need you to solve a problem they have with that platform.
    In general, there are two ways out of each problem:
    1. Build an ecosystem with DuckDB at its center that solves a business problem that a particular niche cares about. 2. Build a reputation solving problems with DuckDB that would attract those that know they have a problem with DuckDB.
    Honestly, best of luck here, becoming successful at business is hard if you're not already in tune with why folks buy and ensuring you're selling something they want to buy from you.
    
    lazarus01 5 hours ago
    
    I use to be in sales before I became an MLE.
    There is a theory called diffusion of innovation. The simple explanation is that there are 5 different cohorts of buyers. Early adopters, visionaries, pragmatists, conservatives and laggards. Early adopters and visionaries are risk takers, who will make bold moves to achieve order of magnitude results. This is called the early market, which represents 13% of the market. The pragmatists and conservatives make up the mainstream market which is about 70%.
    In order to get into the mainstream market, you need solid adoption from the early market.
    To choose a niche, you need to develop a solution that fits nicely into the buyers expectations for different types of market participants. There is the market alternative and product alternative. The market alternative is the solution that owns the highest proportion of market share. The product alternative is innovative tech that challenges superiority to the market alternative.
    You need to introduce a solution that fits in between those participants to stand out.
    To choose a solution, go to industry trade events and talk to people about high value problems that aren’t solved by current participants. That is the purpose of industry associations, to solve difficult problems.
    Visionaries and early adopters love new vendors. They will champion you through their organization if your solution will help them meet their goals.
    Good luck
    
    aabdi 3 hours ago
    
    Probably should tell guy to read the crossing the chasm book? Seems useful in this context.
    Define the smallest market possible or something like that. I’m not sales though.
vanyaland 6 hours ago

[flagged]

axegon_ 6 hours ago

I use duckdb HEAVILY at work and it's been a game changer. I'm sifting through terabytes of data multiple times a day, mixing, matching, updating, filtering, DuckDB is second to none. For anyone that hasn't used it: you are missing out.

xtracto 2 hours ago

This may be useful for somebody: We are also using DuckDB heavily at my workplace (we do Tax analytics of very large companies with huge amounts of data). We have certain DuckDB processes that happened in AWS infrastructure, where the data is saved in GP3 disks.
We didn't know that for GP3 disks, you can increase not only IOPS but also Read/Write Throughput [1] which by default is 125 MB/s. So by default we were not seeing the performance we expected.
Once we increased the throughput of the EBS, it was amazing. So if you are not seeing the performance you read about online when using DuckDB, it may be something like that.
[1] https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-p...
- pants2 2 hours ago
  
  This seems crazy low to me. AWS has default 3K IOPS and 125 MB/s throughput, meanwhile my Macbook Pro has 700K IOPS and 14.5GB/s throughput.
  Is Amazon running on super outdated legacy networking?
  - kristjansson 2 hours ago
    
    SAN vs local. Local NVME (“instance storage“) on AWS is wicked fast too, but live and dies with the instance

0xferruccio 14 hours ago

DuckDB is amazing for any sort of fast data analysis when the data is small enough that it can fit on your laptop

Recently at work I've been using it to analyse the Claude code sessions of every engineer at our company (that we upload to S3) and it's been extremely helpful to help us find gaps in devex and have clear metrics to back up the impact of fixing them

Another thing it's been really useful for has been getting metrics on Claude skills usage and then dive into use-cases by looking at the transcripts

Other engineers that had never touched DuckDB were so impressed with how easy it is for AI agents to write queries on our dataset

skeeter2020 3 hours ago

>> DuckDB is amazing for any sort of fast data analysis when the data is small enough that it can fit on your laptop
I agree, and the dirty (not so) secret big data providers like Snowflake try to hide: the majority of your work is not big data and WILL fit on your local machine. My last company was spending $2M/yr on contract with Snowflake, and another million between Fivetran and Matillion. Of the 1200 clients using analytics maybe 2 had enough data to warrant "infinite scalability" and a dozen wanted Snowflake because they already had corporate warehouses in Snowflake (they probably didn't need it either). Turns out the Extract and Load could be handled by bog-standard C# code and a bunch of SQL, while almost everyone was better off with a DuckDB database running locally, often in the browser. You've probably heard YAGNI before (You Ain't Gonna Need It) but it's even more likely with "Big Data". #SmallDataConvert
- tomjakubowski 3 hours ago
  
  Folks have been beating this drum for as long as I've worked in software, dating to the Hadoop era, and it remains true today. So much of "big data" only appears big because it's poorly stored, or is represented wastefully (in persistent storage or in memory).
zurfer 12 hours ago

Like sqlite, duckdb is underappreciated as a production database. You can totally run it on servers or even "serverless" and do some heavy data transformations or with the right server size work with large scale datasets (up to a TB compressed seems fine).
- ndr 10 hours ago
  
  This. I've recently used both duckdb and sqlite to power a dashboard for a small restaurant of a family member. It converts all their sales to a very tiny parquet files, daily.
  The file fits in memory and can do all sort of computation in the browser itself. The backend is extremely simple, it just loads the JS and serves the parquet files.
  It was also trivial to let the owner do their own queries, just give the schema to an LLM and let it use the charting library, no data hallucinations. If they need it in the dashboard they can either use that one or ask me to review that query.
  To be honest, given how simple some things became, it's been really fun to work on.
  - skeeter2020 3 hours ago
    
    Similar experience here. The best thing I've built in a long time is replacing a complex (and scary) permissions system built on top of Snowflake with single role duckdb databases that - aside from no longer worrying about bugs leaking data across roles - are more performant, timely and flexible. Combined with the use of AI this is the way forward IMO.
    At the other end of the spectrum, working with random data on "what if?" and exploration tasks with DuckDB is fun again. it's so straightforward and fast, with tools and functions for pretty much everything.
  - kristjansson 2 hours ago
    
    > no data hallucinations
    Dangerous thing to assert. It’ll happily run SQL that works, but doesn’t necessarily correspond to intentions or unstated assumptions about the data.
    
    ndr 2 hours ago
    
    Of course I meant that it won't make data up.
    It can only emit SQL and the json spec of the chart.
    Since shipping I've reviewed dozens of queries and charts it produces answering the user. I'm yet to catch sonnet off guard.
  - noworriesnate 3 hours ago
    
    I have a a theory that LLMs are going to be the death knell of big SaaS. It's so much harder to build and maintain an massive SaaS that does 80% of what 80% of your customers want, than it is to build something small and simple that does 100% of what one customer wants.
    
    kristjansson 2 hours ago
    
    Maybe once the model can administer and operate the service too.
    For now building the 10% of the SaaS that you need still leaves you operating 100% of a new service/process
  - wills_forward 3 hours ago
    
    neat
- tomnipotent 12 hours ago
  
  Not to mention it can query across heterogeneous sources, so the same query can use a duckdb table, sqlite, csv, and parquet (including predicate pushdown).
tosh 6 hours ago

Agree, in addition to that DuckDB also works quite well for data that is too big to fit in memory or on the machine DuckDB is on (predicate push down, out of core processing, …).
cyanide911 12 hours ago

>Recently at work I've been using it to analyse the Claude code sessions of every engineer at our company (that we upload to S3) and it's been extremely helpful to help us find gaps in devex and have clear metrics to back up the impact of fixing them
Nice! How do you set things up so that your engineers's claude code sessions upload to S3? Thanks for the help in advance
- _boffin_ 6 hours ago
  
  Probably on a business / Enterprise plan, which has managed settings and also telemetry export. Give it a collector endpoint to export to and then have collector send to s3.
- pimeys 8 hours ago
  
  If you use OpenCode, the sessions are all in a local sqlite database. After lunch I'm pushing one of my agents to crunch some data from that using duckdb...
ashu1461 8 hours ago

Can you please expand more on the claude analysis part. What exactly you analysed and what outcome it helped with ?
- ryanchants 5 hours ago
  
  Not who you responded to, but I've been working on cctx. It's an open source tool for analyzing claude code sessions to see where things went wrong(tool failure loops, bloated context, and the like).
  https://github.com/jacquardlabs/cctx
creatorpilot 10 hours ago

[flagged]

roywiggins 5 minutes ago

ai; dr

nikolay 29 minutes ago

DuckDB is a major RDBMS, yet there's almost no support for it from DB migration frameworks, etc.

anitil 15 hours ago

DuckDb makes so much of my life easier, though I've never used it for large problems. The ability to run `select * from 'data.json'` is just lovely. The fact that it's also a powerhouse is so impressive, I'd usually expect a project to be good at small problems (like mine) xor large problems, but not both

medvezhenok 14 hours ago

Yup. And an extra benefit that you can treat any file like a table, so you can also do something like

  UPDATE my_table
  SET x = file1.x,
      y = file2.y
  FROM 'first_file.csv' file1
  LEFT JOIN 's3://my_bucket/second_file.parquet' file2
    ON file1.id = file2.id
  WHERE mytable.id = file1.id;

steve_adams_86 15 hours ago

> DuckDB has received widespread adoption because it's just so damn easy to use.

This was a major factor in my initial adoption. Since then it has stuck because it’s also absurdly capable, versatile, and fast.

If it wasn’t so easy to use I suspect I wouldn’t have adopted it when I did. The ergonomics are crazy. It still impresses me regularly.

jkubicek 15 hours ago

What do you use it for? I’m perpetually interested in using DuckDB, but it doesn’t seem to do anything I need.
- medvezhenok 14 hours ago
  
  Basically like a locally hosted Snowflake - it only shines if you have enough data to analyze (100 MB - 100 GB is probably the sweet-spot range - less than that and the benefits are small, more than that and you risk flying too close to the sun with memory usage).
  It has connectors for Postgres & other stores, so I find it faster to connect to a Postgres instance, pull all of the data from a table (even if the table is like 50GB - if you have 30 cores on the machine it will pull from Postgres using 30 cores in parallel, so it will only take a minute or two) - and then any analytical queries on the data are 10+ times faster in DuckDB over native Postgres (GROUP BY, regexp_replace, count(distinct...) etc).
  - sceadu 4 hours ago
    
    In my experience it works OK with spilling to disk so I haven't had too much of a concern with memory usage... previously I had issues with it OOM'ing and failing (or maybe this was a skill issue?), but haven't had that happen recently.
- orthoxerox 14 hours ago
  
  All kinds of data processing. For example, you download a million rows of metrics and load them in Excel to build pivot tables. It works, but now it's a billion rows. If you know SQL, it's a snap to point DuckDB at the source CSV or JSON and get the results in a second.
- skeeter2020 3 hours ago
  
  the taste that hooked me: the next time you have a bunch of json data, csvs or other data - local or remote - and someone wants some charts (for me it was "productivity" metrics from Jira combined with a bunch of other stuff). First it is very easy/fast to load this data; DuckDB has a very liberal parsing engine and good connectors. Second, I used to worry a lot about my table definitions and cleaning data before structuring it. Not anymore! With DuckDB I find myself iteratively transforming data and creating new tables, combining sources, converting columns, slicing/dicing/rotating. It's very easy to "remix" data and there are functions or extensions for everything you might want to do. There's so little friction to get started that I've found it just naturally becomes the multitool in my toolbox.
  THis will give you some experience and you'll start to see applicable problem spaces for DuckDB in product areas, especially anything with BI or DW.
- steve_adams_86 14 hours ago
  
  The most interesting use case lately has been using it as the transformation and validation engine for a CLI that handles scientific data. Some datasets are small and could have been handled at the application layer, but some are quite massive (especially genomic data). DuckDB bundles with the CLI and travels around any platform, is super lightweight, allows for easily running in CI, on a user’s machine, against datasets of all sizes, and so on.
  There are other embeddable options out there but I found DuckDb fit better for the potentially massive datasets, and also because of how naturally it ingests the types of data we work with, some of its unique features, and how trivial it was to learn and integrate with the project.
  Otherwise I use it almost daily for doing guardrailed data exploration with LLMs. I prefer SQL over random DSLs in AWS or Sentry or what have you. I’ll ingest the data I need and just run SQL against it. I mentioned in another comment that I’ll tend to store more useful data (especially data I export routinely, like infra cost reports) on S3 and use a Rill instance to do basic exploration in a GUI (it will query remote parquet files).
- raihansaputra 9 hours ago
  
  throwing in my 2 cents: It just replaced pandas for me. It's just so much easier to write sql against csv/json/whatever format data in jupyter/marimo notebooks through duckdb rather than reasoning through pandas. SQL is far more natural for me, and agents also work through it easily.
  - skeeter2020 2 hours ago
    
    really learning SQL (syntax, boolean logic, how queries are broken down, etc) way back in uni has been the single biggest pay-off of my entire career.
- wiredfool 8 hours ago
  
  Few different use cases, other than just a general swiss army knife for vaguely tabular data.
  * fastapi + duckdb + parquet for the backend for a relatively high profile website
  * wasm duckdb + react for a few visualization websites
  * yaml driven ETL from lots of sources, principally ugly spreadsheets, into usable data. More T than E or L really
- edweis 14 hours ago
  
  I personally find it useful to search logs with AI
  - steve_adams_86 14 hours ago
    
    Yes, it’s amazing for giving rails and structure to data so you can be sure an LLM is making more sense than it might with grep and jq. It also allows a little more sanity at scale with jobs like this. You can get pretty crazy with parquet in S3 with an engine like duckdb. And it’s dirt cheap to keep that stuff hanging around for future reference and sanity checking your understanding of things.
    For data I reference frequently, and especially which I know will grow over time, I’ve started using Rill because it makes ad-hoc exploration very smooth and low-friction.
    My process tends to be something like:
    1. Explore logs or some other at least somewhat structured dataset
    2. Use Claude to find useful patterns and determine how I might benefit from this data in ways I wasn’t yet aware
    3. See how often it’s useful for decision making
    4. If it’s frequently useful, formalize it as a view in my Rill instance and refine the models to maximize their utility
- hilariously 8 hours ago
  
  Honestly as someone whose super SQL focused and spends less time focusing on python I just can write generic SQL to transform things in memory to do whatever I want, its very helpful for that.

M4R5H4LL an hour ago

I’ve enjoyed reading that article a lot. I have used DuckDB as a tool in a code assistant and I am very happy with what I see so far.

romaniv 5 hours ago

It's an interesting project, but the discussion on HN looks weird. It gets brought up every few weeks[1] and everyone just spams comments with messages about how "fast" it is.

DuckDB is fast for some specific workloads. If you use it for most other things, it is at least an order of magnitude slower than SQLite. It also has some limitations in terms of what SQL it will currently run (e.g. I immediately ran into an issue with recursive queries). That will probably get better with time.

[1] If you search HN for "sqlite" and "duckdb" you get 4,310 hits and 2,398 hits respectively. That's a very heavy skew, considering SQLite is everywhere and had been around for a quarter century, while DuckDB effectively appeared on the scene two years ago.

skeeter2020 3 hours ago

I'm going to sound like a broken record but... different use cases. They're analogous in the comparison "sqlite for analytics" but completely different architectures and implementations. Part of this is the fault of the developers, but I feel they were trying to highlight the similar focus on in-process, zero dependencies, simplicity and test coverage - not a direct "vs" comparison. IME recursive queries in analytical workflows are not very common; they typically work against the fundamental data layout on disk.
SQLite is awesome and I would love to see more posts about it, but the reality is one of the major reasons it's awesome is the no-drama/stability/it just works. DuckDB is seeing a lot of development on many fronts so there's a lot more to learn and talk about right now.
xearl an hour ago

> while DuckDB effectively appeared on the scene two years ago.
duckdb is ~7 years old by now. it was quite popular long before it became 1.0. heck, even motherduck has been founded 4 years ago.
skotobaza 3 hours ago

> DuckDB is fast for some specific workloads
Yes, it's specifically promoted as DBMS for OLAP workload. And it's usually compared to ClickHouse, another analytical DBMS. So people who use it know why it's good.

willtemperley 11 hours ago

The one huge caveat for anyone that cannot use dynamic linking e.g. in an AppStore context, DuckDB isn’t a great choice. It’s very hard to statically link extensions.

This is where Arrow wins I think. Arrow CPP for example has very portable builds and the C interface is very usable for building bindings.

DuckDB is excellent, but it’s more a black box than a library.

Edit: after a conversation with a robot, it would seem that the DuckDB and ArrowCPP C APIs are complimentary, so it's very possible to have Arrow CPP and DuckDB to coexist in an app, each with its own strength. Arrow CPP doen't have a simple SQL story for example.

tobilg 10 hours ago

I can't confirm this, I have several instances which have statically linked extensions...
- willtemperley 8 hours ago
  
  I didn't say it was impossible but there are specific situations in which it's very difficult. What I was unable to resolve was getting DuckDB to statically link with httpfs and I'm not the only one [1].
  So being more specific, I don't know how I could get a static build of DuckDB to work with Parquet and httpfs (i.e. query S3) working in an app store environment. It was a day's work to get Arrow CPP to call back into Swift for the transport layer.
  However I do now see that DuckDB recently provided an extension point for providing your own transport layer, so my point might well be moot for that reason [2].
  [1] https://github.com/duckdb/duckdb/issues/16190 [2] https://github.com/duckdb/duckdb/pull/17464

ilia-a 6 hours ago

DuckDB is really neat, recently added PDO interface for it for PHP https://github.com/iliaal/pdo_duckdb

Still a bit raw, but getting there

jdw64 14 hours ago

The data scientists I work with use this. Why do they use it? I don't really know much about it, but I've noticed they use it quite often. I mainly use MySQL or PostgreSQL. What are the advantages of DuckDB? It seems like they usually use it as an alternative to Pandas.

medvezhenok 14 hours ago

DuckDB has been probably my most used tool in 2026 - if you're comfortable with SQL it's incredible at quickly prototyping and slicing / dicing data.
I do a lot of experiments with regexes, and if you get used to the RE2 syntax that DuckDB uses, you can see up to 10-100x uplift in terms of speed compared to Postgres on things like regexp_matches(), regexp_extract(), etc (depending on query/table/machine specifics). It has quite powerful scripting with custom Macros, fixes a lot of annoyances of SQL for me compared to Postgres.
I think if you have access to a machine with a lot of RAM / cores and a beefy data set, then it's basically like a RAMdisk version of Snowflake running locally on your machine.
(and of course the fact that it makes it convenient to read CSV/parquet, read/write from S3, etc) - it's a very ergonomic tool.
- jdw64 14 hours ago
  
  Thank you for your kind reply. I should look into it too. In my case, knowing various libraries is directly related to my livelihood. Have a good day.
Demiurge 14 hours ago

Here is the thing, it’s a write only single file format. If you need to run analytical queries it’s optimized for reading, you just open a file and query for the parts you want. If you have multiple clients that read and write data to the database, you should use postgresql.
It’s not really a database in the traditional sense, there is no ACID complexity, it’s a library that lets use write SQL to query a tabular data file.
bdcravens 14 hours ago

Primarily the ability to work directly with data in its native format (CSV for example) without needing ETL.
- throwaway7783 14 hours ago
  
  How does this work in a production setup? Can this be set up like a server, or is it mostly for individual users to play around with data?
  - orthoxerox 14 hours ago
    
    The idea is that you treat data storage and data processing as two distinct tasks. You have your data in S3 or HDFS or a local directory and you run DuckDB on whatever single-node compute you have: a local machine or a container in a cluster.
    There are companies that write cluster computing engines with duckdb as the byte-cruncher at their heart, but usually it's more like NumPy, Pandas or Polars on steroids. Or SQLite, but for running OLAP queries.
  - DanielHB 11 hours ago
    
    In my previous job (working with electric vehicles) we had a AWS batch job that pulled all data from S3[1] into containers (one container per vehicle) and then push that data into duckdb then run some basic queries and data analysis.
    The key thing is that this scaled horizontally pretty much forever, since each vehicle had a fixed amount of data per year we could tightly control the performance characteristics of the analysis. Adding more vehicles didn't make things slower, just linearly more expensive.
    I vaguely remember the data from those containers also being used to process some aggregate analysis (like the each vehicle-container would output some data that would be consumed by another job that did aggregates). But I don't remember the specifics.
    [1]: I believe we used JSONL or parquet format, but I didn't work in that part of the stack directly
  - blackoil 14 hours ago
    
    It is an OLAP db. So you can have a pipeline storing data in parquet files in S3. And then use DuckDB to directly query on it.
- jdw64 14 hours ago
  
  Then it definitely makes sense. Scientists usually handle a lot of CSV files. Thank you

pedromlsreis 4 hours ago

DuckDB is a great example of how far you can get by removing unnecessary layers... Columnar layout and vectorized execution is a powerful combination for OLAP workloads.

skeeter2020 2 hours ago

the CSV parser is really good too. Anyone who's struggled with consuming CSV/TSV knows this is not trivial. DuckDB uses hueristics to be very liberal in what it accepts without crapping out like many parsers.

snissn 13 hours ago

I'm just curious - is duckdb too slow for people? This benchmark from clickhouse shows it being fairly slow compared to some options: https://jsonbench.com/

conradkay 11 hours ago

That's for their `JSON` data types. In DuckDB it's just a string meaning lots of queries will have to do JSON parsing on every row, but the inserts are very fast. Definitely a bit of a footgun and when you actually just need STRUCT or MAP.
There's a talk about ClickHouse's approach from its creator: https://www.youtube.com/watch?v=xHj9mysh0GI , but the gist is that it maintains (sub)columns to store different paths in the JSON
In other ways DuckDB has very good JSON support, like you can do `CREATE TABLE name AS `SELECT * FROM 'data.json';` and it'll infer the schema when possible.
- hilariously 8 hours ago
  
  Not just that, you can SELECT * FROM read_json('folder/*') and read an entire schematically consistent folder of json files, or parquet files, or basically anything it supports with its various functions. Duckdb is insanely useful.

mcv 10 hours ago

Is everything becoming columnar? Parquet stores data per column instead of per row because it improves compression. I get that. Arrow apparently is columnar, and now DuckDB also gets its efficiency by treating data as columns instead of rows?

I still need to wrap my head around how that works, but it's a fascinating development.

levanten 10 hours ago

It depends on your task. In analytics where you need to scan lots of data points within few columns, then columnar storage is very much the best. But for transactional workloads where you have to deal with specific entities, row based would be more advantageous. There are hybrid systems that try to be both at the same time but in my experience they end not doing either very well.
- brandmeyer 6 hours ago
  
  Some day we'll get CREATE TABLE ... ( ... STORAGE ORDER COLUMN MAJOR) to have our transactional cake on the tables that need it and eat our analytics cake on the tables that need that.
  But until then, separate tools for separate purposes isn't a bad place to be when those tools are both fantastic.
- hilariously 8 hours ago
  
  Often used to be referred to as HTAP, and yeah in most data engineering its moving things from OLTP to OLAP forms, and OLAP pretty much always benefit from columnar compression for aggregations and rollups.
charlieflowers an hour ago

BTW, columnar is very similar to struct of arrays (SOA) and some of the reasons it works well overlap with SOA.
skeeter2020 2 hours ago

compression is a side effect but not really the goal. To simplify, analytical queries often filter on a specific column value, and if these are laid out contiguously it makes disk-level reads much faster than rows that would involve read-skip-read-etc. In transactional systems data is typically written as rows though, so that's likely slower in a columnar system. As a general rule, heavy read workflows with known access patterns is going to benefit from a columnar layout.
squirrellous 5 hours ago

Those three things you mentioned kind of live in the same niche - offline data storage and querying. In that world yes everything has become columnar since it’s just better. Row-oriented is still the solution for online streaming use cases.

Panzerschrek 14 hours ago

If DuckDB is so fast and has no data transfer overheads, does it need all this typical SQL machinery with filtering and joining via SELECT queries? Wouldn't it be simpler and faster to return all data to the caller code (all table rows, but only requested columns) and let it perform all other necessary data processing logic?

jauco 13 hours ago

You’d end up implementing your own home grown version of hash join and query pushdown (skipping parquet row groups entirely) etc and your own home grown heuristics in selecting the right one (planning)
Which can outperform a generic solution like this of course, but it’s not less work to make faster for most cases.
Also duckdb can give you access to an in memory representation (e.g. `fetch_arrow_table()`) so you have less “language data structure wrapping” overhead. And you can do filtering yourself on that. In most cases the “where” statements will win though.
efromvt 8 hours ago

The SELECT machinery is the product with databases! SQL often the shortest description of the processing logic, and the database has an efficient local execution engine that can prune/reduce data read based on the plan. Very hard to match in app, especially when joins get involved.

bunsenhoneydew 12 hours ago

DuckDB is a fantastic piece of tech. One of the best, if not the best, I’ve found in several years.

tdhz77 3 hours ago

Is duckdb multi region active active?

pknerd 14 hours ago

FTA:

> ..In-process means there's no server. You don't connect to DuckDB; you load it as a library inside your program, the same way you'd load NumPy or Polars

Does it mean it can perform all statistical computations as well if I want to use for algo trading?

thefourthchime 15 hours ago

I’m a huge fan, I’ve been wanting to know into the internals. Look forward to digging in.

codingbear 14 hours ago

duckdb is so nice coupled with claude code. It extensive file support and some very interesting decisions on local caching data (like from S3 or snowflake) makes it easy to slice and dice almost any kind of tabular data.

blackoil 14 hours ago

> duckdb is so nice coupled with claude code
Can you expand upon it? You mean claude code use it to store its memory/state or it can do business queries using DuckDB.
- medvezhenok 14 hours ago
  
  Claude code can write exploratory queries for you to give you a quick rundown on the shape of the data set, frequencies, missing values, etc etc (without having to load it into a more persistent data store or writing custom python scripts). I also find SQL snippets inherently more re-usable than custom python code.
  You can also write a skill that CC can re-use if you're analyzing a lot of similar data sets with minor variance.

holografix 14 hours ago

Why is DuckDB so popular when one can use Python + Pandas?

Better perf + SQL is that mostly it?

refactor_master 13 hours ago
The better question is, why is DuckDB so popular when one can use Polars which has a sane, lintable, typesafe API compared to the mess that is SQL:
```
  WITH lagged AS (
      SELECT 
          *, 
          LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) AS prev_time
      FROM events
  ),
  sessions AS (
      SELECT 
          *, 
          SUM(COALESCE((date_diff('minute', prev_time, event_time) > 30)::INT, 1)) 
              OVER (PARTITION BY user_id ORDER BY event_time) AS session_id
      FROM lagged
  )
  SELECT
      user_id,
      session_id,
      MIN(event_time) AS session_start,
      MAX(event_time) AS session_end,
      COUNT(*) AS event_count
  FROM sessions
  GROUP BY ALL
  ORDER BY user_id, session_start;
```
vs
```
  result = (
      df.sort(["user_id", "event_time"])
      .with_columns(
          session_id=(
              pl.when(pl.col("event_time").diff().is_null())
              .then(1)
              .when(pl.col("event_time").diff().dt.total_minutes() > 30)
              .then(1)
              .otherwise(0)
              .cum_sum()
              .over("user_id")
          )
      )
      .group_by(["user_id", "session_id"])
      .agg(
          session_start=pl.col("event_time").min(),
          session_end=pl.col("event_time").max(),
          event_count=pl.col("event_time").count(),
      )
      .sort(["user_id", "session_start"])
  )
```
- coldtea 9 hours ago
  
  Precisely to avoid the custom NIH Polars API, and use SQL which works everywhere (yes, inconsistencies aside).
- brikym 12 hours ago
  
  Polars typesafe? It doesn't show you any errors until runtime right? Kusto query language is the best I've seen at type safety and I wish open source DBs would steal some ideas from it.
- SmirkingRevenge 16 minutes ago
  
  Polars and duckdb interoperate so it's not really one or the other.
  Other dataframe libraries work directly on duckdb also, like Ibis.
  FWIW, I think SQLAlchemy does also.
  In any case though, I find both the SQL and the code there equally readable.
- willj 3 hours ago
  
  How can you tell if someone is a polars fan? Don’t worry, they’ll tell you. :)
- sceadu 4 hours ago
  
  I really hate SQL as a language for doing data-related tasks, but the core tech for it is often much better in terms of reliability (and is close to having performance parity IME)... specifically for working with larger-than-memory datasets. I think the team at Polars is working on improving larger-than-memory operations though.
- porridgeraisin 12 hours ago
  
  I understand the linting aspect but not gonna lie I understood the first one immediately way more than the 2nd one due to knowing SQL well.
- homebessguy 12 hours ago
  
  "Languages come and go, but SQL is forever"
- IshKebab 11 hours ago
  
  That does look nicer if you have a Parquet file and want to analyze it. But DuckDB is also a database - if you want a persistent, reliable and mutable data store I don't think Polars would be suitable would it? (Genuine question - you sound like an expert and I'm not.)
  - tomjakubowski 4 hours ago
    
    The other thing DuckDB does quite well, and which is out of scope for polars, is its keeping of an absolute zoo of external data sources you can query with SQL from the same database client. It's an excellent data warehousing tool.
estetlinus 13 hours ago

Why would you prefer Python and Pandas over good old SQL? Pandas is so verbose and hard to debug, most of the times struggle to be performant on small datasets.
SQL has been around since the dawn of databases. I am happy to see a trend away from pandas.
paytonjjones 13 hours ago

Pandas has lots and lots of problems.
Performance is definitely one of them, but it also has inconsistent and duplicated methods, inconsistent defaults (e.g. some methods are inplace by default), copy by reference issues, I could go on.
It was an early winner in an extremely popular language. That's really the main thing going for it, but alternatives have been a long time coming.
RobinL 13 hours ago

I wrote a blog post a while back to address this question here: https://www.robinlinacre.com/recommend_duckdb/

f311a 13 hours ago

I wish this article was not LLM written

sigbottle 6 hours ago

What the fuck.

I've never been that strong of an engineer. TIL that at one of my internships I was building DuckDB but for the company's private use cases. Well, trying to anyways. I didn't really get the whole picture, the pieces did not fit into place.

Didn't get the return offer obviously, probably because I didn't make the connection (or really a coherent narrative of what I was building). RIP. You live and you learn, I guess.

bunbun69 4 hours ago

Holy LLM slop article…

explodes 3 hours ago

> You open a connection, send SQL over TCP (a protocol to send data over a network),
Thanks for the clarification, LLM!!
- greenavocado an hour ago
  
  To be fair, most managers don't know what TCP is

pknerd 13 hours ago

umm can we say it can replace SQLite?

pierregillesl 7 hours ago

I wouldn’t see it as a replacement.
SQLite and DuckDB solve different problems and actually complement each other quite well. SQLite is excellent for transactional workloads (OLTP), while DuckDB shines for analytical workloads (OLAP), especially time-series data and aggregations.
We’ve been using both side-by-side in an open-source project for about two years: SQLite for configuration and transactional data, DuckDB for historical sensor data and analytics. So far, it’s been a very good combination.
- skeeter2020 2 hours ago
  
  DuckDB kind of created this false comparison by their own early positioning, but I've tried to charitably interpret it as modeling the spirit and motivations of SQLite, not literally being "the SQLite for Analytics". Aside from both being in-process databases they are very different.
- tomjakubowski 2 hours ago
  
  You can even use DuckDB to query SQLite :^)
3eb7988a1663 13 hours ago

OLAP vs OLTP. Sure you could use one for the other, but they have ideal use cases. Updating a single record in SQLite is going to be more efficient than doing the same in DuckDB.
steve_adams_86 13 hours ago

They seem similar at a glance but they’re quite different. You can think of SQLite as a transactional database while DuckDB is better used as an analytical database.
I can see applications having valid reasons to use both. You can use SQLite as the catalog in duck lake systems, for example. SQLite is your metadata record, DuckDB is your ingestion/scanning/aggregating/joining engine.

pierregillesl 7 hours ago

[dead]

kunal183 10 hours ago

[flagged]

Omniloop 10 hours ago

[flagged]

charanmilan 10 hours ago

[flagged]

gordonwu8383 4 hours ago

[flagged]

gordonwu8383 4 hours ago

[dead]

gordonwu8383 7 hours ago

[dead]

tobyhinloopen 11 hours ago

The only reason I know and use DuckDB is because my (internal, private-use-only, experimental) vibe coded projects use it a ton. I didn't pick it - LLMs did. Until this article, I wasn't aware of what it actually is capable of.

Most of these projects use JSON(L) files for storage, and duckdb to process them.

mootothemax 11 hours ago

If you haven’t investigated storing in parquet format - and it doesn’t break other consumers that need your jsonl formatted files - it could be worth trialling for your use case. You’ll see vastly smaller file sizes (even more so if you use zstd compression), and querying time will shoot up.
Usual caveats apply, but as a general rule it’s held up well for me. Only downside is that inspecting the results moves from vi on the output file to duckdb and a select * from.
- medvezhenok 7 hours ago
  
  Yup, especially data backups (although I wouldn't store critical backups like this, only nice-to-have ones). One minor note is that parquet file sizes / compressed sizes can be sensitive to ordering, so you can try different sort orders to get optimal compression.
  I found with using various tricks I can get the zstd parquet to be up to 10x (or more) smaller than an equivalent Postgres table - but obviously the exact compression ratios will depend on the kind of data you have and how well your Postgres table is normalized.
- tobyhinloopen 9 hours ago
  
  I'll 100% try DuckDB in more serious projects where I would normally use Sqlite.
ai_fry_ur_brain 11 hours ago

What an incredible way to build software