I worked at a Scala shop about 15 years ago. It was terrible. Everyone had their own "dialect" of features they used (kinda like C++.). The tooling was the worst (Eclipse's Scala plugin was especially awful, IntelliJ's was okay.) The compiler was slow.
a) Scala being a JVM language is one of the fastest around. Much faster than say Python.
b) How large are the 1% of the feeds and the size of the total joined datasets. Because ultimately that is what you build platforms for. Not the simple use cases.
1) Yes Scala and JVM is fast. If we could just use that to clean up a feed on a single box that would be great. The problem is calling the Spark API creates a lot of complexity for developers and runtime platform which is super slow.
2) Yes for the few feeds that are a TB we need spark. The platform really just loads from hadoop transforms then saves back again.
a) You can easily run Spark jobs on a single box. Just set executors = 1.
b) The reason centralised clusters exist is because you can't have dozens/hundreds of data engineers/scientists all copying company data onto their laptop, causing support headaches because they can't install X library and making productionising impossible. There are bigger concerns than your personal productivity.
Using a Python solution like Dask might actually be better, because you can work with all of the Python data frameworks and tools, but you can also easily scale it if you need it without having to step into the Spark world.
Re: b. This is a place where remote standard dev environments are a boon. I'm not going to give each dev a terabyte of RAM, but a terabyte to share with a reservation mechanism understanding that contention for the full resource is low? Yes, please.
a) One of the only languages you can write your entire app in Scala i.e. it supports compiling to Javascript, JVM and LLVM.
b) It has the only formally proven type system of any language.
c) It is the innovation language. Many of the concepts that are now standard in other languages had their implementation borrowed from Scala. And it is continuing to innovate with libraries like Gears (https://github.com/lampepfl/gears) which does async without colouring and compiler additions like resource capabilities.
PySpark is great, except for UDF performance. This gap means that Scala is helpful for some Spark edge cases like column-level encryption/decryption with UDF
The R community has been hard at work on small data. I still highly prefer working on on memory data in R dplyr DataTable are elegant and fast.
The CRan packages are all high quality if the maintainer stops responding to emails for 2 months your package is automatically removed. Most packages come from university Prof's that have been doing this their whole career.
A really big part of a in-memory dataframe centric workflow is how easy it is to do one step at a time and inspect the result.
With a database it is difficult to run a query, look at the result and then run a query on the result. To me, that is what is missing in replacing pandas/dplyr/polars with DuckDB.
I'm not sure I really follow, you can create new tables for any step if you want to do it entirely within the db, but you can also just run duckdb against your dataframes in memory.
In R, data sources, intermediate results, and final results are all dataframes (slight simplification). With DuckDB, to have the same consistency you need every layer and step to be a database table, not a data frame, which is awkward for the standard R user and use case.
You can also use duckplyr as a drop in replacing for dplyr. Automatically fails over to dplyr for unsupported behavior, and for most operations is notably faster.
Data.Table is competitive with DuckDb in many cases, though as a DuckDB enthusiast I hate to admit this. :)
> History is full of “what if”s, what if something like DuckDB had existed in 2012? The main ingredients were there, vectorized query processing had already been invented in 2005. Would the now somewhat-silly-looking move to distributed systems for data analysis have ever happened?
I like the gist of the article, but the conclusion sounds like 20/20 hindsight.
All the elements were there, and the author nails it, but maybe the right incentive structure wasn't there to create the conditions to make it able to be done.
Between 2010 and 2015, there was a genuine feeling from almost all industry that we would converge to massive amounts of data, because until this time, the industry had never faced a time with so much abundance of data in terms of data capture and ease of placing sensors everywhere.
The natural step in this scenario won't be, most of the time, something like "let's find efficient ways to do it with the same capacity" but instead "let's invest to be able to process this in a distributed manner independent of the volume that we can have."
It's the same thing between OpenAI/ChatGPT and DeepSeek, where one can say that the math was always there, but the first runner was OpenAI with something less efficient but with a different set of incentive structures.
> The problem is that people believe theirs app will be web-scale pretty-soon so need to solve the problem ASAP.
Investors really wanted to hear about your scaling capabilities, even when it didn't make sense. But the burn rate at places that didn't let a spreadsheet determine scale was insane.
Years working on microservices, and now I start planning/discovery with "why isn't this running on a box in the closet" and only accept numerical explanations. Putting a dollar value on excess capacity and labeling it "ad spend" changes perspectives.
A database is not only about disk size and query performance. Database reflects the company's culture, processes, workflows, collaboration etc. It has an entire ecosystem around it - master data, business processes, transactions, distributed applications, regulatory requirements, resiliency, Ops, reports, tooling etc,
The role of a database is not just to deliver query performance. It needs to fit into the ecosystem, serve the overall role on multiple facets, deliver on a wide range of expectations - tech and non-tech.
While the useful dataset itself may not outpace the hardware advancements, the ecosystem complexity will definitely outpace any hardware or AI advancements. Overall adaptation to the ecosystem will dictate the database choice, not query performance. Technologies will not operate in isolation.
And its very much the tech culture at large that influences the company's tech choices. Those techies chasing shiny things and trying to shoehorn it into their job - perhaps cynically to pad their cvs or perhaps generously thinking it will actually be the right thing to do - have an outsized say in how tech teams think about tech and what they imagine their job is.
Back in 2012 we were just recovering from the everything-is-xml craze and in the middle of the no-sql craze and everything was web-scale and distribute-first micro-services etc.
And now, after all that mess, we have learned to love what came before: namely, please please please just give me sql! :D
Why you don't just quietly use SQL instead of condescending lecturing others about how compromised their tech choices are.
NoSQL e.g. Cassandra, MongoDB and Microservices were invented to solve real-world problems which is why they are still so heavily used today. And the criticism of them is exactly the same that was levelled at SQL back in the day.
It's all just tools at the end of the day and there isn't one that works for all use cases.
Around 20 years ago I was working for a database company. During that time, I attended SIGMOD, which is the top conference for databases.
The keynote speaker for the conference Stonebraker, who started Postgres, among other things. He talked about the history of relational databases.
At that time, XML databases were all the rage -- now nobody remembers them. Stonebraker explained that there is nothing new in the hierarchical databases. There was a significant battle in SIGMOD, I think somewhere in the 1980s (I forget the exact time frame) between network databases and relational databases.
The relational databases won that battle, as they have won against each competing hierarchical database technology since.
The reason is that relational databases are based on relational algebra. This has very practical consequences, for example you can query the data more flexibly.
When you use JSON storage such as MongoDB, when you decide your root entities you are stuck with that decision. I see very often in practice that there will always come new requirements that you did not foresee that you then need to work around.
MongoDB is a $2b/year revenue company growing at 20% y/y. JSON stores are not going anywhere and it's an essential tool for dealing in data where you have no control over the schema or you want to do it in the application layer.
And the only "battle" is one you've invented in your head. People who deal in data for a living just pick the right data store for the right data schema.
Every person I know who has ever used Cassandra in prod has cursed its name. Mongo lost data for close to a decade, and Microservices mostly are NOT used to solve real world problems but instead used either as an organizational or technical hammer for which everything is a nail. Hell there's entire books written how you should cut people off from each other so they can "naturally" write microservices and hyperscale your company!!
Whereas the fact is that Datastax and MongoDB are highly successful companies indicating that in fact those databases are solving a real world problem.
No, a database reflects what you make out of it. Reports are just queries after all. I dont know what all the other stuff you named has to do with the database directly. The only purpose of databases is to store and read data, thats what it comes down to. So query performance IS one of the most important metrics.
> As recently shown, the median scan in Amazon Redshift and Snowflake reads a doable 100 MB of data, and the 99.9-percentile reads less than 300 GB. So the singularity might be closer than we think.
This isn't really saying much. It is a bit like saying the 1:1000 year storm levy is overbuilt for 99.9% of storms. They aren't the storms the levy was built for, y'know. It wasn't set up with them close to the top of mind. The database might do 1,000 queries in a day.
The focus for design purposes is really to queries that live out on the tail - can they be done on a smaller database? How much value do they add? What capabilities does the database need to handle them? Etc. That is what should justify a Redshift database. Or you can provision one to hold your 1Tb of data because red things go fast and we all know it :/
1TB memory servers weren't THAT exotic even in say 2014~2018 era either, I know as I had a few at work.
Not cheap, but these were at companies with 100s of SWEs / billions in revenue / would eventually have multi-million dollar cloud bills for what little they migrated there.
You can take a different approach to the 1-in-1000 jobs. Like don't do them, or approximate them. I remember the time I wrote a program that would have taken a century to finish and then developed an approximation that got it done in about 20 minutes.
On the contrary, it's saying a lot about sheer data size, that's all. The things you mention may be crucial why Redshift and co. have been chosen (or not - in my org Redshift was used as standard so even small dataset were put into it as the management want to standardize, for better or worse), but the fact remains that if you deal with smaller datasets all of the time, you may want to reconsider the solutions you use.
A tangential story. I remember, back in 2010, contemplating the idea of completely distributed DBs inspired by then popular torrent technology. In this one, a client would not be different from a server, except by the amount of data it holds. And it would probably receive the data in torrents manner.
What puzzled me was that a client would want others to execute its queries, but not want to load all the data and make queries for the others. And how to prevent conflicting update queries sent to different seeds.
I also thought that Crockford's distributed web idea (where every page is hosted like on torrents) was a good one, even though I didn't think deep of this one.
Until I saw the discussion on web3, where someone pointed out that uploading any data on one server would make a lot of hosts to do the job of hosting a part of it, and every small movement would cause tremendous amounts of work for the entire web.
I have a large analytics dataset in BigQuery and I wrote an interactive exploratory UI on top of it and any query I did generally finished in 2s or less. This led to a very simple app with infinite analytics refinement that was also fast.
I would definitely not trade that for a pre-computed analytics approach. The freedom to explore in real time is enlightening and freeing.
I think you have restricted yourself to recomputed fix analytics but real time interactive analytics is also an interesting area.
I only retired my 2014 MBP ... last week! It started transiently not booting and then, after just a few weeks, it switched to be only transiently booting. Figured it was time. My new laptop is actually a very budget buy, and not a mac, and in many things a bit slower than the old MBP.
Anyway, the old laptop is about par with the 'big' VMs that I use for work to analyse really big BQ datasets. My current flow is to do the kind of 0.001% queries that don't fit on a box on BigQuery and massage things with just enough prepping to make the intermediate result fit on a box. Then I extract that to parquet stored on the VM and do the analysis on the VM using DuckDB from python notebooks.
DuckDB has revolutionised not what I can do but how I can do it. All the ingredients were around before, but DuckDB brings it together and makes the ergonomics completely different. Life is so much easier with joins and things than trying to do the same in, say, pandas.
I still have mine, but it's languishing, I don't know what to do with it / how to get rid of it, it doesn't feel like trash. The Apple stores do returns but for this one you get nothing, they're just like "yeah we'll take care of it".
The screen started to delaminate on the edges, and its follow-up (a MBP with the touch bar)'s screen is completely broken (probably just the connector cable).
I don't have a use for it, but it feels wasteful just to throw it away.
I'm working on a big research project that uses duckdb, I need a lot of compute resources to develop my idea but I don't have a lot of money.
I'm throwing a bottle into the ocean: if anyone has spare compute with good specs they could lend me for a non-commercial project it would help me a lot.
> If we look at the time a bit closer, we see the queries take anywhere between a minute and half an hour. Those are not unreasonable waiting times for analytical queries on that sort of data in any way.
I'm really skeptical arguments that say it's OK to be slow. Even on the modern laptop example queries still take up to 47 seconds.
Granted, I'm not looking at the queries but the fact is that there are a lot of applications where users need results back in less than a second.[0] If the results are feeding automated processes like page rendering they need it back in 10s of millisecond at most. That takes hardware to accomplish consistently. Especially if the datasets are large.
The small data argument becomes even weaker when you consider that analytic databases don't just do queries on static datasets. Large datasets got that way by absorbing a lot of data very quickly. They therefore do ingest, compaction, and transformations. These require resources, especially if they run in parallel with query on the same data. Scaling them independently requires distributed systems. There isn't another solution.
[0] SIEM, log management, trace management, monitoring dashboards, ... All potentially large datasets where people sift through data very quickly and repeatedly. Nobody wants to wait more than a couple seconds for results to come back.
> As recently shown, the median scan in Amazon Redshift and Snowflake reads a doable 100 MB of data, and the 99.9-percentile reads less than 300 GB. So the singularity might be closer than we think.
There is some circular reasoning embedded here. I've seen many, many cases of people finding ways to cut up their workloads into small chunks because the performance and efficiency of these platforms is far from optimal if you actually tried to run your workload at its native scale. To some extent, these "small reads" reflect the inadequacy of the platform, not the desire of a user to run a particular workload.
A better interpretation may be that the existing distributed architectures for data analytics don't scale well except for relatively trivial workloads. There has been an awareness of this for over a decade but a dearth of platform architectures that address it.
Maybe it was all VC funded solutions looking for problems?
It's a lot easier to monetize data analytics solutions if users code & data are captive in your hosted infra/cloud environment than it is to sell people a binary they can run on their own kit...
All the better if its an entire ecosystem of .. stuff.. living in "the cloud", leaving end users writing checks to 6 different portfolio companies.
> Maybe it was all VC funded solutions looking for problems?
Remember, from 2020-2023 we had an entire movement to push a thing called "Modern data stack (MDS)" with big actors like a16z lecturing the market about it [1].
I am originally from Data. Never worked with anything out of the Data: DS, MLE, DE, MLOps and so on. One thing that I envy from other developer careers is to have bosses/leaders that had battle-tested knowledge around delivering things using pragmatic technologies.
Most of the "AI/Data Leaders" have at maximum 15-17 years of career dealing with those tools (and I am talking about some dinosaurs in a good sense that saw the DWH or Data Mining).
After 2018 we had an explosion of people working in PoCs or small projects at best, trying to mimic what the latest blog post from some big tech company pushed.
A lot of those guys are the bosses/leaders today, and worse, they were formed during a 0% interest environment, tons of hype around the technology, little to no scrutiny or business necessity for impact, upper management that did not understand really what those guys were doing, and in a space that wasn't easy for guys from other parts of tech to join easily and call it out (e.g., SRE, Backend, Design, Front-end, Systems Engineering, etc.).
In other words, it's quite simple to sell complexity or obscure technology for most of these people, and the current moment in tech is great because we have more guys from other disciplines chime in and share their knowledge on how to assess and implement technology.
OK now you need PortCo1's company analytics platform, PortCo2's orchestration platform, PortCo3's SRE platform, PortCo4's Auth platform, PortCo5's IaC platform, PortCo6's Secrets Mgmt Platform, PortoCo7's infosec platform, etc.
I am sure I forgot another 10 things.
Even if some of these things were open source or "open source", there was the upsell to the managed/supported/business license/etc version for many of these tools.
This is the primary failure of data platforms from my perspective. You need too many 3rd parties/partners to actually get anything done with your data and costs become unbearable.
Cloud and SaaS were good for a while because they took away the old sales-CTO pipeline that often saw a whole org suffering from one person's signature. But they also took away the benefits of a more formal evaluation process, and nowadays nobody knows how to do one.
I'm not sure how cloud/saas made the CTO behavior and its consequences any better. At least on-prem if they picked the "wrong" DB / message bus / etc, you could quietly replicate to another stack internally as needed for your analytics needs.
If your data is lodged in some SaaS product in AWS, good luck replicating that to GCP, Azure, or heaven forbid on-prem, without extortion level costs.
> and in a space that wasn't easy for guys from other parts of tech to join easily and call it out (e.g., SRE, Backend, Design, Front-end, Systems Engineering, etc.).
As an SRE/SysEng/Devops/SysAdmin (depending on the company that hires me): most people in the same job as me could easily call it out.
You don't have to be such a big nerds to know that you can fit 6TB of memory in a single (physical) server. That's been true for a few years. Heck, AWS had 1TB+ memory instances for a few years now.
The thing is... Upper management wanted "big data" and the marketing people wanted to put the fancy buzzword on the company website and on linkedin. The data people wanted to be able to put the fancy buzzword on their CV (and on their Linkedin profile -- and command higher salaries due to that - can you blame them?).
> In other words, it's quite simple to sell complexity or obscure technology for most of these people
The unspoken secret is that this kind of BS wasn't/isn't only going on in the data fields (in my opinion).
> The unspoken secret is that this kind of BS wasn't/isn't only going on in the data fields (in my opinion).
Yes, once you see it in one area you notice if everywhere.
A lot of IT spend is CEOs chasing something they half heard/misunderstanding a competitor doing, or a CTO taking Gartner a little too seriously, or engineering leads doing resume driven architecture. My last shop did a lot of this kind of this stuff "we need a head of [observability|AI|$buzzword].
The ONE thing that gives me the most pause about DuckDB is that some people in my industry who are guilty of the above are VERY interested in DuckDB. I like to wait for the serial tech evangelists to calm down a bit and see where the dust settles.
Did my phd around that time and did a project “scaling” my work on a spark cluster. Huge pita and no better than my local setup which was an MBP15 with pandas a postgres (actually I wrote+contributed a big chunk of pandas read_sql at that time to make is postgres compatible using sqlalchemy)
For those of you from the AI world, this is the equivalent of the bitter lesson and DeWitts argument about database machines from the early 80s. That is, if you wait a bit with the exponential pace of Moores law (or modern equivalents), improvements in “general purpose” hardware will obviate DB specific improvements. The problem is that back in 2012, we had customers that wanted to query terabytes of logs for observability, or analyze adtech streams, etc. So, I feel like this is a pointless argument. If your data fit on an old MacBook Pro, sure you should’ve built for that.
AWS started offering local SSD storage up to 2 TB in 2012 (HI1 instance type) and in late 2013 this went up to 6.4 TB (I2 instance type). While these amounts don't cover all customers, plenty of data fits on these machines. But the software stack to analyze it efficiently was lacking, especially in the open-source space.
AWS also had customers that had petabytes of data in Redshift for analysis. The conversation is missing a key point: DuckDB is optimizing for a different class of use cases. They’re optimizing for data science and not traditional data warehousing use cases. It’s masquerading as size. Even for small sizes, there are other considerations: access control, concurrency control, reliability, availability, and so on. The requirements are different for those different use cases. Data science tends to be single user, local, and lower availability requirements than warehouses that serve production pipelines, data sharing, and so on. I also think that DuckDB can be used for those, but not optimized for those.
>..here is a small number of tables in Redshift with
trillions of rows, while the majority is much more reasonably sized
with only millions of rows. In fact, most tables have less than a
million rows and the vast majority (98 %) has less than a billion
rows.
The argument can be made that 98% of people using redshift can potentially get by with DuckDB.
Around 0.5 to 50 GB is such an annoying area, because Excel starts falling over on the lower end and even nicer computers will start seriously struggling on the larger end if you're not being extremely efficient.
I mean, not everyone spent their decade on distributed computing. Some devs with a retrogrouch inclination kept writing single threaded code in native languages on a single node. Single core clock speed stagnated, but it was still worth buying new CPU's with more cores because they also had more cache, and all the extra cores are useful for running ~other peoples' bloated code.
I find that good multithreading can speed up parallelizable workloads by 5-10 times depending on CPU core count, if you don't have tight latency constraints (and even games with millisecond-level latency deadlines are multithreaded these days, though real-time code may look different than general code).
This is really a question of economics. The biggest organizations with the most ability to hire engineers have need for technologies that can solve their existing problems in incremental ways, and thus we end up with horrible technologies like Hadoop and Iceberg. They end up hiring talented engineers to work on niche problems, and a lot of the technical discourse ends up revolving around technologies that don't apply to the majority of organizations, but still cause FOMO amongst them. I, for one, am extremely happy to see technologies like DuckDB come along to serve the long tail.
It's a way of saying twice as fast and twice as slow have equal effect on opposite sides. If your baseline is 10 seconds, one benchmark takes 5 seconds, and another one takes 20 seconds then the geometric mean gives you 10 seconds as the result because they cancel each other. The arithmetic mean would treat it differently because in absolute terms 10 seconds slow down is bigger than 5 seconds speedup. But that is not fair for speedups because the absolute speedup you can reach is at most 10 seconds but slow down has no limits.
If half your requests are 2x as long and half are 2x as fast, you don’t take the same wall time to run — you take longer.
Let’s say you have 20 requests, 10 of type A and 10 of type B. They originally both take 10 seconds, for 200 seconds total. You halve A and double B. Now it takes 50 + 200 = 250 seconds, or 12.5 on average.
This is a case where geometric mean deceives you - because the two really are asymmetric and “twice as fast” is worth less than “twice as slow”.
There is definitely no single magical number that can perfectly represent an entire set of numbers. There will always be some cases they are not representative enough. In the request example you are mostly interested in the total processing times so it does make sense you use a metric based on addition. But you could also frame a similar scenario where halving the processing time lets you handle twice as many items in the same duration. In that case a ratio-based or multiplicative view might be more appropriate.
Sure — but the arithmetic mean also captures that case: if you only halve the time, it also will report that change accurately.
What we’re handling is the case where you have split outcomes — and there the arithmetic and geometric mean disagree, so we can ask which better reflects reality.
I’m not saying the geometric mean is always wrong — but it is in this case.
A case where it makes sense is what happens when your stock halves in value then doubles in value?
In general, geometric mean is appropriate where effects are compounding (eg, two price changes to the same stock) but not when we’re combining (requests are handled differently). Two benchmarks is more combining (do task A then task B), rather than compounding.
The geometric mean of n numbers is the n-th root of the product of all numbers. The mean square error is the sum of the squares of all numbers, divided by n. (I.e. the arithmetic mean of the squares.) They're not the same.
I'm not gonna edit what I wrote but you are interpreting it too way too literally. I was not describing the implementation of anything, I was just giving a link that explains why thinking about things in terms of area (geometry) is popular in stats. Its a bit like the epiphany that histograms don't need to be bars of equal width.
It's not the point of the blog post, but I love the fact that the author's 2012 MacBook Pro is still useable. I can't imagine there are too many Dell laptops from that era still alive and kicking.
The machine from the article - a 2012 MBP Retina with 16 GB memory and 2.6 GHz i7 - had cost $2999 in the US (and significantly more in most of the rest of the world) at release. That's around $4200 today adjusting for inflation. You won't see many Dell laptops with that sort of price tag.
I have worked for a half dozen companies all swearing up and down they had big data and meaningfully one customer had 100TB of logs and another 10TB of stuff, everyone else when actually thought of properly and had just utter trash removed was really under 10TB.
Also - sqlite would have been totally fine for these queries a decade ago or more (just slower) - I messed with 10GB+ datasets with it more than 10 years ago.
This has the same energy of this article named "Command-line Tools can be 235x Faster than your Hadoop Cluster" [1]
[1] - https://adamdrake.com/command-line-tools-can-be-235x-faster-...
Ugh I have joined a big data team. 99% of the feeds are less than a few GB yet we have to use Scala and Spark. Its so slow to develop and slow to run.
I can’t believe anyone would write scala at this point
I worked at a Scala shop about 15 years ago. It was terrible. Everyone had their own "dialect" of features they used (kinda like C++.). The tooling was the worst (Eclipse's Scala plugin was especially awful, IntelliJ's was okay.) The compiler was slow.
I'm assuming it's better now?
a) Scala being a JVM language is one of the fastest around. Much faster than say Python.
b) How large are the 1% of the feeds and the size of the total joined datasets. Because ultimately that is what you build platforms for. Not the simple use cases.
1) Yes Scala and JVM is fast. If we could just use that to clean up a feed on a single box that would be great. The problem is calling the Spark API creates a lot of complexity for developers and runtime platform which is super slow. 2) Yes for the few feeds that are a TB we need spark. The platform really just loads from hadoop transforms then saves back again.
a) You can easily run Spark jobs on a single box. Just set executors = 1.
b) The reason centralised clusters exist is because you can't have dozens/hundreds of data engineers/scientists all copying company data onto their laptop, causing support headaches because they can't install X library and making productionising impossible. There are bigger concerns than your personal productivity.
> a) You can easily run Spark jobs on a single box. Just set executors = 1.
Sure but why would you do this? Just using pandas or duckdb or even bash scripts makes your life is much easier than having to deal with Spark.
For when you need more executors without rewriting your logic.
Using a Python solution like Dask might actually be better, because you can work with all of the Python data frameworks and tools, but you can also easily scale it if you need it without having to step into the Spark world.
Re: b. This is a place where remote standard dev environments are a boon. I'm not going to give each dev a terabyte of RAM, but a terabyte to share with a reservation mechanism understanding that contention for the full resource is low? Yes, please.
But can you justify Scala existing at all in 2025. I think it pushed boundaries but ultimately failed as a language worth adoption.l anymore.
Absolutely.
a) One of the only languages you can write your entire app in Scala i.e. it supports compiling to Javascript, JVM and LLVM.
b) It has the only formally proven type system of any language.
c) It is the innovation language. Many of the concepts that are now standard in other languages had their implementation borrowed from Scala. And it is continuing to innovate with libraries like Gears (https://github.com/lampepfl/gears) which does async without colouring and compiler additions like resource capabilities.
I’m sorry but these are extremely weak arguments, and I would contend scale caused more harm than good in all
PySpark is a wrapper, so Scala is unnecessary and boggy.
PySpark is great, except for UDF performance. This gap means that Scala is helpful for some Spark edge cases like column-level encryption/decryption with UDF
The R community has been hard at work on small data. I still highly prefer working on on memory data in R dplyr DataTable are elegant and fast.
The CRan packages are all high quality if the maintainer stops responding to emails for 2 months your package is automatically removed. Most packages come from university Prof's that have been doing this their whole career.
A really big part of a in-memory dataframe centric workflow is how easy it is to do one step at a time and inspect the result.
With a database it is difficult to run a query, look at the result and then run a query on the result. To me, that is what is missing in replacing pandas/dplyr/polars with DuckDB.
I'm not sure I really follow, you can create new tables for any step if you want to do it entirely within the db, but you can also just run duckdb against your dataframes in memory.
You can, but then every step starts with a drop table if exists; insert into …
Or you nest your queries:
Intermediate steps won't be stored but until queries take a while to execute it's a nice way to do step-wise extension of an analysis.Edit: It's a rather neat and underestimated property of query results that you can query them in the next scope.
Or better yet, use CTEs: https://duckdb.org/docs/stable/sql/query_syntax/with.html
In R, data sources, intermediate results, and final results are all dataframes (slight simplification). With DuckDB, to have the same consistency you need every layer and step to be a database table, not a data frame, which is awkward for the standard R user and use case.
You can also use duckplyr as a drop in replacing for dplyr. Automatically fails over to dplyr for unsupported behavior, and for most operations is notably faster.
Data.Table is competitive with DuckDb in many cases, though as a DuckDB enthusiast I hate to admit this. :)
> History is full of “what if”s, what if something like DuckDB had existed in 2012? The main ingredients were there, vectorized query processing had already been invented in 2005. Would the now somewhat-silly-looking move to distributed systems for data analysis have ever happened?
I like the gist of the article, but the conclusion sounds like 20/20 hindsight.
All the elements were there, and the author nails it, but maybe the right incentive structure wasn't there to create the conditions to make it able to be done.
Between 2010 and 2015, there was a genuine feeling from almost all industry that we would converge to massive amounts of data, because until this time, the industry had never faced a time with so much abundance of data in terms of data capture and ease of placing sensors everywhere.
The natural step in this scenario won't be, most of the time, something like "let's find efficient ways to do it with the same capacity" but instead "let's invest to be able to process this in a distributed manner independent of the volume that we can have."
It's the same thing between OpenAI/ChatGPT and DeepSeek, where one can say that the math was always there, but the first runner was OpenAI with something less efficient but with a different set of incentive structures.
It will not happened. The problem is that people believe theirs app will be web-scale pretty-soon so need to solve the problem ASAP.
Is only after being burned many many times that arise the need for simplicity.
Is the same of NoSql. Only after suffer it you appreciate going back.
ie: Tools like this circle back only after the pain of a bubble. It can't be done inside it
> The problem is that people believe theirs app will be web-scale pretty-soon so need to solve the problem ASAP.
Investors really wanted to hear about your scaling capabilities, even when it didn't make sense. But the burn rate at places that didn't let a spreadsheet determine scale was insane.
Years working on microservices, and now I start planning/discovery with "why isn't this running on a box in the closet" and only accept numerical explanations. Putting a dollar value on excess capacity and labeling it "ad spend" changes perspectives.
A database is not only about disk size and query performance. Database reflects the company's culture, processes, workflows, collaboration etc. It has an entire ecosystem around it - master data, business processes, transactions, distributed applications, regulatory requirements, resiliency, Ops, reports, tooling etc,
The role of a database is not just to deliver query performance. It needs to fit into the ecosystem, serve the overall role on multiple facets, deliver on a wide range of expectations - tech and non-tech.
While the useful dataset itself may not outpace the hardware advancements, the ecosystem complexity will definitely outpace any hardware or AI advancements. Overall adaptation to the ecosystem will dictate the database choice, not query performance. Technologies will not operate in isolation.
And its very much the tech culture at large that influences the company's tech choices. Those techies chasing shiny things and trying to shoehorn it into their job - perhaps cynically to pad their cvs or perhaps generously thinking it will actually be the right thing to do - have an outsized say in how tech teams think about tech and what they imagine their job is.
Back in 2012 we were just recovering from the everything-is-xml craze and in the middle of the no-sql craze and everything was web-scale and distribute-first micro-services etc.
And now, after all that mess, we have learned to love what came before: namely, please please please just give me sql! :D
Why you don't just quietly use SQL instead of condescending lecturing others about how compromised their tech choices are.
NoSQL e.g. Cassandra, MongoDB and Microservices were invented to solve real-world problems which is why they are still so heavily used today. And the criticism of them is exactly the same that was levelled at SQL back in the day.
It's all just tools at the end of the day and there isn't one that works for all use cases.
Around 20 years ago I was working for a database company. During that time, I attended SIGMOD, which is the top conference for databases.
The keynote speaker for the conference Stonebraker, who started Postgres, among other things. He talked about the history of relational databases.
At that time, XML databases were all the rage -- now nobody remembers them. Stonebraker explained that there is nothing new in the hierarchical databases. There was a significant battle in SIGMOD, I think somewhere in the 1980s (I forget the exact time frame) between network databases and relational databases.
The relational databases won that battle, as they have won against each competing hierarchical database technology since.
The reason is that relational databases are based on relational algebra. This has very practical consequences, for example you can query the data more flexibly.
When you use JSON storage such as MongoDB, when you decide your root entities you are stuck with that decision. I see very often in practice that there will always come new requirements that you did not foresee that you then need to work around.
I don't care what other people use, however.
MongoDB is a $2b/year revenue company growing at 20% y/y. JSON stores are not going anywhere and it's an essential tool for dealing in data where you have no control over the schema or you want to do it in the application layer.
And the only "battle" is one you've invented in your head. People who deal in data for a living just pick the right data store for the right data schema.
I find using Postgres and JSONB often gets me the best of both worlds.
And sql server alone is like 5 billion/yr.
Almost like there is room in the market for more than just SQL databases.
Sensitive much?
Ah yes MongoDB, it's web-scale!
Every person I know who has ever used Cassandra in prod has cursed its name. Mongo lost data for close to a decade, and Microservices mostly are NOT used to solve real world problems but instead used either as an organizational or technical hammer for which everything is a nail. Hell there's entire books written how you should cut people off from each other so they can "naturally" write microservices and hyperscale your company!!
So all of this is just meaningless anecdotes.
Whereas the fact is that Datastax and MongoDB are highly successful companies indicating that in fact those databases are solving a real world problem.
No, a database reflects what you make out of it. Reports are just queries after all. I dont know what all the other stuff you named has to do with the database directly. The only purpose of databases is to store and read data, thats what it comes down to. So query performance IS one of the most important metrics.
You can always make your data bigger without increasing disk space or decreasing performance by making the font size larger!
This feels like a companion to classic 2015 paper "Scalability! But at what COST?":
https://www.usenix.org/system/files/conference/hotos15/hotos...
> As recently shown, the median scan in Amazon Redshift and Snowflake reads a doable 100 MB of data, and the 99.9-percentile reads less than 300 GB. So the singularity might be closer than we think.
This isn't really saying much. It is a bit like saying the 1:1000 year storm levy is overbuilt for 99.9% of storms. They aren't the storms the levy was built for, y'know. It wasn't set up with them close to the top of mind. The database might do 1,000 queries in a day.
The focus for design purposes is really to queries that live out on the tail - can they be done on a smaller database? How much value do they add? What capabilities does the database need to handle them? Etc. That is what should justify a Redshift database. Or you can provision one to hold your 1Tb of data because red things go fast and we all know it :/
If you only have 1tb of data then you can have it in ram on a modern server.
AND even if you have 10TB of data, NVMe storage is ridiculously fast compared to what disk used to look like (or s3...)
In the last few years, sure, but certainly not in 2012.
1TB memory servers weren't THAT exotic even in say 2014~2018 era either, I know as I had a few at work.
Not cheap, but these were at companies with 100s of SWEs / billions in revenue / would eventually have multi-million dollar cloud bills for what little they migrated there.
You can take a different approach to the 1-in-1000 jobs. Like don't do them, or approximate them. I remember the time I wrote a program that would have taken a century to finish and then developed an approximation that got it done in about 20 minutes.
> This isn't really saying much.
On the contrary, it's saying a lot about sheer data size, that's all. The things you mention may be crucial why Redshift and co. have been chosen (or not - in my org Redshift was used as standard so even small dataset were put into it as the management want to standardize, for better or worse), but the fact remains that if you deal with smaller datasets all of the time, you may want to reconsider the solutions you use.
A tangential story. I remember, back in 2010, contemplating the idea of completely distributed DBs inspired by then popular torrent technology. In this one, a client would not be different from a server, except by the amount of data it holds. And it would probably receive the data in torrents manner.
What puzzled me was that a client would want others to execute its queries, but not want to load all the data and make queries for the others. And how to prevent conflicting update queries sent to different seeds.
I also thought that Crockford's distributed web idea (where every page is hosted like on torrents) was a good one, even though I didn't think deep of this one.
Until I saw the discussion on web3, where someone pointed out that uploading any data on one server would make a lot of hosts to do the job of hosting a part of it, and every small movement would cause tremendous amounts of work for the entire web.
I have a large analytics dataset in BigQuery and I wrote an interactive exploratory UI on top of it and any query I did generally finished in 2s or less. This led to a very simple app with infinite analytics refinement that was also fast.
I would definitely not trade that for a pre-computed analytics approach. The freedom to explore in real time is enlightening and freeing.
I think you have restricted yourself to recomputed fix analytics but real time interactive analytics is also an interesting area.
I only retired my 2014 MBP ... last week! It started transiently not booting and then, after just a few weeks, it switched to be only transiently booting. Figured it was time. My new laptop is actually a very budget buy, and not a mac, and in many things a bit slower than the old MBP.
Anyway, the old laptop is about par with the 'big' VMs that I use for work to analyse really big BQ datasets. My current flow is to do the kind of 0.001% queries that don't fit on a box on BigQuery and massage things with just enough prepping to make the intermediate result fit on a box. Then I extract that to parquet stored on the VM and do the analysis on the VM using DuckDB from python notebooks.
DuckDB has revolutionised not what I can do but how I can do it. All the ingredients were around before, but DuckDB brings it together and makes the ergonomics completely different. Life is so much easier with joins and things than trying to do the same in, say, pandas.
I still have mine, but it's languishing, I don't know what to do with it / how to get rid of it, it doesn't feel like trash. The Apple stores do returns but for this one you get nothing, they're just like "yeah we'll take care of it".
The screen started to delaminate on the edges, and its follow-up (a MBP with the touch bar)'s screen is completely broken (probably just the connector cable).
I don't have a use for it, but it feels wasteful just to throw it away.
I have the same machine and installed Fedora 41 on it. Everything works out of the box, including WiFi and sound.
eBay is pretty active for that kind of thing. Spares/repair.
I'm working on a big research project that uses duckdb, I need a lot of compute resources to develop my idea but I don't have a lot of money.
I'm throwing a bottle into the ocean: if anyone has spare compute with good specs they could lend me for a non-commercial project it would help me a lot.
My email is in my profile. Thank you.
> If we look at the time a bit closer, we see the queries take anywhere between a minute and half an hour. Those are not unreasonable waiting times for analytical queries on that sort of data in any way.
I'm really skeptical arguments that say it's OK to be slow. Even on the modern laptop example queries still take up to 47 seconds.
Granted, I'm not looking at the queries but the fact is that there are a lot of applications where users need results back in less than a second.[0] If the results are feeding automated processes like page rendering they need it back in 10s of millisecond at most. That takes hardware to accomplish consistently. Especially if the datasets are large.
The small data argument becomes even weaker when you consider that analytic databases don't just do queries on static datasets. Large datasets got that way by absorbing a lot of data very quickly. They therefore do ingest, compaction, and transformations. These require resources, especially if they run in parallel with query on the same data. Scaling them independently requires distributed systems. There isn't another solution.
[0] SIEM, log management, trace management, monitoring dashboards, ... All potentially large datasets where people sift through data very quickly and repeatedly. Nobody wants to wait more than a couple seconds for results to come back.
Related in the big-data-benchmarks-on-old-laptop department: https://www.frankmcsherry.org/graph/scalability/cost/2015/01...
DuckDB works well if
* you have a small datasets (total, not just what a single user is scanning)
* no real-time updates, just a static dataset that you can analyze at leisure
* only few users and only one doing any writes
* several seconds is an OK response time, get's worse if you have to load your scanned segment into DuckDB node.
* generally read-only workloads
So yeah, not convinced we lost a decade.
> As recently shown, the median scan in Amazon Redshift and Snowflake reads a doable 100 MB of data, and the 99.9-percentile reads less than 300 GB. So the singularity might be closer than we think.
There is some circular reasoning embedded here. I've seen many, many cases of people finding ways to cut up their workloads into small chunks because the performance and efficiency of these platforms is far from optimal if you actually tried to run your workload at its native scale. To some extent, these "small reads" reflect the inadequacy of the platform, not the desire of a user to run a particular workload.
A better interpretation may be that the existing distributed architectures for data analytics don't scale well except for relatively trivial workloads. There has been an awareness of this for over a decade but a dearth of platform architectures that address it.
Maybe it was all VC funded solutions looking for problems?
It's a lot easier to monetize data analytics solutions if users code & data are captive in your hosted infra/cloud environment than it is to sell people a binary they can run on their own kit...
All the better if its an entire ecosystem of .. stuff.. living in "the cloud", leaving end users writing checks to 6 different portfolio companies.
> Maybe it was all VC funded solutions looking for problems?
Remember, from 2020-2023 we had an entire movement to push a thing called "Modern data stack (MDS)" with big actors like a16z lecturing the market about it [1].
I am originally from Data. Never worked with anything out of the Data: DS, MLE, DE, MLOps and so on. One thing that I envy from other developer careers is to have bosses/leaders that had battle-tested knowledge around delivering things using pragmatic technologies.
Most of the "AI/Data Leaders" have at maximum 15-17 years of career dealing with those tools (and I am talking about some dinosaurs in a good sense that saw the DWH or Data Mining).
After 2018 we had an explosion of people working in PoCs or small projects at best, trying to mimic what the latest blog post from some big tech company pushed.
A lot of those guys are the bosses/leaders today, and worse, they were formed during a 0% interest environment, tons of hype around the technology, little to no scrutiny or business necessity for impact, upper management that did not understand really what those guys were doing, and in a space that wasn't easy for guys from other parts of tech to join easily and call it out (e.g., SRE, Backend, Design, Front-end, Systems Engineering, etc.).
In other words, it's quite simple to sell complexity or obscure technology for most of these people, and the current moment in tech is great because we have more guys from other disciplines chime in and share their knowledge on how to assess and implement technology.
[1] - https://a16z.com/emerging-architectures-for-modern-data-infr...
Right.. shove your data in our data platform.
OK now you need PortCo1's company analytics platform, PortCo2's orchestration platform, PortCo3's SRE platform, PortCo4's Auth platform, PortCo5's IaC platform, PortCo6's Secrets Mgmt Platform, PortoCo7's infosec platform, etc.
I am sure I forgot another 10 things. Even if some of these things were open source or "open source", there was the upsell to the managed/supported/business license/etc version for many of these tools.
This is the primary failure of data platforms from my perspective. You need too many 3rd parties/partners to actually get anything done with your data and costs become unbearable.
Cloud and SaaS were good for a while because they took away the old sales-CTO pipeline that often saw a whole org suffering from one person's signature. But they also took away the benefits of a more formal evaluation process, and nowadays nobody knows how to do one.
I'm not sure how cloud/saas made the CTO behavior and its consequences any better. At least on-prem if they picked the "wrong" DB / message bus / etc, you could quietly replicate to another stack internally as needed for your analytics needs.
If your data is lodged in some SaaS product in AWS, good luck replicating that to GCP, Azure, or heaven forbid on-prem, without extortion level costs.
> and in a space that wasn't easy for guys from other parts of tech to join easily and call it out (e.g., SRE, Backend, Design, Front-end, Systems Engineering, etc.).
As an SRE/SysEng/Devops/SysAdmin (depending on the company that hires me): most people in the same job as me could easily call it out.
You don't have to be such a big nerds to know that you can fit 6TB of memory in a single (physical) server. That's been true for a few years. Heck, AWS had 1TB+ memory instances for a few years now.
The thing is... Upper management wanted "big data" and the marketing people wanted to put the fancy buzzword on the company website and on linkedin. The data people wanted to be able to put the fancy buzzword on their CV (and on their Linkedin profile -- and command higher salaries due to that - can you blame them?).
> In other words, it's quite simple to sell complexity or obscure technology for most of these people
The unspoken secret is that this kind of BS wasn't/isn't only going on in the data fields (in my opinion).
> The unspoken secret is that this kind of BS wasn't/isn't only going on in the data fields (in my opinion).
Yes, once you see it in one area you notice if everywhere.
A lot of IT spend is CEOs chasing something they half heard/misunderstanding a competitor doing, or a CTO taking Gartner a little too seriously, or engineering leads doing resume driven architecture. My last shop did a lot of this kind of this stuff "we need a head of [observability|AI|$buzzword].
The ONE thing that gives me the most pause about DuckDB is that some people in my industry who are guilty of the above are VERY interested in DuckDB. I like to wait for the serial tech evangelists to calm down a bit and see where the dust settles.
Krazam did a brilliant video on Small Data: https://youtu.be/eDr6_cMtfdA?si=izuCAgk_YeWBqfqN
Did my phd around that time and did a project “scaling” my work on a spark cluster. Huge pita and no better than my local setup which was an MBP15 with pandas a postgres (actually I wrote+contributed a big chunk of pandas read_sql at that time to make is postgres compatible using sqlalchemy)
Thank you for read_sql with SQLalchemy/postgres! We use it all the time at our company:)
For those of you from the AI world, this is the equivalent of the bitter lesson and DeWitts argument about database machines from the early 80s. That is, if you wait a bit with the exponential pace of Moores law (or modern equivalents), improvements in “general purpose” hardware will obviate DB specific improvements. The problem is that back in 2012, we had customers that wanted to query terabytes of logs for observability, or analyze adtech streams, etc. So, I feel like this is a pointless argument. If your data fit on an old MacBook Pro, sure you should’ve built for that.
AWS started offering local SSD storage up to 2 TB in 2012 (HI1 instance type) and in late 2013 this went up to 6.4 TB (I2 instance type). While these amounts don't cover all customers, plenty of data fits on these machines. But the software stack to analyze it efficiently was lacking, especially in the open-source space.
AWS also had customers that had petabytes of data in Redshift for analysis. The conversation is missing a key point: DuckDB is optimizing for a different class of use cases. They’re optimizing for data science and not traditional data warehousing use cases. It’s masquerading as size. Even for small sizes, there are other considerations: access control, concurrency control, reliability, availability, and so on. The requirements are different for those different use cases. Data science tends to be single user, local, and lower availability requirements than warehouses that serve production pipelines, data sharing, and so on. I also think that DuckDB can be used for those, but not optimized for those.
Data size is a red herring in the conversation.
>Data size is a red herring in the conversation.
Not really. A Redshift paper just shared that.
>..here is a small number of tables in Redshift with trillions of rows, while the majority is much more reasonably sized with only millions of rows. In fact, most tables have less than a million rows and the vast majority (98 %) has less than a billion rows.
The argument can be made that 98% of people using redshift can potentially get by with DuckDB.
https://assets.amazon.science/24/3b/04b31ef64c83acf98fe3fdca...
This makes a completely valid point when you constrain the meaning of Big Data to “the largest dataset one can fit on a single computer”.
At companies I've worked at "Big Data" was often used to mean "too big to open in Excel" or in the extreme case "too big to fit in RAM on my laptop"
Annoyingly medium data is my term for this.
Around 0.5 to 50 GB is such an annoying area, because Excel starts falling over on the lower end and even nicer computers will start seriously struggling on the larger end if you're not being extremely efficient.
I mean, not everyone spent their decade on distributed computing. Some devs with a retrogrouch inclination kept writing single threaded code in native languages on a single node. Single core clock speed stagnated, but it was still worth buying new CPU's with more cores because they also had more cache, and all the extra cores are useful for running ~other peoples' bloated code.
I find that good multithreading can speed up parallelizable workloads by 5-10 times depending on CPU core count, if you don't have tight latency constraints (and even games with millisecond-level latency deadlines are multithreaded these days, though real-time code may look different than general code).
High-frequency trading, gaming, audio/DSP, embedded, etc. There's a lot of room for that kind of developer.
This is really a question of economics. The biggest organizations with the most ability to hire engineers have need for technologies that can solve their existing problems in incremental ways, and thus we end up with horrible technologies like Hadoop and Iceberg. They end up hiring talented engineers to work on niche problems, and a lot of the technical discourse ends up revolving around technologies that don't apply to the majority of organizations, but still cause FOMO amongst them. I, for one, am extremely happy to see technologies like DuckDB come along to serve the long tail.
is there open source project analytics that build on top of duck db yet????
I mostly see clickhouse,postgress etc
> The geometric mean of the timings improved from 218 to 12, a ca. 20× improvement.
Why do they use the geometric mean to average execution times?
It's a way of saying twice as fast and twice as slow have equal effect on opposite sides. If your baseline is 10 seconds, one benchmark takes 5 seconds, and another one takes 20 seconds then the geometric mean gives you 10 seconds as the result because they cancel each other. The arithmetic mean would treat it differently because in absolute terms 10 seconds slow down is bigger than 5 seconds speedup. But that is not fair for speedups because the absolute speedup you can reach is at most 10 seconds but slow down has no limits.
This is the best explain-like-im-5 I've heard for geo mean and helped it click in my head, thank you :)
But reality doesn’t care:
If half your requests are 2x as long and half are 2x as fast, you don’t take the same wall time to run — you take longer.
Let’s say you have 20 requests, 10 of type A and 10 of type B. They originally both take 10 seconds, for 200 seconds total. You halve A and double B. Now it takes 50 + 200 = 250 seconds, or 12.5 on average.
This is a case where geometric mean deceives you - because the two really are asymmetric and “twice as fast” is worth less than “twice as slow”.
There is definitely no single magical number that can perfectly represent an entire set of numbers. There will always be some cases they are not representative enough. In the request example you are mostly interested in the total processing times so it does make sense you use a metric based on addition. But you could also frame a similar scenario where halving the processing time lets you handle twice as many items in the same duration. In that case a ratio-based or multiplicative view might be more appropriate.
Sure — but the arithmetic mean also captures that case: if you only halve the time, it also will report that change accurately.
What we’re handling is the case where you have split outcomes — and there the arithmetic and geometric mean disagree, so we can ask which better reflects reality.
I’m not saying the geometric mean is always wrong — but it is in this case.
A case where it makes sense is what happens when your stock halves in value then doubles in value?
In general, geometric mean is appropriate where effects are compounding (eg, two price changes to the same stock) but not when we’re combining (requests are handled differently). Two benchmarks is more combining (do task A then task B), rather than compounding.
Squaring is a really good way to make the common-but-small numbers have bigger representation than the outlying-but-large numbers.
I just did a quick google and first real result was this blog post with a good explanation with some good illustrations https://jlmc.medium.com/understanding-three-simple-statistic...
Its the very first illustration at the top of that blog post that 'clicks' for me. Hope it helps!
The inverse is also good: mean-square-error is the good way for comparing how similar two datasets (e.g. two images) are.
The geometric mean of n numbers is the n-th root of the product of all numbers. The mean square error is the sum of the squares of all numbers, divided by n. (I.e. the arithmetic mean of the squares.) They're not the same.
I'm not gonna edit what I wrote but you are interpreting it too way too literally. I was not describing the implementation of anything, I was just giving a link that explains why thinking about things in terms of area (geometry) is popular in stats. Its a bit like the epiphany that histograms don't need to be bars of equal width.
I am on the late 2015 version and I have an ebay body stashed for when the time comes to refurbish that small data machine.
Any good keywords to search?
It's not the point of the blog post, but I love the fact that the author's 2012 MacBook Pro is still useable. I can't imagine there are too many Dell laptops from that era still alive and kicking.
The machine from the article - a 2012 MBP Retina with 16 GB memory and 2.6 GHz i7 - had cost $2999 in the US (and significantly more in most of the rest of the world) at release. That's around $4200 today adjusting for inflation. You won't see many Dell laptops with that sort of price tag.
[dead]
I have worked for a half dozen companies all swearing up and down they had big data and meaningfully one customer had 100TB of logs and another 10TB of stuff, everyone else when actually thought of properly and had just utter trash removed was really under 10TB.
Also - sqlite would have been totally fine for these queries a decade ago or more (just slower) - I messed with 10GB+ datasets with it more than 10 years ago.
[dead]