Sampling with SQL

blog.moertel.com

178 points by thunderbong 9 months ago

tmoertel 8 months ago

Hey! I'm tickled to see this on HN. I'm the author. If you have any questions, just ask. I'll do my best to answer them here.

indoordin0saur 8 months ago

Thank you! The thing I find tricky here is choosing the weight. I think maybe one obvious way you would want to weight samples would be for recency. E.g. if I have a table of user login events then I care about seeing more of the ones that happened recently but still want to see some of the older ones. Would the algorithm still work if I converted a `created_at` timestamp to epoch time and used that? Or would I want to normalize it in some way?
- tmoertel 8 months ago
  
  All of your `created_at` epoch times are going to be of similar magnitude, since they’re all about the same relative distance from the start of the epoch (usually 1970-01-01T00:00:00). So if you use them as weights directly, you’re in effect taking a uniform sample – every row has about the same chance of being chosen.
  Based on what I glean about your scenario, I suspect you’d be better served by some kind of exponentially decaying weight based on age (i.e., age = current_timestamp ‒ created_at). For example, if you wanted to make your rows half as likely to be selected every time they grew another hour older, you could use `POW(2.0, -age / 3600)` as your weight, where age is given in seconds.
  duckdb> WITH Ages AS ( SELECT age FROM UNNEST([0, 3600, 7200]) AS t(age) ) SELECT age, POW(2.0, -age / 3600) AS w FROM Ages; ┌──────┬──────┐ │ age ┆ w │ ╞══════╪══════╡ │ 0 ┆ 1.0 │ │ 3600 ┆ 0.5 │ │ 7200 ┆ 0.25 │ └──────┴──────┘
- swasheck 8 months ago
  
  depending on what you’re analyzing, EMA could give you a good weighting method
swasheck 8 months ago

this is a really fascinating and interesting to me. i’ve been using sql to analyze large data sets recently since sql is my primary skillset and having new methods and algorithms is quite handy and interesting.
i do have a question of clarification. this is built on weighted sampling with weights in the dataset. does this indicate some sort of preprocessing to arrive at the weights?
- tmoertel 8 months ago
  
  The weights just indicate how important each row is to whatever you’re trying to estimate. Sometimes, your datasets will have numeric columns that naturally suggest themselves as weights. Other times, you may have to create weights.
  For example, say you run a website like Wikipedia and want to estimate the percentage of page views that go to pages that are dangerously misleading. You have a dataset that contains all of your site’s pages, and you have logs that indicate which pages users have viewed. What you don’t have is a flag for each page that indicates whether it’s “dangerously misleading”; you’ll have to create it. And that’s likely to require a panel of expert human reviewers. Since it’s not practical to have your experts review every single page, you’ll want to review only a sample of pages. And since each page spreads misleading information only to the degree it’s viewed by users, you’ll want to weight each page by its view count. To get those counts, you’ll have to process the site logs, and count how many times each page was viewed. Then you can take a sample of pages weighted by those counts.
  That’s a pretty typical scenario. You think about the question you’re trying to answer. You think about the data you have and the data you’ll need. Then you figure out how to get the data you need, hopefully without too much trouble.
  
  swasheck 8 months ago
  
  so that’s a “yes, in many cases it requires preprocessing which would occur prior to the scope of this article.”
  thank you!

rhymer 8 months ago

Be careful, the weight of Algorithm A by Efraimidis and Spirakis cannot be interpreted as the inclusion probability, and thus cannot be used in survey sampling to construct the Horvitz–Thompson estimator.

See "Remarks on some misconceptions about unequal probability sampling without replacement" by Yves Tillé.

Quoted from Tillé's conclusion: "There is a multitude of correct and fast method of sampling... there is no reason to use an incorrect method like weighted random sampling where we do not control the inclusion probabilities"

It's not clear to me how easy it is to implement the "multitude of corect and fast methods" in SQL, though. Would love to see some reference implementation.

apwheele 8 months ago

Care to elaborate on this? So this post does not save the resulting weight, so you don't use that in any subsequent calculations. You would just treat the result as a simple random sample. So it is unclear why this critique matters.
- tmoertel 8 months ago
  
  Yes, rhymer has hit on the primary tradeoff behind Algorithm A. While Algorithm A is fast, easy, and makes it possible to sample any dataset you can access with SQL, including massive distributed datasets, its draws from the population are not independent and identically distributed (because with each draw you make the population one item smaller), nor does it let you compute the inclusion probabilties that would let you use the most common reweighting methods, such as a Horvitz–Thompson estimator, to produce unbiased estimates from potentially biased samples.
  In practice, however, when your samples are small, your populations are large, and your populations' weights are not concentrated in a small number of members, you can use Algorithm A and just pretend that you have a sample of i.i.d. draws. In these circumstances, the potential for bias is generally not worth worrying about.
  But when you cannot play so fast and loose, you can produce unbiased estimates from an Algorithm A sample by using an ordered estimator, such as Des Raj's ordered estimator [1].
  You could alternatively use a different sampling method, one that does produce inclusion probabilities. But these tend to be implemented only in more sophisticated statistical systems (e.g., R's "survey" package [2]) and thus not useful unless you can fit your dataset into those systems.
  For very large datasets, then, I end up using Algorithm A to take samples and (when it matters) Des Raj's ordered estimator to make estimates.
  (I was planning a follow up post to my blog on this subject.)
  [1] The original papers are hard to come by online, so I suggest starting with an introduction like https://home.iitk.ac.in/~shalab/sampling/chapter7-sampling-v..., starting at page 11.
  [2] https://cran.r-project.org/web/packages/survey/survey.pdf

natmaka 8 months ago

If you need to "extract a small, sample dataset from a larger PostgreSQL database while maintaining referential integrity" check https://github.com/mla/pg_sample

Horffupolde 8 months ago

ORDERing by RANDOM on a large table requires a full scan, even with LIMIT. For tables that don’t fit in memory this is impractical and I/O shoots to 100%.

tmoertel 8 months ago
As I mentioned in the "Tricks for faster samples" [1] section of the article, you can take a sample in two parts: (1) identify the rows in the sample and (2) read the data you want for the rows in the sample.
Only Part 1 requires running Algorithm A and its ORDER/LIMIT logic. And, when you run that logic, you only need to read two tiny columns from the table you want to sample: the weight and the primary key (any candidate key will work). So it looks like this:
```
    SELECT pk
    FROM Population
    WHERE weight > 0
    ORDER BY -LN(1.0 - RANDOM()) / weight
    LIMIT 1000  -- Sample size.
```
This operation will be fast whenever your data is stored in a column-oriented format or when you have indexed the weight.
I trust that you will accept as self evident that column stores are very fast when reading two small columns since they don't make you pay to read the columns you don't need.
So I'll focus on the row-store case. If you have indexed your weight, you basically have a tiny table of (weight, primary key) pairs sorted by weight. If you then run a query that needs to read only the weight and primary key, the query planner can fulfill your query by scanning the index, not the full population table. (The fact that the index is sorted doesn't matter. The query planner isn't taking advantage of the index ordering, just the fact that the index has everything it needs and is way smaller than the full population table. Most modern row-oriented database stores support index-only scans, including PostgreSQL [2] and SQLite [3].)
[1] https://blog.moertel.com/posts/2024-08-23-sampling-with-sql....
[2] https://www.postgresql.org/docs/current/indexes-index-only-s...
[3] https://www.sqlite.org/optoverview.html#covering_indexes
tmoertel 8 months ago

Actually, on most modern SQL systems, especially those that deal with large datasets, the `ORDER/LIMIT` combination is implemented as a `TOP N` operation that consumes only O(N) memory.
- crazygringo 8 months ago
  
  But only when operating on fields that are indexed.
  If you pull the highest 5 product ID's sorting by product ID, you're right it only reads those 5 entries from disk.
  But when you ORDER BY RANDOM(), it's forced to do a full-table scan, because RANDOM() can't be indexed.
  
  tmoertel 8 months ago
  
  > But only when operating on fields that are indexed.
  No, even without indexes, you only need O(N) memory to find the top N records under any ordering, including on random sort keys. You do have to examine every row, but if you are using a column store or have indexed the weight column, the scan will require only a small fraction of the work that a read-everything table scan would require.
  
  crazygringo 8 months ago
  
  Oh, sorry I read your comment quickly and hadn't realized you were talking about memory solely.
  But that's because memory usage isn't the relevant bottleneck here -- it's disk IO. It's reading the entire table from disk, or only the entire column(s) if applicable.
  That's not something you ever want to do in production. Not even if it's only a single column. That will destroy performance on any reasonably large table.
  If a full-table scan of all columuns takes 180 seconds, then a "small fraction" to scan a single column might take 10 seconds, but queries running on a production database need to be measured in small numbers of milliseconds.
- swasheck 8 months ago
  
  i don’t think the workspace memory is op point. if the initial set (not the intermediate or final projection) is too large, reads will come from disk which also creates cpu churn. at least in mssql, if the data page is not in memory it is read into the buffer pool and then read from the buffer bool into the workspace for processing. if there isn’t enough buffer space to fit all of the data set pages in the pool, you’re going to be reading and flushing pages. i think pg operates similarly.
  
  tmoertel 8 months ago
  
  Yes, to sample a population, you must consider every row in the population. There's no way around it. But it doesn't have to be as slow and expensive as reading the entire population table.
  To determine which rows are in the sample, you need to consider only two columns: the weight and (a) primary key.
  If your population lives in a traditional row-oriented store, then you're going to have to read every row. But if you index your weights, you only need to scan the index, not the underlying population table to identify the sample.
  If your population lives in a column-oriented store, identifying the sample is fast, again because you only need to read two small columns to do it. Column stores are optimized for this case.
mbb70 8 months ago

How would you randomly sample a table without a full table scan? If each row must be equally likely to be sampled you must touch each row.
Unless you know the id range and select N ids ahead of time, then use an index to pull those. But I think the assumption of the article is you can't do that.
- tmoertel 8 months ago
  
  As I wrote in an earlier comment [1], you can take a sample in two parts, and the first part is the only one that needs to do a scan and, even then, if you index your weights (or are using a column store like Parquet), this scan can be very lightweight and ignore everything except the weights and primary keys.
  [1] https://news.ycombinator.com/item?id=41906816
- crazygringo 8 months ago
  
  In a production database where queries must be fast, indeed you'll have to rely on constraints such as knowing the ID range and that ID's are contiguous, and do exactly as you say -- randomly generate ID's in that range and pull them with an index.
  And indeed, if your data doesn't naturally have contiguous ID's, you might create an AUTOINCREMENT primary key precisely to create those contiguous ID's. And then of course if you have a reasonable but not overwhelming number of deletions, you can get away with retrying a new random number every time you don't get a hit, up to a max number of retries where you just return an error and ask the user to attempt the operation again.
crazygringo 8 months ago

Yup. For any kind of production app with a table that isn't tiny, you should absolutely never do an ORDER BY RANDOM().
You just need to make sure you're running these queries against a database used solely for occasional analytics, where it's no problem to be saturating the disk for two minutes or whatever, because it won't bother anybody else.

nzach 8 months ago

Is the idea of doing statistics directly in the database a relatively new development?

This seems like such a great idea, but this is just the first time I read about this.

This is a great article, thanks for sharing.

oli5679 8 months ago

One thing that can be useful with sampling is sampling a consistent but growing sub population. This can help maintain a consistent holdout for machine learning models, help you sample analytical data and test joins without null issues etc.

If you use a deterministic hash, like farm_fingerprint on your id column (e. g user_id) and keep if modulus N = 0, you will keep the same growing list of users across runs and queries to different tables.

apwheele 8 months ago

Very nice, another pro-tip for folks is that you can set the weights to get approximate stratified sampling. So say group A had 100,000 rows, and group B had 10,000 rows, and you wanted each in the resulting to have approximately the same proportion. You would set the weight for each A row to be 1/100,000 and for B to be 1/10,000.

If you want exact counts I think you would need to do RANK and PARTITION BY.

emmelaich 8 months ago

Is there something in the SQL standard that says functions are guaranteed to executed more than once?

I swear that once I used something like random() and it was only executed once, making it useless for the task at hand. I had to use some trick to ensure it was executed for each row.

I may have used it in the `select` part. Dialect was Oracle's, from memory.

related: https://xkcd.com/221/

yen223 8 months ago

Postgres makes a distinction between IMMUTABLE, STABLE, and VOLATILE functions, with volatile functions being functions that - with the same arguments - can produce different results even within the same statement. Therefore VOLATILE functions will always be executed once per call.
I'm not sure if this is part of the ANSI SQL standard.
hobs 8 months ago

It depends on the function and the SQL implementation, you can see in this simulator that where rand() > rand() evaluates row by row in MySQL but once in SQL Server, so its easy to get this stuff messed up even if the code is "equivalent" its really not.
https://onecompiler.com/mysql/42vq8s23b https://onecompiler.com/sqlserver/42vq8tz24
- emmelaich 8 months ago
  
  Thanks, that's a bit upsetting :-)
  
  tmoertel 8 months ago
  
  Indeed.
  On systems with unfortunate evaluation semantics for `RAND`, you can generate fresh random values for each row by creating a function for that purpose and calling it on the primary key of each row. I provide one example in the article at:
  https://blog.moertel.com/posts/2024-08-23-sampling-with-sql....
  I'll include a copy here because it's short. It's for DuckDB and was created to let us generate a controllable number of fresh random values for each row:
  -- Returns a pseudorandom fp64 number in the range [0, 1). The number -- is determined by the given `key`, `seed` string, and integer `index`. CREATE MACRO pseudorandom_uniform(key, seed, index) AS ( (HASH(key || seed || index) >> 11) * POW(2.0, -53) );
  
  o11c 8 months ago
  
  `HASH` looks like a slow function ... does something like `rand() + rowid & 0` or `((rand() * p53 + rowid) % p53) / p53` work?
  
  tmoertel 8 months ago
  
  Generally, table scans dominate the cost of sampling, so evaluating a "slow" function once per row doesn't matter. What does matter is whether you can push filtering expressions down into the scans to eliminate I/O and decoding work early. Some systems have trouble pushing down RAND efficiency, which can make alternatives like the deterministic function I shared advantageous.
  
  emmelaich 8 months ago
  
  Thanks. I think my trick was to generate (externally) a table of random numbers, then `join` by rowid that randtable to the table of interest. It was only a few million numbers so didn't take long to generate. And it was pretty much a one-off job.
  
  hobs 8 months ago
  
  Had a good laugh, this is the normal response to difference in SQL implementations in my experience.
magicalhippo 8 months ago

Conversely, today() and friends might be evaluated per row and not per row, to the mild surprise of some of us.
Just reinforced my belief of not assuming you know what the DB server is doing, verify!
adornKey 8 months ago

About a decade ago I tried to use random() and I found its behaviour to be very random... The changelogs of all databases I used later contained bug fixes on this subject. (Sybase, Microsoft)
I'm not sure if standards today are consistent and trustworthy on this subject. Back in the day it was very obvious that most likely nobody ever seriously had tried to use random().
paulddraper 8 months ago

No.

indoordin0saur 8 months ago

For the newbies: quick way I'll get a sample of rows (without weights) in Redshift:

`SELECT * FROM sales ORDER BY RANDOM() LIMIT 10;`

peterm4 8 months ago

Been working with sqlite and Typescript, needing random rows sampled reproducibly with a seed. Found this handy little trick [0]. Not the cleanest, but it works!
`SELECT * FROM sales ORDER BY substr(${seedValue} * sales.id, length(sales.id) + 2) LIMIT 10;`
[0] https://stackoverflow.com/questions/24256258/order-by-random...

bastawhiz 8 months ago

> ORDER BY -LN(1.0 - RANDOM()) / weight

Isn't this the same as just

> ORDER BY -LN(RANDOM()) / weight

tmoertel 8 months ago

Yes, except that it avoids the possibility of a floating-point error. In most SQL implementations, `RANDOM()` can return zero, but the log of 0 is undefined.
From the article:
> One last numerical subtlety. Why do we generate random numbers with the expression `1.0 - RANDOM()` instead of just `RANDOM()`? Since most implementations of RANDOM(), such as the PCG implementation used by DuckDB, return a floating-point value in the semi-closed range [0, 1), they can theoretically return zero. And we don’t want to take the logarithm of zero. So we instead use `1.0 - RANDOM()` to generate a random number in the semi-closed range (0, 1], which excludes zero.
- bastawhiz 8 months ago
  
  Got it, I'd missed that. Thank you