When to throw `IN` out

The Problem

Recently in a project I manage there were a few stack depth limit exceeded database errors coming from PostgreSQL. Naturally I went straight to the query that was causing it, assuming I’d find a bad trigger on a table causing an infinite loop. When I arrived at the query I quickly realized it was one of the most core queries in the application, so if there were a bad trigger we would have known about it already. I dug a bit more and was able to confirm there weren’t any triggers to blame. I was at a loss.

This query was just doing a basic deletion with an IN clause using multi- column tuples (e.g. delete from some_table where (thing_id, related_id, other_id) in ((1,2,3),(4,5,6))). The difference with this particular run of the query that was causing errors was the shear number being deleted. There were over 8,000 tuples in the IN comparison. After much searching, and asking a colleague for help, we found that large numbers of values for IN comparisons could lead to the stack depth limit exceeded errors I was seeing, and we discovered a very interesting workaround.

What’s wrong with IN

First let’s explore what the IN comparison does. Per the docs, the IN comparison is just shorthand for boolean operators. For values that are tuples, Postgresql does a bit more work to try to make as simple of a query as possible. If all the values of a particular position of the tuple are the same it will pull that out as its own comparison, then any remaining will end up in a nested series of ors joined with ands. For example:

(id1, id2, id3, id4) in ((1, 2, 3, 4), (1, 2, 5, 4), (1, 4, 3, 2))

is internally converted to this structure:

(
  (id1 = 1) AND
    (
      (
        (id2, id3, id4) = (2, 3, 4)
        OR
        (id2, id3, id4) = (2, 5, 4)
      )
      OR
      (id2, id3, id4) = (4, 3, 2)
    )
)

(See also our working example.) For tiny sets this seems acceptable, but the nesting quickly gets out of hand with larger sets, eventually overwhelming the query parser.

The Solution

The workaround is to create a virtual table of sorts of the values to join with instead of using the IN comparison (see below). If you aren’t certain that your values are unique, it requires a tiny bit more effort (a subquery where you select distinct * from the values), but is still viable and faster than an IN comparison. The code that was generating the tuples stayed the same, just moved to a different part of the query; so this workaround was very easy to test out.

inner join (
  values (1, 2, 3, 4), (1, 2, 5, 4), (8, 4, 3, 2)
) v(id1, id2, id3, id4) using (id1, id2, id3, id4)

(See working example)

The Difference

I ran some rudimentary explain analyze queries to get run times for the two different methods and I was quite impressed. The table the query was deleting from has around 40 million rows, and the query itself was using 47 tuples. The IN comparison style consistently gave around 350-380ms for execution time, whereas the “virtual table” with a join style stayed in the 1-5ms range. In either case, rerunning the same exact query multiple times in a row would get the time down to .7-1ms, but that use case (especially for this specific query) was extremely rare, so that initial time is what would matter most. Using this new method, I reran the large deletion (tuple count ended up at 8,113). It wasn’t quite as snappy as the others, landing at 1,700ms for the first run. But for that many tuples, it’s a number I’m happy to see.

Other RDBMS’s

This technique also appears to work in MS SQL Server, but not in MariaDB/MySQL.

Need help with PostgreSQL performance?

Singlebrook is available to help troubleshoot the performance of your PostgreSQL database! Feel free to contact us for a free 30-minute consultation.