The Most Annoying Optimizer Fail in Postgres ✅ Best index solved it.

Franck Pachot
8 min readDec 8, 2023

--

This blog post is based on a question asked on the PostgreSQL Slack channel (you can join via this invite link: https://pgtreats.info/slack-invite). The question was related to a bad index chosen by the query planner. Some suggestions were made to prevent index usage by concatenating a dummy string to a column to hinder the query planner. Some even mentioned that there is no solution, referencing the following “ Most Annoying Optimizer Fail in Postgres”.

Based on my experience with SQL tuning, I have observed that in various databases, when the query planner needs to choose between two indexes and selects the inferior one, it is generally because none of those indexes is optimal. For the optimizer to function efficiently choosing the right index, on of them must be the unequivocal and obvious choice for accessing the data in your query.

Let’s consider an example similar to this scenario. The problem is that we are filtering one column using seq=0 and sorting on another using order by id limit 1. Additionally, to mimic the question raised in Slack, I have added a multi-value filter on status in (1,2,3).

drop table if exists demo;

create table demo (
primary key(id)
, id uuid, seq bigint, status int, value text
);

create index demo_seq on demo(seq, status);

--- rows inserted with `seq` higher than zero
insert into demo
select gen_random_uuid(), generate_series(1,1000000), 10*random(), 'x'
;

--- rows inserted with `seq` equal to zero and higher UUID
insert into demo (id, seq, status, value)
select ('f'||substr(gen_random_uuid()::text,2))::uuid , 0, 10*random(), 'x' from demo;

vacuum analyze demo;

Query with filter, order and fetch first row

I want to retrieve the first row with seq=0 while ordering the results by id. I have an index on id which returns the rows in the desired order, however, the first row with seq=0 is not at the start of the index, causing a lot of scanning. I also have an index on seq that can quickly find all the rows with seq=0, however, sorting is required before selecting the first row.

The following is the plan that has been picked up by the query planner:

yugabyte=# explain (analyze, buffers) 
select * from demo where seq=0 and status in (1,2,3)
order by id limit 1;

QUERY QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.43..0.83 rows=1 width=30) (actual time=631.777..631.777 rows=1 loops=1)
Buffers: shared hit=940846 read=651
-> Index Scan using demo_pkey on demo (cost=0.43..118912.16 rows=296982 width=30) (actual time=631.775..631.776 rows=1 loops=1)
Filter: ((seq = 0) AND (status = ANY ('{1,2,3}'::integer[])))
Rows Removed by Filter: 937241
Buffers: shared hit=940846 read=651
Planning Time: 0.093 ms
Execution Time: 631.794 ms
(8 rows)

The query planner selected the primary key on id since it is already ordered. However, it could not access seq=0 with an Index Condition and had to apply it as a filter after scanning the entire index. It had to read half of the rows before finding the first row and stopped there because of the limit 1.

We had to read 940846 blocks and discard 937241 rows before returning the one row for the result. The Query planner underestimated the cost of this row’s position in the table and estimated a low cost for it ( cost=0.43..0.83).

Let’s attempt to use the other index. Since PostgreSQL doesn’t provide optimizer hints, we need to find a way to test it. One way to do this is to add a string after the id column. This will preserve the order of the query, but the index won't be utilized for it:

yugabyte=# set max_parallel_workers_per_gather=0;
SET

yugabyte=# explain (analyze, buffers)
select * from demo where seq=0 and status in (1,2,3)
order by id||'dummy' limit 1;

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=28869.58..28869.58 rows=1 width=62) (actual time=204.803..204.804 rows=1 loops=1)
Buffers: shared hit=7597
-> Sort (cost=28869.58..29612.04 rows=296982 width=62) (actual time=204.801..204.802 rows=1 loops=1)
Sort Key: (((id)::text || 'dummy'::text))
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared hit=7597
-> Bitmap Heap Scan on demo (cost=5625.35..27384.67 rows=296982 width=62) (actual time=7.831..142.204 rows=299966 loops=1)
Recheck Cond: ((seq = 0) AND (status = ANY ('{1,2,3}'::integer[])))
Heap Blocks: exact=7354
Buffers: shared hit=7597
-> Bitmap Index Scan on demo_seq (cost=0.00..5551.10 rows=296982 width=0) (actual time=6.958..6.958 rows=299966 loops=1)
Index Cond: ((seq = 0) AND (status = ANY ('{1,2,3}'::integer[])))
Buffers: shared hit=243
Planning Time: 0.118 ms
Execution Time: 204.869 ms
(15 rows)

This plan is faster and requires reading fewer pages. We can read a range on this index for seq=0, but the query planner estimates a higher cost ( cost=28869.58..28869.58). This is the case of "Optimizer Fail". For instance, when querying for one row ( limit 1) within a large table, it's impossible for the optimizer to know if this row is at the beginning or the end. The statistics are gathered on all rows, so using them to access a single row is not accurate.

I’ll use another trick in the absence of hints. By adding zero to the sequence, the condition remains unchanged but the index is not used:

yugabyte=# explain (analyze, buffers) 
select * from demo where seq+0=0 and status in (1,2,3)
order by id||'dummy' limit 1;

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Limit (cost=52243.48..52243.48 rows=1 width=62) (actual time=345.642..345.643 rows=1 loops=1)
Buffers: shared hit=14706
-> Sort (cost=52243.48..52250.97 rows=2998 width=62) (actual time=345.641..345.642 rows=1 loops=1)
Sort Key: (((id)::text || 'dummy'::text))
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared hit=14706
-> Seq Scan on demo (cost=0.00..52228.49 rows=2998 width=62) (actual time=93.005..282.769 rows=299966 loops=1)
Filter: ((status = ANY ('{1,2,3}'::integer[])) AND ((seq + 0) = 0))
Rows Removed by Filter: 1700034
Buffers: shared hit=14706
Planning Time: 0.109 ms
Execution Time: 345.664 ms
(12 rows)

This plan performs a sequential scan. However, it has the advantage of finding other columns like ‘status’ without additional buffer reads.

The best index

I have three execution plans and unfortunately, none of them are good. It’s unlikely that the query planner will make the best selection in this scenario. It’s similar to asking for the best route to travel from one city to another without any direct roads between them.

In this situation, it’s best to combine the advantages of all the available plans:

  • Scan the required rows in the desired order to avoid sorting, just like with the id index.
  • Direct access to the filtered range to prevent a full scan, just like with the seq index.
  • Read all necessary columns, including status, similar to the Seq Scan that reads rows.

This combination approach will provide the best possible results.

When creating an index that starts with seq, all the seq=0 are grouped in one range. Then, id is next in the key to order them within each range. Finally, status can be used to filter further without having to access the table:



create index demo_seq_id_status on demo(seq, id ASC, status);

Please keep in mind that I have the option to use ASC or DESC for the id to match the ORDER BY, and I can move status to an INCLUDE part if it's frequently updated. I don't require more columns because once I have filtered the table to obtain the only required row, an additional read to the table isn't an issue. I just need to cover the columns that are involved in filtering.

After resetting all parameters to their default and using the original SQL query, we can use this index which is directly picked up by the query planner which estimates the low cost accurately, and the query runs much faster than all other plans we have seen:

postgres=# \c

yugabyte=# explain (analyze, buffers)
select * from demo where seq=0 and status in (1,2,3)
order by id limit 1;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.43..0.75 rows=1 width=30) (actual time=0.024..0.025 rows=1 loops=1)
Buffers: shared hit=4
-> Index Scan using demo_seq_id_status on demo (cost=0.43..96627.13 rows=296982 width=30) (actual time=0.024..0.024 rows=1 loops=1)
Index Cond: (seq = 0)
Filter: (status = ANY ('{1,2,3}'::integer[]))
Buffers: shared hit=4
Planning Time: 0.103 ms
Execution Time: 0.037 ms
(8 rows)

Time: 0.739 ms

The same with YugabyteDB

If you are using YugabyteDB, a distributed PostgreSQL database, the example mentioned will work with a slight change. If you want to use the index on the primary key, You need to specify range sharding with primary key (id ASC) because the default is HASH, which is unsorted. To test different plans, you don't need to use any tricks as pg_hint_plan is already installed by default in YugabyteDB. Unlike monolithic databases, YugabyteDB is distributed and doesn't use shared buffers on the query layer. Therefore, the right metrics to monitor are the number of read requests to the storage which can be displayed using explain (analyze, dist).

I force the sequential scan with /*+ SeqScan(demo) */ :

yugabyte=# explain (analyze, dist)  /*+ SeqScan(demo) */
select * from demo where seq=0 and status in (1,2,3)
order by id limit 1;

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Limit (cost=212600.00..212600.00 rows=1 width=60) (actual time=2008.564..2008.565 rows=1 loops=1)
-> Sort (cost=212600.00..212650.00 rows=20000 width=60) (actual time=2008.562..2008.562 rows=1 loops=1)
Sort Key: id
Sort Method: top-N heapsort Memory: 25kB
-> Seq Scan on demo (cost=0.00..212500.00 rows=20000 width=60) (actual time=579.579..1918.203 rows=299556 loops=1)
Remote Filter: (seq = 0)
Filter: (status = ANY ('{1,2,3}'::integer[]))
Rows Removed by Filter: 700444
Storage Table Read Requests: 977
Storage Table Read Execution Time: 1651.938 ms
Planning Time: 0.146 ms
Execution Time: 2008.615 ms

The query had to go through all the rows in the table, with a Seq Scan. It was only able to push down one filter ( Remote Filter: (seq = 0)) and had to remove many others ( Rows Removed by Filter: 700444) after transferring them from the storage layer to the SQL layer ( Read Requests: 977). Finally, the query had to sort the result ( rows=299556) to get the first row.

To use the original index, I force it with /*+ IndexScan(demo demo_seq) */:

yugabyte=# explain (analyze, dist) /*+ IndexScan(demo demo_seq) */
select * from demo where seq=0 and status in (1,2,3)
order by id limit 1;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=2629.00..2629.00 rows=1 width=60) (actual time=1509.784..1509.785 rows=1 loops=1)
-> Sort (cost=2629.00..2679.00 rows=20000 width=60) (actual time=1509.782..1509.783 rows=1 loops=1)
Sort Key: id
Sort Method: top-N heapsort Memory: 25kB
-> Index Scan using demo_seq on demo (cost=0.00..2529.00 rows=20000 width=60) (actual time=6.084..1444.365 rows=299556 loops=1)
Index Cond: ((seq = 0) AND (status = ANY ('{1,2,3}'::integer[])))
Storage Table Read Requests: 293
Storage Table Read Execution Time: 1244.286 ms
Storage Index Read Requests: 293
Storage Index Read Execution Time: 1.454 ms
Planning Time: 0.188 ms
Execution Time: 1510.086 ms

This index was utilized to apply the filter and read a smaller range, but rows still had to be transferred and sorted before sending the first row result.

Finally, with my index on (seq, id ASC, status) there is no need to force anything to obtain the optimal execution plan with the best index

yugabyte=# explain (analyze, dist)
select * from demo where seq=0 and status in (1,2,3)
order by id limit 1;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..0.13 rows=1 width=60) (actual time=10.297..10.298 rows=1 loops=1)
-> Index Scan using demo_seq_id_status on demo (cost=0.00..26004.00 rows=200000 width=60) (actual time=10.294..10.294 rows=1 loops=1)
Index Cond: (seq = 0)
Filter: (status = ANY ('{1,2,3}'::integer[]))
Rows Removed by Filter: 1
Storage Table Read Requests: 1
Storage Table Read Execution Time: 8.913 ms
Storage Index Read Requests: 1
Storage Index Read Execution Time: 0.937 ms
Planning Time: 0.129 ms
Execution Time: 10.517 m

The execution plan demonstrates how scalable the database is. It starts with a single read request to the index, which seeks the beginning of the range where seq=0. Rows are then read in order, and the process can immediately stop after one index entry. After that, there's only one additional read request to the table to retrieve the remaining columns. If we include all columns in the index ( create index demo_seq_id_status on demo(seq, id ASC, status) include (value)), we can even skip this step with an Index Only Scan. The time complexity of this is O(1) as it does not depend on the size of the table. This query needs to read only one row and, with the right index, this is one read request in YugabyteDB (the remote read from the LSM-Tree) or few buffer reads in PostgreSQL (the B-Tree height).

Originally published at https://dev.to on December 8, 2023.

--

--

Franck Pachot

Developer Advocate for YugabyteDB (Open-Source, PostgreSQL-compatible Distributed SQL Database. Oracle Certified Master and AWS Data Hero.