How a Distributed SQL Database Boosts Secondary Index Queries with Index Only Scan

Index Scan: a real-life example

Here is the classical CUSTOMER — ORDERS schema for a company selling coffee capsules, online and in local shops:

create table customers (
customer_id bigint constraint cust_pk primary key,
customer_name text constraint cust_uk unique
);
create table orders (
order_id bigint primary key,
customer_id bigint references customers,
order_date date,
order_amount decimal
);
select sum(ord.order_amount)
from customers cus join orders ord on cus.customer_id=ord.customer_id
where cus.customer_name='George Clooney'
and order_date > now() - interval '10 years';
create index ord_cust_id on orders(customer_id);
Aggregate
Output: sum(ord.order_amount)
-> Nested Loop Output: ord.order_amount
-> Index Scan using cust_uk on public.customers cus
Output: cus.customer_id, cus.customer_identification
Index Cond: (cus.customer_identification = 'George Clooney'::text)
-> Index Scan using ord_cust_id on public.orders ord
Output: ord.order_id, ord.customer_id, ord.order_date, ord.order_amount
Index Cond: (ord.customer_id = cus.customer_id)
Filter: (ord.order_date > (now() - '10 years'::interval))

Moving to Index Only Scan

Now, let’s think about how data is stored. This company has millions of customers. They order capsules regularly, but not frequently. Let’s say this loyal customer places an order every month. That means you’ll have to fetch 120 rows. That doesn’t seem like a lot, but think about where the rows are stored. They are scattered within the whole table because they arrived through those 10 years, interleaved with a load of millions of other orders.

create index ord_cust_new on orders(customer_id, order_date desc)
include (order_amount);
Aggregate
Output: sum(ord.order_amount)
-> Nested Loop
Output: ord.order_amount
-> Index Scan using cust_uk on public.customers cus
Output: cus.customer_id, cus.customer_identification
Index Cond: (cus.customer_identification = 'George Clooney'::text)
-> Index Only Scan using ord_cust_new on public.orders ord
Output: ord.customer_id, ord.order_date, ord.order_amount
Index Cond: ((ord.customer_id = cus.customer_id) AND (ord.order_date >
(now() - '10 years'::interval)))

A larger index, but not a new one

It is important to note that we’ve created another index here, to show this tuning method, and that the query planner will be choosing it. But this new index should replace the other. As long as the previous columns stay first in the key column list, adding new columns adds more access paths, but still serves the previous ones. This means that in our example I can:

drop index ord_cust_id;
explain select count(*) from orders where customer_id=1;
QUERY PLAN
--------------------------------------------------------------------
Aggregate (cost=15.50..15.51 rows=1 width=8)
-> Index Only Scan using ord_cust_new on orders (cost=0.00..15.25
rows=100 width=0)
Index Cond: (customer_id = 1)

When to use Index Only Scan

Now, you may have heard some myths about fat indexes. One is about the overhead to maintain them when rows are inserted, deleted or updated. But you need to think about it in context. For INSERT and DELETE, you still have the same index maintenance because you need the index anyway. This one just has more columns. For UPDATE, the index maintenance occurs when you change the indexed column value. In our example, the ORDERS amount will not change once inserted, so this concern is irrelevant. We’re using this optimization because the value is not changed frequently and is queried for many rows.

Covering functions in the index

So far, the index examples we have used are based on columns, but YugabyteDB — an open source, distributed SQL database for transactional applications — also supports functional or expression based indexes. Let’s say we have a query like this:

select sum(ord.order_amount)
from customers cus join orders ord on cus.customer_id=ord.customer_id
where cus.customer_name='George Clooney'
and extract(dow from order_date)=0 /* Sunday */;
create index ord_cust_new
on orders(customer_id, (extract(dow from order_date)))
include (order_amount, order_date);
->  Index Only Scan using ord_cust_new on orders ord
Index Cond: ((customer_id = cus.customer_id) AND
((date_part('dow'::text, (order_date)::timestamp without time zone)) =
'0'::double precision))

But what about primary indexes?

I explicitly mentioned secondary indexes in the title, but what about the primary key? The good thing is that, in YugabyteDB, you don’t care. An “Index Scan” on the primary key finds all columns without any additional work because the whole table is stored in the LSM-tree structure. Do not expect an “Index Only Scan” from the primary index, because it is already optimized by the DocDB storage. This is different for PostgreSQL, or Oracle heap tables, where all indexes are secondary indexes and always need an extra hop to the table if they are not covering all query columns. YugabyteDB is primarily optimized for OLTP with primary key access. The secondary index optimization discussed here allows the additional analytic queries — which are part of many OLTP applications — for running efficiently within the same OLTP database.

Further reading

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Franck Pachot

Franck Pachot

502 Followers

Developer Advocate at Yugabyte, Open Source distributed SQL database 🚀 Also Oracle ACE Director, Oracle Certified Master, AWS Data Hero, OakTable member