CREATE INDEX in YugabyteDB: online or fast?

  • online, which means that it doesn’t block, or even slow down, concurrent operation. This requires additional synchronization steps and is slower. However, slower is not a problem when nobody waits on it.
{
echo "drop table if exists demo;"
echo "create table demo (id int primary key " ,c{1..100}" int" "); "
echo "\\timing on"
for i in c{1..100} ; do echo "create index demo$i on demo($i);" ; done
} | psql

BACKFILL

On another run, looking at pg_stat_statements shows the INDEX BACKFILL which is short, around 50 milliseconds, but the CREATE INDEX is around 40 seconds. Because there are a lot of wait operations for this online index creation:

vd -f json http://localhost:13000/statements
query | calls#| total_time%|↑min_time%| max_time%| mean_time%| stddev_time%| rows#║
select pg_stat_statements_reset() | 1 | 0.18 | 0.18 | 0.18 | 0.18 | 0.00 | 1 ║
BACKFILL INDEX 66055 WITH x'0880011a00' READ TIME 6690842023934689280 PARTITION x'aaaa' | 1 | 38.29 | 38.29 | 38.29 | 38.29 | 0.00 | 0 ║
BACKFILL INDEX 66053 WITH x'0880011a00' READ TIME 6690841697617080320 PARTITION x'aaaa' | 1 | 45.90 | 45.90 | 45.90 | 45.90 | 0.00 | 0 ║
BACKFILL INDEX 66054 WITH x'0880011a00' READ TIME 6690841855848988672 PARTITION x'aaaa' | 1 | 46.97 | 46.97 | 46.97 | 46.97 | 0.00 | 0 ║
BACKFILL INDEX 66057 WITH x'0880011a00' READ TIME 6690842368179253248 PARTITION x'aaaa' | 1 | 54.83 | 54.83 | 54.83 | 54.83 | 0.00 | 0 ║
BACKFILL INDEX 66056 WITH x'0880011a00' READ TIME 6690842190767943680 PARTITION x'aaaa' | 1 | 55.29 | 55.29 | 55.29 | 55.29 | 0.00 | 0 ║
BACKFILL INDEX 66052 WITH x'0880011a00' READ TIME 6690841540903989248 PARTITION x'aaaa' | 1 | 70.18 | 70.18 | 70.18 | 70.18 | 0.00 | 0 ║
create table demo (id int primary key ,c1 int ,c2 int ,c3 int ,c4 int ,c5 int ,c6 int ,c7 int ,c8…| 1 | 5454.30 | 5454.30 | 5454.30 | 5454.30 | 0.00 | 0 ║
create index democ2 on demo(c2) | 1 | 37898.21 | 37898.21 | 37898.21 | 37898.21 | 0.00 | 0 ║
create index democ3 on demo(c3) | 1 | 38600.98 | 38600.98 | 38600.98 | 38600.98 | 0.00 | 0 ║
create index democ5 on demo(c5) | 1 | 40794.47 | 40794.47 | 40794.47 | 40794.47 | 0.00 | 0 ║
create index democ4 on demo(c4) | 1 | 40883.21 | 40883.21 | 40883.21 | 40883.21 | 0.00 | 0 ║
create index democ1 on demo(c1) | 1 | 41132.44 | 41132.44 | 41132.44 | 41132.44 | 0.00 | 0 ║
create index democ6 on demo(c6) | 1 | 45125.51 | 45125.51 | 45125.51 | 45125.51 | 0.00 | 0 ║

NONCONCURRENTLY

The solution when you are alone on the database is to disable backfill. The index will still be created on a snapshot view of the table (which is empty in my case) but without trying to backfill concurrent transactions. This is fine here as I know I have no concurrent DML.

{
echo "drop table if exists demo;"
echo "create table demo (id int primary key " ,c{1..100}" int" "); "
echo "\\timing on"
for i in c{1..100} ; do echo "create index demo$i on demo($i);" ; done
} | psql

This is for no-DML only!

NONCONCURRENTLY must be used only when you can guarantee that no concurrent DML will happen on the table. This will not be verified by the database engine, as this is the way to run it faster. It is different from an offline index creation, like in PostgreSQL without the CONCURRENTLY keyword, because the table is not locked in YugabyteDB. This means that CREATE INDEX NONCONCURRENTLY will leave the index in an incorrect state if there are modifications to the underlying table. And this is not visible from pg_index.

poc=# create table demo (a int ,b int);
CREATE TABLE
poc=# \! for i in {1000..1005} ; do psql -c "insert into demo select n,n from generate_series(1,1000) n" ; done &
poc=#
INSERT 0 1000
create index nonconcurrently demoa on demo(a);INSERT 0 1000CREATE INDEX
poc=# select /*+ SeqScan(demo) */ count(*),sum(a),sum(b) from demo where a>0; count |   sum   |   sum
-------+---------+---------
2000 | 1001000 | 1001000
(1 row)
poc=# select /*+ IndexScan(demo demoa) */ count(*),sum(a),sum(b) from demo where a>0;
count | sum | sum
-------+--------+--------
1000 | 500500 | 500500
(1 row)
poc=# INSERT 0 1000poc=#
poc=# select /*+ SeqScan(demo) */ count(*),sum(a),sum(b) from demo where a>0; count |   sum   |   sum
------------+---------+---------
3000 | 1501500 | 1501500
(1 row)
poc=# select /*+ IndexScan(demo demoa) */ count(*),sum(a),sum(b) from demo where a>0; count | sum | sum
------------+---------+---------
2000 | 1001000 | 1001000
(1 row)
poc=# \x
Expanded display is on.
poc=# select relname,indisvalid,indisready,indislive from (select oid indexrelid, relname from pg_class where relname like 'demoa') r natural join pg_index i;
-[ RECORD 1 ]--+------
relname | demoa
indisvalid | t
indisready | t
indislive | t

In summary:

There’s a reason why online index creation is the default: it guarantees the consistency of data whatever happens on your database.

--

--

Developer Advocate at Yugabyte, Open Source distributed SQL database 🚀 Also Oracle ACE Director, Oracle Certified Master, AWS Data Hero, OakTable member

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Franck Pachot

Franck Pachot

502 Followers

Developer Advocate at Yugabyte, Open Source distributed SQL database 🚀 Also Oracle ACE Director, Oracle Certified Master, AWS Data Hero, OakTable member