CREATE INDEX in YugabyteDB: online or fast?

Originally published at https://dev.to on October 7, 2021.

When you create an index, you want the creation to be either:

  • fast, blocking concurrent writes to get a stable version of the rows until the creation is completed, usually with a share lock on the table
  • online, which means that it doesn’t block, or even slow down, concurrent operation. This requires additional synchronization steps and is slower. However, slower is not a problem when nobody waits on it.

On traditional databases, we are used to pessimistic locking, and blocking-DDL statements is the default. Online index creation is an option, and sometimes with the high cost of Enterprise License. The reason is also legacy: online operations were implemented later, when applications became 24/7. But YugabyteDB is new, is free, and, even if supporting all SQL operations, has its roots in hyper-scale databases where you do not wait on locks. The default is to allow concurrent DML when creating an index. According to the above classification, this means that the CREATE INDEX may appear slow. This not a problem when creating an index online, as the application is still running. However there are some situations where you know you are alone on the system and want the fastest index creation. Especially when you have many indexes to create because you are importing a dump, or running a full schema DDL. CI/CD may also re-ceate the schema frequently and you definitely don’t want the overhead to synchronize with potential concurrent DML.

As an example, I’ve run the following:

This 100 index creation ran in 1 hour, with each creation taking between 30 and 60 seconds. The reason is that, to support backfill of concurrent DML, there are multiple changes that need to be in sync for all yb-tserver. This goes though yb-master heartbeats, which introduces some waits. Note that I’m writing this on version 2.9 and it may be improved in the future.

BACKFILL

On another run, looking at pg_stat_statements shows the INDEX BACKFILL which is short, around 50 milliseconds, but the CREATE INDEX is around 40 seconds. Because there are a lot of wait operations for this online index creation:

Note that BACKFILL INDEX is an internal command. You cannot run it yourself, but it is monitored by pg_stat_statements.

NONCONCURRENTLY

The solution when you are alone on the database is to disable backfill. The index will still be created on a snapshot view of the table (which is empty in my case) but without trying to backfill concurrent transactions. This is fine here as I know I have no concurrent DML.

This 100 index creation ran in 4 minutes with each creation taking between 30 and 50 milliseconds.

This is for no-DML only!

NONCONCURRENTLY must be used only when you can guarantee that no concurrent DML will happen on the table. This will not be verified by the database engine, as this is the way to run it faster. It is different from an offline index creation, like in PostgreSQL without the CONCURRENTLY keyword, because the table is not locked in YugabyteDB. This means that CREATE INDEX NONCONCURRENTLY will leave the index in an incorrect state if there are modifications to the underlying table. And this is not visible from pg_index.

So this is to be used only to load a schema without other activity on it.

I want to be sure you know the consequences. Here is an example of bad things that can happen if you incorrectly mention NONCONCURRENTLY:

This runs 5 times a 1000 rows insert with same value for demo.a and demo.b, and the first iteration already occurred.

I’ve created the index with nonconcurrently when, obviously, there was some DML at the same time (we see that another insert of 1000 rows happened during that time)

The Seq Scan on the table shows the 2000 rows, which is right given the feedback of two INSERT 0 1000.

However, from the index I see only a snapshot from the point in time where the CREATE INDEX happened:

Then a new iteration occurs:

Now, here is what is seen from the table (Seq Scan) and index (Index Scan):

The index continues to be maintained, but missed the rows that were inserted during the create, because there were no backfill.

Unfortunately, in the current version (2.9) there’s no indication about it, so be careful:

In summary:

There’s a reason why online index creation is the default: it guarantees the consistency of data whatever happens on your database.

Yes, you can create indexes faster with NONCONCURRENTLY, and there are good cases for that (restore, migration, CI/CD) but it is your responsibility to ensure that there are no concurrent DML.

--

--

Developer Advocate at Yugabyte, Open Source distributed SQL database 🚀 Also Oracle ACE Director, Oracle Certified Master, AWS Data Hero, OakTable member

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Franck Pachot

Developer Advocate at Yugabyte, Open Source distributed SQL database 🚀 Also Oracle ACE Director, Oracle Certified Master, AWS Data Hero, OakTable member