PostgreSQL: measuring query activity (WAL size generated, shared buffer reads, filesystem reads,…)

Franck Pachot
5 min readMay 27, 2019

When I want to know if my application scales, I need to understand the work done by my queries. No need to run a huge amount of data from many concurrent threads. If I can get the relevant statistics behind a single unit-test, then I can infer how it will scale. For example, reading millions of pages to fetch a few rows will cause shared buffer contention. Or generating dozens of megabytes of WAL for a small update will wait on disk, and penalize the backup RTO, or the replication gap.

I’ll show some examples. From pgsql I’ll collect the statistics (which are cumulative from the start if the instance) before:

select *,pg_current_wal_lsn() from pg_stat_database where datname=current_database() \gset

and calculate the difference to show the delta:

select blks_hit-:blks_hit"blk hit",blks_read-:blks_read"blk read",tup_inserted-:tup_inserted"ins",tup_updated-:tup_updated"upd",tup_deleted-:tup_deleted"del",tup_returned-:tup_returned"tup ret",tup_fetched-:tup_fetched"tup fch",xact_commit-:xact_commit"commit",xact_rollback-:xact_rollback"rbk",pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(),:'pg_current_wal_lsn')) "WAL",pg_size_pretty(temp_bytes-:temp_bytes)"temp" from pg_stat_database where datname=current_database();

Most of the statistics come from pg_stat_database. The WAL size is calculated from the latest WAL write pointer exposed with pg_current_wal_lsn() and size calculated with pg_wal_lsn_diff(). I use \gset to get them as plsql substitution variables before and used them to get the difference after.

Better with examples, I create a DEMO database and a DEMO table:

demo=# create table DEMO(n int primary key,flag char,text varchar(1000));CREATE TABLE

Insert 1000 rows:

I’ve run the following insert:

insert into DEMO
select generate_series, 'N',lpad('x',1000,'x') from generate_series(1,10000);
INSERT 0 10000

But I’ve run my queries before and after in order to get the statistics:

I’ve inserted about 10MB of data (10000 rows with a 1000 bytes text and some additional small columns). All those new rows bust be logged by the WAL, which is the reason for the 11MB of redo information. The pages where it was written had to be read from disks in order to update them, which is the reason for 1504 block reads (that’s about 11MB in 8k blocks). Actually, we see 30000 additional block hits in buffer cache and this is the index maintenance. Those 10000 rows were inserted one by one, updating the B*Tree index which may have a height of 3 and that’s 30000 pages touched. Fortunately, they stayed in the shared buffers which is why there’s no block read each time.

This matches with the size of my table:

select relkind,relname,relpages,pg_size_pretty(relpages::bigint*8*1024) from pg_class natural join (select oid relnamespace,nspname from pg_namespace) nsp where nspname='public';

11MB of data, nearly 1500 pages — that’s what I got from the statistics.

Count the rows

Now, I’ll check how many blocks have to be read when counting all rows:

select count(*) from DEMO ;

All table blocks were read, fortunately from the shared buffers. And the reason is in the explain plan: full table scan. Actually, I would expect an Index Only scan for this query because all rows can be counted from the index. But MVCC implementation of PostgreSQL is versioning only the table tuples. Not the index entries, and it has to go to the table, unless a recent vacuum has updated the visibility map.

Vacuum

Let’s vacuum to update the visibility map:

vacuum DEMO;

Many block reads, that’s like a full table scan, with minimal updates here because there’s no dead tuples to remove.

Count from vacuumed table

Now, counting the rows again:

select count(*) from DEMO ;

Finally I have a real Index Only scan, accessing only to 45 buffers to count the 10000 index entries. That’s the most important point with PostgreSQL performance: MVCC allows to mix read and write workloads, thanks to non-blocking reads, but ideally, a vacuum should occur between massive write and massive read use-cases. Again, no need to run a massive pgbench concurrent workload to observe this. Small queries are sufficient as long as we look at the right statistics.

Update

I update the ‘flag’ column to set half of the rows to ‘Y’:

update DEMO set flag='Y' where n%2=0;

This is the operation where PostgreSQL MVCC is less efficient. I changed only one byte per row, but all rows were copied to new blocks. As I’m touching half the rows, they fit in half the blocks. This is the 743 read from disk. And the old version had to be marked as so… finally, nearly 30000 buffer access to update 5000 flags. And the worse is the redo generation. 16MB as the new tuples must be logged, as well as the old versions marked. And PostgreSQL must do full page logging even when few rows/columns are modified. More about this:

In this previous blog I was using strace to get the size of WAL written. Now using the delta offset of WAL pointer is see the same figures.

Delete

Now deleting the non-flagged rows

delete from DEMO where flag='N';

Deleting 5000 rows here, has to scan all blocks to find them (that’s the 10000 tuples returned) which is about 1500 buffers accessed. And for the 5000 found, mark them as deleted, which is 5000 additional buffers accessed.

Why?

This post is there mainly to show the simple query I use to get SQL execution statistics, including WAL writes which are probably the most useful, but unfortunately missing from pg_stat_database. I’m also advocating here for small test cases fully understood rather than general benchmarks difficult to analyze. It is, in my opinion, the best way to understand how it works, both for educational purpose and to guarantee scalable applications.

--

--

Franck Pachot

Developer Advocate for YugabyteDB (Open-Source, PostgreSQL-compatible Distributed SQL Database. Oracle Certified Master and AWS Data Hero.