What is a “truly distributed UDF” ?
This is a re-post from dev.to which is the main place where I blog:
https://dev.to/franckpachot — original article from Dec 9, 2022
YugabyteDB has always supported User Defined Functions (UDFs) in its PostgreSQL compatible API because… it is PostgreSQL compatible.
I’ve been working with Oracle database since version 7.1 and UDFs, as PL/SQL function were already there (as far as I remember). But it was like an anti-pattern when used from a SQL query because, running in another engine, context switches between SQL and PL/SQL was killing the performance. In Oracle 12c came the possibility to run them in the SQL engine (Pragma UDF) and in Oracle 19c came the real in-lining of UDFs with SQL Macros. With those, UDFs can finally be used for what developers are asking for: encapsulate complex processing logic directly within SQL.
PostgreSQL has a great support for UDFs, both in functionalities and performance and this is available in YugabyteDB. A simple CREATE FUNCTION covers the features known in other databases as UDF, SQL Macro or Parameterized View.
Recently, Cockroach announced that they added support for UDFs because this was the most frequently requested features. Building their database from scratch, it comes with lot of limitations (no use in indexes, views, in other functions, no WITH clause or subqueries). Most of the exemples here and the previous posts, which are PostgreSQL compatible, doesn’t work in the current version of CRDB. But the good news from it is that users are asking for UDFs when a database doesn’t support it. No surprises: the SQL features added to traditional databases in the past 20 years were added because developers needed them. That’s why Yugabyte was anticipating this by re-using the PostgreSQL code to get all SQL features from already proven code.
Here are some examples with scalar expressions (to be used in the SELECT, WHERE, HAVING, GROUP BY, ORDER BY clauses) and table functions (used in the FROM clause).
Scalar functions
I’m taking those functions from the examples in pg_tle, an extension that helps packaging those functions. The functions calculate the Manhattan and Euclidian distance:
CREATE FUNCTION my_dist(x1 float8, y1 float8, x2 float8, y2 float8, norm int)
RETURNS float8
AS $UDF$
SELECT (abs(x2 - x1) ^ norm + abs(y2 - y1) ^ norm) ^ (1::float8 / norm);
$UDF$ LANGUAGE SQL IMMUTABLE PARALLEL SAFE;
CREATE FUNCTION manhattan_dist(x1 float8, y1 float8, x2 float8, y2 float8)
RETURNS float8
AS $$
SELECT dist(x1, y1, x2, y2, 1);
$$ LANGUAGE SQL IMMUTABLE PARALLEL SAFE;
CREATE FUNCTION euclidean_dist(x1 float8, y1 float8, x2 float8, y2 float8)
RETURNS float8
AS $$
SELECT dist(x1, y1, x2, y2, 2);
$$ LANGUAGE SQL IMMUTABLE PARALLEL SAFE;
This is typically the kind of code that you want to define only once and then use in SQL queries, without additional latency between application and database, or between different engines in the database.
Here is an example creating a table with a 8x8 points and looking at Manhattan distance and min/max Euclidian distance between those points
CREATE TABLE demo as select a,b from generate_series(1,8) a, generate_series(1,8) b;
select manhattan_dist(x.a, x.b, y.a, y.b)
,min(euclidean_dist(x.a, x.b, y.a, y.b))
,max(euclidean_dist(x.a, x.b, y.a, y.b))
from demo x , demo y
group by manhattan_dist(x.a, x.b, y.a, y.b)
order by 1
;
The result:
Note that this works only on PostgreSQL and PostgreSQL-compatible databases like YugabyteDB or Amazon Aurora. I tried in CockroachDB 22.2 where UDFs were introduced. I was able to create the first dist
function after changing a few things (syntax error for PARALLEL SAFE
and ERROR: unsupported binary operator: <float> ^ <int>
) but the second function fails with ERROR: unknown function: dist(): function undefined
as if the first one is not visible.
In YugabyteDB the execution of the function occurs in the PostgreSQL backend. It is not pushed down to the storage, like the functions in the previous post, because there are multiple tables involved here. But it is still distributed because connections are distributed though the cluster in YugabyteDB.
In the execution plan, it shows as Nested Loop
but there's only one call to the storage (Storage Table Read Requests: 1
), and loops (loops=64
) occur only on the Materialize
result:
Table functions
SQL is all about tables, sets of records, and you can use UDF to return rows. Why this rather than a view? Because you can pass parameters. For example, I’ve created my demo
table above with a select on two generate_series()
. This is the kind of logic that you want to encapsulate in a function:
create or replace function chessboard(size int default 64)
returns table(a int, b int) as $UDF$
select a,b
from generate_series(1,chessboard.size) a
, generate_series(1,chessboard.size) b;
$UDF$ language SQL;
select * from chessboard(3);
This is easy to test and use:
Function is a simple way to encapsulate complex processing logic directly within SQL. The concept exists in all languages and can be used the same in PostgreSQL or YugabyteDB: modular development, code re-use, unit tests… The only reason why there’s not big shout out about it is that this exists for a long time in PostgreSQL, with all feature and optimizations, and came seamlessly into Yugabyte thanks to the fork/merge approach of the query layer.
Two articles follow on the dev.to series:
- EXPLAIN (ANALYZE, DIST) 🚀 YugabyteDB distributed execution plan
- SQL Macros in YugabyteDB with PostgreSQL UDF
1What is a “truly distributed UDF” ?2EXPLAIN (ANALYZE, DIST) 🚀 YugabyteDB distributed execution plan3SQL Macros in YugabyteDB with PostgreSQL UDF