Postgres notes

Personal notes on Postgres.

Some notes on Postgres. Sort of a part of a personal wiki I’m working on.

Run v8 inside PG for triggers, etc: https://plv8.github.io.

The pipelining feature is neat: https://www.postgresql.org/docs/devel/libpq-pipeline-mode.html. Really helpful when you’re using a write-only pattern with multiple statements that need to be sequential.

The function unnest will unnest arrays all the way down, which can be confusing when you’re using custom types like create type element_id as (st int4, mid int8, cap int8);. Just something to know.

If you need to use a lock inside your application query code, there’s a pg fn for that: https://www.postgresql.org/docs/current/explicit-locking.html#ADVISORY-LOCKS

Using postgres for an experimental analytics app prototype, and I’m having trouble writing data fast enough. (Comparing against SQLite which, with simple data, peaks around 3_000_000 inserts/second.) Don’t have time to explore at the moment, but using COPY FROM with a csv file might be an option. What would happen if I just start streaming data into the csv as I use a COPY FROM query? Does pg wait and sort of keep the query running while it waits for EOF? Would it be faster to use prepared statements and just write directly? Or wait until the file is done, then do a COPY FROM? As long as the total load time is faster than doing single or batched inserts, it’d be worth it.

Or maybe there’s a way to write arrow files and tell PG to use them as tables using a FDW? Something to explore more.

Large object details: https://www.postgresql.org/docs/current/lo-interfaces.html#LO-CREATE.

As usual, Citus has some good thoughts on performance: https://www.citusdata.com/blog/2017/09/29/what-performance-can-you-expect-from-postgres/.

A little old, but generally good info on storing huge files inside pg. https://stackoverflow.com/questions/9605922/are-there-performance-issues-storing-files-in-postgresql.

Simple example of loop in function.

count := 0
LOOP
    RAISE NOTICE count;
    count := count + 1;
    EXIT WHEN count = 5;
END LOOP;

create
or replace
function do_thing()
returns void
security definer
set search_path = public
as $$
declare
count int;
begin
    count := 0
    loop
        raise notice count;
        count := count + 1;
        exit when count = 5;
    end loop;
end;
$$ language plpgsql;

Cool little trick to hash an entire row: https://stackoverflow.com/questions/3878499/finding-the-hash-value-of-a-row-in-postgresql.

select md5(cast((o.*) as text))
from objects o;

Updating enums.

alter type role_enum rename value 'object_owner' to 'object_admin';

Quick function to cast unknown text to uuid, defaulting to null/zero uuid.

create or replace
function cast_to_uuid(text)
returns uuid
language plpgsql
as $$
begin    
    return cast($1::text as uuid);
    exception
    when invalid_text_representation then
    return '00000000-0000-0000-0000-000000000000'::uuid;
end;
$$;

On pgtap:

The docs are just okay, and this Supabase doc is actually a better primer: https://supabase.com/docs/guides/database/extensions/pgtap.

If you’re using pgtap for supabase, the usebasejump/supabase-test-helpers GitHub repo has some good functions to get you started with auth and testing RLS.

The select diag(...) helper is okay, but you’re better off using RAISE NOTICE as it will tell you line numbers.

Other links:

Rough and old (but still useful) plpgsql cheatsheet: https://www.postgresonline.com/special_feature.php?sf_name=postgresql83_plpgsql_cheatsheet&outputformat=html.

I guess you can write anonymous code blocks if you want to: https://www.enterprisedb.com/postgres-tutorials/how-use-block-structure-write-anonymous-blocks-and-divide-larger-blocks-logical

do language plpgsql $$
declare
begin
raise notice 'hello there';
end;
$$;

RLS is nice, and for things like PostgREST it’s pretty useful. But there are a lot of things to think about if you’re exposing it over the open internet using only RLS. For example, like Supabase does.

One big thing is simply column-level security. If we’re going to use RLS to say some users can only select/update/insert/delete some rows, it follows that you might want to say that only some users to be able to select or update some columns.

There aren’t a lot of good ways to do this. The sloppy way is to use CREATE VIEW based on a table, and omit some data from it. But you can’t insert into it. A more complete way is to use GRANT to control columns themselves.

---  revoke all
revoke all privileges on objects from authenticated;
---  explicitly grant by action and columns
grant select (id, title, size, created_at) on objects to authenticated;

A much weirder alternative is to use column encryption. (Suggested here https://www.enterprisedb.com/postgres-tutorials/how-implement-column-and-row-level-security-postgresql.) But then you end up having to load the key for specific queries, which is just about as much work as using CREATE VIEW.

On the whole there aren’t many rock-solid solutions here. RLS + CLS are pretty enterprise-ish features, so it makes sense, as big organizations would be the ones willing to pay the tax to use them. Still, it seems very natural to want to do CLS, and expose some subset of your database to users/clients/partners without defining an entire API. For all its faults, SQL is far more expressive that any API/RPC/GraphQL wrapper.

For RLS policies, USING statements check existing table rows, while WITH CHECK are checking for new rows.

NOTIFY is neat.

Here (https://coussej.github.io/2015/09/15/Listening-to-generic-JSON-notifications-from-PostgreSQL-in-Go) is a pretty good post on how to use notify/listen logic. You basically create a trigger function on a table, and use pg_notify to send some value over a channel. Then, using external application code, like Go or something, you create a listener using the API, and listen on a specific channel name.

It seems like best practices are to create a custom record type or use JSON as the payload, but only include the data that your listener needs to know about. Then in your notification loop, you either use just that data, or you query to enrich your data. This is because your channel payload is limited to 8000 bytes.

Something like:

create or replace function notify_event() returns trigger as
$$
begin
    -- This is where you put your logic to build the event.
    -- Doesn't have to be JSON, could be whatever you'd like.
    -- Could even use table for a queuing system.
    perform pg_notify('events', '{"key":21921}');
    return null;
end;
$$ language plpgsql;

PostgREST (https://postgrest.org/en/stable/) has become more popular over the last couple of years. When combined with functions, it may be one of the best ways to move quickly with pg. I’d never expose it to the public internet without doing really thorough testing of RLS, policies, and permissions. But for an internal service? Seems about right.

You can write pg functions in sql, C, pgsql, TCL, Perl, and Python: https://www.postgresql.org/docs/current/xplang.html. My take on the PL languages is this; sql for functions makes sense, C is just a way to allow for external languages to be used, pgsql is basically just sql that lets you do for-loops and variables, but TCL, Perl, and Python all seem like a bridge too far. What are we doing here? Who let these clowns in?

Writing functions in an external C-ish language is possible. Good blog post about it here: https://www.opsdash.com/blog/postgresql-triggers-golang.html.

You basically write an isolated function and use GCC compile it to a .so file, and then register that in pg by loading it via path, and telling pg to consider it written in C.

It’s a little rough, because you have to do all the error handling yourself inside your code, and depending on how you allocate memory, it could be an issue, but it’s possible.

Supabase is a solid way to work with PG locally, with a focus on a sort of “monolith-in-a-box” set of features. Comes with pgsodium, pg_stat_statements, pgcrypto, pgjwt, uuid-ossp. and pg_graphql installed. Also has an okay UI for running queries and goofing around if you don’t want to install DBeaver.

Backups are generally done with https://www.postgresql.org/docs/current/app-pgdump.html, or https://github.com/wal-g/wal-g. Both offer point-in-time snapshots, but I’ve heard from a couple of DBAs that WAL-G is the more robust solution that companies are using if they’re running their own DBs. Managed PG providers go lower than that, and just use block-level snapshots of a disk, because the instances are attached to logical volumes on physical SSDs.

Running cron tasks in pg seems to be done mostly with https://github.com/citusdata/pg_cron, but at the cost of running an additional sort-of “side-car” service. Many managed services that offer pg have this extension available, but maybe not enabled, by default.

I love nearly all of pg’s superpowers, but this is one where I question the requirement. Seems like there aught to be a better way to do scheduling stuff for pg, and it’s not a check-by-jowl service emulating a unix scheduler.

Keeping an eye on this space.

Monitoring the progress of a query seems super difficult. There are crude tricks you can use like using a total, and then incrementing a value towards that total. More on that here: https://www.brianlikespostgres.com/postgres-query-progress-bar.html.

There’s nothing like sqlite’s sqlite3_progress_handler, but that makes sense as pg is not an embedded db. (What would that even look like? Something inside like NOTIFY/LISTEN?)

There are more complex things you can do, some of them outlined here: https://postgres.ai/blog/20220114-progress-bar-for-postgres-queries-lets-dive-deeper. The tl;dr is that it’s somewhat possible, but you need to do a lot of weird things, and possibly patch the core, and even then you get down to it there are still a lot of caveats.

I’m keeping an eye on this space as well.

Foreign data wrappers are interesting. I think of them as roughly webhooks for postgres. postgres_fdw appears to be the most common – people have multiple databases or clusters per service, and they want to run queries across all of them. Although, as long as you properly conformed to pg’s wire protocol, you could use a postgres_fdw with almost any service. (I’m sure there’s a company out there doing exactly this to offer a BI or data lake service.)

Good summary: https://thoughtbot.com/blog/postgres-foreign-data-wrapper
List of wrappers: https://wiki.postgresql.org/wiki/Foreign_data_wrappers

When using VIEW and MATERIALIZED VIEW, one is stored, and the other is not. A plain VIEW is just a query that is named like a table, and is run when it’s used. But a MATERIALIZED VIEW is stored, and needs to be refreshed using the REFRESH statement.

I think of this as a VIEW being a reference to a query, and a MATERIALIZED VIEW is a cached copy of a query’s results, refreshed on demand.

Additionally, it’s apparently more efficient that a simple copied table, even if you’re running a query before coping. An MV stores some join data and index data as well. Additionally, you can use REFRESH MATERIALIZED VIEW CONCURRENTLY to only update rows that changed, as long as you don’t use a WHERE clause in the query, and don’t mind locking access to the view.

A good summary of MVs is here: https://www.timescale.com/blog/how-postgresql-views-and-materialized-views-work-and-how-they-influenced-timescaledb-continuous-aggregates/

Table inheritance could be useful, but also could be a foot-gun. My gut tells me this is what VIEW and MATERIALIZED VIEW are for, along with proper schema design.

https://www.postgresql.org/docs/current/tutorial-inheritance.html

The search path is a shorthand way of setting the schemas to use in a query, so we don’t have to write it out each time. Eg: SET search_path TO public; I prefer to set it on all functions, just to be explicit about it.

Quick and dirty way to check email validity inside a plpgsql function.

create
    or replace function "users.create"(email text)
    returns uuid
    language plpgsql
    security definer
    set search_path = public
as
$$
declare
    param_email text;
    output_id   uuid;
begin
    -- Using lowercase here because I use it elsewhere in my system.
    param_email := lower(email);
    output_id := gen_random_uuid();
    if
        param_email !~ '^[a-z0-9._%-]+@[a-z0-9.-]+[.][a-z]+$' then
        return 0;
    end if;
    -- Actually create the user here....
end;
$$;

A lot of opinions on doing this over at StackOverflow:

This isn’t a replacement to email validation/confirmation, but just helps me normalize things.

Using COMMENTS is nice, but depending on the object (class?) they’re on, it can be difficult to look them up. Easy for tables, but hard for policies, for example. Which is a bummer because that’s the thing that I’d like to use them on the most. Policies for RLS are the most complex parts of a schema, and a part where getting it wrong, or being unclear is very bad.

Here’s what I want to do:

create
policy "objects.update"
on objects for
update
using (
    id in (
        select object_id
        from permissions
        where individual_id = auth.uid()
    )
);
comment on policy
"objects.update"
on "objects"
is 'Objects can be updated by individuals with access.';

While you can use commands like \d through psql, doing it without the command line gets harder.

--- List descriptions, but doesn't include _policy_ comments...
SELECT obj_description(oid)
FROM pg_class;

--- Gives you details on the policy logic, but doesn't include comments...
select *
from pg_policies;

--- Need to do something like this to get policy comments specifically...
select *
from pg_description
where objoid = (
    select oid
    from pg_policy
    where polname = 'objects.update'
);

--- Now use a join, give it column names, and  wrap it in a function.
create or replace
function util_policy_comments()
returns table (
    name    text,
    comment text
)
language plpgsql
as $$
begin
return query
select p.polname::text as name, d.description::text as comment
from pg_description d
inner join pg_policy p on
p.oid = d.objoid;
end;
$$;

You can get there in the end, but it’s just a little weird.

There are blank spaces in the pg error code ranges, which lets you build out your own custom errors.

create or replace
function __custom_raise_permissions_error()
returns integer
language plpgsql as
$$
begin
raise 'permission denied' using errcode = 'D3GA2';
return 1;
end;
$$;

If your database is absorbing some service logic (hey, a lot of people are using PostgREST these days) then it might be useful to add some errors that are specific to your schema.

The main drawbacks are that you’re now closer to writing code in the database, and some extensions implement their own error codes, so you’ve got to watch out for that too.

Useful docs pages that I keep coming back to.

Edited May 5, 2023

postgres | sql | notes | code

2023-01-19