少為人知 PostgreSQL 的功能

今天的文章筆記來自 Lesser Known PostgreSQL Features,主要是在講一些 PostgreSQL 比較少人會用到的功能,自己也覺得滿有趣的。主要是紀錄自己不知道或是沒想過的功能。

如何知道 Upsert 哪些 row 是 insert 還是 update ?

給個例子:

1
2
3
4
5
CREATE TABLE users (
name TEXT PRIMARY KEY,
age INTEGER NOT NULL
);
INSERT INTO users (name, age) values ('foo', 20);

接著 insert one record and update one record:

1
2
3
4
5
6
7
8
9
INSERT INTO users (name, age) values ('foo', 30), ('bar', 40)
ON CONFLICT (name) DO UPDATE SET age = EXCLUDED.age
RETURNING *, (xmax = 0 ) AS inserted;

name | age | inserted
------+-----+----------
foo | 30 | f
bar | 40 | t
(2 rows)

關鍵在於透過 (xmax = 0) 可以判斷出該 row 是 insert 還是 update:

The identity (transaction ID) of the deleting transaction, or zero for an undeleted row version. It is possible for this column to be nonzero in a visible row version. That usually indicates that the deleting transaction hasn’t committed yet, or that an attempted deletion was rolled back.

當 PostgreSQL update a record 實際的行為是為該 record 再額外建立一個新版本 (每個 record 的版本通常又稱為 tuple),舊版本會跟著 auto vacuum 而刪掉,而 xmax 的作用:

  1. 會是紀錄該 tuple 是被哪個 tx 刪除或是更新的,只有 tx id 介於 xmin 和 xmax 才能看到該 tuple,當沒有任何 tx id < xmas 就代表該 tuple 可以被 vacuum 了
  2. 所以當這個 row 是被 insert 的話,因為沒有所謂的舊版本,所以 xmax = 0。

詳細解釋可以參考 Laurenz Albestackoverflow 的回答,這位是 PG 大神我幾乎每次在 stackoverflow 找 PG 解答都會看到他回答,每次回答都很精闢。

這樣的 query 可不可靠?理論上可靠,除非 PG 改底層的實作但機率不高。

匹配多種模式

通常這種模糊搜尋 query 都會用 like 來做處理,最常見是找是否存在包含這個字串,所以通常會這樣寫:

1
2
SELECT * FROM users
WHERE email LIKE '%@gmail.com';

如果也想找 yahoo.com 呢?

1
2
3
4
5
SELECT *
FROM users
WHERE
email LIKE '%@gmail.com'
OR email LIKE '%@yahoo.com'

因為 like 本身是不支援 regular expression 的,所以可以改用 SIMILAR TO 會更方便:

1
2
3
SELECT *
FROM users
WHERE email SIMILAR TO '%@gmail.com|%@yahoo.com'

當然也可以用 regular expression operator:

1
2
3
SELECT *
FROM users
WHERE email ~ '@gmail\.com$|@yahoo\.com$'

另外也能用 ANY + ARRAY 的做法:

1
2
3
SELECT *
FROM users
WHERE email ~ ANY(ARRAY['@gmail\.com$', '@yahoo\.com$'])

這樣的好處在 application layer 可以用 array type variable 的方式來寫 query。

找到 sequence 現在的值

在 PG 如果要用 auto_increment 的功能,就要用 sequence 特殊欄位。

找到的方式有以下幾種:

  1. currval

    限制是如果當前 session 沒有使用過 nextval () 就不能在該 session 使用 currval

    1
    2
    CREATE SEQUENCE sequence1;
    SELECT currval('sequence1');

    必須要先這樣:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    SELECT nextval('sequence1');
    nextval
    ---------
    1
    (1 row)

    SELECT currval('sequence1');
    currval
    ---------
    1
    (1 row)

    所以這個方式並不太實用

  2. 從 PG 10 開始有提供 pg_sequences view 來取得 sequences 的資訊

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    SELECT * FROM pg_sequences WHERE sequencename = 'sequence1';
    -[ RECORD 1 ]-+--------------------
    schemaname | public
    sequencename | sequence1
    sequenceowner | postgres
    data_type | bigint
    start_value | 1
    min_value | 1
    max_value | 9223372036854775807
    increment_by | 1
    cycle | f
    cache_size | 1
    last_value | 1
  3. 使用 pg_sequence_last_value

    1
    2
    3
    SELECT pg_sequence_last_value('sequence1');
    -[ RECORD 1 ]----------+--
    pg_sequence_last_value | 1

    但奇怪的是這個 function 是沒有在官方文件紀錄的,所以使用上可能要小心 (?

  4. 當然你可以直接對該 sequence 進行 select

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
       SELECT * FROM sequence1;
    -[ RECORD 1 ]--
    last_value | 1
    log_cnt | 32
    is_called | t

    ## 如何在 psql 使用 COPY

    可以使用 `\Copy` 或是 `COPY`,但是 `\Copy` 無法支援 multi-line SQL

    ```sql
    postgres=# \COPY (
    \copy: parse error at end of line

折衷的方式是建立 temp view

1
2
3
4
5
6
7
postgres=# CREATE TEMPORARY VIEW user_view AS
SELECT name, age
FROM users;
CREATE VIEW

postgres=# \COPY (SELECT * FROM user_view) TO users.csv WITH CSV HEADER;
COPY 2

不然就是用 COPY:

1
2
3
4
5
db=# COPY (
SELECT name, age
FROM users;
) TO STDOUT WITH CSV HEADER \g users.csv
COPY 5

透過用 \g 將 STDOUT 輸出成 local file。

避免對 auto generated key 進行設值

通常在 pg 用 auto_increment 的功能都會使用 serial datatype:

1
2
3
4
5
CREATE TABLE sale (
id SERIAL PRIMARY KEY,
sold_at TIMESTAMPTZ,
amount INT
);

可是如果使用 serial 的話會有一些問題,因此從 PG 10 開始推薦使用 identity columns

1
2
3
4
5
CREATE TABLE sale (
id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
sold_at TIMESTAMPTZ,
amount INT
);
1
2
3
4
5
6
7
8
postgres=# INSERT INTO sale (sold_at, amount) VALUES (now(), 1000);
INSERT 0 1

postgres=# SELECT * FROM sale;
id | sold_at | amount
----+-------------------------------+--------
1 | 2023-12-30 23:00:42.928397+08 | 1000
(1 row)

另外 IDENTITY 可以限制該 column 只有自動產生而不能用 query 直接指定值:

1
2
3
4
5
6
7
8
9
10
11
12
13
CREATE TABLE sale (
id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
sold_at TIMESTAMPTZ,
amount INT
);

postgres=# INSERT INTO sale (sold_at, amount) VALUES (now(), 1000);
INSERT 0 1

postgres=# INSERT INTO sale (id, sold_at, amount) VALUES (2, now(), 1000);
ERROR: cannot insert into column "id"
DETAIL: Column "id" is an identity column defined as GENERATED ALWAYS.
HINT: Use OVERRIDING SYSTEM VALUE to override.

就算設定了 ALWAYS 也可透過 OVERRIDING SYSTEM VALUE 來強制設定。因此最好的情況都是設定 ALWAYS。

Dollar Quoting 來處理特殊字元在字串中

在使用 text column 的時候,裡面的值可能會包含一些 escape characters,像是 single quote,需要用這樣的方式來處理:

1
2
3
4
5
postgres=# SELECT 'Kenny''s Pizza';
?column?
---------------
Kenny's Pizza
(1 row)

但是可以用 Dollar Quoting 的方式來處理:

1
2
3
4
5
6
postgres=# SELECT $$a long
string with new lines
and 'single quotes'
and "double quotes

PostgreSQL doesn't mind ;)$$ AS text;

既可以處理 escape characters 也能處理多行。

而如果字串需要放 $ 的字元的話那可以使用 Dollar Quoting 搭配 tag 的方式:

1
2
3
4
5
6
7
8
9
10
11
postgres=# SELECT $JSON${
"name": "Kenny's Pizza",
"tagline": "Best value for your $$"
}$JSON$ AS json;
json
-----------------------------------------
{ +
"name": "Kenny's Pizza", +
"tagline": "Best value for your $$"+
}
(1 row)

也能透過這樣的方式產生 jsonb objects

1
2
3
4
5
6
7
8
postgres=# SELECT $JSON${
"name": "Kenny's Pizza",
"tagline": "Best value for your $$"
}$JSON$::jsonb AS json;
json
----------------------------------------------------------------
{"name": "Kenny's Pizza", "tagline": "Best value for your $$"}
(1 row)

對 database object 下註解

這個挺有趣的,可以對任何的 database object 下註解,雖然實務上用處不大,因為這種東西感覺很難維護。久了就年久失修了。

例如,對 Table 下註解

1
postgres=# COMMENT ON TABLE sale IS 'Sales made in the system';

透過 \dt+ 來看到 comment 的內容:

1
2
3
4
5
6
postgres=# \dt+ sale
List of relations
Schema | Name | Type | Owner | Persistence | Access method | Size | Description
--------+------+-------+----------+-------------+---------------+------------+--------------------------
public | sale | table | postgres | permanent | heap | 8192 bytes | Sales made in the system
(1 row)

也可以對 table 上的 column 下註解:

1
2
3
4
5
6
7
8
9
postgres=# COMMENT ON COLUMN sale.sold_at IS 'When was the sale finalized';
COMMENT

postgres=# \d+ sale
Column │ Type │ Description
──────────┼──────────────────────────┼─────────────────────────────
id │ integer
sold_at │ timestamp with time zone │ When was the sale finalized
amount │ integer

也可以用 dollar quoting 來寫 multi-line 的註解:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
COMMENT ON FUNCTION generate_random_string IS $docstring$
Generate a random string at a given length from a list of possible characters.

Parameters:

- length (int): length of the output string
- characters (text): possible characters to choose from

Example:

postgres=# SELECT generate_random_string(10);
generate_random_string
────────────────────────
o0QsrMYRvp

postgres=# SELECT generate_random_string(3, 'AB');
generate_random_string
────────────────────────
ABB
$docstring$;

可用 \df+ 來看 function 註解:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
postgres=# \df+ generate_random_string
List of functions
-[ RECORD 1 ]-------+---------------------------------------------------------------------------------------------------------------
Schema | public
Name | generate_random_string
Result data type | text
Argument data types | length integer, characters text DEFAULT '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'::text
Type | func
Volatility | volatile
Parallel | unsafe
Owner | postgres
Security | invoker
Access privileges |
Language | plpgsql
Internal name |
Description | +
| Generate a random string at a given length from a list of possible characters. +
| +
| Parameters: +
| +
| - length (int): length of the output string +
| - characters (text): possible characters to choose from +
| +
| Example: +
| +
| postgres=# SELECT generate_random_string(10); +
| generate_random_string +
| ──────────────────────── +
| o0QsrMYRvp +
| +
| postgres=# SELECT generate_random_string(3, 'AB'); +
| generate_random_string +
| ──────────────────────── +
| ABB +
|

Autocomplete Reserved Words in Uppercase

在 psql 中可以設定 keyword 的 autocomplete 自動變全大寫:

1
2
3
postgres=# \set COMP_KEYWORD_CASE upper
postgres=# selec <tab>
postgres=# SELECT

Sleep for interval

通常如果要在 pg 上用 sleep 的效果,都會用 pg_sleep 來指定要等待的秒數:

1
2
3
4
5
6
7
postgres=# \timing
Timing is on.
postgres=# SELECT pg_sleep(3);
-[ RECORD 1 ]
pg_sleep |

Time: 3001.924 ms (00:03.002)

可以使用 pg_sleep_for 來指定 human read 的時間長度:

1
postgres=# SELECT pg_sleep_for('4 minutes 15 seconds');

pg_sleep_for 可以接受 interval type 的參數。

在 PG 13 之後不需要 extension 就能產生 uuid v4

在 PG 13 之前都會需要用 uuid-ossp extension 來產生 uuid v4,但是從 PG 13 後就有內建的 function 可以產生了:

1
2
3
postgres=# SELECT gen_random_uuid() AS uuid;
-[ RECORD 1 ]------------------------------
uuid | 00ffb15d-4cc2-432c-a1ed-e9111fa1c6e5

除非你要產生非 v4 的 uuid,才會需要 uuid-ossp extension。不然之後就不需要在特地使用這個 extension 了。

產生可重複的 random data

產生可重複的 random data 對於測試或是 benchmark 有很大的幫助。

一般來說隨機的資料,會使用 random function :

1
2
3
4
5
6
7
8
9
postgres=# SELECT
random() AS random_float,
ceil(random() * 10) AS random_int_0_10,
'2023-01-01'::date + interval '1 days' * ceil(random() * 365) AS random_day_in_2023;

-[ RECORD 1 ]------+--------------------
random_float | 0.5202183827443916
random_int_0_10 | 6
random_day_in_2023 | 2023-11-27 00:00:00

如果希望下一次 random data 也能一樣,則可以設定 setseed

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
postgres=# SELECT setseed(0.4050);
-[ RECORD 1 ]
setseed |

postgres=# SELECT
random() AS random_float,
ceil(random() * 10) AS random_int_0_10,
'2023-01-01'::date + interval '1 days' * ceil(random() * 365) AS random_day_in_2023
FROM generate_series(1, 2);
-[ RECORD 1 ]------+--------------------
random_float | 0.6672612966396358
random_int_0_10 | 5
random_day_in_2023 | 2023-03-13 00:00:00
-[ RECORD 2 ]------+--------------------
random_float | 0.3497101761996504
random_int_0_10 | 7
random_day_in_2023 | 2023-04-23 00:00:00

postgres=# SELECT setseed(0.4050);
-[ RECORD 1 ]
setseed |

postgres=# SELECT
random() AS random_float,
ceil(random() * 10) AS random_int_0_10,
'2023-01-01'::date + interval '1 days' * ceil(random() * 365) AS random_day_in_2023
FROM generate_series(1, 2);
-[ RECORD 1 ]------+--------------------
random_float | 0.6672612966396358
random_int_0_10 | 5
random_day_in_2023 | 2023-03-13 00:00:00
-[ RECORD 2 ]------+--------------------
random_float | 0.3497101761996504
random_int_0_10 | 7
random_day_in_2023 | 2023-04-23 00:00:00

就算是在不同的 database 只要 setseed 一樣就可以產生一樣的結果。

Search Path 的另外用法

在 PG 中 search path 預設是:

1
2
3
postgres=# SHOW search_path;
-[ RECORD 1 ]----------------
search_path | "$user", public

通常如果沒有設定相同 user name 的 schema 預設都會從 public schema 開始找對應的 object

文章內提供了一種用法,假設今天該 table 有新舊版的用戶,其中舊版所需要的 column 在新版是要拿掉了,那麼我們可以建立該 user 同名的 schema 並使用 view 的方式來 query 該 table:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
postgres=# CREATE SCHEMA kenny;
CREATE SCHEMA

postgres=# CREATE TABLE kenny.foo (value text);
CREATE TABLE

postgres=# INSERT INTO kenny.foo VALUES ('B');
INSERT 0 1

postgres=# \conninfo
You are connected to database "postgres" as user "kenny"

postgres=# SELECT * FROM foo;
value
───────
B

這樣的話搜尋的順序是先 kenny.foo 再來才是 public.foo

當然你也可以直接更新 search path 的設定:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
postgres=# CREATE SCHEMA test;
CREATE SCHEMA

postgres=# CREATE TABLE test.foo (value text);
CREATE TABLE

postgres=# INSERT INTO test.foo VALUES ('C');
INSERT 0 1

postgres=# SHOW search_path;
search_path
─────────────────
"$user", public

postgres=# SELECT * FROM foo;
value
───────
A

postgres=# SET search_path TO test, "$user", public;
SET

postgres=# SELECT * FROM foo;
value
───────
C

找到重疊的 range

常見的情境是定會議室的時間必須與其他會議室不會有重疊,因此要下 query 來檢查是否有重疊到:

1
2
3
4
5
6
7
CREATE TABLE meetings (
starts_at timestamp without time zone not null,
ends_at timestamp without time zone not null
);

INSERT INTO meetings (starts_at, ends_at) VALUES ('2021-10-01 10:00:00', '2021-10-01 10:30:00'),
('2021-10-01 11:15:00', '2021-10-01 12:00:00'), ('2021-10-01 12:30:00', '2021-10-01 12:45:00');

一般來說要找出 overlap 的 where query 會這樣寫:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
WITH new_meetings AS (
SELECT
id,
starts_at::timestamp as starts_at,
ends_at::timestamp as ends_at
FROM (VALUES
('A', '2021-10-01 11:10 UTC', '2021-10-01 11:55 UTC'),
('B', '2021-10-01 11:20 UTC', '2021-10-01 12:05 UTC'),
('C', '2021-10-01 11:20 UTC', '2021-10-01 11:55 UTC'),
('D', '2021-10-01 11:10 UTC', '2021-10-01 12:05 UTC'),
('E', '2021-10-01 11:15 UTC', '2021-10-01 12:00 UTC'),
('F', '2021-10-01 12:00 UTC', '2021-10-01 12:10 UTC'),
('G', '2021-10-01 11:00 UTC', '2021-10-01 11:15 UTC')
) as t(
id, starts_at, ends_at
)
)
SELECT
*
FROM
meetings, new_meetings
WHERE
new_meetings.starts_at BETWEEN meetings.starts_at and meetings.ends_at
OR new_meetings.ends_at BETWEEN meetings.starts_at and meetings.ends_at
OR meetings.starts_at BETWEEN new_meetings.starts_at and new_meetings.ends_at
OR meetings.ends_at BETWEEN new_meetings.starts_at and new_meetings.ends_at;

starts_at │ ends_at │ id │ starts_at │ ends_at
─────────────────────┼─────────────────────┼────┼─────────────────────┼────────────────────
2021-10-01 11:15:002021-10-01 12:00:00 │ A │ 2021-10-01 11:10:002021-10-01 11:55:00
2021-10-01 11:15:002021-10-01 12:00:00 │ B │ 2021-10-01 11:20:002021-10-01 12:05:00
2021-10-01 11:15:002021-10-01 12:00:00 │ C │ 2021-10-01 11:20:002021-10-01 11:55:00
2021-10-01 11:15:002021-10-01 12:00:00 │ D │ 2021-10-01 11:10:002021-10-01 12:05:00
2021-10-01 11:15:002021-10-01 12:00:00 │ E │ 2021-10-01 11:15:002021-10-01 12:00:00
2021-10-01 11:15:002021-10-01 12:00:00 │ F │ 2021-10-01 12:00:002021-10-01 12:10:00
2021-10-01 11:15:002021-10-01 12:00:00 │ G │ 2021-10-01 11:00:002021-10-01 11:15:00

但問題是 F 跟 G 的情況應該要算是不 overlap 因為剛好是接續前後 meeting,原因在於 BETWEEN 是 inclusive 的,所以 query 要改成:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
WITH new_meetings AS (/* ... */)
SELECT
*
FROM
meetings, new_meetings
WHERE
(new_meetings.starts_at > meetings.starts_at AND new_meetings.starts_at < meetings.ends_at)
OR
(new_meetings.ends_at > meetings.starts_at AND new_meetings.ends_at < meetings.ends_at)
OR
(meetings.starts_at > new_meetings.starts_at AND meetings.starts_at < new_meetings.ends_at)
OR
(meetings.ends_at > new_meetings.starts_at AND meetings.ends_at < new_meetings.ends_at)
OR
(meetings.starts_at = new_meetings.starts_at AND meetings.ends_at = new_meetings.ends_at);

starts_at │ ends_at │ id │ starts_at │ ends_at
─────────────────────┼─────────────────────┼────┼─────────────────────┼────────────────────
2021-10-01 11:15:002021-10-01 12:00:00 │ A │ 2021-10-01 11:10:002021-10-01 11:55:00
2021-10-01 11:15:002021-10-01 12:00:00 │ B │ 2021-10-01 11:20:002021-10-01 12:05:00
2021-10-01 11:15:002021-10-01 12:00:00 │ C │ 2021-10-01 11:20:002021-10-01 11:55:00
2021-10-01 11:15:002021-10-01 12:00:00 │ D │ 2021-10-01 11:10:002021-10-01 12:05:00
2021-10-01 11:15:002021-10-01 12:00:00 │ E │ 2021-10-01 11:15:002021-10-01 12:00:00

這樣 query 很複雜,因此可以使用 PG 提供的 OVERLAP

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
WITH new_meetings AS (
SELECT
id,
starts_at::timestamptz as starts_at,
ends_at::timestamptz as ends_at
FROM (VALUES
('A', '2021-10-01 11:10 UTC', '2021-10-01 11:55 UTC'),
('B', '2021-10-01 11:20 UTC', '2021-10-01 12:05 UTC'),
('C', '2021-10-01 11:20 UTC', '2021-10-01 11:55 UTC'),
('D', '2021-10-01 11:10 UTC', '2021-10-01 12:05 UTC'),
('E', '2021-10-01 11:15 UTC', '2021-10-01 12:00 UTC'),
('F', '2021-10-01 12:00 UTC', '2021-10-01 12:10 UTC'),
('G', '2021-10-01 11:00 UTC', '2021-10-01 11:15 UTC')
) as t(
id, starts_at, ends_at
)
)
SELECT
*
FROM
meetings, new_meetings
WHERE
(new_meetings.starts_at, new_meetings.ends_at)
OVERLAPS (meetings.starts_at, meetings.ends_at);
starts_at | ends_at | id | starts_at | ends_at
---------------------+---------------------+----+---------------------+---------------------
2021-10-01 11:15:00 | 2021-10-01 12:00:00 | A | 2021-10-01 11:10:00 | 2021-10-01 11:55:00
2021-10-01 11:15:00 | 2021-10-01 12:00:00 | B | 2021-10-01 11:20:00 | 2021-10-01 12:05:00
2021-10-01 11:15:00 | 2021-10-01 12:00:00 | C | 2021-10-01 11:20:00 | 2021-10-01 11:55:00
2021-10-01 11:15:00 | 2021-10-01 12:00:00 | D | 2021-10-01 11:10:00 | 2021-10-01 12:05:00
2021-10-01 11:15:00 | 2021-10-01 12:00:00 | E | 2021-10-01 11:15:00 | 2021-10-01 12:00:00
(5 rows)

總結

感謝這篇文章讓我又學到了不少新奇的東西或用法~