Postgres - 深入探討 index engine 的行為

發表於 2021-08-08 更新於 2024-04-28 分類於 Postgres Disqus：文章字數： 13k 所需閱讀時間 ≈ 24 分鐘

Database 最值得學習的不外乎是 index 的設計，為什麼呢？因為有了 index 才可以加速查詢，但是內部的 index engine 是怎麼做到的？如果我們可以了解細節的差異的話，我們在設計要怎麼放 index 在什麼欄位，我覺得會更有感覺。

今天這篇文章會介紹 Postgres 內部的 index engine 是怎麼運作，不僅僅單純的 index scan，事實上 Postgres 會因應不同的情況採取不同的 scan 策略。

index

index 對於資料庫而言是一個額外且輔助的 structure，資料庫是可以不要有 index 的，在小資料的情況下，確實有了 index 也許 Database 給你的搜尋策略仍舊會是 sequential scan，但大資料的情況下，若有 index 的話提升的查詢效率是非常快的。

目前從 Postgres 9.6 開始提供了共有以下不同類型的 index：

Hash
B-tree
GiST
SP-GiST
GIN
RUM
BRIN
Bloom

可以的話，我會寫上面每一個 index 的細節文章。

在開始講 index engine 的運作前，我們先了解以下的名詞：

Index Access Method

Postgres 管理每個不同類型的 index type 的行為是透過 access method interface 去定義的，所以其實是可以建立自己的 index type，這個 access method interface 預計是在下篇文章細講。
Page and Block

在 Postgres 的定義上兩者都算是同義詞，指的是所有對象的儲存單位，每一個 Page 預設為 8KB，所以資料庫上的資料會放在硬碟上叫做 Page or Block 的地方。但通常習慣上在內存會叫做 Page，在 Disk 會叫做 Block，但兩者意義是一樣的。
TID

前面說到資料會放在 Page 上，那麼會怎麼放呢？Page 上面會有好幾個叫做 tuple 的元組，每一個 tuple 代表每一 row 的資料，然後這些 tuple 再被存放到 Page 上。

而 TID 指的就是 tuple 的 id，其實也就是指那個 row identifier 的意思，而 TID 的值是一個 pair 其內容為 (block number, tuple index within the block)，來看個例子：

TID = (42,9)，代表 tuple 的資料在第 43 block 內的第 9 個 element。

這也說明了為什麼 index 可以快速定位到資料？這是因為 index 本質上都是會儲存與 TID 的配對關係，所以當我們有 index，index 可以直接拿到 TID，透過拿取 TID 的值，就能快速定位到是在哪個 page 內的第幾個 tuple。

index 的優劣

我們知道 index 是因為為了提升讀取的效率，而產生出的額外的 structure，但是要知道如果它參考的 TID 的實際資料被更新 / 刪除 / 插入，Postgres 都需要額外花 cost 去更新 index，所以對於寫入的效率而言，有 index 是一定會下降的。當今天有個 transaction 會影響 index reference 的 TID 的話，都需要在同一個 transaction 內，去更新 index。

Index Engine

因為有 access method 的關係，不同的 index type 其 index engine 的統一行為如下：

Read data from corresponding versions of table rows.
Fetch row versions TID by TID or in a batch using a prebuilt bitmap. (而 TID 的部分是從 access method 這邊獲取的)
Check visibility of row versions for the current transaction taking into account its isolation level.

但是 index 會根據資料量或是優化的策略，採用不同的 scan 策略。為了等等可以做 index 的實驗，先來建立一個 table 來做示範：

1	create table t(a integer, b text, c boolean);

接著 Insert data 進去：

insert into t(a,b,c)
  select s.id, chr((32+random()*94)::integer), random() < 0.01
  from generate_series(1,100000) as s(id)
  order by random();

insert data 這邊有做了一些處理：

a column 用 Postgres 內建的 generate_series 來建立 1 ~ 100000 個數字
b column 用 char + random () * 94 來 insert ASCII 的相關 character
c column 用 random () 會產生 0 ~ 1 之間的隨機數字，根據機率而言理論上會有 1% 的欄位會是 TRUE，其他都是 FALSE
最後用 order by random () 代表 insert 進去的 data 是沒有經過排序過的，理論上會是打亂後的數據

Index scan

我們來建立個 index，當我們沒有指定 index type 前，預設都會是使用 btree index：

1	create index on t(a);

接著來執行 analyze：

1	analyze t;

為什麼會需要執行 analyze？因為 analyze 的用意是：analyze 會收集有關資料庫中資料表內容的統計資訊，並將結果儲在在 pg_statistic 系統目錄中。查詢計劃程序會使用這些統計資訊來幫助決定查詢的最有效執行計劃。也就是說為了避免可能 Postgres 沒有給予我們好的 query plan，這邊特定運行這個指令，事實上這個步驟，根據配置的設定會是定期運行的，

接著來執行這個查詢：

1	explain (costs off) select * from t where a = 1;

會得到 the optimizer 給我們的 scan 的策略是採取怎樣的：

          QUERY PLAN          
-------------------------------
 Index Scan using t_a_idx on t
   Index Cond: (a = 1)
(2 rows)

沒錯，這樣就是 index scan。 a = 1 這個 condition 可以這樣表示：indexed-field = expression，index engine 會先在 Btree 找出 a = 1 位在 leaf node，找到之後則訪問 leaf node 所指向的 TID，這邊其實是透過 access method 回傳 TID，但是由於 Postgres 是採用 MVCC 來實現 isolation 的，所以這邊回傳的 TID 可能會有多種，每一種 TID 就代表一個 row version，最後 index engine 必須根據 isolation level 來對照這個 row 是否是 visibility，如果是可見的代表可以取這個 TID 所對應的 Page 的 real data 的位置，最後 return 資料回來。

Bitmap scan

前面 Index scan 有其局限性，當你要 select 的資料量很多，比如說我的 where 條件是：

1	explain (costs off) select * from t where a <= 100;

             QUERY PLAN            
------------------------------------
 Bitmap Heap Scan on t
   Recheck Cond: (a <= 100)
   ->  Bitmap Index Scan on t_a_idx
         Index Cond: (a <= 100)
(4 rows)

這邊不再會是用 index scan，而是會採用 Bitmap scan 的方式，會什麼會轉換這樣的方式？是因為當 a <= 100 所指向的 TID 量很多，加上 TID 所指向的 Page 可能很多是 same page，為了盡可能重複訪問同一個 page，來提升效率。

這邊 index engine 會為每一個有加上 index 欄位在 memory 做成一個 bitmap，這邊的話會針對 a column index 拿到的所有 TIDs 所指向的 row versions 做成一個 bitmap，每個 Page 只會讀取一次，接著第二步有所謂的 Recheck Cond，這個指的是重新檢查 index condition 這邊，因為如果資料量很大，這個 bitmap 不一定能完全 fit 在 memory 上，所以 index engine 會採用 lossy bitmap 的形式，bitmap 這邊提供是在哪一個 Page，但是並不知道是位於 Page 上的哪個位置，所以 Recheck 的含義是，要對 Page 上的每個 tuple 進行條件的 check。

要怎麼 bitmap 是不是有用到 lossy 來構建，可以透過 explain analyze 來看到更多訊息：

             QUERY PLAN            
------------------------------------
 Bitmap Heap Scan on t  (cost=5.22..252.62 rows=104 width=7) (actual time=0.071..0.195 rows=100 loops=1)
   Recheck Cond: (a <= 100)
   Heap Blocks: exact=693 lossy=3732
   ->  Bitmap Index Scan on t_a_idx  (cost=0.00..5.20 rows=104 width=0) (actual time=0.040..0.041 rows=100 loops=1)
         Index Cond: (a <= 100)
 Planning Time: 0.220 ms
 Execution Time: 0.227 ms
(7 rows)

Heap Blocks: exact=693 lossy=3732 這樣的意思代表：在總共 4425 個 Page 中，有 693 個儲存詳細資訊的 tuple（包括 tuple pointer），而其他 3732 個 Page 在 bitmap 中是有損的（只有 Page）

OK，那麼是哪個配置參數控制 Bitmap 產生有損還是無損的呢？是透過 work_mem 參數來控制的，一旦 RAM 不足夠放入 Bitmap 就需要產生 Recheck Cond and lossy bitmap，但是在 explain 這邊不論有沒有用到 lossy bitmap 都會顯示 recheck cond 的過程，但實際上是沒有執行的。

接著我們在 b column 建立 index：

1 2	create index on t(b); analyze t;

如果我們加入兩個 index column 的查詢會怎樣呢？

1	explain (costs off) select * from t where a <= 100 and b = 'a';

             QUERY PLAN            
------------------------------------
 Bitmap Heap Scan on t
   Recheck Cond: ((a <= 100) AND (b = 'a'::text))
   ->  BitmapAnd
         ->  Bitmap Index Scan on t_a_idx
               Index Cond: (a <= 100)
         ->  Bitmap Index Scan on t_b_idx
               Index Cond: (b = 'a'::text)
(7 rows)

多 index 欄位的查詢滿大可能會變成 bitmap 來提升查詢的速率，前面說到每一個 column 所得到的 TID 的 row versions 會做成一個 bitmap，這邊會產生兩個 bitmap 並將這些 bitmap 進行 and 計算，讓每一個 page 只會被訪問一次。

這邊要注意的是 Bitmap Scan 怎樣發揮最理想？是當 Page 的數據物理排序與 index 紀錄不一樣才最好，如果當這兩者都一樣，那代表出現訪問同一個 Page 可能性變低了，因為比如說 a <= 100 的資料都在同一個 page 不會散亂在其他 page 上，所以這邊不需要 bitmap scan，也不用 index scan，直接 sequential scan 效果會更好。

我們可以透過 postgres 內部的系統表來查詢某 table 的 column 的 correlation 的值為何，這個值代表的是 Page 裡面的資料物理排序與 table 的 logical 排序的相關性。

1	select attname, correlation from pg_stats where tablename = 't';

attname | correlation  
---------+--------------
 a       | 0.0061246315
 c       |       0.9406
 b       |            1
(3 rows)

意思是當相關性越高 (越接近 1)，就越接近 sequential scan，代表 data 的 Page 都是接連再一起的意思，如果這時候還用 Bitmap scan 的策略的話，反而會提升 cost，因為會造成 random scan 的效果，在這種情況下，建議使用 index scan 策略或是 sequential scan 都可能會比 bitmap scan 更理想。

相反來說相關性越低，就更適合使用 bitmap scan，因為其資料本來就散亂在不同 Page 上。

Sequential scan

再講 Sequential scan 我覺得需要先介紹何謂： Database Cardinality 跟 Index Selectivity。

Database Cardinality V.S Index Selectivity

Database Cardinality

Cardinality 指的是這個 column 有多少個 distinct 的 value。

it’s the number of distinct values in a table column, relative to the number of rows in the table.

通常如果我們會說 high cardinality or medium cardinality or low cardinality，回過頭來看我們的 t table 的 a column 的值：

1	select count(distinct a) from t;

count  
--------
 100000
(1 row)

是從 1 ~ 100000，所以相對來說這個 column 的 cardinality 非常的高。

Index Selectivity

Index Selectivity 可以想成是 index engine 搜尋的時候，有效降低搜尋的範圍的指標。

來看例子：

1	select count(distinct c) from t;

count 
-------
     2
(1 row)

我們知道 c column 本身只存在 TRUE 或是 FALSE 的值，並且 FALSE 的值佔大多數，所以當如果 index 設在 c column 上，當我要找 not c 的 record：

create index on t(c);
anaylze t;
-- 禁用 seqential scan
set enable_seqscan=off;

1	explain analyze select * from t where not c;

             QUERY PLAN            
------------------------------------
 Bitmap Heap Scan on t  (cost=1871.67..3304.67 rows=99000 width=7) (actual time=12.141..33.558 rows=99038 loops=1)
   Filter: (NOT c)
   Heap Blocks: exact=439
   ->  Bitmap Index Scan on t_c_idx  (cost=0.00..1846.92 rows=99000 width=0) (actual time=12.050..12.050 rows=99038 loops=1)
         Index Cond: (c = false)
 Planning Time: 0.088 ms
 Execution Time: 39.161 ms
(7 rows)

要花 39 ms 才可以撈取資料完畢。

再來看，如果使用 sequential scan 呢：

1 2	set enable_seqscan=on; explain analyze select * from t where not c;

             QUERY PLAN            
------------------------------------
 Seq Scan on t  (cost=0.00..1443.00 rows=99000 width=7) (actual time=0.210..23.194 rows=99038 loops=1)
   Filter: (NOT c)
   Rows Removed by Filter: 962
 Planning Time: 0.085 ms
 Execution Time: 29.655 ms
(5 rows)

很明顯 sequential scan 比較快，也不需要花費 memory 建立 bitmap。

所以有一件事情是需要記住的：需要盡可能讓索引減少過濾的範圍，否則變成 random io 去存取 Page 上的 tuple 並不會比 sequential io 好。

以我們剛剛的例子並沒有舉例的很好，如果今天我要找 TRUE 的 record 呢？由於 TRUE 只佔一部分的值，所以在 c column 建立 index 是會比較快搜尋到的。

summary

所以當我們在下 index 要在哪個 column 的時候，可先看這兩個指標，看最常使用的 query 的條件欄位，本身的 Database Cardinality 高不高，一般來說像是性別的欄位它的 Database Cardinality 就不高，以及它的 Index Selectivity 會偏低，因為我的條件欄位本身就算下了，可能也需要過濾很多候選的 record。

回過來看，會使用 sequential scan 策略，通常都是使用 index 反而會花更多 cost。

例如：

1	explain (costs off) select * from t where a <= 40000;

            QUERY PLAN       
------------------------
 Seq Scan on t
   Filter: (a <= 40000)
(2 rows)

就算我們有在 a column 加上了 index，優化器還是會採用 sequential scan，這是因為 a <= 40000 match 到的 record 很多，這會增加讀取 index page 的 cost，而且是採用 random io 會比 sequential io 慢得多，但是這邊要注意的是 SSD 的 random io 與 sequential io 的成本差距並不大，而優化器在計算 cost 的時候會用到 “seq_page_cost” and “random_page_cost” 這兩個參數來計算 cost，預設的值記得是 1 : 4，在 SSD 的情況下推薦 random_page_cost 也改成 1 or 1.1 會比較不會誤判，詳情可以參考這篇文章：https://amplitude.engineering/how-a-single-postgresql-config-change-improved-slow-query-performance-by-50x-85593b8991b0

Covering indexes

Index only scan

如果可以從 index 上面直接獲取到實際的 value，而不用到 page 拿資料，速度會更快，這樣的方式其實就是 index only scan，來看個例子：

1 2	vacuum t; explain (costs off) select a from t where a < 100;

             QUERY PLAN            
------------------------------------
 Index Only Scan using t_a_idx on t
   Index Cond: (a < 100)
(2 rows)

然而，設計上並不是我們真的可以從 index 上獲取到我們想要的資料，流程上還是會需要透過 index 拿到許多 match 到的 row version，但是前面說到還需要額外檢查 visibility in the current transaction，在 index only scan 的策略下，Postgres 會 maintain 一個叫做 visibility map 的東西，用來標注更改時間不夠長的 page，這些 page 會被當作可見的，無論開始時間和隔離級別如何，如果 index 這邊返回的 TID 在這些 page 上，就可以避免 visibility 的檢查。

那麼維護 visibility map 是透過 vacuum 的操作，因此定期 vacuum 是很重要的，會提高 index only scan 的效率，如果 index only scan 的效率不高，則 the optimizer 會選擇使用一般的 index scan 策略。

我們可以來看每次 query 的 heap fetch 次數為何：

1	explain (analyze, costs off) select a from t where a < 100;

             QUERY PLAN            
------------------------------------
 Index Only Scan using t_a_idx on t (actual time=0.356..0.373 rows=99 loops=1)
   Index Cond: (a < 100)
   Heap Fetches: 0
 Planning Time: 0.158 ms
 Execution Time: 1.681 ms
(5 rows)

Heap Fetches 為 0，代表不需要存取 table，直接存取 index 就可以拿到資料，所以說這個數字越小越好。

Include index

在 Postgres 11 開始推出 include index，其實也是 cover index 的概念，來看例子：

1	CREATE INDEX ON t(a) INCLUDE (b);

INCLUDE 意思是把 b column 的值也存在 index 上，這樣當我同時想撈 a column and b column 的值，可以用到 index only scan 的策略，但是這也代表 b column 並不會成為索引的 search key，所以如果建立這種 index，使用 where a and b 只有針對 a column 做 index only scan 的策略。

在 INCLUDE 出來前，會採用 CREATE INDEX 會變成這樣：

1	CREATE INDEX ON t(a,b);

如果我們的 query 如果不太會把 b column 當作條件的話，這樣所帶來的成本就有點高了，因為要維護兩個 search key。INCLUDE 的方式 b column 的值不需要存在 non-leaf 上。

而 INCLUDE 的方式也可以用在 unique index 上。

Indexes on several fields

前面說過在 include 的做法出來前，會透過建立 index 在多個 field 來實現：

1 2	create index on t(a,b); anaylyze t;

explain (costs off) select * from t where a <= 100 and b = 'a';
             QUERY PLAN            
------------------------------------
 Index Scan using t_a_b_idx on t
   Index Cond: ((a <= 100) AND (b = 'a'::text))
(2 rows)

如果 where 條件是使用 a, b 兩個 column，the optimizer 會偏向用 index scan 策略而不是採用 bitmap scan 的方式，因為 multi-column index 使得拿到所有 TIDs 的 cost 變低，可以想像 search key 會先從 a column 開始找，找完符合 a column 接著會去找符合 b column 的條件，但由於 index 是建立在這兩個 column 上的，加上 b = ‘a’ 這條件本身可以拿到的 record 數也不多，所以採用 index scan 會比較快。

如果你下這樣的 query：

explain (costs off) select * from t where a <= 100;
                         QUERY PLAN              
--------------------------------------
 Bitmap Heap Scan on t
   Recheck Cond: (a <= 100)
   ->  Bitmap Index Scan on t_a_b_idx
         Index Cond: (a <= 100)
(4 rows)

由於 record 數量多，the opitimizer 又會傾向用 Bitmap scan 的方式來增加效率。

此外，這樣的 multi-column index 的方式，要考量到你下 query 的時機而決定要不要用，例如你常常會需要用到 a column 當作條件以及 a + b 兩個 column 當作條件，那麼就適合這樣用。但如果你單純查 b column 的話，這樣的 index 是無法幫助到的。

除了考量到上面所說的，排序的條件也是可以考慮進去的，比如說，假設 CREATE INDEX 的欄位是 (a, b)，這代表的意思是 a 和 b 在 btree 上由小到大排序，此時這個 index 適合的 ORDER BY 指令就是 ORDER BY a, b 或者是 ORDER BY a DESC, b DESC，但如果說查詢的時候是要求 a ASC, b DESC 這種，那麼該 index 的順序就沒有幫助了，所以在建立這樣的 index 的時候也可以根據你想要的情境去選擇排列的順序，例如：

1	CREATE INDEX ON t(a ASC, b DESC);

這樣就可以滿足不同 multi-column index 下，不同欄位的排序組合。然後通常越常用到的 column 當作條件的話，則最左邊的會以這邊 column 為優先。因為你這個 column 最長當作條件，他又能有效的降低候選 record，這個 column 就很適合在 multi-column index 的情況下放最左邊。

Indexes on expressions

如果你的條件是需要透過一些計算而產生的話，例如：

explain (costs off) select * from t where lower(b) = 'a';

                QUERY PLAN                
------------------------------------------
 Seq Scan on t
   Filter: (lower((b)::text) = 'a'::text)
(2 rows)

在還沒建立 index 前會是採用 Seq Scan，即使你有 create index on t (b)，這樣的 index 是無法在上面的 query 被採用的。因此，你可以這樣做：

create index on t(lower(b));
analyze t;
explain (costs off) select * from t where lower(b) = 'a';

                     QUERY PLAN                    
----------------------------------------------------
 Bitmap Heap Scan on t
   Recheck Cond: (lower((b)::text) = 'a'::text)
   ->  Bitmap Index Scan on t_lower_idx
         Index Cond: (lower((b)::text) = 'a'::text)
(4 rows)

這樣的做法其實就是預先算法 lower (b) 的值並存在 index btree 上，當作 search key，因此如果計算的方式要花許多 cost 這樣會降低 write 的 performance。

Partial indexes

部分索引指的是建立 table 的子集的 index 的效果，會這樣建立 index，通常也是因為考量到 table 裡面的 record 可能是 highly nonuniform distribution，例如想透過 index 來找出相對於其他 record 而言最不常見的，當然你會覺得說這跟建立一般的 index 差別在哪？

我們先來建立 c column 的 index

create index on t(c);
analyze t;
explain (costs off) select * from t where c;
                QUERY PLAN          
-------------------------------
 Index Scan using t_c_idx on t
   Index Cond: (c = true)
   Filter: c
(3 rows)

前面 insert data 可以知道只有 c column 只有 1% 的 record 會是 TRUE，所以這樣的 index 會很有幫助。

如果用 where not c，就會變成 seq scan：

explain (costs off) select * from t where not c;
    QUERY PLAN    
-------------------
 Seq Scan on t
   Filter: (NOT c)
(2 rows)

很合理，因為 not c 的 record 數量太多，the opitimizer 會寧願使用 seq scan 速度會比較快， cost 也比較低。

然而這樣建立 index 的方式會使得佔用的 page 數量較多：

select relpages from pg_class where relname='t_c_idx';
 relpages
----------
      276
(1 row)

雖然只有 1% 的 record 但是這樣的 index size 需要佔據 276 的 page 的容量了。但是如果我們的情境大多只需要找出 where c 的 record，那這樣的 index 因為還要幫 false 的 record 建立，而且還不會使用到，太浪費了。

這時候部分索引的優勢就出來了：

create index on t(c) where c;
anaylyze t;
select relpages from pg_class where relname='t_c_idx1';
 relpages
----------
        5
(1 row)

page 數量大幅度的減少，而且要知道 page 數量少，也代表 search key 也會變快，因為少了很多無謂的搜尋， btree level 也會變矮。這樣的 index 對於 size and performance may be pretty significant.

Sorting

這邊 sorting 其實就是前面說的 indexes on serveral fields 一樣要注意的細節，因為 index 在 btree 架構下是會存在 sorted 的結構的，因此在建立 index 可以指定排序的順序。

先來看如果沒用 index scan 的效果：

set enable_indexscan=off;
explain (costs off) select * from t order by a;
     QUERY PLAN      
---------------------
 Sort
   Sort Key: a
   ->  Seq Scan on t
(3 rows)

需要 seq scan 還需要花 sort 的 cost。

如果用 index scan：

set enable_indexscan=on;
explain (costs off) select * from t order by a;
          QUERY PLAN          
-------------------------------
 Index Scan using t_a_idx on t
(1 row)

只花費了 index scan 的 cost，也不需要花費 sort 的 cost。

Concurrent building

這邊要講的是 create index 的時候會 acquires a SHARE lock for the table.

例如：

select mode, granted from pg_locks where relation = 't'::regclass;
   mode    | granted
-----------+---------
 ShareLock | t
(1 row)

這時候如果 create index 期間有其他 transaction 要 insert or update or delete 都會被 block，直到 create index 操作完成並且 release lock 出來。

因此可以這樣做：

1	create index concurrently on t(a);

詳情可以參考我之前寫的文章：Postgres-zero-downtime-migration - 該注意的細節

總結

這邊介紹了深入探討了 index engine 採用的各種不同 scan 的策略，以及優缺點。這些內容都是參考了，Postgres 官方文件還有這篇文章：https://postgrespro.com/blog/pgsql/3994098
最後統整出來的，花了不少心思～所以每次下 index 都值得好好想想，怎樣下才是最符合自己的情境，以及常用 explain 指令來做測試。