Postgresql-Performance
使用索引的 Postgres 查詢永遠不會完成
所以我有以下3張表:
create table imdb_dev.dim_title ( id uuid not null, tconst varchar(10) not null, title_type_id uuid not null, primary_title text, original_title text, is_adult boolean, start_year integer, end_year integer, runtime_in_minutes integer, constraint dim_title_pkey primary key (id), constraint dim_title_title_type_id_fkey foreign key (title_type_id) references imdb_dev.dim_title_type ); create table imdb_dev.dim_title_type ( id uuid not null, title_type text not null, constraint dim_title_type_pkey primary key (id) ); create table staging.title_basics ( tconst varchar(10) not null, titletype varchar(20) not null, primarytitle text, originaltitle text, isadult boolean, startyear integer, endyear integer, runtimeminutes integer, genres text, constraint title_basics_pkey primary key (tconst, titletype) ) partition by LIST (titletype);
我想選擇存在的行,
title_basics
然後插入dim_title
. 兩者dim_title
都有title_basics
大約 800 萬行,而dim_title_type
只是一個只有 12 行的映射表。這是我用來選擇我需要插入的行的查詢。由於某種原因,它只是繼續執行並且永遠不會完成:
SELECT md5(tconst)::UUID AS id, tconst, tt.id AS title_type_id, tb.primarytitle AS primary_title, tb.originaltitle AS original_title, tb.isadult AS is_adult, tb.startyear AS start_year, tb.endyear AS end_year, tb.runtimeminutes AS runtime_in_minutes FROM "funbro"."staging"."title_basics" tb LEFT JOIN "funbro"."imdb_dev"."dim_title_type" tt ON tt.title_type = tb.titleType WHERE md5(tconst)::UUID NOT IN (select id from "funbro"."imdb_dev"."dim_title") ;
我肯定做錯了什麼,但我看不到那是什麼。這是查詢計劃:
Gather (cost=1001.27..533568642515.39 rows=4040707 width=97) Workers Planned: 2 -> Hash Left Join (cost=1.27..533568237444.69 rows=1683628 width=97) Hash Cond: ((tb.titletype)::text = tt.title_type) -> Parallel Append (cost=0.00..533568197457.27 rows=1683627 width=73) -> Parallel Seq Scan on title_basics_tvepisode tb_4 (cost=0.00..381512705366.11 rows=1231620 width=75) Filter: (NOT (SubPlan 1)) SubPlan 1 -> Materialize (cost=0.00..289561.20 rows=8081413 width=16) -> Seq Scan on dim_title (cost=0.00..209693.13 rows=8081413 width=16) -> Parallel Seq Scan on title_basics_short tb_3 (cost=0.00..52946232635.92 rows=170924 width=63) Filter: (NOT (SubPlan 1)) -> Parallel Seq Scan on title_basics_movie tb_1 (cost=0.00..37574087716.28 rows=121299 width=65) Filter: (NOT (SubPlan 1)) -> Parallel Seq Scan on title_basics_video tb_7 (cost=0.00..20256814175.33 rows=65394 width=81) Filter: (NOT (SubPlan 1)) -> Parallel Seq Scan on title_basics_tvseries tb_6 (cost=0.00..19194394831.39 rows=61965 width=68) Filter: (NOT (SubPlan 1)) -> Parallel Seq Scan on title_basics_tvmovie tb_5 (cost=0.00..12033907879.19 rows=38848 width=77) Filter: (NOT (SubPlan 1)) -> Parallel Seq Scan on title_basics_others tb_2 (cost=0.00..10050046434.90 rows=32444 width=79) Filter: (NOT (SubPlan 1)) -> Hash (cost=1.12..1.12 rows=12 width=25) -> Seq Scan on dim_title_type tt (cost=0.00..1.12 rows=12 width=25)
這些是執行 Postgres 13 的機器的規格。磁碟是 SSD。Postgres 配置是預設配置:
好的,找出問題所在。
首先,我認為按照建議添加索引作為表達式
staging.title_basics
會使查詢規劃器使用索引,但我這樣做了,但仍然面臨同樣的問題:Postgres 將繼續進行 Seq Scan:create index title_basics_id_uuid on staging.title_basics (( md5(tconst)::UUID ));
真正的問題是
NOT IN
在where
謂詞中使用效率非常低。為 id 為 null 的 LEFT JOIN 更改它會使查詢在不到 1 分鐘的時間內完成。參考:wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use_NOT_IN最終查詢:
SELECT md5(tb.tconst)::UUID as id, tt.id AS title_type_id, tb.primarytitle AS primary_title, tb.originaltitle AS original_title, tb.isadult AS is_adult, tb.startyear AS start_year, tb.endyear AS end_year, tb.runtimeminutes AS runtime_in_minutes FROM "funbro"."staging"."title_basics" tb LEFT JOIN "funbro"."imdb_dev"."dim_title_type" tt ON tt.title_type = tb.titleType LEFT JOIN imdb_dev.dim_title t ON md5(tb.tconst)::UUID = t.id WHERE t.id is null ;