Postgresql
優化查詢以獲取 x 在另一個表的值對之間的行
PostgreSQL 9.4.5
在我的數據庫中有 2 個表,vcentries和exons。
我想查詢 vcentries 中的行,其中pos位於表外顯子中的位置對列表 exonstart 和 exonend 之間。
查詢結果:select * from exons where exons.genename = ‘NM_001301824’
總查詢執行時間:11 毫秒。檢索到 8 行。
這相當於在沒有跨表連接的情況下提取單個外顯子的結果:
Select pos,chrom,alt,ref from vcfentries where chrom = '1' and pos > 33546679 and pos < 33547159
總查詢執行時間:11 毫秒。檢索到 89 行。
目前這些是我在桌子上的索引
查詢特定外顯子是有效的:
select vcfentries.pos,vcfentries.chrom,vcfentries.alt,vcfentries.ref,exons.exonnumber from exons , vcfentries where vcfentries.pos BETWEEN exons.exonstart and exons.exonend and exons.genename = 'NM_001301824' and exonnumber = 6 and vcfentries.chrom = exons.chrom
總查詢執行時間:72 毫秒。檢索到 137 行。
"Nested Loop (cost=438468.31..2324377.64 rows=809587 width=24)" " -> Index Only Scan using exonspkey on exons (cost=0.42..8.44 rows=1 width=26)" " Index Cond: ((genename = 'NM_001301824'::text) AND (exonnumber = 8))" " -> Bitmap Heap Scan on vcfentries (cost=438467.89..2317823.87 rows=654532 width=16)" " Recheck Cond: ((pos >= exons.exonstart) AND (pos <= exons.exonend) AND (chrom = exons.chrom))" " -> Bitmap Index Scan on vcfentries_pos_chrom_idx (cost=0.00..438304.26 rows=654532 width=0)" " Index Cond: ((pos >= exons.exonstart) AND (pos <= exons.exonend) AND (chrom = exons.chrom))"
當查詢所有這些時,性能會下降。突然,它跳入分鐘範圍:
select * from exons , vcfentries where vcfentries.pos BETWEEN exons.exonstart and exons.exonend and exons.genename = 'NM_001037501' and vcfentries.chrom = exons.chrom
總查詢執行時間:325389 毫秒。檢索到 2331 行。
"Hash Join (cost=58.73..11528494.14 rows=11334216 width=24)" " Output: vcfentries.pos, vcfentries.chrom, vcfentries.alt, vcfentries.ref, exons.exonnumber" " Hash Cond: (vcfentries.chrom = exons.chrom)" " Join Filter: ((vcfentries.pos >= exons.exonstart) AND (vcfentries.pos <= exons.exonend))" " -> Seq Scan on coeus.vcfentries (cost=0.00..7170736.76 rows=141378976 width=16)" " Output: vcfentries.pos, vcfentries.chrom, vcfentries.alt, vcfentries.ref, vcfentries.analysisid, vcfentries.filter, vcfentries.info_ac, vcfentries.info_af, vcfentries.info_an, vcfentries.info_baseqranksum, vcfentries.info_clippingranksum, vcfentrie (...)" " -> Hash (cost=58.56..58.56 rows=14 width=26)" " Output: exons.exonnumber, exons.exonstart, exons.exonend, exons.chrom" " -> Bitmap Heap Scan on coeus.exons (cost=4.53..58.56 rows=14 width=26)" " Output: exons.exonnumber, exons.exonstart, exons.exonend, exons.chrom" " Recheck Cond: (exons.genename = 'NM_001301824'::text)" " -> Bitmap Index Scan on exons_genename_idx (cost=0.00..4.53 rows=14 width=0)" " Index Cond: (exons.genename = 'NM_001301824'::text)"
外顯子表大小適中,只有大約 400K 行。
vcentries 表很大,有幾億行,但使用索引查詢的速度可以接受。
我無法優化此查詢。當我在嘗試使用顯式連接時執行解釋時,我得到了相同的執行計劃。
任何想法為什麼它會創建如此糟糕的執行計劃以及任何建議的修復或更好的查詢?
我為此實現的解決方案是添加一個 plpgsql 函式,該函式執行每個單獨的查詢並將結果輸出為表。
我定義了一個輸出記錄類型並使用了這個函式:
DROP FUNCTION coeus.getexonvcf(text); CREATE OR REPLACE FUNCTION coeus.getexonvcf(text) RETURNS SETOF coeus.exonvcf AS $BODY$ DECLARE r bigint; s coeus.exonvcf; BEGIN FOR r IN SELECT exonnumber FROM coeus.exons where exons.genename = $1 LOOP for s IN select ex.genename,ex.exonnumber,ex.direction,ex.padding, vcf.* from coeus.exons ex, coeus.vcfentries vcf where vcf.pos BETWEEN ex.exonstart and ex.exonend and ex.genename = $1 and ex.exonnumber = r and vcf.chrom = ex.chrom LOOP RETURN NEXT s; END LOOP; END LOOP; RETURN; END $BODY$ LANGUAGE 'plpgsql' ;
這避免了任何全表掃描,並在一秒多一點的時間內執行。
總查詢執行時間:1493 毫秒。檢索到 2331 行。
對於具有大量外顯子條目的基因,如 Titin (363),它的速度很慢,但在一般情況下,它的速度可以接受。
我建議在這種情況下使用 CTE。看起來像這樣的東西:
with exons (genename, exonnumber, crom, exonstart, exonend, padding, direction) as (select genename, exonnumber, crom, exonstart, exonend, padding, direction from exons where enename = 'NM_001037501') select * from vcfentries vcf join exons exo using (chrom) where vcf.pos between exo.exonstart and exo.exonend;