Query
在 Postgresql 查詢中有效地選擇多個連續範圍的開始和結束
我在一個表中有大約 10 億行數據,其名稱和整數範圍在 1-288 之間。對於給定的name,每個int都是唯一的,並且範圍內並非所有可能的整數都存在——因此存在間隙。
此查詢生成一個範例案例:
--what I have: SELECT * FROM ( VALUES ('foo', 2), ('foo', 3), ('foo', 4), ('foo', 10), ('foo', 11), ('foo', 13), ('bar', 1), ('bar', 2), ('bar', 3) ) AS baz ("name", "int")
我想為每個名稱和連續整數序列生成一個查找表。每個這樣的行將包含:
name – name列的值
start – 連續序列中的第一個整數
end – 連續序列中的最終值
span – end - start + 1
此查詢為上述範例生成範例輸出:
--what I need: SELECT * FROM ( VALUES ('foo', 2, 4, 3), ('foo', 10, 11, 2), ('foo', 13, 13, 1), ('bar', 1, 3, 3) ) AS contiguous_ranges ("name", "start", "end", span)
因為我有這麼多行,所以效率越高越好。也就是說,我只需要執行一次此查詢,因此這不是絕對要求。
提前致謝!
編輯:
我應該補充一點,歡迎 PL/pgSQL 解決方案(請解釋任何花哨的技巧——我還是 PL/pgSQL 的新手)。
測試視圖:
create view v as select * from ( values ('foo', 2), ('foo', 3), ('foo', 4), ('foo', 10), ('foo', 11), ('foo', 13), ('bar', 1), ('bar', 2), ('bar', 3) ) as baz ("name", "int");
詢問:
with recursive t("name", "int") as ( select "name", "int", 1 as span from v union all select "name", v."int", t.span+1 as span from v join t using ("name") where v."int"=t."int"+1 ) select "name", "start", "start"+span-1 as "end", span from( select "name", ("int"-span+1) as "start", max(span) as span from ( select "name", "int", max(span) as span from t group by "name", "int" ) z group by "name", ("int"-span+1) ) z;
結果:
name | start | end | span ------+-------+-----+------ foo | 2 | 4 | 3 foo | 13 | 13 | 1 bar | 1 | 3 | 3 foo | 10 | 11 | 2 (4 rows)
我很想知道它在你的十億行表上的表現。
您可以使用視窗函式來做到這一點。基本思想是使用
lead
和lag
視窗函式將行拉到目前行的前面和後面。然後我們可以計算我們是否有序列的開始或結束:create temp view temp_view as select n, val, (lead <> val + 1 or lead is null) as islast, (lag <> val - 1 or lag is null) as isfirst, (lead <> val + 1 or lead is null) and (lag <> val - 1 or lag is null) as orphan from ( select n, lead(val, 1) over( partition by n order by n, val), lag(val, 1) over(partition by n order by n, val ), val from test order by n, val ) as t ; select * from temp_view; n | val | islast | isfirst | orphan -----+-----+--------+---------+-------- bar | 1 | f | t | f bar | 2 | f | f | f bar | 3 | t | f | f bar | 24 | t | t | t bar | 42 | t | t | t foo | 2 | f | t | f foo | 3 | f | f | f foo | 4 | t | f | f foo | 10 | f | t | f foo | 11 | t | f | f foo | 13 | t | t | t (11 rows)
(我使用了一個視圖,所以下面的邏輯會更容易理解。)所以現在我們知道該行是開始還是結束。我們必須將其折疊成一行:
select n as "name", first, coalesce (last, first) as last, coalesce (last - first + 1, 1) as span from ( select n, val as first, -- this will not be excellent perf. since were calling the view -- for each row sequence found. Changing view into temp table -- will probably help with lots of values. ( select min(val) from temp_view as last where islast = true -- need this since isfirst=true, islast=true on an orphan sequence and last.orphan = false and first.val < last.val and first.n = last.n ) as last from (select * from temp_view where isfirst = true) as first ) as t ; name | first | last | span ------+-------+------+------ bar | 1 | 3 | 3 bar | 24 | 24 | 1 bar | 42 | 42 | 1 foo | 2 | 4 | 3 foo | 10 | 11 | 2 foo | 13 | 13 | 1 (6 rows)
對我來說看起來是正確的:)