在 Postgresql 查詢中有效地選擇多個連續範圍的開始和結束

June 1, 2019

我在一個表中有大約 10 億行數據，其名稱和整數範圍在 1-288 之間。對於給定的name，每個int都是唯一的，並且範圍內並非所有可能的整數都存在——因此存在間隙。
此查詢生成一個範例案例：
--what I have:
SELECT *
FROM ( VALUES ('foo', 2),
             ('foo', 3),
             ('foo', 4),
             ('foo', 10),
             ('foo', 11),
             ('foo', 13),
             ('bar', 1),
             ('bar', 2),
             ('bar', 3)
    ) AS baz ("name", "int")
我想為每個名稱和連續整數序列生成一個查找表。每個這樣的行將包含：
name – name列的值
start – 連續序列中的第一個整數
end – 連續序列中的最終值
span – end - start + 1
此查詢為上述範例生成範例輸出：
--what I need:
SELECT * 
FROM ( VALUES ('foo', 2, 4, 3),
             ('foo', 10, 11, 2),
             ('foo', 13, 13, 1),
             ('bar', 1, 3, 3)
    ) AS contiguous_ranges ("name", "start", "end", span)
因為我有這麼多行，所以效率越高越好。也就是說，我只需要執行一次此查詢，因此這不是絕對要求。
提前致謝！
編輯：
我應該補充一點，歡迎 PL/pgSQL 解決方案（請解釋任何花哨的技巧——我還是 PL/pgSQL 的新手）。

怎麼用with recursive

測試視圖：

create view v as 
select *
from ( values ('foo', 2),
             ('foo', 3),
             ('foo', 4),
             ('foo', 10),
             ('foo', 11),
             ('foo', 13),
             ('bar', 1),
             ('bar', 2),
             ('bar', 3)
    ) as baz ("name", "int");

詢問：

with recursive t("name", "int") as ( select "name", "int", 1 as span from v
                                    union all
                                    select "name", v."int", t.span+1 as span
                                    from v join t using ("name")
                                    where v."int"=t."int"+1 )
select "name", "start", "start"+span-1 as "end", span
from( select "name", ("int"-span+1) as "start", max(span) as span
     from ( select "name", "int", max(span) as span 
            from t
            group by "name", "int" ) z
     group by "name", ("int"-span+1) ) z;

結果：

name | start | end | span
------+-------+-----+------
foo  |     2 |   4 |    3
foo  |    13 |  13 |    1
bar  |     1 |   3 |    3
foo  |    10 |  11 |    2
(4 rows)

我很想知道它在你的十億行表上的表現。

您可以使用視窗函式來做到這一點。基本思想是使用lead和lag視窗函式將行拉到目前行的前面和後面。然後我們可以計算我們是否有序列的開始或結束：

create temp view temp_view as
   select
       n,
       val,
       (lead &lt;&gt; val + 1 or lead is null) as islast,
       (lag &lt;&gt; val - 1 or lag is null) as isfirst,
       (lead &lt;&gt; val + 1 or lead is null) and (lag &lt;&gt; val - 1 or lag is null) as orphan
   from
   (
       select
           n,
           lead(val, 1) over( partition by n order by n, val),
           lag(val, 1) over(partition by n order by n, val ),
           val
       from test
       order by n, val
   ) as t
;  
select * from temp_view;
n  | val | islast | isfirst | orphan 
-----+-----+--------+---------+--------
bar |   1 | f      | t       | f
bar |   2 | f      | f       | f
bar |   3 | t      | f       | f
bar |  24 | t      | t       | t
bar |  42 | t      | t       | t
foo |   2 | f      | t       | f
foo |   3 | f      | f       | f
foo |   4 | t      | f       | f
foo |  10 | f      | t       | f
foo |  11 | t      | f       | f
foo |  13 | t      | t       | t
(11 rows)

（我使用了一個視圖，所以下面的邏輯會更容易理解。）所以現在我們知道該行是開始還是結束。我們必須將其折疊成一行：

select
   n as "name",
   first,
   coalesce (last, first) as last,
   coalesce (last - first + 1, 1) as span
from
(
   select
   n,
   val as first,
   -- this will not be excellent perf. since were calling the view
   -- for each row sequence found. Changing view into temp table 
   -- will probably help with lots of values.
   (
       select min(val)
       from temp_view as last
       where islast = true
       -- need this since isfirst=true, islast=true on an orphan sequence
       and last.orphan = false
       and first.val &lt; last.val
       and first.n = last.n
   ) as last
   from
       (select * from temp_view where isfirst = true) as first
) as t
;

name | first | last | span 
------+-------+------+------
bar  |     1 |    3 |    3
bar  |    24 |   24 |    1
bar  |    42 |   42 |    1
foo  |     2 |    4 |    3
foo  |    10 |   11 |    2
foo  |    13 |   13 |    1
(6 rows)

對我來說看起來是正確的:)

引用自：https://dba.stackexchange.com/questions/17045

在 Postgresql 查詢中有效地選擇多個連續範圍的開始和結束

相關問答

在 PostgreSQL REPMGR 主從環境中，HOT STANDBY 設置是否需要相同的硬體配置？

用過去 1 小時的時間值減去目前時間值

簡單 SQL 查詢中的語法錯誤

如何通過外鍵進行查詢，同時僅將值與另一列中用作外鍵的值在同一行

Postgres 按多列排序

在連接後添加一個條件時，包含多個連接的查詢變得太慢