Postgresql

在PostgreSQL中計算表的每一列中的缺失值

  • March 28, 2021

問題:(表)的哪一列fortune500缺失值最多?為了找出答案,我已經單獨檢查了下面提到的每一列,這非常乏味。

SELECT 
   SUM(CASE WHEN ticker IS NULL THEN 1 ELSE 0 END) AS ticker_null_num, 
   SUM(CASE WHEN profits_change IS NULL THEN 1 ELSE 0 END) AS profits_change_null_num,
   SUM(CASE WHEN industry IS NULL THEN 1 ELSE 0 END) AS industry_null_num
FROM fortune500;

並且還分別對每一列執行了以下查詢:

SELECT count(*) - count(ticker) AS missing
 FROM fortune500;
etc....

我的問題: 有沒有更好/動態的方法,因為這種方法非常麻煩,而且需要很長時間,因為我喜歡一個表中有 50-60 列,那麼在這種情況下我應該做什麼而不是本手冊方法。有人可以用一種好的方法幫我找到按降序排序的每一列的缺失值嗎?例如:

COLUMN_NAME    MISSING_VALUES_COUNT
col1               60
col2               50
col3               45
col4               40
etc.....  

您可以將行轉換為 JSON,以便為每一列動態生成一行:

select colname, 
      count(cols.value) as non_null_values,
      (select count(*) from the_table) - count(cols.value) as missing
from the_table t
 cross join jsonb_each(jsonb_strip_nulls(to_jsonb(t))) as cols(colname, value)
group by colname;

但這在大桌子上不會很快。

這種基本形式的查詢只使用一次順序掃描,並且盡可能高效:

SELECT x.*
FROM  (
  SELECT count(*) AS ct
       , count(ticker) AS ticker
       , count(profits_change) AS profits_change
       , count(industry) AS industry
    -- , more?
  FROM   fortune500
  ) t
CROSS  JOIN LATERAL (
  VALUES
     ('ticker', ct - ticker)
   , ('profits_change', ct - profits_change)
   , ('industry', ct - industry
-- , more?
  ) x(column_name, missing_values)
  ORDER  BY missing_values DESC, column_name DESC;

此函式動態生成並執行給定表的所有列的查詢:

CREATE OR REPLACE FUNCTION f_count_nulls(_tbl regclass)
 RETURNS TABLE (column_name text, missing_values bigint)
 LANGUAGE plpgsql STABLE PARALLEL SAFE AS
$func$
BEGIN
  RETURN QUERY EXECUTE (
  SELECT format(
  $$
  SELECT x.*
  FROM  (SELECT count(*) AS ct, %s FROM %s) t
  CROSS  JOIN LATERAL (VALUES %s) x(col, nulls)
  ORDER  BY nulls DESC, col DESC
  $$, string_agg(format('count(%1$I) AS %1$I', attname), ', ')
    , $1
    , string_agg(format('(%1$L, ct - %1$I)', attname), ', ')
     )
  FROM   pg_catalog.pg_attribute
  WHERE  attrelid = $1
  AND    attnum > 0
  AND    NOT attisdropped
  -- more filters?
  );
END
$func$;

稱呼:

SELECT * FROM f_count_nulls('public.fortune500');  -- optionally schema-qualified

產生請求的結果。

所有標識符都被安全處理(在需要時引用,沒有 SQL 注入)。

db<>在這裡擺弄

有關的:

引用自:https://dba.stackexchange.com/questions/276131