Postgresql
在PostgreSQL中計算表的每一列中的缺失值
問題:(表)的哪一列
fortune500
缺失值最多?為了找出答案,我已經單獨檢查了下面提到的每一列,這非常乏味。SELECT SUM(CASE WHEN ticker IS NULL THEN 1 ELSE 0 END) AS ticker_null_num, SUM(CASE WHEN profits_change IS NULL THEN 1 ELSE 0 END) AS profits_change_null_num, SUM(CASE WHEN industry IS NULL THEN 1 ELSE 0 END) AS industry_null_num FROM fortune500;
並且還分別對每一列執行了以下查詢:
SELECT count(*) - count(ticker) AS missing FROM fortune500; etc....
我的問題: 有沒有更好/動態的方法,因為這種方法非常麻煩,而且需要很長時間,因為我喜歡一個表中有 50-60 列,那麼在這種情況下我應該做什麼而不是本手冊方法。有人可以用一種好的方法幫我找到按降序排序的每一列的缺失值嗎?例如:
COLUMN_NAME MISSING_VALUES_COUNT col1 60 col2 50 col3 45 col4 40 etc.....
您可以將行轉換為 JSON,以便為每一列動態生成一行:
select colname, count(cols.value) as non_null_values, (select count(*) from the_table) - count(cols.value) as missing from the_table t cross join jsonb_each(jsonb_strip_nulls(to_jsonb(t))) as cols(colname, value) group by colname;
但這在大桌子上不會很快。
這種基本形式的查詢只使用一次順序掃描,並且盡可能高效:
SELECT x.* FROM ( SELECT count(*) AS ct , count(ticker) AS ticker , count(profits_change) AS profits_change , count(industry) AS industry -- , more? FROM fortune500 ) t CROSS JOIN LATERAL ( VALUES ('ticker', ct - ticker) , ('profits_change', ct - profits_change) , ('industry', ct - industry -- , more? ) x(column_name, missing_values) ORDER BY missing_values DESC, column_name DESC;
此函式動態生成並執行給定表的所有列的查詢:
CREATE OR REPLACE FUNCTION f_count_nulls(_tbl regclass) RETURNS TABLE (column_name text, missing_values bigint) LANGUAGE plpgsql STABLE PARALLEL SAFE AS $func$ BEGIN RETURN QUERY EXECUTE ( SELECT format( $$ SELECT x.* FROM (SELECT count(*) AS ct, %s FROM %s) t CROSS JOIN LATERAL (VALUES %s) x(col, nulls) ORDER BY nulls DESC, col DESC $$, string_agg(format('count(%1$I) AS %1$I', attname), ', ') , $1 , string_agg(format('(%1$L, ct - %1$I)', attname), ', ') ) FROM pg_catalog.pg_attribute WHERE attrelid = $1 AND attnum > 0 AND NOT attisdropped -- more filters? ); END $func$;
稱呼:
SELECT * FROM f_count_nulls('public.fortune500'); -- optionally schema-qualified
產生請求的結果。
所有標識符都被安全處理(在需要時引用,沒有 SQL 注入)。
db<>在這裡擺弄
有關的: