從超大的 csv 文件中提取數據

August 6, 2019

我有一個 40gb 的 csv 文件，其中有超過 6000 萬行用於數據分析。每行都有一個唯一標識符（一些數字）。例如，第一行的唯一標識符將在大約 150,000 行之後重複。
我想要一種方法來執行整個文件，並提取具有相同標識符的行並將它們寫入新的 csv 文件。有沒有一種好的、自動化的方法來做到這一點？請注意，該文件非常大，Excel 無法打開它。

複製到 PostgreSQL
Psha…微軟Sheeple。PostgreSQL 顯然是完成這項任務的優越平台。不用擔心下載東西和找出一個神秘的配置管理器，只需使用你的包管理器（brew在我的情況下）和brew install postgresql
然後，假設more /tmp/my.csv看起來像這樣..
uuid,foo,bar
00000000-0000-0000-0000-000000000001,bing,bong
00000000-0000-0000-0000-000000000002,bork,björk
... lots more rows
00000000-0000-0000-0000-000000000001,blat,splunge
…然後您可以將以下內容通過管道script.sql輸入psql -f ./script.sql，您就完成了！有嚼勁的複制麵食
drop table if exists myCsv, myOkays, myDupes;

create temporary table myCsv (
    "uuid" uuid
   ,foo    text
   ,bar    text
);

copy myCsv 
from '/tmp/my.csv' delimiter ',' csv header;

select 0::int AS "dupe_count", * into myDupes from myCsv limit(0);
select * into myOkays from myCsv limit(0);

with dupes as (
   select "uuid", count(*) as "dupe_count"
   from myCsv
   group by "uuid"
   having count(*) &gt; 1
)
insert into myDupes
select d."dupe_count", mc.* 
from myCsv as mc
join dupes d on d."uuid" = mc."uuid";

insert into myOkays
select mc.* 
from myCsv as mc
where not exists ( 
   select 1
   from myDupes d 
   where d."uuid" = mc."uuid"
);
嘿，快！你現在有
您的原始文件
包含所有重複記錄的表
只包含您想要的記錄的表
當然，您會想要做一些進一步的數據清理。您甚至可能希望將數據導出回文件系統上的 csv，以便在數據科學工具中使用。但是您遇到的錯誤的確切細節可能最好用於另一個問題。

引用自：https://dba.stackexchange.com/questions/240439

從超大的 csv 文件中提取數據

複製到 PostgreSQL

相關問答

更正查詢的“行終止於”

Excel 電子表格導入 MySQL

如何編寫腳本來殺死 MSSQL 中的 MS Office 數據庫鎖

自動化 SQL Server 報告並通過郵件發送

自動生成 sql 報告並通過郵件發送

在其中一列中使用 json 字元串數組導入 CSV