Csv

如何將一個巨大的 csv 文件導入 Cassandra?

  • January 7, 2020

我正在嘗試將一個 csv 文件導入到很長的 Cassandra 中。這些是食品:成分、營養、標籤……它來自 Open Food Data。列出食品產品的資訊:成分、營養資訊、標籤等。大部分數據來自眾包資訊。法國公共data.gouv.fr開放平台的文件這個信封

我試過的命令

我對能夠使用 python 腳本收集的所有列嘗試以下命令:

cqlsh> COPY bouffe(code, url, creator, created_t, created_datetime, last_modified_t, last_modified_datetime, product_name, generic_name, quantity, packaging, packaging_tags, brands, brands_tags, categories, categories_tags, categories_fr, origins, origins_tags, manufacturing_places, manufacturing_places_tags, labels, labels_tags, labels_fr, emb_codes, emb_codes_tags, first_packaging_code_geo, cities, cities_tags, purchase_places, stores, countries, countries_tags, countries_fr, ingredients_text, allergens, allergens_fr, traces, traces_tags, traces_fr, serving_size, no_nutriments, additives_n, additives, additives_tags, additives_fr, ingredients_from_palm_oil_n, ingredients_from_palm_oil, ingredients_from_palm_oil_tags, ingredients_that_may_be_from_palm_oil_n, ingredients_that_may_be_from_palm_oil, ingredients_that_may_be_from_palm_oil_tags, nutrition_grade_uk, nutrition_grade_fr, pnns_groups_1, pnns_groups_2, states, states_tags, states_fr, main_category, main_category_fr, image_url, image_small_url, energy_100g, energy-from-fat_100g, fat_100g, saturated-fat_100g, butyric-acid_100g, caproic-acid_100g, caprylic-acid_100g, capric-acid_100g, lauric-acid_100g, myristic-acid_100g, palmitic-acid_100g, stearic-acid_100g, arachidic-acid_100g, behenic-acid_100g, lignoceric-acid_100g, cerotic-acid_100g, montanic-acid_100g, melissic-acid_100g, monounsaturated-fat_100g, polyunsaturated-fat_100g, omega-3-fat_100g, alpha-linolenic-acid_100g, eicosapentaenoic-acid_100g, docosahexaenoic-acid_100g, omega-6-fat_100g, linoleic-acid_100g, arachidonic-acid_100g, gamma-linolenic-acid_100g, dihomo-gamma-linolenic-acid_100g, omega-9-fat_100g, oleic-acid_100g, elaidic-acid_100g, gondoic-acid_100g, mead-acid_100g, erucic-acid_100g, nervonic-acid_100g, trans-fat_100g, cholesterol_100g, carbohydrates_100g, sugars_100g, sucrose_100g, glucose_100g, fructose_100g, lactose_100g, maltose_100g, maltodextrins_100g, starch_100g, polyols_100g, fiber_100g, proteins_100g, casein_100g, serum-proteins_100g, nucleotides_100g, salt_100g, sodium_100g, alcohol_100g, vitamin-a_100g, beta-carotene_100g, vitamin-d_100g, vitamin-e_100g, vitamin-k_100g, vitamin-c_100g, vitamin-b1_100g, vitamin-b2_100g, vitamin-pp_100g, vitamin-b6_100g, vitamin-b9_100g, folates_100g, vitamin-b12_100g, biotin_100g, pantothenic-acid_100g, silica_100g, bicarbonate_100g, potassium_100g, chloride_100g, calcium_100g, phosphorus_100g, iron_100g, magnesium_100g, zinc_100g, copper_100g, manganese_100g, fluoride_100g, selenium_100g, chromium_100g, molybdenum_100g, iodine_100g, caffeine_100g, taurine_100g, ph_100g, fruits-vegetables-nuts_100g, fruits-vegetables-nuts-estimate_100g, collagen-meat-protein-ratio_100g, cocoa_100g, chlorophyl_100g, carbon-footprint_100g, nutrition-score-fr_100g, nutrition-score-uk_100g, glycemic-index_100g, water-hardness_100g) FROM 'bouffe.csv' WITH HEADER = true;

但它給了我以下錯誤:

...
Failed to import 23 rows: ParseError - Invalid row length 84 should be 163,  given up without retries
Failed to import 47 rows: ParseError - Invalid row length 77 should be 163,  given up without retries
Failed to import 73 rows: ParseError - Invalid row length 52 should be 163,  given up without retries
Failed to import 5000 rows: Error - new-line character seen in unquoted field - do you need to open the file in universal-newline mode?,  given up after 1 attempts
Failed to import 2 rows: ParseError - Invalid row length 32 should be 163,  given up without retries
Failed to import 56 rows: ParseError - Invalid row length 69 should be 163,  given up without retries
Exceeded maximum number of insert errors 1000 Avg. rate:    7467 rows/s
Failed to process 192457 rows; failed rows written to import_k1_bouffe.err
Exceeded maximum number of insert errors 1000
Processed: 185000 rows; Rate:    4855 rows/s; Avg. rate:    7407 rows/s
185000 rows imported from 0 files in 24.977 seconds (0 skipped).

之前,我創建了:

create ColumnFamily Bouffe
(Code varchar PRIMARY KEY,
url varchar,
...
)

當我要求 cassandra 描述我的桌子時,我有:

cqlsh:k1> DESCRIBE TABLE bouffe;

CREATE TABLE k1.bouffe (
   code int PRIMARY KEY,
   additives text,
   additives_fr text,
   additives_n text,
   additives_tags text,
   alcohol_100g text,
   allergens text,
   allergens_fr text,
   alpha_linolenic_acid_100g text,
   arachidic_acid_100g text,
   arachidonic_acid_100g text,
   behenic_acid_100g text,
   beta_carotene_100g text,
   bicarbonate_100g text,
   biotin_100g text,
   brands text,
   brands_tags text,
   butyric_acid_100g text,
   caffeine_100g text,
   calcium_100g text,
   capric_acid_100g text,
   caproic_acid_100g text,
   caprylic_acid_100g text,
   carbohydrates_100g text,
   carbon_footprint_100g text,
   casein_100g text,
   categories text,
   categories_fr text,
   categories_tags text,
   cerotic_acid_100g text,
   chloride_100g text,
   chlorophyl_100g text,
   cholesterol_100g text,
   chromium_100g text,
   cities text,
   cities_tags text,
   cocoa_100g text,
   collagen_meat_protein_ratio_100g text,
   copper_100g text,
   countries text,
   countries_fr text,
   countries_tags text,
   created_datetime text,
   created_t text,
   creator text,
   dihomo_gamma_linolenic_acid_100g text,
   docosahexaenoic_acid_100g text,
   eicosapentaenoic_acid_100g text,
   elaidic_acid_100g text,
   emb_codes text,
   emb_codes_tags text,
   energy_100g text,
   energy_from_fat_100g text,
   erucic_acid_100g text,
   fat_100g text,
   fiber_100g text,
   first_packaging_code_geo text,
   fluoride_100g text,
   folates_100g text,
   fructose_100g text,
   fruits_vegetables_nuts_100g text,
   fruits_vegetables_nuts_estimate_100g text,
   gamma_linolenic_acid_100g text,
   generic_name text,
   glucose_100g text,
   glycemic_index_100g text,
   gondoic_acid_100g text,
   image_small_url text,
   image_url text,
   ingredients_from_palm_oil text,
   ingredients_from_palm_oil_n text,
   ingredients_from_palm_oil_tags text,
   ingredients_text text,
   ingredients_that_may_be_from_palm_oil text,
   ingredients_that_may_be_from_palm_oil_n text,
   ingredients_that_may_be_from_palm_oil_tags text,
   iodine_100g text,
   iron_100g text,
   labels text,
   labels_fr text,
   labels_tags text,
   lactose_100g text,
   last_modified_datetime text,
   last_modified_t text,
   lauric_acid_100g text,
   lignoceric_acid_100g text,
   linoleic_acid_100g text,
   magnesium_100g text,
   main_category text,
   main_category_fr text,
   maltodextrins_100g text,
   maltose_100g text,
   manganese_100g text,
   manufacturing_places text,
   manufacturing_places_tags text,
   mead_acid_100g text,
   melissic_acid_100g text,
   molybdenum_100g text,
   monounsaturated_fat_100g text,
   montanic_acid_100g text,
   myristic_acid_100g text,
   nervonic_acid_100g text,
   no_nutriments text,
   nucleotides_100g text,
   nutrition_grade_fr text,
   nutrition_grade_uk text,
   nutrition_score_fr_100g text,
   nutrition_score_uk_100g text,
   oleic_acid_100g text,
   omega_3_fat_100g text,
   omega_6_fat_100g text,
   omega_9_fat_100g text,
   origins text,
   origins_tags text,
   packaging text,
   packaging_tags text,
   palmitic_acid_100g text,
   pantothenic_acid_100g text,
   ph_100g text,
   phosphorus_100g text,
   pnns_groups_1 text,
   pnns_groups_2 text,
   polyols_100g text,
   polyunsaturated_fat_100g text,
   potassium_100g text,
   product_name text,
   proteins_100g text,
   purchase_places text,
   quantity text,
   salt_100g text,
   saturated_fat_100g text,
   selenium_100g text,
   serum_proteins_100g text,
   serving_size text,
   silica_100g text,
   sodium_100g text,
   starch_100g text,
   states text,
   states_fr text,
   states_tags text,
   stearic_acid_100g text,
   stores text,
   sucrose_100g text,
   sugars_100g text,
   taurine_100g text,
   traces text,
   traces_fr text,
   traces_tags text,
   trans_fat_100g text,
   url text,
   vitamin_a_100g text,
   vitamin_b12_100g text,
   vitamin_b1_100g text,
   vitamin_b2_100g text,
   vitamin_b6_100g text,
   vitamin_b9_100g text,
   vitamin_c_100g text,
   vitamin_d_100g text,
   vitamin_e_100g text,
   vitamin_k_100g text,
   vitamin_pp_100g text,
   water_hardness_100g text,
   zinc_100g text
) WITH bloom_filter_fp_chance = 0.01
   AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
   AND comment = ''
   AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
   AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
   AND crc_check_chance = 1.0
   AND dclocal_read_repair_chance = 0.1
   AND default_time_to_live = 0
   AND gc_grace_seconds = 864000
   AND max_index_interval = 2048
   AND memtable_flush_period_in_ms = 0
   AND min_index_interval = 128
   AND read_repair_chance = 0.0
   AND speculative_retry = '99PERCENTILE';

數據是什麼樣的

有些列不在列標題下方water-hardness_100g

bookfiice 的照片顯示 water-hardness_100g 之後的東西

因此,如何將一個巨大的 csv 文件導入 Cassandra ?

我目前的想法是用 python 創建一個 csv 文件,以便填充,NaN.

使用 Cassandra BulkLoader(以前稱為 SSTableloader)。此處詳細說明瞭如何執行載入的詳細資訊https://www.datastax.com/dev/blog/using-the-cassandra-bulk-loader-updated

有點晚了,但我快速查看了數據。

以下設置應該沒問題:

Separated by: only select 'Tab'
Text delimiter: use none

有些欄位有引號,但欄位本身沒有引號。

引用自:https://dba.stackexchange.com/questions/192088