Csv
如何將一個巨大的 csv 文件導入 Cassandra?
我正在嘗試將一個 csv 文件導入到很長的 Cassandra 中。這些是食品:成分、營養、標籤……它來自 Open Food Data。列出食品產品的資訊:成分、營養資訊、標籤等。大部分數據來自眾包資訊。法國公共data.gouv.fr開放平台的文件這個信封
我試過的命令
我對能夠使用 python 腳本收集的所有列嘗試以下命令:
cqlsh> COPY bouffe(code, url, creator, created_t, created_datetime, last_modified_t, last_modified_datetime, product_name, generic_name, quantity, packaging, packaging_tags, brands, brands_tags, categories, categories_tags, categories_fr, origins, origins_tags, manufacturing_places, manufacturing_places_tags, labels, labels_tags, labels_fr, emb_codes, emb_codes_tags, first_packaging_code_geo, cities, cities_tags, purchase_places, stores, countries, countries_tags, countries_fr, ingredients_text, allergens, allergens_fr, traces, traces_tags, traces_fr, serving_size, no_nutriments, additives_n, additives, additives_tags, additives_fr, ingredients_from_palm_oil_n, ingredients_from_palm_oil, ingredients_from_palm_oil_tags, ingredients_that_may_be_from_palm_oil_n, ingredients_that_may_be_from_palm_oil, ingredients_that_may_be_from_palm_oil_tags, nutrition_grade_uk, nutrition_grade_fr, pnns_groups_1, pnns_groups_2, states, states_tags, states_fr, main_category, main_category_fr, image_url, image_small_url, energy_100g, energy-from-fat_100g, fat_100g, saturated-fat_100g, butyric-acid_100g, caproic-acid_100g, caprylic-acid_100g, capric-acid_100g, lauric-acid_100g, myristic-acid_100g, palmitic-acid_100g, stearic-acid_100g, arachidic-acid_100g, behenic-acid_100g, lignoceric-acid_100g, cerotic-acid_100g, montanic-acid_100g, melissic-acid_100g, monounsaturated-fat_100g, polyunsaturated-fat_100g, omega-3-fat_100g, alpha-linolenic-acid_100g, eicosapentaenoic-acid_100g, docosahexaenoic-acid_100g, omega-6-fat_100g, linoleic-acid_100g, arachidonic-acid_100g, gamma-linolenic-acid_100g, dihomo-gamma-linolenic-acid_100g, omega-9-fat_100g, oleic-acid_100g, elaidic-acid_100g, gondoic-acid_100g, mead-acid_100g, erucic-acid_100g, nervonic-acid_100g, trans-fat_100g, cholesterol_100g, carbohydrates_100g, sugars_100g, sucrose_100g, glucose_100g, fructose_100g, lactose_100g, maltose_100g, maltodextrins_100g, starch_100g, polyols_100g, fiber_100g, proteins_100g, casein_100g, serum-proteins_100g, nucleotides_100g, salt_100g, sodium_100g, alcohol_100g, vitamin-a_100g, beta-carotene_100g, vitamin-d_100g, vitamin-e_100g, vitamin-k_100g, vitamin-c_100g, vitamin-b1_100g, vitamin-b2_100g, vitamin-pp_100g, vitamin-b6_100g, vitamin-b9_100g, folates_100g, vitamin-b12_100g, biotin_100g, pantothenic-acid_100g, silica_100g, bicarbonate_100g, potassium_100g, chloride_100g, calcium_100g, phosphorus_100g, iron_100g, magnesium_100g, zinc_100g, copper_100g, manganese_100g, fluoride_100g, selenium_100g, chromium_100g, molybdenum_100g, iodine_100g, caffeine_100g, taurine_100g, ph_100g, fruits-vegetables-nuts_100g, fruits-vegetables-nuts-estimate_100g, collagen-meat-protein-ratio_100g, cocoa_100g, chlorophyl_100g, carbon-footprint_100g, nutrition-score-fr_100g, nutrition-score-uk_100g, glycemic-index_100g, water-hardness_100g) FROM 'bouffe.csv' WITH HEADER = true;
但它給了我以下錯誤:
... Failed to import 23 rows: ParseError - Invalid row length 84 should be 163, given up without retries Failed to import 47 rows: ParseError - Invalid row length 77 should be 163, given up without retries Failed to import 73 rows: ParseError - Invalid row length 52 should be 163, given up without retries Failed to import 5000 rows: Error - new-line character seen in unquoted field - do you need to open the file in universal-newline mode?, given up after 1 attempts Failed to import 2 rows: ParseError - Invalid row length 32 should be 163, given up without retries Failed to import 56 rows: ParseError - Invalid row length 69 should be 163, given up without retries Exceeded maximum number of insert errors 1000 Avg. rate: 7467 rows/s Failed to process 192457 rows; failed rows written to import_k1_bouffe.err Exceeded maximum number of insert errors 1000 Processed: 185000 rows; Rate: 4855 rows/s; Avg. rate: 7407 rows/s 185000 rows imported from 0 files in 24.977 seconds (0 skipped).
之前,我創建了:
create ColumnFamily Bouffe (Code varchar PRIMARY KEY, url varchar, ... )
當我要求 cassandra 描述我的桌子時,我有:
cqlsh:k1> DESCRIBE TABLE bouffe; CREATE TABLE k1.bouffe ( code int PRIMARY KEY, additives text, additives_fr text, additives_n text, additives_tags text, alcohol_100g text, allergens text, allergens_fr text, alpha_linolenic_acid_100g text, arachidic_acid_100g text, arachidonic_acid_100g text, behenic_acid_100g text, beta_carotene_100g text, bicarbonate_100g text, biotin_100g text, brands text, brands_tags text, butyric_acid_100g text, caffeine_100g text, calcium_100g text, capric_acid_100g text, caproic_acid_100g text, caprylic_acid_100g text, carbohydrates_100g text, carbon_footprint_100g text, casein_100g text, categories text, categories_fr text, categories_tags text, cerotic_acid_100g text, chloride_100g text, chlorophyl_100g text, cholesterol_100g text, chromium_100g text, cities text, cities_tags text, cocoa_100g text, collagen_meat_protein_ratio_100g text, copper_100g text, countries text, countries_fr text, countries_tags text, created_datetime text, created_t text, creator text, dihomo_gamma_linolenic_acid_100g text, docosahexaenoic_acid_100g text, eicosapentaenoic_acid_100g text, elaidic_acid_100g text, emb_codes text, emb_codes_tags text, energy_100g text, energy_from_fat_100g text, erucic_acid_100g text, fat_100g text, fiber_100g text, first_packaging_code_geo text, fluoride_100g text, folates_100g text, fructose_100g text, fruits_vegetables_nuts_100g text, fruits_vegetables_nuts_estimate_100g text, gamma_linolenic_acid_100g text, generic_name text, glucose_100g text, glycemic_index_100g text, gondoic_acid_100g text, image_small_url text, image_url text, ingredients_from_palm_oil text, ingredients_from_palm_oil_n text, ingredients_from_palm_oil_tags text, ingredients_text text, ingredients_that_may_be_from_palm_oil text, ingredients_that_may_be_from_palm_oil_n text, ingredients_that_may_be_from_palm_oil_tags text, iodine_100g text, iron_100g text, labels text, labels_fr text, labels_tags text, lactose_100g text, last_modified_datetime text, last_modified_t text, lauric_acid_100g text, lignoceric_acid_100g text, linoleic_acid_100g text, magnesium_100g text, main_category text, main_category_fr text, maltodextrins_100g text, maltose_100g text, manganese_100g text, manufacturing_places text, manufacturing_places_tags text, mead_acid_100g text, melissic_acid_100g text, molybdenum_100g text, monounsaturated_fat_100g text, montanic_acid_100g text, myristic_acid_100g text, nervonic_acid_100g text, no_nutriments text, nucleotides_100g text, nutrition_grade_fr text, nutrition_grade_uk text, nutrition_score_fr_100g text, nutrition_score_uk_100g text, oleic_acid_100g text, omega_3_fat_100g text, omega_6_fat_100g text, omega_9_fat_100g text, origins text, origins_tags text, packaging text, packaging_tags text, palmitic_acid_100g text, pantothenic_acid_100g text, ph_100g text, phosphorus_100g text, pnns_groups_1 text, pnns_groups_2 text, polyols_100g text, polyunsaturated_fat_100g text, potassium_100g text, product_name text, proteins_100g text, purchase_places text, quantity text, salt_100g text, saturated_fat_100g text, selenium_100g text, serum_proteins_100g text, serving_size text, silica_100g text, sodium_100g text, starch_100g text, states text, states_fr text, states_tags text, stearic_acid_100g text, stores text, sucrose_100g text, sugars_100g text, taurine_100g text, traces text, traces_fr text, traces_tags text, trans_fat_100g text, url text, vitamin_a_100g text, vitamin_b12_100g text, vitamin_b1_100g text, vitamin_b2_100g text, vitamin_b6_100g text, vitamin_b9_100g text, vitamin_c_100g text, vitamin_d_100g text, vitamin_e_100g text, vitamin_k_100g text, vitamin_pp_100g text, water_hardness_100g text, zinc_100g text ) WITH bloom_filter_fp_chance = 0.01 AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'} AND comment = '' AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'} AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND crc_check_chance = 1.0 AND dclocal_read_repair_chance = 0.1 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = '99PERCENTILE';
數據是什麼樣的
有些列不在列標題下方
water-hardness_100g
:因此,如何將一個巨大的 csv 文件導入 Cassandra ?
我目前的想法是用 python 創建一個 csv 文件,以便填充
,
與NaN
.
使用 Cassandra BulkLoader(以前稱為 SSTableloader)。此處詳細說明瞭如何執行載入的詳細資訊https://www.datastax.com/dev/blog/using-the-cassandra-bulk-loader-updated
有點晚了,但我快速查看了數據。
以下設置應該沒問題:
Separated by: only select 'Tab' Text delimiter: use none
有些欄位有引號,但欄位本身沒有引號。