Importing large csv-files into mongodb
I wanted to import some dummy data into Mongo-DB to test the aggregation functions. I thought a nice source would be the tpc-h testdata which can generate arbitrary volumes of data from 1 GB to 100 GB. You can download the data generation kit from the website : http://www.tpc.org/tpch/
In the generated csv-files the header is missing, but you can find the names in the pdf. For the customers table it is:
custkey|name|address|nationkey|phone|acctbal|mktsegment|comment
The mongodb import possibilities are very limited. Basically you can only import COMMA separated (or TAB separated) values, and if the lines have commas in the data then it also fails. So I wrote a little python script which converts CSV-Data to the mongo-db import json format. The first line in the csv file has to be the names of the headers. in the following lines I’m preparing the tpc-h file with headers converting it to json and then import it into my mongodb. mongodb uses a special json format (every value in one line without commas and squarebrackets. You can also import json-arrays, but the size is very limited.
echo "custkey|name|address|nationkey|phone|acctbal|mktsegment|comment" > header_customer.tbl cat header_customer.tbl customer.tbl > customer_with_header.tbl ./csv2mongodbjson.py -c customer_with_header.tbl -j customer.json -d '|' mongoimport --db test --collection customer --file customer.json
for a csv file with 150000 lines the conversion takes about 3 seconds.
Converting CSV-Files to Mongo-DB JSON format
csv2mongodbjson.py
#!/usr/bin/python import csv from optparse import OptionParser # converts a array of csv-columns to a mongodb json line def convert_csv_to_json(csv_line, csv_headings): json_elements = [] for index,heading in enumerate(csv_headings): json_elements.append(heading + ": \"" + unicode(csv_line[index],'UTF-8') + "\"") line = "{ " + ', '.join(json_elements) + " }" return line # parsing the commandline options parser = OptionParser(description="parses a csv-file and converts it to mongodb json format. The csv file has to have the column names in the first line.") parser.add_option("-c", "--csvfile", dest="csvfile", action="store", help="input csvfile") parser.add_option("-j", "--jsonfile", dest="jsonfile", action="store", help="json output file") parser.add_option("-d", "--delimiter", dest="delimiter", action="store", help="csvdelimiter") (options, args) = parser.parse_args() # parsing and converting the csvfile csvreader = csv.reader(open(options.csvfile, 'rb'), delimiter=options.delimiter) column_headings = csvreader.next() jsonfile = open(options.jsonfile, 'wb') while True: try: csv_current_line = csvreader.next() json_current_line = convert_csv_to_json(csv_current_line,column_headings) print >>jsonfile, json_current_line except csv.Error as e : print "Error parsing csv: %s" % e except StopIteration as e: print "=== Finished ===" break jsonfile.close() |