Importing large csv-files into mongodb

I wanted to import some dummy data into Mongo-DB to test the aggregation functions. I thought a nice source would be the tpc-h testdata which can generate arbitrary volumes of data from 1 GB to 100 GB. You can download the data generation kit from the website : http://www.tpc.org/tpch/

In the generated csv-files the header is missing, but you can find the names in the pdf. For the customers table it is:

custkey|name|address|nationkey|phone|acctbal|mktsegment|comment

The mongodb import possibilities are very limited. Basically you can only import COMMA separated (or TAB separated) values, and if the lines have commas in the data then it also fails. So I wrote a little python script which converts CSV-Data to the mongo-db import json format. The first line in the csv file has to be the names of the headers. in the following lines I’m preparing the tpc-h file with headers converting it to json and then import it into my mongodb. mongodb uses a special json format (every value in one line without commas and squarebrackets. You can also import json-arrays, but the size is very limited.

echo "custkey|name|address|nationkey|phone|acctbal|mktsegment|comment" > header_customer.tbl
cat header_customer.tbl customer.tbl > customer_with_header.tbl
./csv2mongodbjson.py -c customer_with_header.tbl -j customer.json -d '|'
mongoimport --db test --collection customer --file customer.json

for a csv file with 150000 lines the conversion takes about 3 seconds.

Converting CSV-Files to Mongo-DB JSON format

csv2mongodbjson.py

#!/usr/bin/python
import csv
from optparse import OptionParser
 
# converts a array of csv-columns to a mongodb json line
def convert_csv_to_json(csv_line, csv_headings):
	json_elements = []
	for index,heading in enumerate(csv_headings):
	    json_elements.append(heading + ": \"" + unicode(csv_line[index],'UTF-8') + "\"")
 
	line = "{ " + ', '.join(json_elements) + " }"
	return line
 
# parsing the commandline options
parser = OptionParser(description="parses a csv-file and converts it to mongodb json format. The csv file has to have the column names in the first line.")
parser.add_option("-c", "--csvfile", dest="csvfile", action="store", help="input csvfile")
parser.add_option("-j", "--jsonfile", dest="jsonfile", action="store", help="json output file")
parser.add_option("-d", "--delimiter", dest="delimiter", action="store", help="csvdelimiter")
 
(options, args) = parser.parse_args()
 
# parsing and converting the csvfile
csvreader = csv.reader(open(options.csvfile, 'rb'), delimiter=options.delimiter)
column_headings = csvreader.next()
jsonfile = open(options.jsonfile, 'wb')
 
while True:
    try: 
        csv_current_line = csvreader.next()
	json_current_line = convert_csv_to_json(csv_current_line,column_headings)
	print >>jsonfile, json_current_line
 
    except csv.Error as e :
        print "Error parsing csv: %s" % e
    except StopIteration as e:
        print "=== Finished ==="
        break
 
jsonfile.close()
pixelstats trackingpixel

Leave a comment

Your comment

Time limit is exhausted. Please reload the CAPTCHA.