Published by Nick on 17 Apr 2008 at 01:36 pm
Living in CSV Hell
As part of my new job im left dealing with a whole series of databases, and one of the regular jobs is to update them from various sources. Normally that involves extracting data as a CSV file from the main database, which then needs to be edited and checking in excel because all the table names need changing so that when you import it into the database system used within the office it matches the fields used there. Also the conversion to csv seems to introduce random characters in fields, so as part of the process you also have to check all the data and make sure it doesnt contain any of these random characters before finally importing it and hoping everything works.
Now incase this didnt sound too painful, while most of the time there are only a few new records, im about to get to the one time of year where we receive in excess of 2500 new records. And while in previous years it may have been accepted that you need to manually check them, i *really* dont want to have to waste a day (and my sanity!) checking all this data.
So we get to my problem(s), not only am i stuck using windows (bad!), but ive also got a crappy CRT monitor (my eyes!), and finally i have to spend half my time working with a custom frontend to an oracle database which is written in java and helpfully launches from inside IE and is so obtuse and badly designed i feel sick everytime i use it!
However, getting away from my severe java allergies :p im hitting a roadblock in terms of automating at least some of this, originally i was going to try out my python skills, till i realised i really dont know anything, so i resorted to using PHP. In both cases though, despite some googling there seem to be very few, if any, decent tutorials/guides to working with CSV files. So my question plea is for any guides/tutorials or any decent documentation for working with CSV files in PHP. Im open to pretty much anything as the best ive found is for pretty much just opening the file and reading it.
Now all i have to do is dust off my PHP skills and hope i can throw something together that works!




Leifbk on 17 Apr 2008 at 5:25 pm #
That’s the kind of problem space that Perl was designed for, and in which it really shines. There’s even a special CSV module that takes care of most of the common problems in that area.
Even if I do a lot of coding in PHP, I’ll switch to Perl anytime I need to do some real data munging.
regards, Leif
Brian on 18 Apr 2008 at 12:56 am #
Many databases can read delimited data files directly. mysql for example can do it via LOAD DATA INFILE. I’d be surprised to find a database that can’t do it.
Ruby also has a couple nice CSV libraries. I’ve used them a lot. PHP isn’t really a shell scripting kind of language.
Parsing CSV files isn’t quite trivial, so resist writing your own parser if you can. Everyone’s first gut reaction is to use a regular expression, but you have to handle commas inside quoted strings and escaped quotes inside quoted strings etc.
Nick on 18 Apr 2008 at 3:13 pm #
Yeah, i had thought of perl, but sadly my knowledge of perl is pretty much zero.
With regards to using mysql, i was tempted here too as it would make it *really* easy to dump into a database and then allow me to do all the work to it i wanted, but after thinking it through some more ive come up with what i hope is a simpler solution that doesnt even care that its a CSV file.
As all i really want to do is change all the column titles and remove some blank lines at the top im just using good old fopen(). Its not elegant but importantly its something i can throw together quickly and it still saves me time rather than doing it by hand.
Ive also been able to throw in some basic checking, as im not concerned about whats actually in each cell at this point as im just checking for erroneous characters (sometimes we get random question marks or other bits appear that throws off the target database). Normally this would all be checked by hand and only take a minute, however for a few weeks a year we get thousands of applications to check, so even a simple script like this would allow us to automatically do at least some basic checking and flag records to check later by hand, hopefully saving us a lot of time.
Dont worry though Brian, im not about to try and run this as a shell script, its all going into a basic webpage on our intranet site so files can be uploaded and then download back when their fixed along with a page listing any suspect entries.
Thanks for both your ideas, thankfully a nights sleep allowed me to regain enough of my sanity to come up with a simple(r) solution that doesnt tax my frankly limited programming skills!