adam.nz aboutpostsprojectscontact

Processing CSVs on the Command Line by Adam Shand

I'm doing my taxes and need to process a bunch of CSV files that I've downloaded from my bank to give to my accountant. It's pretty easy to do this by hand, but it's tedious and boring. You could also do it in Excel but where's the fun in that? My goals are to:

  • Search all my downloaded CSV files for transactions between 1 April 2007 and 31 March 2008
  • Sort the matched lines by date
  • Pull out all transactions relevant to my accountant

Assuming that the CSVs you've downloaded from your bank(s) have the date field as the first value and are formatted DD/MM/YYYY (the separator doesn't matter DD-MM-YYYY will work fine), it's wonderfully straight forward:

18/12/2007,-79.32,"Interest","Debit Interest",,"From   0xxx 0","3xxxxx-xx",
01/02/2008,-231.27,"Interest","Debit Interest",,"From   0xxx 0","3xxxxx-xx",

Grab all the records for the relevant tax year:

# egrep -h "^../(0[4-9]|1[012])/2007|0[1-3]/2008" ~/Documents/Money/Accounts/*.csv > /tmp/tax.csv

Then sort the results by date:

# sort -n -t, -k 1.7,1.10 -k 1.4,1.5 -k 1.1,1.2 /tmp/tax.csv > /tmp/tax_sorted_0708.csv

You now have all your 07/08 tax year transactions conveniently sorted and stored in a file called tax_sorted_0708.csv. It's now trivial to pull out the relevant records for your accountant:

# egrep -i "interest|vodafone|telecom|etc" /tmp/tax_sorted_0708.csv > /tmp/tax_final_0708.csv

Of course, you can also put it all together into a single line like this:

# egrep -h "^../(0[4-9]|1[012])/2007|0[1-3]/2008" ~/Documents/Money/Accounts/*.csv | sort -n -t, -k 1.7,1.10 -k 1.4,1.5 -k 1.1,1.2 | egrep -i "interest|vodafone|telecom" > /tmp/tax_final_0708.csv

It took me a few minutes to realise that I could pull out the cross year records with a simple regex, but it was how to sort the results by date stumped me for a bit. I knew I'd be able to do it in Perl pretty easily but was looking for a simple command line tool to sort by date and I couldn't find a purpose written one.

Then I remembered the sub-field capabilities of trusty old sort. Using -t, allows you to set the field delimiter to comma and using -k 1.7,1.10 allows you to sort based on the 7 to 10th characters inside the first field (YYYY). Using multiple keys allows you to sort by multiple fields or sub-fields. In this case -k 1.7,1.10 -k 1.4,1.5 -k 1.1,1.2 allows me to sort first by year (YYYY), then by month (MM) and finally by day (DD).

Perfect! I love Unix. :-)

Update 1: For reasons I don't yet understand, the above sort command works fine on Linux, but not on Mac. I'm assuming that this is because of the differences between the GNU version and the BSD version or sort. From reading the man pages, it looks like they should both work, but I'm obviously missing something.

Update 2: The difference is that on Mac, you must specify the start and stop character for the -k parameter. On Linux, if you don't specify the second position, it defaults to the last character in the field. I've updated the above examples to use the version which works on either, for the curious here's the difference.

This works with GNU sort, but not with Mac sort:

# sort -n -t, -k 1.7 -k 1.4 -k 1.1 /tmp/tax.csv

This works with both GNU and Mac sort. The difference is that you are explicitly saying sort first on character 7 through 10 on field one, then character 4 and 5 of field one and finally sort by characters 1 and 2 of field one):

# sort -n -t, -k 1.7,1.10 -k 1.4,1.5 -k 1.1,1.2

Share your thoughts?

0 comments

Copyheart 1994–2024 Adam Shand. Sharing is an act of love.