Using csvkit to format, clean, and fix CSV files

Haven’t you seen valid CSV files in a while? Me too. The csvkit package is a Python-based set of tools to process CSV files: fix and clean, convert between different delimiting and quoting, grep, and even query data. We’ll focus on changing the format and fixing CSV files in this tutorial.

But before we start, let’s install csvkit:

pip install csvkit

This will install a whole bunch of tools:

csv2ods    csvcut     csvgrep    csvjson    csvpy      csvsql     csvstat    
csvclean   csvformat  csvjoin    csvlook    csvsort    csvstack

We’ll focus only on a couple of tools from this list.

Removing CSV header

Remove CSV header

If we want to remove the header row from the CSV file:

csvformat -K 1 data.csv > out.csv

We can use -K 2 to remove the first two lines, or any number to remove the first N lines.

Adding header row to CSV

A quick way to insert a header row is to use the --no-header-row option. This will insert a mock header with columns named “a,b,c…”:

csvformat --no-header-row data.csv > data-with-header.csv

Removing column(s) from CSV file

Drop CSV column

To drop a column from the CSV file:

csvcut -C 3,4 data.csv > out.csv

If we only want specific columns to stay while removing everything else, we can use:

csvcut -c 1,2,5 data.csv > out.csv

Changing CSV delimiter

To convert CSV to a tab-separated (TSV) file:

csvformat -D ";" data.csv
id;name;price
1;Phone;123
7;
2;TV, Screens;34
3;Boot;5

Changing commas to tabs as CSV delimiter

To convert CSV to a tab-separated (TSV) file:

csvformat -T data.csv
id  name    price
1   Phone   123
7
2   TV, Screens 34
3   Boot    5

Changing CSV quoting (e.g double quotes to single quotes)

We can use a custom quoting symbol for CSV:

csvformat -Q "'" data.csv
id,name,price
1,Phone,123
7,
2,'TV, Screens',34
3,Boot,5

Cleaning invalid CSV rows

In many cases, we have to deal with broken CSV files. We can filter out invalid records from CSV files:

csvclean data.csv

This tool will create 2 files:

data_err.csv
data_out.csv

Now we can analyze all errors in the data_err.csv file:

cat data_err.csv
line_number,msg,id,name,price
2,"Expected 3 columns, found 2 columns",7,
3,"Expected 3 columns, found 4 columns",2,TV, Screens,34

Working with compressed CSV files

All tools from csvkit understands gzip compression, so we don’t need to decompress:

csvformat compressed.csv.gz

Piping CSV data

We can also pipe CSV data directly to csvkit commands, which makes it useful to process CSV output from other programs on the fly:

echo '1,2,3,"hi"' | csvformat
1,2,3,hi
Published 2 years ago in #programming about #python and #csv by Denys Golotiuk

Edit this article on Github
Denys Golotiuk in 2024

I'm a database architect, focused on OLAP databases for storing & processing large amounts of data in an efficient manner. Let me know if you have questions or need help.

Contact me via golotyuk@gmail.com or mrcrypster @ github