Using csvkit to format, clean, and fix CSV files
Haven’t you seen valid CSV files in a while? Me too. The csvkit
package is a Python-based set of tools to process CSV files: fix and clean, convert between different delimiting and quoting, grep, and even query data. We’ll focus on changing the format and fixing CSV files in this tutorial.
But before we start, let’s install csvkit
:
pip install csvkit
pip install
— installs Python packagecsvkit
— what we want to install
This will install a whole bunch of tools:
csv2ods csvcut csvgrep csvjson csvpy csvsql csvstat
csvclean csvformat csvjoin csvlook csvsort csvstack
We’ll focus only on a couple of tools from this list.
Removing CSV header
If we want to remove the header row from the CSV file:
csvformat -K 1 data.csv > out.csv
csvformat
— formatting tool for CSV files-K 1
— skips one line (first row) from the CSV filedata.csv
— file to remove CSV header fromout.csv
— resulting CSV without a header row
We can use -K 2
to remove the first two lines, or any number to remove the first N
lines.
Adding header row to CSV
A quick way to insert a header row is to use the --no-header-row
option. This will insert a mock header with columns named “a,b,c…”:
csvformat --no-header-row data.csv > data-with-header.csv
--no-header-row
— will add header row
Removing column(s) from CSV file
To drop a column from the CSV file:
csvcut -C 3,4 data.csv > out.csv
csvcut
— a tool that removes rows/columns from CSV files-C
— remove given columns (by indexes, starting from 1, uppercase “C”)3,4
— removes the third and fourth columns for a given file
If we only want specific columns to stay while removing everything else, we can use:
csvcut -c 1,2,5 data.csv > out.csv
-c
— keep specified columns only (lowercase “c”)1,2,5
— we’ll have first, second, and fifth columns in the resulting file
Changing CSV delimiter
To convert CSV to a tab-separated (TSV) file:
csvformat -D ";" data.csv
id;name;price
1;Phone;123
7;
2;TV, Screens;34
3;Boot;5
csvformat
— a tool to change CSV formatting-D
— specify delimiter instead of a comma;
— delimiter we want to usedata.csv
— input CSV file to change
Changing commas to tabs as CSV delimiter
To convert CSV to a tab-separated (TSV) file:
csvformat -T data.csv
id name price
1 Phone 123
7
2 TV, Screens 34
3 Boot 5
-T
— here we ask to use tabs as a delimiter for output
Changing CSV quoting (e.g double quotes to single quotes)
We can use a custom quoting symbol for CSV:
csvformat -Q "'" data.csv
id,name,price
1,Phone,123
7,
2,'TV, Screens',34
3,Boot,5
-Q
— set quoting symbol'
— we want to use a single quote
Cleaning invalid CSV rows
In many cases, we have to deal with broken CSV files. We can filter out invalid records from CSV files:
csvclean data.csv
csvclean
— filters the given CSV file and creates valid (and file with errors)data.csv
— file to filter invalid records from
This tool will create 2 files:
data_err.csv
data_out.csv
data_out.csv
— this file will contain only valid CSV recordsdata_err.csv
— this file will contain invalid records along with error details
Now we can analyze all errors in the data_err.csv
file:
cat data_err.csv
line_number,msg,id,name,price
2,"Expected 3 columns, found 2 columns",7,
3,"Expected 3 columns, found 4 columns",2,TV, Screens,34
Working with compressed CSV files
All tools from csvkit
understands gzip compression, so we don’t need to decompress:
csvformat compressed.csv.gz
.gz
— we can work with compressed files
Piping CSV data
We can also pipe CSV data directly to csvkit
commands, which makes it useful to process CSV output from other programs on the fly:
echo '1,2,3,"hi"' | csvformat
1,2,3,hi
echo '1,2,3,"hi"'
— sample command that output some CSV|
— pipe our output to thecsvkit
toolcsvformat
— a tool to accept piped CSV data
Edit this article on Github