Reading CSV, TSV, and invalid CSV files with Golang
CSV is still one of the most popular formats to organize table-like data. Golang is a powerful tool to process CSV given its performance and ease of use. Let’s see how to address the most common cases.
Reading CSV file
To read CSV files it’s recommended to use encoding/csv
reader component. We’re going to use the following data.csv
for examples:
cat data.csv
id,name,price
1,Phone,123
2,TV,34
3,Boot,5
We might want to read and process CSV files line by line in most cases to handle large files and never run out of memory:
package main
import (
"encoding/csv"; "fmt"; "io"; "os"
)
func main() {
f, _ := os.Open("data.csv")
r := csv.NewReader(f)
for {
row, err := r.Read()
if err == io.EOF {
break
}
fmt.Println(row)
}
}
[id name price]
[1 Phone 123]
[2 TV 34]
[3 Boot 5]
encoding/csv
— this is the package that allows us to read CSV filesos.Open("data.csv")
— opens data.csv file for readingcsv.NewReader(f)
— use the opened file for the CSV readerrow, err := r.Read()
— read (next) line from our CSV fileif err == io.EOF {
— this will be triggered when we reach the end of filefmt.Println(row)
— prints row array that was read from CSV
If we know we work with small CSV files, we can use ReadAll()
method to read the entire file:
package main
import (
"encoding/csv"; "fmt"; "os"
)
func main() {
f, _ := os.Open("data.csv")
r := csv.NewReader(f)
rows, _ := r.ReadAll()
fmt.Println(rows)
}
[[id name price] [1 Phone 123] [2 TV 34] [3 Boot 5]]
r.ReadAll()
— will read entire CSV filerows
— will contain array of rows (also arrays)
Reading TSV files and other custom delimiters
In some cases, CSV files are actually not comma-delimited (“C” comes for comma in “CSV”), but other symbols are used to separate columns. Use Comma
property to define the delimiter in this case. Let’s read tab separated file (tabs are used for columns separation):
package main
import (
"encoding/csv"; "fmt"; "io"; "os"
)
func main() {
f, _ := os.Open("data.tsv")
r := csv.NewReader(f)
r.Comma = '\t'
for {
row, err := r.Read()
if err == io.EOF {
break
}
fmt.Println(row)
}
}
r.Comma = '\t'
— we can use any (single) symbol here to match delimiter used in file
Reading CSV with custom quoting symbols
Double quotes should be used to quote values in CSV files, but someone might have decided to use something else when creating CSV you have to deal with.
Unfortunately, encoding/csv
component doesn’t support custom quotes. In such cases, we can use extra tools to reformat before we feed them to our program. Let’s take the following single-quoted CSV file as an example:
cat data-custom.csv
id,name,price
1,Phone,123
2,'TV, Screens',34
3,Boot,5
We can use python csvkit toolset to change quoting:
csvformat -q "'" data.csv > data-standard.csv
csvformat
— this tool formats given files based on specified rules-q "'"
— here we state that our file uses single quotes for quotingdata-standard.csv
— formatted CSV will be written to this file
This will produce the following file:
id,name,price
1,Phone,123
2,"TV, Screens",34
3,Boot,5
As we can see, now we have double quotes and this file can be used with our Golang program.
Dealing with broken/invalid CSV files
Broken CSV file is a common case. Let’s try to handle the following broken CSV:
id,name,price
1,Phone,123
7,
2,TV, Screens,34
3,Boot,
7,
— broken, because it has less than 3 columns2,TV, Screens,34
— broken, because the second column is not escaped but has a comma in it
While processing this file, encoding/csv
component will throw errors on invalid rows which we catch and process in a way we want:
package main
import (
"encoding/csv"; "fmt"; "io"; "os"
)
func main() {
f, _ := os.Open("data.csv")
r := csv.NewReader(f)
for {
row, err := r.Read()
if err == io.EOF {
break
}
if err != nil {
fmt.Println(err)
continue
}
fmt.Println(row)
}
}
[id name price]
[1 Phone 123]
record on line 3: wrong number of fields
record on line 4: wrong number of fields
[3 Boot 5]
err != nil
— check if we got an error for the current rowfmt.Println(err)
— output errorcontinue
— we do not want to process (or print as in the example) invalid rows, so we skip
Another option is to use csvclean
tool from csvkit
toolset to filter invalid rows from the CSV file.
Edit this article on Github