How manage ingesting errors in ClickHouse

In many cases, we have to work with broken or invalid data files before ingesting data into ClickHouse. ClickHouse supports a huge amount of input formats but in case of an invalid format, we get the following error:

Code: 117. DB::Exception: Expected end of line

We have several strategies here to use (and combine) to achieve desired results.

Skipping errors

ClickHouse has several settings options to control errors while ingesting data from text formats (e.g., CSV or TSV):

Both settings will ask ClickHouse to allow a certain amount of errors, either in absolute value or in percent to total imported records. Invalid records will be skipped while valid records will be imported successfully. ClickHouse throws an exception when it reaches the first limit - either absolute or ratio. If limits are not reached, no error is raised.

It’s a good approach to use both settings options at a query time to have better control over queries:

cat data.csv | clickhouse-client --progress -q \
"INSERT INTO some_table SETTINGS input_format_allow_errors_num = 5, input_format_allow_errors_ratio = 0.1 FORMAT CSV"

Fixing input files

Skipping errors can be a quick fix, but you might want to try fixing source files instead. If dealing with delimited text formats (like CSV), take a look at fixing data with CSVkit.

Further reading

Published a year ago in #data about #clickhouse by Denys Golotiuk

Edit this article on Github
Denys Golotiuk in 2024

I'm a database architect, focused on OLAP databases for storing & processing large amounts of data in an efficient manner. Let me know if you have questions or need help.

Contact me via golotyuk@gmail.com or mrcrypster @ github