How to manage ingesting errors in ClickHouse
In many cases, we have to work with broken or invalid data files before ingesting data into ClickHouse. ClickHouse supports a huge amount of input formats but in case of an invalid format, we get the following error:
Code: 117. DB::Exception: Expected end of line
We have several strategies here to use (and combine) to achieve desired results.
Skipping errors
ClickHouse has several settings options to control errors while ingesting data from text formats (e.g., CSV or TSV):
input_format_allow_errors_num
— total number of acceptable invalid records during ingest operation (e.g. value of5
will allow 5 invalid records),input_format_allow_errors_ratio
— acceptable portion of invalid records during ingest operation, specified as a float number from0
(meaning no errors are allowed) to1
(meaning 100% of records can be invalid).
Both settings will ask ClickHouse to allow a certain amount of errors, either in absolute value or in percent to total imported records. Invalid records will be skipped while valid records will be imported successfully. ClickHouse throws an exception when it reaches the first limit - either absolute or ratio. If limits are not reached, no error is raised.
It’s a good approach to use both settings options at a query time to have better control over queries:
cat data.csv | clickhouse-client --progress -q \
"INSERT INTO some_table SETTINGS input_format_allow_errors_num = 5, input_format_allow_errors_ratio = 0.1 FORMAT CSV"
data.csv
— file to ingest data into ClickHouse from,SETTINGS
— allows setting query options,5
— we limit the total amount of allowed invalid records to 5,0.1
— we limit the percentage of invalid records to 10%.
Fixing input files
Skipping errors can be a quick fix, but you might want to try fixing source files instead. If dealing with delimited text formats (like CSV), take a look at fixing data with CSVkit.
Further reading
- input_format_allow_errors_num settings option in ClickHouse
- input_format_allow_errors_ratio settings option in ClickHouse
Edit this article on Github