Compression & Encoding
Compression & Encoding
IoTDB employs various encoding and compression techniques to enhance storage efficiency and reduce I/O operations during data writing and reading. Below is a detailed explanation of the supported encoding and compression methods.
1. Encoding Methods
IoTDB supports multiple encoding methods tailored for different data types to optimize storage and performance.
- PLAIN
The default encoding method, meaning no encoding is applied. It supports multiple data types and offers high time efficiency for compression and decompression, but has relatively slow storage efficiency.
- TS_2DIFF
Second-order differential encoding (TS_2DIFF) is suitable for encoding monotonically increasing or decreasing sequences. It is not ideal for encoding data with significant fluctuations.
- RLE
Run-Length Encoding (RLE) is ideal for sequences where certain values appear consecutively. It is not effective for sequences where most consecutive values differ.
RLE can also be used to encode floating-point numbers, while it is necessary to specify decimal precision when creating time series. It is suitable to store sequence data where floating-point values appear continuously, but is not recommended for sequences requiring high decimal precision or those with large fluctuations.
Both RLE and TS_2DIFF encoding for
float
anddouble
have precision limitations, with a default of two decimal places. GORILLA encoding is recommended instead.
- GORILLA
A lossless encoding method suitable for sequences where consecutive values are close to each other. It is not effective for data with large fluctuations.
Currently, there are two versions of GORILLA encoding implementation, it is recommended to use GORILLA
instead of GORILLA_V1
(which is deprecated).
Usage restrictions:
- When using GORILLA encoding for
INT32
data, ensure that the sequence does not contain values equal toInteger.MIN_VALUE
. - When using GORILLA encoding for
INT64
data, ensure that the sequence does not contain values equal toLong.MIN_VALUE
.
- DICTIONARY
A lossless encoding method suitable for data with low cardinality (i.e., a limited number of unique values). It is not recommended for high-cardinality data.
- ZIGZAG
Maps signed integers to unsigned integers, making it suitable for small integer values.
- CHIMP
A lossless encoding method designed for streaming floating-point data compression. It is efficient for sequences with small variations and low random noise.
Usage restrictions:
- When using CHIMP encoding for
INT32
data, ensure that the sequence does not contain values equal toInteger.MIN_VALUE
. - When using CHIMP encoding for
INT64
data, ensure that the sequence does not contain values equal toLong.MIN_VALUE
.
- SPRINTZ
A lossless encoding method combining prediction, ZigZag encoding, bit packing, and run-length encoding. It is best suited for time-series data with small absolute differences (i.e., low fluctuation) and is not effective for data with large variations.
- RLBE
A lossless encoding method combining differential encoding, bit packing, run-length encoding, Fibonacci encoding, and concatenation. It is suitable for time-series data with a small and steadily increasing trend but is not effective for highly fluctuating data.
Data Types and Supported Encoding Methods
The following table summarizes the recommended and supported encoding methods for each data type:
Data Type | Recommended Encoding | Supported Encoding Methods |
---|---|---|
BOOLEAN | RLE | PLAIN, RLE |
INT32 | TS_2DIFF | PLAIN, RLE, TS_2DIFF, GORILLA, ZIGZAG, CHIMP, SPRINTZ, RLBE |
DATE | TS_2DIFF | PLAIN, RLE, TS_2DIFF, GORILLA, ZIGZAG, CHIMP, SPRINTZ, RLBE |
INT64 | TS_2DIFF | PLAIN, RLE, TS_2DIFF, GORILLA, ZIGZAG, CHIMP, SPRINTZ, RLBE |
TIMESTAMP | TS_2DIFF | PLAIN, RLE, TS_2DIFF, GORILLA, ZIGZAG, CHIMP, SPRINTZ, RLBE |
FLOAT | GORILLA | PLAIN, RLE, TS_2DIFF, GORILLA, CHIMP, SPRINTZ, RLBE |
DOUBLE | GORILLA | PLAIN, RLE, TS_2DIFF, GORILLA, CHIMP, SPRINTZ, RLBE |
TEXT | PLAIN | PLAIN, DICTIONARY |
STRING | PLAIN | PLAIN, DICTIONARY |
BLOB | PLAIN | PLAIN |
Error Handling: If the data type entered by the user does not match the specified encoding method, the system will display an error message. For example:
IoTDB> create timeseries root.ln.wf02.wt02.status WITH DATATYPE=BOOLEAN, ENCODING=TS_2DIFF
Msg: 507: encoding TS_2DIFF does not support BOOLEAN
2. Compression Methods
When the time series is written and encoded as binary data according to the specified type, IoTDB applies compression techniques to further enhance storage efficiency. While both encoding and compression aim to optimize storage, encoding techniques are typically designed for specific data types (e.g., second-order differential encoding is only suitable for INT32 or INT64, and storing floating-point numbers requires multiplying them by 10ⁿ to convert them into integers) before converting the data into a binary stream. Compression methods like SNAPPY operate on the binary stream, making them independent of the data type.
2.1 Supported Compression Methods
IoTDB allows specifying the compression method of a column when creating a time series. Currently, IoTDB supports the following compression methods:
- UNCOMPRESSED
- SNAPPY
- LZ4 (Recommended)
- GZIP
- ZSTD
- LZMA2
2.2 Compression Ratio Statistics
IoTDB provides compression ratio statistics to monitor the effectiveness of compression. The statistics are stored in:data/datanode/system/compression_ratio
- ratio_sum: The total sum of memtable compression ratios.
- memtable_flush_time: The total number of memtable flushes.
The average compression ratio can be calculated as:Average Compression Ratio = ratio_sum / memtable_flush_time