Monitoring Panel Deployment
Monitoring Panel Deployment
The IoTDB monitoring panel is one of the supporting tools for the IoTDB Enterprise Edition. It aims to solve the monitoring problems of IoTDB and its operating system, mainly including operating system resource monitoring, IoTDB performance monitoring, and hundreds of kernel monitoring indicators, in order to help users monitor the health status of the cluster, and perform cluster optimization and operation. This article will take common 3C3D clusters (3 Confignodes and 3 Datanodes) as examples to introduce how to enable the system monitoring module in an IoTDB instance and use Prometheus+Grafana to visualize the system monitoring indicators.
The instructions for using the monitoring panel tool can be found in the Instructions section of the document.
1. Installation Preparation
- Installing IoTDB: You need to first install IoTDB V1.0 or above Enterprise Edition. You can contact business or technical support to obtain
- Obtain the IoTDB monitoring panel installation package: Based on the enterprise version of IoTDB database monitoring panel, you can contact business or technical support to obtain
2. Installation Steps
2.1 IoTDB enables monitoring indicator collection
- Open the monitoring configuration item. The configuration items related to monitoring in IoTDB are disabled by default. Before deploying the monitoring panel, you need to open the relevant configuration items (note that the service needs to be restarted after enabling monitoring configuration).
Configuration | Located in the configuration file | Description |
---|---|---|
cn_metric_reporter_list | conf/iotdb-system.properties | Uncomment the configuration item and set the value to PROMETHEUS |
cn_metric_level | conf/iotdb-system.properties | Uncomment the configuration item and set the value to IMPORTANT |
cn_metric_prometheus_reporter_port | conf/iotdb-system.properties | Uncomment the configuration item to maintain the default setting of 9091. If other ports are set, they will not conflict with each other |
dn_metric_reporter_list | conf/iotdb-system.properties | Uncomment the configuration item and set the value to PROMETHEUS |
dn_metric_level | conf/iotdb-system.properties | Uncomment the configuration item and set the value to IMPORTANT |
dn_metric_prometheus_reporter_port | conf/iotdb-system.properties | Uncomment the configuration item and set it to 9092 by default. If other ports are set, they will not conflict with each other |
Taking the 3C3D cluster as an example, the monitoring configuration that needs to be modified is as follows:
Node IP | Host Name | Cluster Role | Configuration File Path | Configuration |
---|---|---|---|---|
192.168.1.3 | iotdb-1 | confignode | conf/iotdb-system.properties | cn_metric_reporter_list=PROMETHEUS cn_metric_level=IMPORTANT cn_metric_prometheus_reporter_port=9091 |
192.168.1.4 | iotdb-2 | confignode | conf/iotdb-system.properties | cn_metric_reporter_list=PROMETHEUS cn_metric_level=IMPORTANT cn_metric_prometheus_reporter_port=9091 |
192.168.1.5 | iotdb-3 | confignode | conf/iotdb-system.properties | cn_metric_reporter_list=PROMETHEUS cn_metric_level=IMPORTANT cn_metric_prometheus_reporter_port=9091 |
192.168.1.3 | iotdb-1 | datanode | conf/iotdb-system.properties | dn_metric_reporter_list=PROMETHEUS dn_metric_level=IMPORTANT dn_metric_prometheus_reporter_port=9092 |
192.168.1.4 | iotdb-2 | datanode | conf/iotdb-system.properties | dn_metric_reporter_list=PROMETHEUS dn_metric_level=IMPORTANT dn_metric_prometheus_reporter_port=9092 |
192.168.1.5 | iotdb-3 | datanode | conf/iotdb-system.properties | dn_metric_reporter_list=PROMETHEUS dn_metric_level=IMPORTANT dn_metric_prometheus_reporter_port=9092 |
- Restart all nodes. After modifying the monitoring indicator configuration of three nodes, the confignode and datanode services of all nodes can be restarted:
./sbin/stop-standalone.sh #Stop confignode and datanode first
./sbin/start-confignode.sh -d #Start confignode
./sbin/start-datanode.sh -d #Start datanode
- After restarting, confirm the running status of each node through the client. If the status is Running, it indicates successful configuration:

2.2 Install and configure Prometheus
Taking Prometheus installed on server 192.168.1.3 as an example.
- Download the Prometheus installation package, which requires installation of V2.30.3 and above. You can go to the Prometheus official website to download it(https://prometheus.io/docs/introduction/first_steps/)
- Unzip the installation package and enter the unzipped folder:
tar xvfz prometheus-*.tar.gz
cd prometheus-*
- Modify the configuration. Modify the configuration file prometheus.yml as follows
- Add configNode task to collect monitoring data for ConfigNode
- Add a datanode task to collect monitoring data for DataNodes
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "confignode"
static_configs:
- targets: ["iotdb-1:9091","iotdb-2:9091","iotdb-3:9091"]
honor_labels: true
- job_name: "datanode"
static_configs:
- targets: ["iotdb-1:9092","iotdb-2:9092","iotdb-3:9092"]
honor_labels: true
- Start Prometheus. The default expiration time for Prometheus monitoring data is 15 days. In production environments, it is recommended to adjust it to 180 days or more to track historical monitoring data for a longer period of time. The startup command is as follows:
./prometheus --config.file=prometheus.yml --storage.tsdb.retention.time=180d
Confirm successful startup. Enter in browser http://192.168.1.3:9090 Go to Prometheus and click on the Target interface under Status. When you see that all States are Up, it indicates successful configuration and connectivity.
Clicking on the left link in Targets will redirect you to web monitoring and view the monitoring information of the corresponding node:

2.3 Install Grafana and configure the data source
Taking Grafana installed on server 192.168.1.3 as an example.
- Download the Grafana installation package, which requires installing version 8.4.2 or higher. You can go to the Grafana official website to download it(https://grafana.com/grafana/download)
- Unzip and enter the corresponding folder
tar -zxvf grafana-*.tar.gz
cd grafana-*
- Start Grafana:
./bin/grafana-server web
Log in to Grafana. Enter in browser http://192.168.1.3:3000 (or the modified port), enter Grafana, and the default initial username and password are both admin.
Configure data sources. Find Data sources in Connections, add a new data source, and configure the Data Source to Prometheus

When configuring the Data Source, pay attention to the URL where Prometheus is located. After configuring it, click on Save&Test and a Data Source is working prompt will appear, indicating successful configuration

2.4 Import IoTDB Grafana Dashboards
Enter Grafana and select Dashboards:
Click the Import button on the right side
Import Dashboard using upload JSON file
Select the JSON file of one of the panels in the IoTDB monitoring panel, using the Apache IoTDB ConfigNode Dashboard as an example (refer to the installation preparation section in this article for the monitoring panel installation package):
Select Prometheus as the data source and click Import
Afterwards, you can see the imported Apache IoTDB ConfigNode Dashboard monitoring panel
Similarly, we can import the Apache IoTDB DataNode Dashboard Apache Performance Overview Dashboard、Apache System Overview Dashboard, You can see the following monitoring panel:
At this point, all IoTDB monitoring panels have been imported and monitoring information can now be viewed at any time.
3. Appendix, Detailed Explanation of Monitoring Indicators
3.1 System Dashboard
This panel displays the current usage of system CPU, memory, disk, and network resources, as well as partial status of the JVM.
CPU
- CPU Cores:CPU cores
- CPU Utilization:
- System CPU Utilization:The average CPU load and busyness of the entire system during the sampling time
- Process CPU Utilization:The proportion of CPU occupied by the IoTDB process during sampling time
- CPU Time Per Minute:The total CPU time of all processes in the system per minute
Memory
- System Memory:The current usage of system memory.
- Commited VM Size: The size of virtual memory allocated by the operating system to running processes.
- Total Physical Memory:The total amount of available physical memory in the system.
- Used Physical Memory:The total amount of memory already used by the system. Contains the actual amount of memory used by the process and the memory occupied by the operating system buffers/cache.
- System Swap Memory:Swap Space memory usage.
- Process Memory:The usage of memory by the IoTDB process.
- Max Memory:The maximum amount of memory that an IoTDB process can request from the operating system. (Configure the allocated memory size in the datanode env/configure env configuration file)
- Total Memory:The total amount of memory that the IoTDB process has currently requested from the operating system.
- Used Memory:The total amount of memory currently used by the IoTDB process.
Disk
- Disk Space:
- Total Disk Space:The maximum disk space that IoTDB can use.
- Used Disk Space:The disk space already used by IoTDB.
- Logs Per Minute:The average number of logs at each level of IoTDB per minute during the sampling time.
- File Count:Number of IoTDB related files
- All:All file quantities
- TsFile:Number of TsFiles
- Seq:Number of sequential TsFiles
- Unseq:Number of unsequence TsFiles
- WAL:Number of WAL files
- Cross-Temp:Number of cross space merge temp files
- Inner-Seq-Temp:Number of merged temp files in sequential space
- Innsr-Unseq-Temp:Number of merged temp files in unsequential space
- Mods:Number of tombstone files
- Open File Handles:Number of file handles opened by the system
- File Size:The size of IoTDB related files. Each sub item corresponds to the size of the corresponding file.
- Disk Utilization (%):Equivalent to the% util indicator in iostat, it to some extent reflects the level of disk busyness. Each sub item is an indicator corresponding to the disk.
- Disk I/O Throughput:The average I/O throughput of each disk in the system over a period of time. Each sub item is an indicator corresponding to the disk.
- Disk IOPS:Equivalent to the four indicators of r/s, w/s, rrqm/s, and wrqm/s in iostat, it refers to the number of times a disk performs I/O per second. Read and write refer to the number of times a disk performs a single I/O. Due to the corresponding scheduling algorithm of block devices, in some cases, multiple adjacent I/Os can be merged into one. Merge read and merge write refer to the number of times multiple I/Os are merged into one I/O.
- Disk I/O Latency (Avg):Equivalent to the await of iostat, which is the average latency of each I/O request. Separate recording of read and write requests.
- Disk I/O Request Size (Avg):Equivalent to the avgrq sz of iostat, it reflects the size of each I/O request. Separate recording of read and write requests.
- Disk I/O Queue Length (Avg):Equivalent to avgqu sz in iostat, which is the average length of the I/O request queue.
- I/O Syscall Rate:The frequency of process calls to read and write system calls, similar to IOPS.
- I/O Throughput:The throughput of process I/O can be divided into two categories: actual-read/write and attemppt-read/write. Actual read and actual write refer to the number of bytes that a process actually causes block devices to perform I/O, excluding the parts processed by Page Cache.
JVM
- GC Time Percentage:The proportion of GC time spent by the node JVM in the past minute's time window
- GC Allocated/Promoted Size: The average size of objects promoted to the old era per minute by the node JVM, as well as the size of objects newly applied for by the new generation/old era and non generational new applications
- GC Live Data Size:The long-term surviving object size of the node JVM and the maximum intergenerational allowed value
- Heap Memory:JVM heap memory usage.
- Maximum heap memory:The maximum available heap memory size for the JVM.
- Committed heap memory:The size of heap memory that has been committed by the JVM.
- Used heap memory:The size of heap memory already used by the JVM.
- PS Eden Space:The size of the PS Young area.
- PS Old Space:The size of the PS Old area.
- PS Survivor Space:The size of the PS survivor area.
- ...(CMS/G1/ZGC, etc)
- Off-Heap Memory:Out of heap memory usage.
- Direct Memory:Out of heap direct memory.
- Mapped Memory:Out of heap mapped memory.
- GCs Per Minute:The average number of garbage collection attempts per minute by the node JVM, including YGC and FGC
- GC Latency Per Minute:The average time it takes for node JVM to perform garbage collection per minute, including YGC and FGC
- GC Events Breakdown Per Minute:The average number of garbage collections per minute by node JVM due to different reasons, including YGC and FGC
- GC Pause Time Breakdown Per Minute:The average time spent by node JVM on garbage collection per minute due to different reasons, including YGC and FGC
- JIT Compilation Time Per Minute:The total time JVM spends compiling per minute
- Loaded & Unloaded Classes:
- Loaded:The number of classes currently loaded by the JVM
- Unloaded:The number of classes uninstalled by the JVM since system startup
- Active Java Threads:The current number of surviving threads in IoTDB. Each sub item represents the number of threads in each state.
Network
Eno refers to the network card connected to the public network, while lo refers to the virtual network card.
- Network Speed:The speed of network card sending and receiving data
- Network Throughput (Receive/Transmit):The size of data packets sent or received by the network card, calculated from system restart
- Packet Transmission Rate:The speed at which the network card sends and receives packets, and one RPC request can correspond to one or more packets
- Active TCP Connections:The current number of socket connections for the selected process (IoTDB only has TCP)
3.2 Performance Overview Dashboard
Cluster Overview
- Total CPU Cores:Total CPU cores of cluster machines
- DataNode CPU Load:CPU usage of each DataNode node in the cluster
- Disk
- Total Disk Space: Total disk size of cluster machines
- DataNode Disk Utilization: The disk usage rate of each DataNode in the cluster
- Total Time Series: The total number of time series managed by the cluster (including replicas), the actual number of time series needs to be calculated in conjunction with the number of metadata replicas
- Cluster Info: Number of ConfigNode and DataNode nodes in the cluster
- Up Time: The duration of cluster startup until now
- Total Write Throughput: The total number of writes per second in the cluster (including replicas), and the actual total number of writes needs to be analyzed in conjunction with the number of data replicas
- Memory
- Total System Memory: Total memory size of cluster machine system
- Total Swap Memory: Total size of cluster machine swap memory
- DataNode Process Memory Utilization: Memory usage of each DataNode in the cluster
- Total Files:Total number of cluster management files
- Cluster System Overview:Overview of cluster machines, including average DataNode node memory usage and average machine disk usage
- Total DataBases: The total number of databases managed by the cluster (including replicas)
- Total DataRegions: The total number of DataRegions managed by the cluster
- Total SchemaRegions: The total number of SchemeRegions managed by the cluster
Node Overview
- CPU Cores: The number of CPU cores in the machine where the node is located
- Disk Space: The disk size of the machine where the node is located
- Time Series: Number of time series managed by the machine where the node is located (including replicas)
- System Overview: System overview of the machine where the node is located, including CPU load, process memory usage ratio, and disk usage ratio
- Write Throughput: The write speed per second of the machine where the node is located (including replicas)
- System Memory: The system memory size of the machine where the node is located
- Swap Memory:The swap memory size of the machine where the node is located
- File Count: Number of files managed by nodes
Performance
- Session Idle Time:The total idle time and total busy time of the session connection of the node
- Client Connections: The client connection status of the node, including the total number of connections and the number of active connections
- Operation Latency: The time consumption of various types of node operations, including average and P99
- Average Interface Latency: The average time consumption of each thrust interface of a node
- P99 Interface Latency: P99 time consumption of various thrust interfaces of nodes
- Total Tasks: The number of system tasks for each node
- Average Task Latency: The average time spent on various system tasks of a node
- P99 Task Latency: P99 time consumption for various system tasks of nodes
- Operations Per Second: The number of operations per second for a node
- Mainstream Process
- Operations Per Second (Stage-wise): The number of operations per second for each stage of the node's main process
- Average Stage Latency: The average time consumption of each stage in the main process of a node
- P99 Stage Latency: P99 time consumption for each stage of the node's main process
- Schedule Stage
- Schedule Operations Per Second: The number of operations per second in each sub stage of the node schedule stage
- Average Schedule Stage Latency:The average time consumption of each sub stage in the node schedule stage
- P99 Schedule Stage Latency: P99 time consumption for each sub stage of the schedule stage of the node
- Local Schedule Sub Stages
- Local Schedule Operations Per Second: The number of operations per second in each sub stage of the local schedule node
- Average Local Schedule Stage Latency: The average time consumption of each sub stage in the local schedule stage of the node
- P99 Local Schedule Latency: P99 time consumption for each sub stage of the local schedule stage of the node
- Storage Stage
- Storage Operations Per Second: The number of operations per second in each sub stage of the node storage stage
- Average Storage Stage Latency: Average time consumption of each sub stage in the node storage stage
- P99 Storage Stage Latency: P99 time consumption for each sub stage of node storage stage
- Engine Stage
- Engine Operations Per Second: The number of operations per second in each sub stage of the node engine stage
- Average Engine Stage Latency: The average time consumption of each sub stage in the engine stage of a node
- P99 Engine Stage Latency: P99 time consumption of each sub stage in the node engine stage
System
- CPU Utilization: CPU load of nodes
- CPU Latency Per Minute: The CPU time per minute of a node, with the maximum value related to the number of CPU cores
- GC Latency Per Minute:The average GC time per minute for nodes, including YGC and FGC
- Heap Memory: Node's heap memory usage
- Off-Heap Memory: Non heap memory usage of nodes
- Total Java Threads: Number of Java threads on nodes
- File Count:Number of files managed by nodes
- File Size: Node management file size situation
- Logs Per Minute: Different types of logs per minute for nodes
3.3 ConfigNode Dashboard
This panel displays the performance of all management nodes in the cluster, including partitioning, node information, and client connection statistics.
Node Overview
- Database Count: Number of databases for nodes
- Region
- DataRegion Count:Number of DataRegions for nodes
- DataRegion Status: The state of the DataRegion of the node
- SchemaRegion Count: Number of SchemeRegions for nodes
- SchemaRegion Status: The state of the SchemeRegion of the node
- System Memory Utilization: The system memory size of the node
- Swap Memory Utilization: Node's swap memory size
- ConfigNodes Status: The running status of the ConfigNode in the cluster where the node is located
- DataNodes Status:The DataNode situation of the cluster where the node is located
- System Overview: System overview of nodes, including system memory, disk usage, process memory, and CPU load
NodeInfo
- Node Count: The number of nodes in the cluster where the node is located, including ConfigNode and DataNode
- ConfigNode Status: The status of the ConfigNode node in the cluster where the node is located
- DataNode Status: The status of the DataNode node in the cluster where the node is located
- SchemaRegion Distribution: The distribution of SchemaRegions in the cluster where the node is located
- SchemaRegionGroup Leader Distribution: The distribution of leaders in the SchemaRegionGroup of the cluster where the node is located
- DataRegion Distribution: The distribution of DataRegions in the cluster where the node is located
- DataRegionGroup Leader Distribution:The distribution of leaders in the DataRegionGroup of the cluster where the node is located
Protocol
- Client Count
- Active Clients: The number of active clients in each thread pool of a node
- Idle Clients: The number of idle clients in each thread pool of a node
- Borrowed Clients Per Second: Number of borrowed clients in each thread pool of the node
- Created Clients Per Second: Number of created clients for each thread pool of the node
- Destroyed Clients Per Second: The number of destroyed clients in each thread pool of the node
- Client time situation
- Average Client Active Time: The average active time of clients in each thread pool of a node
- Average Client Borrowing Latency: The average borrowing waiting time of clients in each thread pool of a node
- Average Client Idle Time: The average idle time of clients in each thread pool of a node
Partition Table
- SchemaRegionGroup Count: The number of SchemaRegionGroups in the Database of the cluster where the node is located
- DataRegionGroup Count: The number of DataRegionGroups in the Database of the cluster where the node is located
- SeriesSlot Count: The number of SeriesSlots in the Database of the cluster where the node is located
- TimeSlot Count: The number of TimeSlots in the Database of the cluster where the node is located
- DataRegion Status: The DataRegion status of the cluster where the node is located
- SchemaRegion Status: The status of the SchemeRegion of the cluster where the node is located
Consensus
- Ratis Stage Latency: The time consumption of each stage of the node's Ratis
- Write Log Entry Latency: The time required to write a log for the Ratis of a node
- Remote / Local Write Latency: The time consumption of remote and local writes for the Ratis of nodes
- Remote / Local Write Throughput: Remote and local QPS written to node Ratis
- RatisConsensus Memory Utilization: Memory usage of Node Ratis consensus protocol
3.4 DataNode Dashboard
This panel displays the monitoring status of all data nodes in the cluster, including write time, query time, number of stored files, etc.
Node Overview
- Total Managed Entities: Entity situation of node management
- Write Throughput: The write speed per second of the node
- Memory Usage: The memory usage of the node, including the memory usage of various parts of IoT Consensus, the total memory usage of SchemaRegion, and the memory usage of various databases.
Protocol
- Node Operation Time Consumption
- Average Operation Latency: The average time spent on various operations of a node
- P50 Operation Latency: The median time spent on various operations of a node
- P99 Operation Latency: P99 time consumption for various operations of nodes
- Thrift Statistics
- Thrift Interface QPS: QPS of various Thrift interfaces of nodes
- Average Thrift Interface Latency: The average time consumption of each Thrift interface of a node
- Thrift Connections: The number of Thrfit connections of each type of node
- Active Thrift Threads: The number of active Thrift connections for each type of node
- Client Statistics
- Active Clients: The number of active clients in each thread pool of a node
- Idle Clients: The number of idle clients in each thread pool of a node
- Borrowed Clients Per Second:Number of borrowed clients for each thread pool of a node
- Created Clients Per Second: Number of created clients for each thread pool of the node
- Destroyed Clients Per Second: The number of destroyed clients in each thread pool of the node
- Average Client Active Time: The average active time of clients in each thread pool of a node
- Average Client Borrowing Latency: The average borrowing waiting time of clients in each thread pool of a node
- Average Client Idle Time: The average idle time of clients in each thread pool of a node
Storage Engine
- File Count: Number of files of various types managed by nodes
- File Size: Node management of various types of file sizes
- TsFile
- Total TsFile Size Per Level: The total size of TsFile files at each level of node management
- TsFile Count Per Level: Number of TsFile files at each level of node management
- Average TsFile Size Per Level: The average size of TsFile files at each level of node management
- Total Tasks: Number of Tasks for Nodes
- Task Latency: The time consumption of tasks for nodes
- Compaction
- Compaction Read/Write Throughput: The merge read and write speed of nodes per second
- Compactions Per Minute: The number of merged nodes per minute
- Compaction Chunk Status: The number of Chunks in different states merged by nodes
- Compacted-Points Per Minute: The number of merged nodes per minute
Write Performance
- Average Write Latency: Average node write time, including writing wal and memtable
- P50 Write Latency: Median node write time, including writing wal and memtable
- P99 Write Latency: P99 for node write time, including writing wal and memtable
- WAL
- WAL File Size: Total size of WAL files managed by nodes
- WAL Files:Number of WAL files managed by nodes
- WAL Nodes: Number of WAL nodes managed by nodes
- Checkpoint Creation Time: The time required to create various types of CheckPoints for nodes
- WAL Serialization Time (Total): Total time spent on node WAL serialization
- Data Region Mem Cost: Memory usage of different DataRegions of nodes, total memory usage of DataRegions of the current instance, and total memory usage of DataRegions of the current cluster
- Serialize One WAL Info Entry Cost: Node serialization time for a WAL Info Entry
- Oldest MemTable Ram Cost When Cause Snapshot: MemTable size when node WAL triggers oldest MemTable snapshot
- Oldest MemTable Ram Cost When Cause Flush: MemTable size when node WAL triggers oldest MemTable flush
- WALNode Effective Info Ratio: The effective information ratio of different WALNodes of nodes
- WAL Buffer
- WAL Buffer Latency: Node WAL flush SyncBuffer takes time, including both synchronous and asynchronous options
- WAL Buffer Used Ratio: The usage rate of the WAL Buffer of the node
- WAL Buffer Entries Count: The number of entries in the WAL Buffer of a node
- Flush Statistics
- Average Flush Latency: The total time spent on node Flush and the average time spent on each sub stage
- P50 Flush Latency: The total time spent on node Flush and the median time spent on each sub stage
- P99 Flush Latency: The total time spent on node Flush and the P99 time spent on each sub stage
- Average Flush Subtask Latency: The average time consumption of each node's Flush subtask, including sorting, encoding, and IO stages
- P50 Flush Subtask Latency: The median time consumption of each subtask of the Flush node, including sorting, encoding, and IO stages
- P99 Flush Subtask Latency: The average subtask time P99 for Flush of nodes, including sorting, encoding, and IO stages
- Pending Flush Task Num: The number of Flush tasks in a blocked state for a node
- Pending Flush Sub Task Num: Number of Flush subtasks blocked by nodes
- Tsfile Compression Ratio of Flushing MemTable: The compression rate of TsFile corresponding to node flashing Memtable
- Flush TsFile Size of DataRegions: The corresponding TsFile size for each disk flush of nodes in different DataRegions
- Size of Flushing MemTable: The size of the Memtable for node disk flushing
- Points Num of Flushing MemTable: The number of points when flashing data in different DataRegions of a node
- Series Num of Flushing MemTable: The number of time series when flashing Memtables in different DataRegions of a node
- Average Point Num of Flushing MemChunk: The average number of disk flushing points for node MemChunk
Schema Engine
- Schema Engine Mode: The metadata engine pattern of nodes
- Schema Consensus Protocol: Node metadata consensus protocol
- Schema Region Number:Number of SchemeRegions managed by nodes
- Schema Region Memory Overview: The amount of memory in the SchemeRegion of a node
- Memory Usgae per SchemaRegion:The average memory usage size of node SchemaRegion
- Cache MNode per SchemaRegion: The number of cache nodes in each SchemeRegion of a node
- MLog Length and Checkpoint: The total length and checkpoint position of the current mlog for each SchemeRegion of the node (valid only for SimpleConsense)
- Buffer MNode per SchemaRegion: The number of buffer nodes in each SchemeRegion of a node
- Activated Template Count per SchemaRegion: The number of activated templates in each SchemeRegion of a node
- Time Series statistics
- Timeseries Count per SchemaRegion: The average number of time series for node SchemaRegion
- Series Type: Number of time series of different types of nodes
- Time Series Number: The total number of time series nodes
- Template Series Number: The total number of template time series for nodes
- Template Series Count per SchemaRegion: The number of sequences created through templates in each SchemeRegion of a node
- IMNode Statistics
- Pinned MNode per SchemaRegion: Number of IMNode nodes with Pinned nodes in each SchemeRegion
- Pinned Memory per SchemaRegion: The memory usage size of the IMNode node for Pinned nodes in each SchemeRegion of the node
- Unpinned MNode per SchemaRegion: The number of unpinned IMNode nodes in each SchemeRegion of a node
- Unpinned Memory per SchemaRegion: Memory usage size of unpinned IMNode nodes in each SchemeRegion of the node
- Schema File Memory MNode Number: Number of IMNode nodes with global pinned and unpinned nodes
- Release and Flush MNode Rate: The number of IMNodes that release and flush nodes per second
- Cache Hit Rate: Cache hit rate of nodes
- Release and Flush Thread Number: The current number of active Release and Flush threads on the node
- Time Consumed of Relead and Flush (avg): The average time taken for node triggered cache release and buffer flushing
- Time Consumed of Relead and Flush (99%): P99 time consumption for node triggered cache release and buffer flushing
Query Engine
- Time Consumption In Each Stage
- Average Query Plan Execution Time: The average time spent on node queries at each stage
- P50 Query Plan Execution Time: Median time spent on node queries at each stage
- P99 Query Plan Execution Time: P99 time consumption for node query at each stage
- Execution Plan Distribution Time
- Average Query Plan Dispatch Time: The average time spent on node query execution plan distribution
- P50 Query Plan Dispatch Time: Median time spent on node query execution plan distribution
- P99 Query Plan Dispatch Time: P99 of node query execution plan distribution time
- Execution Plan Execution Time
- Average Query Execution Time: The average execution time of node query execution plan
- P50 Query Execution Time:Median execution time of node query execution plan
- P99 Query Execution Time: P99 of node query execution plan execution time
- Operator Execution Time
- Average Query Operator Execution Time: The average execution time of node query operators
- P50 Query Operator Execution Time: Median execution time of node query operator
- P99 Query Operator Execution Time: P99 of node query operator execution time
- Aggregation Query Computation Time
- Average Query Aggregation Execution Time: The average computation time for node aggregation queries
- P50 Query Aggregation Execution Time: Median computation time for node aggregation queries
- P99 Query Aggregation Execution Time: P99 of node aggregation query computation time
- File/Memory Interface Time Consumption
- Average Query Scan Execution Time: The average time spent querying file/memory interfaces for nodes
- P50 Query Scan Execution Time: Median time spent querying file/memory interfaces for nodes
- P99 Query Scan Execution Time: P99 time consumption for node query file/memory interface
- Number Of Resource Visits
- Average Query Resource Utilization: The average number of resource visits for node queries
- P50 Query Resource Utilization: Median number of resource visits for node queries
- P99 Query Resource Utilization: P99 for node query resource access quantity
- Data Transmission Time
- Average Query Data Exchange Latency: The average time spent on node query data transmission
- P50 Query Data Exchange Latency: Median query data transmission time for nodes
- P99 Query Data Exchange Latency: P99 for node query data transmission time
- Number Of Data Transfers
- Average Query Data Exchange Count: The average number of data transfers queried by nodes
- Query Data Exchange Count: The quantile of the number of data transfers queried by nodes, including the median and P99
- Task Scheduling Quantity And Time Consumption
- Query Queue Length: Node query task scheduling quantity
- Average Query Scheduling Latency: The average time spent on scheduling node query tasks
- P50 Query Scheduling Latency: Median time spent on node query task scheduling
- P99 Query Scheduling Latency: P99 of node query task scheduling time
Query Interface
- Load Time Series Metadata
- Average Timeseries Metadata Load Time: The average time taken for node queries to load time series metadata
- P50 Timeseries Metadata Load Time: Median time spent on loading time series metadata for node queries
- P99 Timeseries Metadata Load Time: P99 time consumption for node query loading time series metadata
- Read Time Series
- Average Timeseries Metadata Read Time: The average time taken for node queries to read time series
- P50 Timeseries Metadata Read Time: The median time taken for node queries to read time series
- P99 Timeseries Metadata Read Time: P99 time consumption for node query reading time series
- Modify Time Series Metadata
- Average Timeseries Metadata Modification Time:The average time taken for node queries to modify time series metadata
- P50 Timeseries Metadata Modification Time: Median time spent on querying and modifying time series metadata for nodes
- P99 Timeseries Metadata Modification Time: P99 time consumption for node query and modification of time series metadata
- Load Chunk Metadata List
- Average Chunk Metadata List Load Time: The average time it takes for node queries to load Chunk metadata lists
- P50 Chunk Metadata List Load Time: Median time spent on node query loading Chunk metadata list
- P99 Chunk Metadata List Load Time: P99 time consumption for node query loading Chunk metadata list
- Modify Chunk Metadata
- Average Chunk Metadata Modification Time: The average time it takes for node queries to modify Chunk metadata
- P50 Chunk Metadata Modification Time: The total number of bits spent on modifying Chunk metadata for node queries
- P99 Chunk Metadata Modification Time: P99 time consumption for node query and modification of Chunk metadata
- Filter According To Chunk Metadata
- Average Chunk Metadata Filtering Time: The average time spent on node queries filtering by Chunk metadata
- P50 Chunk Metadata Filtering Time: Median filtering time for node queries based on Chunk metadata
- P99 Chunk Metadata Filtering Time: P99 time consumption for node query filtering based on Chunk metadata
- Constructing Chunk Reader
- Average Chunk Reader Construction Time: The average time spent on constructing Chunk Reader for node queries
- P50 Chunk Reader Construction Time: Median time spent on constructing Chunk Reader for node queries
- P99 Chunk Reader Construction Time: P99 time consumption for constructing Chunk Reader for node queries
- Read Chunk
- Average Chunk Read Time: The average time taken for node queries to read Chunks
- P50 Chunk Read Time: Median time spent querying nodes to read Chunks
- P99 Chunk Read Time: P99 time spent on querying and reading Chunks for nodes
- Initialize Chunk Reader
- Average Chunk Reader Initialization Time: The average time spent initializing Chunk Reader for node queries
- P50 Chunk Reader Initialization Time: Median time spent initializing Chunk Reader for node queries
- P99 Chunk Reader Initialization Time:P99 time spent initializing Chunk Reader for node queries
- Constructing TsBlock Through Page Reader
- Average TsBlock Construction Time from Page Reader: The average time it takes for node queries to construct TsBlock through Page Reader
- P50 TsBlock Construction Time from Page Reader: The median time spent on constructing TsBlock through Page Reader for node queries
- P99 TsBlock Construction Time from Page Reader:Node query using Page Reader to construct TsBlock time-consuming P99
- Query the construction of TsBlock through Merge Reader
- Average TsBlock Construction Time from Merge Reader: The average time taken for node queries to construct TsBlock through Merge Reader
- P50 TsBlock Construction Time from Merge Reader: The median time spent on constructing TsBlock through Merge Reader for node queries
- P99 TsBlock Construction Time from Merge Reader: Node query using Merge Reader to construct TsBlock time-consuming P99
Query Data Exchange
The data exchange for the query is time-consuming.
- Obtain TsBlock through source handle
- Average Source Handle TsBlock Retrieval Time: The average time taken for node queries to obtain TsBlock through source handle
- P50 Source Handle TsBlock Retrieval Time:Node query obtains the median time spent on TsBlock through source handle
- P99 Source Handle TsBlock Retrieval Time: Node query obtains TsBlock time P99 through source handle
- Deserialize TsBlock through source handle
- Average Source Handle TsBlock Deserialization Time: The average time taken for node queries to deserialize TsBlock through source handle
- P50 Source Handle TsBlock Deserialization Time: The median time taken for node queries to deserialize TsBlock through source handle
- P99 Source Handle TsBlock Deserialization Time: P99 time spent on deserializing TsBlock through source handle for node query
- Send TsBlock through sink handle
- Average Sink Handle TsBlock Transmission Time: The average time taken for node queries to send TsBlock through sink handle
- P50 Sink Handle TsBlock Transmission Time: Node query median time spent sending TsBlock through sink handle
- P99 Sink Handle TsBlock Transmission Time: Node query sends TsBlock through sink handle with a time consumption of P99
- Callback data block event
- Average Data Block Event Acknowledgment Time: The average time taken for node query callback data block event
- P50 Data Block Event Acknowledgment Time: Median time spent on node query callback data block event
- P99 Data Block Event Acknowledgment Time: P99 time consumption for node query callback data block event
- Get Data Block Tasks
- Average Data Block Task Retrieval Time: The average time taken for node queries to obtain data block tasks
- P50 Data Block Task Retrieval Time: The median time taken for node queries to obtain data block tasks
- P99 Data Block Task Retrieval Time: P99 time consumption for node query to obtain data block task
Query Related Resource
- MppDataExchangeManager:The number of shuffle sink handles and source handles during node queries
- LocalExecutionPlanner: The remaining memory that nodes can allocate to query shards
- FragmentInstanceManager: The query sharding context information and the number of query shards that the node is running
- Coordinator: The number of queries recorded on the node
- MemoryPool Size: Node query related memory pool situation
- MemoryPool Capacity: The size of memory pools related to node queries, including maximum and remaining available values
- DriverScheduler Count: Number of queue tasks related to node queries
Consensus - IoT Consensus
- Memory Usage
- IoTConsensus Used Memory: The memory usage of IoT Consumes for nodes, including total memory usage, queue usage, and synchronization usage
- Synchronization Status Between Nodes
- IoTConsensus Sync Index Size: SyncIndex size for different DataRegions of IoT Consumption nodes
- IoTConsensus Overview:The total synchronization gap and cached request count of IoT consumption for nodes
- IoTConsensus Search Index Growth Rate: The growth rate of writing SearchIndex for different DataRegions of IoT Consumer nodes
- IoTConsensus Safe Index Growth Rate: The growth rate of synchronous SafeIndex for different DataRegions of IoT Consumer nodes
- IoTConsensus LogDispatcher Request Size: The request size for node IoT Consusus to synchronize different DataRegions to other nodes
- Sync Lag: The size of synchronization gap between different DataRegions in IoT Consumption node
- Min Peer Sync Lag: The minimum synchronization gap between different DataRegions and different replicas of node IoT Consumption
- Peer Sync Speed Difference: The maximum difference in synchronization from different DataRegions to different replicas for node IoT Consumption
- IoTConsensus LogEntriesFromWAL Rate: The rate at which nodes IoT Consumus obtain logs from WAL for different DataRegions
- IoTConsensus LogEntriesFromQueue Rate: The rate at which nodes IoT Consumes different DataRegions retrieve logs from the queue
- Different Execution Stages Take Time
- The Time Consumed of Different Stages (avg): The average time spent on different execution stages of node IoT Consumus
- The Time Consumed of Different Stages (50%): The median time spent on different execution stages of node IoT Consusus
- The Time Consumed of Different Stages (99%):P99 of the time consumption for different execution stages of node IoT Consusus
Consensus - DataRegion Ratis Consensus
- Ratis Consensus Stage Latency: The time consumption of different stages of node Ratis
- Ratis Log Write Latency: The time consumption of writing logs at different stages of node Ratis
- Remote / Local Write Latency: The time it takes for node Ratis to write locally or remotely
- Remote / Local Write Throughput(QPS): QPS written by node Ratis locally or remotely
- RatisConsensus Memory Usage:Memory usage of node Ratis
Consensus - SchemaRegion Ratis Consensus
- RatisConsensus Stage Latency: The time consumption of different stages of node Ratis
- Ratis Log Write Latency: The time consumption for writing logs at each stage of node Ratis
- Remote / Local Write Latency: The time it takes for node Ratis to write locally or remotelyThe time it takes for node Ratis to write locally or remotely
- Remote / Local Write Throughput(QPS): QPS written by node Ratis locally or remotely
- RatisConsensus Memory Usage: Node Ratis Memory Usage