GreptimeTeam · nicecui · Aug 22, 2024 · Aug 19, 2024 · Aug 19, 2024 · Aug 20, 2024
@@ -444,7 +444,8 @@ Available options:
 | `global_write_buffer_reject_size`        | String  | `2GB`         | Global write buffer size threshold to reject write requests. If not set, it's default to 2 times of `global_write_buffer_size`                                                        |
 | `sst_meta_cache_size`                    | String  | `128MB`       | Cache size for SST metadata. Setting it to 0 to disable the cache.<br/>If not set, it's default to 1/32 of OS memory with a max limitation of 128MB.                                  |
 | `vector_cache_size`                      | String  | `512MB`       | Cache size for vectors and arrow arrays. Setting it to 0 to disable the cache.<br/>If not set, it's default to 1/16 of OS memory with a max limitation of 512MB.                      |
-| `page_cache_size`                        | String  | `512MB`       | Cache size for pages of SST row groups. Setting it to 0 to disable the cache.<br/>If not set, it's default to 1/16 of OS memory with a max limitation of 512MB.                       |
+| `page_cache_size`                        | String  | `512MB`       | Cache size for pages of SST row groups. Setting it to 0 to disable the cache.<br/>If not set, it's default to 1/8 of OS memory.                                                       |
+| `selector_result_cache_size`             | String  | `512MB`       | Cache size for time series selector (e.g. `last_value()`). Setting it to 0 to disable the cache.<br/>If not set, it's default to 1/8 of OS memory.                                    |
 | `sst_write_buffer_size`                  | String  | `8MB`         | Buffer size for SST writing.                                                                                                                                                          |
 | `scan_parallelism`                       | Integer | `0`           | Parallelism to scan a region (default: 1/4 of cpu cores).<br/>- `0`: using the default value (1/4 of cpu cores).<br/>- `1`: scan in current thread.<br/>- `n`: scan in parallelism n. |
 | `inverted_index`                         | --      | --            | The options for inverted index in Mito engine.                                                                                                                                        |

@@ -0,0 +1,179 @@
+# Performance Tuning Tips
+
+A GreptimeDB instance's default configuration may not fit all use cases. It's important to tune the database configurations and usage according to the scenario.
+
+GreptimeDB provides various metrics to help monitor and troubleshoot performance issues. The official repository provides [Grafana dashboard templates](https://github.com/GreptimeTeam/greptimedb/tree/main/grafana) for both standalone and cluster modes.
+
+## Query
+
+### Metrics
+
+The following metrics help diagnose query performance issues:
+| Metric | Type | Description |
+|---|---|---|
+| greptime_mito_read_stage_elapsed_bucket | histogram | The elapsed time of different phases of a query in the storage engine. |
+| greptime_mito_cache_bytes | gauge | Size of cached contents |
+| greptime_mito_cache_hit | counter | Total count of cache hit |
+| greptime_mito_cache_miss | counter | Total count of cache miss |
+
+
+### Using cache for object stores
+
+It's highly recommended to enable the object store read cache and the write cache in the storage engine. This could reduce query time by more than 10 times.
+
+The read cache stores objects or ranges on the local disk to avoid fetching the same range from the remote again. The following example shows how to enable the read cache for S3.
+- The `cache_path` is the directory to store cached objects.
+- The `cache_capacity` is the capacity of the cache. It's recommended to leave at least 1/10 of the total disk space for it.
+
+```toml
+[storage]
+type = "S3"
+bucket = "ap-southeast-1-test-bucket"
+root = "your-root"
+access_key_id = "****"
+secret_access_key = "****"
+endpoint = "https://s3.amazonaws.com/"
+region = "your-region"
+cache_path = "/path/to/s3cache"
+cache_capacity = "10G"
+```
+
+The write cache acts as a write-through cache that stores files on the local disk before uploading them to the object store. This reduces the first query latency. The following example shows how to enable the write cache.
+- The `enable_experimental_write_cache` flag enables the write cache
+- The `experimental_write_cache_size` sets the capacity of the cache
+- The `experimental_write_cache_path` sets the path to store cached files. It is under the data home by default.
+- The `experimental_write_cache_ttl` sets the TTL of the cached files.
+
+
+```toml
+[[region_engine]]
+[region_engine.mito]
+enable_experimental_write_cache = true
+experimental_write_cache_size = "10G"
+experimental_write_cache_ttl = "8h"
+# experimental_write_cache_path = "/path/to/write/cache"
+```
+
+### Enlarging cache size
+
+You can monitor the `greptime_mito_cache_bytes` and `greptime_mito_cache_miss` metrics to determine if you need to increase the cache size. The `type` label in these metrics indicates the type of cache.
+
+If the `greptime_mito_cache_miss` metric is consistently high and increasing, or if the `greptime_mito_cache_bytes` metric reaches the cache capacity, you may need to adjust the cache size configurations of the storage engine.
+
+Here's an example:
+
+```toml
+[[region_engine]]
+[region_engine.mito]
+# Cache size for the write cache. The `type` label value for this cache is `file`.
+experimental_write_cache_size = "10G"
+# Cache size for SST metadata. The `type` label value for this cache is `sst_meta`.
+sst_meta_cache_size = "128MB"
+# Cache size for vectors and arrow arrays. The `type` label value for this cache is `vector`.
+vector_cache_size = "512MB"
+# Cache size for pages of SST row groups. The `type` label value for this cache is `page`.
+page_cache_size = "512MB"
+# Cache size for time series selector (e.g. `last_value()`). The `type` label value for this cache is `selector_result`.
+selector_result_cache_size = "512MB"
+
+[region_engine.mito.index]
+## The max capacity of the index staging directory.
+staging_size = "10GB"
+```
+
+Some tips:
+- 1/10 of disk space for the `experimental_write_cache_size` at least
+- 1/4 of total memory for the `page_cache_size` at least if the memory usage is under 20%
+- Double the cache size if the cache hit ratio is less than 50%
+- If using full-text index, leave 1/10 of disk space for the `staging_size` at least
+
+
+### Enlarging scan parallelism
+
+The storage engine limits the number of concurrent scan tasks to 1/4 of CPU cores for each query. Enlarging the parallelism can reduce the query latency if the machine's workload is relatively low.
+
+```toml
+[[region_engine]]
+[region_engine.mito]
+scan_parallelism = 8
+```
+
+### Using append-only table if possible
+
+In general, append-only tables have a higher scan performance as the storage engine can skip merging and deduplication. What's more, the query engine can use statistics to speed up some queries if the table is append-only.
+
+We recommend enabling the [append_mode](/reference/sql/create.md##create-an-append-only-table) for the table if it doesn't require deduplication or performance is prioritized over deduplication. For example, a log table should be append-only as log messages may have the same timestamp.
+
+## Ingestion
+
+### Metrics
+
+The following metrics help diagnose ingestion issues:
+
+| Metric | Type | Description |
+|---|---|---|
+| greptime_mito_write_stage_elapsed_bucket | histogram | The elapsed time of different phases of processing a write request in the storage engine |
+| greptime_mito_write_buffer_bytes | gauge | The current estimated bytes allocated for the write buffer (memtables). |
+| greptime_mito_write_rows_total | counter | The number of rows written to the storage engine |
+| greptime_mito_write_stall_total | gauge | The number of rows currently stalled due to high memory pressure |
+| greptime_mito_write_reject_total | counter | The number of rows rejected due to high memory pressure |
+| raft_engine_sync_log_duration_seconds_bucket | histogram | The elapsed time of flushing the WAL to the disk |
+| greptime_mito_flush_elapsed | histogram | The elapsed time of flushing the SST files |
+
+
+### Batching rows
+
+Batching means sending multiple rows to the database over the same request. This can significantly improve ingestion throughput. A recommended starting point is 1000 rows per batch. You can enlarge the batch size if latency and resource usage are still acceptable.
+
+### Writing by time window
+Although GreptimeDB can handle out-of-order data, it still affects performance. GreptimeDB infers a time window size from ingested data and partitions the data into multiple time windows according to their timestamps. If the written rows are not within the same time window, GreptimeDB needs to split them, which affects write performance.
+
+Generally, real-time data doesn't have the issues mentioned above as they always use the latest timestamp. If you need to import data with a long time range into the database, we recommend creating the table in advance and [specifying the compaction.twcs.time_window option](/reference/sql/create.md#create-a-table-with-custom-compaction-options).
+
+
+## Schema
+
+### Using multiple fields
+
+While designing the schema, we recommend putting related metrics that can be collected together in the same table. This can also improve the write throughput and compression ratio.
+
+
+For example, the following three tables collect the CPU usage metrics.
+
+```sql
+CREATE TABLE IF NOT EXISTS cpu_usage_user (
+  hostname STRING NULL,
+  usage_value BIGINT NULL,
+  ts TIMESTAMP(9) NOT NULL,
+  TIME INDEX (ts),
+  PRIMARY KEY (hostname)
+);
+CREATE TABLE IF NOT EXISTS cpu_usage_system (
+  hostname STRING NULL,
+  usage_value BIGINT NULL,
+  ts TIMESTAMP(9) NOT NULL,
+  TIME INDEX (ts),
+  PRIMARY KEY (hostname)
+);
+CREATE TABLE IF NOT EXISTS cpu_usage_idle (
+  hostname STRING NULL,
+  usage_value BIGINT NULL,
+  ts TIMESTAMP(9) NOT NULL,
+  TIME INDEX (ts),
+  PRIMARY KEY (hostname)
+);
+```
+
+We can merge them into one table with three fields.
+
+```sql
+CREATE TABLE IF NOT EXISTS cpu (
+  hostname STRING NULL,
+  usage_user BIGINT NULL,
+  usage_system BIGINT NULL,
+  usage_idle BIGINT NULL,
+  ts TIMESTAMP(9) NOT NULL,
+  TIME INDEX (ts),
+  PRIMARY KEY (hostname)
+);
+```
@@ -433,7 +433,8 @@ fork_dictionary_bytes = "1GiB"
 | `global_write_buffer_reject_size`        | 字符串 | `2GB`         | 写入缓冲区内数据的大小超过 `global_write_buffer_reject_size` 后拒绝写入请求，默认为 `global_write_buffer_size` 的 2 倍 |
 | `sst_meta_cache_size`                    | 字符串 | `128MB`       | SST 元数据缓存大小。设为 0 可关闭该缓存<br/>默认为内存的 1/32，不超过 128MB                                            |
 | `vector_cache_size`                      | 字符串 | `512MB`       | 内存向量和 arrow array 的缓存大小。设为 0 可关闭该缓存<br/>默认为内存的 1/16，不超过 512MB                             |
-| `page_cache_size`                        | 字符串 | `512MB`       | SST 数据页的缓存。设为 0 可关闭该缓存<br/>默认为内存的 1/16，不超过 512MB                                              |
+| `page_cache_size`                        | 字符串 | `512MB`       | SST 数据页的缓存。设为 0 可关闭该缓存<br/>默认为内存的 1/8                                              |
+| `selector_result_cache_size`             | 字符串 | `512MB`       | `last_value()` 等时间线检索结果的缓存。设为 0 可关闭该缓存<br/>默认为内存的 1/16，不超过 512MB                                  |
 | `sst_write_buffer_size`                  | 字符串 | `8MB`         | SST 的写缓存大小                                                                                                       |
 | `scan_parallelism`                       | 整数   | `0`           | 扫描并发度 (默认 1/4 CPU 核数)<br/>- `0`: 使用默认值 (1/4 CPU 核数)<br/>- `1`: 单线程扫描<br/>- `n`: 按并行度 n 扫描   |
 | `inverted_index.create_on_flush`         | 字符串 | `auto`        | 是否在 flush 时构建索引<br/>- `auto`: 自动<br/>- `disable`: 从不                                                       |

@@ -0,0 +1,179 @@
+# 性能调优技巧
+
+GreptimeDB 实例的默认配置可能不适合所有场景。因此根据场景调整数据库配置和使用方式相当重要。
+
+GreptimeDB 提供了各种指标来帮助监控和排查性能问题。官方仓库里提供了用于独立模式和集群模式的 [Grafana dashboard 模版](https://github.com/GreptimeTeam/greptimedb/tree/main/grafana)。
+
+## 查询
+
+### 指标
+
+以下指标可用于诊断查询性能问题：
+| 指标 | 类型 | 描述 |
+|---|---|---|
+| greptime_mito_read_stage_elapsed_bucket | histogram | 存储引擎中查询不同阶段的耗时。 |
+| greptime_mito_cache_bytes | gauge | 缓存内容的大小 |
+| greptime_mito_cache_hit | counter | 缓存命中总数 |
+| greptime_mito_cache_miss | counter | 缓存未命中总数 |
+
+
+## 为对象存储开启缓存
+
+我们推荐在使用对象存储时启用读取缓存和写入缓存。这可以将查询耗时缩短 10 倍以上。
+
+读取缓存将对象或一段范围的数据存储在本地磁盘上，以避免再次从远程读取相同的数据。以下示例展示了如何为 S3 启用读取缓存。
+- `cache_path` 是存储缓存对象的目录。
+- `cache_capacity` 是缓存的容量。建议至少留出总磁盘空间的 1/10 用于缓存。
+
+```toml
+[storage]
+type = "S3"
+bucket = "ap-southeast-1-test-bucket"
+root = "your-root"
+access_key_id = "****"
+secret_access_key = "****"
+endpoint = "https://s3.amazonaws.com/"
+region = "your-region"
+cache_path = "/path/to/s3cache"
+cache_capacity = "10G"
+```
+
+写入缓存起到 write-through 缓存的作用，在将文件上传到对象存储之前，会先将它们存储在本地磁盘上。这可以减少第一次查询的延迟。以下示例展示了如何启用写入缓存。
+- `enable_experimental_write_cache` 开关可用来启用写入缓存
+- `experimental_write_cache_size` 用来设置缓存的容量
+- `experimental_write_cache_path` 用来设置存储缓存文件的路径。默认情况下它位于数据主目录下。
+- `experimental_write_cache_ttl` 用来设置缓存文件的 TTL。
+
+
+```toml
+[[region_engine]]
+[region_engine.mito]
+enable_experimental_write_cache = true
+experimental_write_cache_size = "10G"
+experimental_write_cache_ttl = "8h"
+# experimental_write_cache_path = "/path/to/write/cache"
+```
+
+## 增大缓存大小
+
+可以监控 `greptime_mito_cache_bytes` 和 `greptime_mito_cache_miss` 指标来确定是否需要增加缓存大小。这些指标中的 `type` 标签表示缓存的类型。
+
+如果 `greptime_mito_cache_miss` 指标一直很高并不断增加，或者 `greptime_mito_cache_bytes` 指标达到缓存容量，可能需要调整存储引擎的缓存大小配置。
+
+以下是一个例子：
+
+```toml
+[[region_engine]]
+[region_engine.mito]
+# 写入缓存的缓存大小。此缓存的 `type` 标签值为 `file`。
+experimental_write_cache_size = "10G"
+# SST 元数据的缓存大小。此缓存的 `type` 标签值为 `sst_meta`。
+sst_meta_cache_size = "128MB"
+# 向量和箭头数组的缓存大小。此缓存的 `type` 标签值为 `vector`。
+vector_cache_size = "512MB"
+# SST 行组页面的缓存大小。此缓存的 `type` 标签值为 `page`。
+page_cache_size = "512MB"
+# 时间序列查询结果（例如 `last_value()`）的缓存大小。此缓存的 `type` 标签值为 `selector_result`。
+selector_result_cache_size = "512MB"
+
+[region_engine.mito.index]
+## 索引暂存目录的最大容量。
+staging_size = "10GB"
+```
+
+一些建议：
+- 至少将 `experimental_write_cache_size` 设置为磁盘空间的 1/10
+- 如果数据库内存使用率低于 20%，则可以至少将 `page_cache_size` 设置为总内存大小的 1/4
+- 如果缓存命中率低于 50%，则可以将缓存大小翻倍
+- 如果使用全文索引，至少将 `staging_size` 设置为磁盘空间的 1/10
+
+## 扩大扫描并行度
+
+存储引擎将每个查询的并发扫描任务数限制为 CPU 内核数的 1/4。如果机器的工作负载相对较低，扩大并行度可以减少查询延迟。
+
+```toml
+[[region_engine]]
+[region_engine.mito]
+scan_parallelism = 8
+```
+
+## 尽可能使用 append-only 表
+
+一般来说，append-only 表具有更高的扫描性能，因为存储引擎可以跳过合并和去重操作。此外，如果表是 append-only 表，查询引擎可以使用统计信息来加速某些查询。
+
+如果表不需要去重或性能优先于去重，我们建议为表启用 [append_mode](/reference/sql/create.md##create-an-append-only-table)。例如，日志表应该是 append-only 表，因为日志消息可能具有相同的时间戳。
+
+
+## 写入
+
+### 指标
+
+以下指标有助于诊断写入问题：
+
+| 指标 | 类型 | 描述 |
+|---|---|---|
+| greptime_mito_write_stage_elapsed_bucket | histogram | 存储引擎中处理写入请求的不同阶段的耗时 |
+| greptime_mito_write_buffer_bytes | gauge | 当前为写入缓冲区（memtables）分配的字节数（估算） |
+| greptime_mito_write_rows_total | counter | 写入存储引擎的行数 |
+| greptime_mito_write_stall_total | gauge | 当前由于内存压力过高而被阻塞的行数 |
+| greptime_mito_write_reject_total | counter | 由于内存压力过高而被拒绝的行数 |
+| raft_engine_sync_log_duration_seconds_bucket | histogram | 将 WAL 刷入磁盘的耗时 |
+| greptime_mito_flush_elapsed | histogram | 刷入 SST 文件的耗时 |
+
+
+### 批量写入行
+
+批量写入是指通过同一个请求将多行数据发送到数据库。这可以显著提高写入吞吐量。建议的起始值是每批 1000 行。如果延迟和资源使用仍然可以接受，可以扩大攒批大小。
+
+### 按时间窗口写入
+虽然 GreptimeDB 可以处理乱序数据，但乱序数据仍然会影响性能。GreptimeDB 从写入的数据中推断出时间窗口的大小，并根据时间戳将数据划分为多个时间窗口。如果写入的行不在同一个时间窗口内，GreptimeDB 需要将它们拆分，这会影响写入性能。
+
+通常，实时数据不会出现上述问题，因为它们始终使用最新的时间戳。如果需要将具有较长时间范围的数据导入数据库，我们建议提前创建表并 [指定 compaction.twcs.time_window 选项](/reference/sql/create.md#create-a-table-with-custom-compaction-options)。
+
+
+## 表结构
+
+### 使用多值
+
+在设计架构时，我们建议将可以一起收集的相关指标放在同一个表中。这也可以提高写入吞吐量和压缩率。
+
+
+例如，以下三个表收集了 CPU 的使用率指标。
+
+```sql
+CREATE TABLE IF NOT EXISTS cpu_usage_user (
+  hostname STRING NULL,
+  usage_value BIGINT NULL,
+  ts TIMESTAMP(9) NOT NULL,
+  TIME INDEX (ts),
+  PRIMARY KEY (hostname)
+);
+CREATE TABLE IF NOT EXISTS cpu_usage_system (
+  hostname STRING NULL,
+  usage_value BIGINT NULL,
+  ts TIMESTAMP(9) NOT NULL,
+  TIME INDEX (ts),
+  PRIMARY KEY (hostname)
+);
+CREATE TABLE IF NOT EXISTS cpu_usage_idle (
+  hostname STRING NULL,
+  usage_value BIGINT NULL,
+  ts TIMESTAMP(9) NOT NULL,
+  TIME INDEX (ts),
+  PRIMARY KEY (hostname)
+);
+```
+
+我们可以将它们合并为一个具有三个字段的表。
+
+```sql
+CREATE TABLE IF NOT EXISTS cpu (
+  hostname STRING NULL,
+  usage_user BIGINT NULL,
+  usage_system BIGINT NULL,
+  usage_idle BIGINT NULL,
+  ts TIMESTAMP(9) NOT NULL,
+  TIME INDEX (ts),
+  PRIMARY KEY (hostname)
+);
+```
diff --git a/sidebars.ts b/sidebars.ts
@@ -177,6 +177,7 @@ const sidebars: SidebarsConfig = {
             'user-guide/operations/compaction',
             'user-guide/operations/monitoring',
             'user-guide/operations/tracing',
+            'user-guide/operations/performance-tuning-tips',
           ],
         },
         'user-guide/upgrade',