Added performance section to vsphere README (#5353)

2019-01-29 20:32:48 -05:00
parent 6c6ff372ff
commit d207269a30
1 changed files with 80 additions and 0 deletions
--- a/plugins/inputs/vsphere/README.md
+++ b/plugins/inputs/vsphere/README.md
@@ -196,6 +196,86 @@ For setting up concurrency, modify `collect_concurrency` and `discover_concurren
  # discover_concurrency = 1
 ```

+## Performance Considerations
+
+### Realtime vs. historical metrics
+
+vCenter keeps two different kinds of metrics, known as realtime and historical metrics. 
+
+* Realtime metrics: Avaialable at a 20 second granularity. These metrics are stored in memory and are very fast and cheap to query. Our tests have shown that a complete set of realtime metrics for 7000 virtual machines can be obtained in less than 20 seconds. Realtime metrics are only available on **ESXi hosts** and **virtual machine** resources. Realtime metrics are only stored for 1 hour in vCenter. 
+* Historical metrics: Available at a 5 minute, 30 minutes, 2 hours and 24 hours rollup levels. The vSphere Telegraf plugin only uses the 5 minute rollup. These metrics are stored in the vCenter database and can be expensive and slow to query. Historical metrics are the only type of metrics available for **clusters**, **datastores** and **datacenters**.
+
+For more information, refer to the vSphere documentation here: https://pubs.vmware.com/vsphere-50/index.jsp?topic=%2Fcom.vmware.wssdk.pg.doc_50%2FPG_Ch16_Performance.18.2.html
+
+This distinction has an impact on how Telegraf collects metrics. A single instance of an input plugin can have one and only one collection interval, which means that you typically set the collection interval based on the most frequently collected metric. Let's assume you set the collection interval to 1 minute. All realtime metrics will be collected every minute. Since the historical metrics are only available on a 5 minute interval, the vSphere Telegraf plugin automatically skips four out of five collection cycles for these metrics. This works fine in many cases. Problems arise when the collection of historical metrics takes longer than the collecition interval. This will cause error messages similar to this to appear in the Telegraf logs:
+
+```2019-01-16T13:41:10Z W! [agent] input "inputs.vsphere" did not complete within its interval```
+
+This will disrupt the metric collection and can result in missed samples. The best practice workaround is to specify two instances of the vSphere plugin, one for the realtime metrics with a short collection interval and one for the historical metrics with a longer interval. You can use the ```*_metric_exclude``` to turn off the resources you don't want to collect metrics for in each instance. For example:
+
+```
+## Realtime instance
+[[inputs.vsphere]]
+  interval = "60s"
+  vcenters = [ "https://someaddress/sdk" ]
+  username = "someuser@vsphere.local"
+  password = "secret"
+
+  insecure_skip_verify = true
+  force_discover_on_init = true
+
+  # Exclude all historical metrics
+  datastore_metric_exclude = ["*"]
+  cluster_metric_exclude = ["*"]
+  datacenter_metric_exclude = ["*"]
+
+  collect_concurrency = 5
+  discover_concurrency = 5
+
+# Historical instance
+[[inputs.vsphere]]
+
+  interval = "300s"
+
+  vcenters = [ "https://someaddress/sdk" ]
+  username = "someuser@vsphere.local"
+  password = "secret"
+
+  insecure_skip_verify = true
+  force_discover_on_init = true
+  host_metric_exclude = ["*"] # Exclude realtime metrics
+  vm_metric_exclude = ["*"] # Exclude realtime metrics
+
+  max_query_metrics = 256 
+  collect_concurrency = 3
+```
+
+### Configuring max_query_metrics setting
+
+The ```max_query_metrics``` determines the maximum number of metrics to attempt to retrieve in one call to vCenter. Generally speaking, a higher number means faster and more efficient queries. However, the number of allowed metrics in a query is typically limited in vCenter by the ```config.vpxd.stats.maxQueryMetrics``` setting in vCenter. The value defaults to 64 on vSphere 5.5 and older and 256 on newver versions of vCenter. The vSphere plugin always checks this setting and will automatically reduce the number if the limit configured in vCenter is lower than max_query_metrics in the plugin. This will result in a log message similar to this:
+
+```2019-01-21T03:24:18Z W! [input.vsphere] Configured max_query_metrics is 256, but server limits it to 64. Reducing.```
+
+You may ask a vCenter administrator to increase this limit to help boost performance. 
+
+### Cluster metrics and the max_query_metrics setting
+
+Cluster metrics are handled a bit differently by vCenter. They are aggregated from ESXi and virtual machine metrics and may not be available when you query their most recent values. When this happens, vCenter will attempt to perform that aggregation on the fly. Unfortunately, all the subqueries needed internally in vCenter to perform this aggregation will count towards ```config.vpxd.stats.maxQueryMetrics```. This means that even a very small query may result in an error message similar to this:
+
+```2018-11-02T13:37:11Z E! Error in plugin [inputs.vsphere]: ServerFaultCode: This operation is restricted by the administrator - 'vpxd.stats.maxQueryMetrics'. Contact your system administrator```
+
+There are two ways of addressing this:
+* Ask your vCenter administrator to set ```config.vpxd.stats.maxQueryMetrics``` to a number that's higher than the total number of virtual machines managed by a vCenter instance.
+* Exclude the cluster metrics and use either the basicstats aggregator to calculate sums and averages per cluster or use queries in the visualization tool to obtain the same result. 
+
+### Concurrency settings
+
+The vSphere plugin allows you to specify two concurrency settings:
+* ```collect_concurrency```: The maximum number of simultaneous queries for performance metrics allowed per resource type.
+* ```discover_concurrency```: The  maximum number of simultaneous queries for resource discovery allowed.
+
+While a higher level of concurrency typically has a positive impact on performance, increasing these numbers too much can cause performance issues at the vCenter server. A rule of thumb is to set these parameters to the number of virtual machines divided by 1500 and rounded up to the nearest integer. 
+
 ## Measurements &amp; Fields

 - Cluster Stats