telegraf/plugins/inputs/zipkin
Daniel Nelson 10db774db3
Add prometheus round trip unit tests (#6720)
2019-11-26 17:31:36 -08:00
..
cmd Update vendoring to dep from gdm (#4314) 2018-06-19 11:55:38 -07:00
codec Enable gofmt code simplification (#4887) 2018-10-19 13:32:54 -07:00
testdata
trace
README.md
convert.go
convert_test.go Add prometheus round trip unit tests (#6720) 2019-11-26 17:31:36 -08:00
handler.go
handler_test.go
zipkin.go Document and add support to input plugins for logging alias (#6357) 2019-09-23 15:39:50 -07:00
zipkin_test.go Add prometheus round trip unit tests (#6720) 2019-11-26 17:31:36 -08:00

README.md

Zipkin Plugin

This plugin implements the Zipkin http server to gather trace and timing data needed to troubleshoot latency problems in microservice architectures.

Please Note: This plugin is experimental; Its data schema may be subject to change based on its main usage cases and the evolution of the OpenTracing standard.

Configuration:

[[inputs.zipkin]]
    path = "/api/v1/spans" # URL path for span data
    port = 9411 # Port on which Telegraf listens

The plugin accepts spans in JSON or thrift if the Content-Type is application/json or application/x-thrift, respectively. If Content-Type is not set, then the plugin assumes it is JSON format.

Tracing:

This plugin uses Annotations tags and fields to track data from spans

  • TRACE: is a set of spans that share a single root span. Traces are built by collecting all Spans that share a traceId.

  • SPAN: is a set of Annotations and BinaryAnnotations that correspond to a particular RPC.

  • Annotations: for each annotation & binary annotation of a span a metric is output. Records an occurrence in time at the beginning and end of a request.

    Annotations may have the following values:

    • CS (client start): beginning of span, request is made.
    • SR (server receive): server receives request and will start processing it network latency & clock jitters differ it from cs
    • SS (server send): server is done processing and sends request back to client amount of time it took to process request will differ it from sr
    • CR (client receive): end of span, client receives response from server RPC is considered complete with this annotation

Tags

  • "id": The 64 bit ID of the span.
  • "parent_id": An ID associated with a particular child span. If there is no child span, the parent ID is set to ID.
  • "trace_id": The 64 or 128-bit ID of a particular trace. Every span in a trace shares this ID. Concatenation of high and low and converted to hexadecimal.
  • "name": Defines a span
Annotations have these additional tags:
  • "service_name": Defines a service
  • "annotation": The value of an annotation
  • "endpoint_host": Listening port concat with IPV4, if port is not present it will not be concatenated
Binary Annotations have these additional tag:
  • "service_name": Defines a service
  • "annotation": The value of an annotation
  • "endpoint_host": Listening port concat with IPV4, if port is not present it will not be concatenated
  • "annotation_key": label describing the annotation

Fields:

  • "duration_ns": The time in nanoseconds between the end and beginning of a span.

Sample Queries:

Get All Span Names for Service my_web_server

SHOW TAG VALUES FROM "zipkin" with key="name" WHERE "service_name" = 'my_web_server'
  • Description: returns a list containing the names of the spans which have annotations with the given service_name of my_web_server.

Get All Service Names

SHOW TAG VALUES FROM "zipkin" WITH KEY = "service_name"
  • Description: returns a list of all distinct endpoint service names.

Find spans with longest duration

SELECT max("duration_ns") FROM "zipkin" WHERE "service_name" = 'my_service' AND "name" = 'my_span_name' AND time > now() - 20m GROUP BY "trace_id",time(30s) LIMIT 5
  • Description: In the last 20 minutes find the top 5 longest span durations for service my_server and span name my_span_name

This test will create high cardinality data so we recommend using the tsi influxDB engine.

How To Set Up InfluxDB For Work With Zipkin

Steps
  1. Update InfluxDB to >= 1.3, in order to use the new tsi engine.

  2. Generate a config file with the following command:

influxd config > /path/for/config/file
  1. Add the following to your config file, under the [data] tab:
[data]
  index-version = "tsi1"
  1. Start influxd with your new config file:
influxd -config=/path/to/your/config/file
  1. Update your retention policy:
ALTER RETENTION POLICY "autogen" ON "telegraf" DURATION 1d SHARD DURATION 30m

Example Input Trace:

Trace Example from Zipkin model

{
  "traceId": "bd7a977555f6b982",
  "name": "query",
  "id": "be2d01e33cc78d97",
  "parentId": "ebf33e1a81dc6f71",
  "timestamp": 1458702548786000,
  "duration": 13000,
  "annotations": [
    {
      "endpoint": {
        "serviceName": "zipkin-query",
        "ipv4": "192.168.1.2",
        "port": 9411
      },
      "timestamp": 1458702548786000,
      "value": "cs"
    },
    {
      "endpoint": {
        "serviceName": "zipkin-query",
        "ipv4": "192.168.1.2",
        "port": 9411
      },
      "timestamp": 1458702548799000,
      "value": "cr"
    }
  ],
  "binaryAnnotations": [
    {
      "key": "jdbc.query",
      "value": "select distinct `zipkin_spans`.`trace_id` from `zipkin_spans` join `zipkin_annotations` on (`zipkin_spans`.`trace_id` = `zipkin_annotations`.`trace_id` and `zipkin_spans`.`id` = `zipkin_annotations`.`span_id`) where (`zipkin_annotations`.`endpoint_service_name` = ? and `zipkin_spans`.`start_ts` between ? and ?) order by `zipkin_spans`.`start_ts` desc limit ?",
      "endpoint": {
        "serviceName": "zipkin-query",
        "ipv4": "192.168.1.2",
        "port": 9411
      }
    },
    {
      "key": "sa",
      "value": true,
      "endpoint": {
        "serviceName": "spanstore-jdbc",
        "ipv4": "127.0.0.1",
        "port": 3306
      }
    }
  ]
}