Trace Proxy Metrics

Introduction

Trace proxy metrics typically provide insights into the performance, behavior, and health of a system’s distributed tracing component. This document provides information about the various trace metrics monitored and reported by our tracing system.

Trace Proxy Metrics

Labels

service_name, operation, app, instance, language: The name of the service, operation, app, instance, language that metric belongs to correspondingly.

transaction_type, transaction_category, transaction_sub_category: These labels tells about the classification of the spans based on span attributes.

infra_group: this is a kubernates 2.0 label. it is a combination of clusterName, namespaceName, workloadName in the format of “clusterName_namespaceName_workloadName”. its value will be “NO_GROUP” incase of non kubernates 2.0 pipeline.

resourceName: this is equal to one of these attributes from the source (which comes first in that order): “instance”, “resourceName”, “k8s.pod.name”, “host.name”

apdex_threshold: Configured threshold value.

status_code: Status code of a span.

kind: Kind is the type of span. Can be used to specify additional relationships between spans.

Note

If EnableHostnameLabel config field is set to true in tracing-proxy config hostname(which is the hostname of the machine where tracing proxy is running) also been added into labels.

Metric Name	Metric Type	Description	Label
`trace_operations_latency`	Histogram	Span latency in microseconds (µs). A Prometheus histogram with buckets {100, 200, ..., 1000}.	"service_name", "operation", "app", "instance", "transaction_type", "transaction_category", "transaction_sub_category", "language", "infra_group", "resourceName", "kind"
`trace_root_operation_latency`	Histogram	Root span latency in microseconds (µs). A Prometheus histogram with buckets {100, 200, ..., 1000}.	"service_name", "operation", "app", "instance", "transaction_type", "transaction_category", "transaction_sub_category", "language", "infra_group", "resourceName", "kind"
`trace_operations_latency_ms`	Gauge	Span latency in microseconds (µs).	"service_name", "operation", "app", "instance", "transaction_type", "transaction_category", "transaction_sub_category", "language", "infra_group", "resourceName", "kind"
`trace_root_operation_latency_ms`	Gauge	Root span latency in microseconds (µs).	"service_name", "operation", "app", "instance", "transaction_type", "transaction_category", "transaction_sub_category", "language", "infra_group", "resourceName", "kind"
`trace_operations_failed`	Counter	Number of error spans.	"service_name", "operation", "app", "instance", "transaction_type", "transaction_category", "transaction_sub_category", "language", "infra_group", "resourceName", "kind"
`trace_operations_succeeded`	Counter	Number of non-error spans.	"service_name", "operation", "app", "instance", "transaction_type", "transaction_category", "transaction_sub_category", "language", "infra_group", "resourceName", "kind"
`trace_operations_total`	Counter	Number of total spans.	"service_name", "operation", "app", "instance", "transaction_type", "transaction_category", "transaction_sub_category", "language", "infra_group", "resourceName", "kind"
`trace_root_operations_failed`	Counter	Number of root error spans.	"service_name", "operation", "app", "instance", "transaction_type", "transaction_category", "transaction_sub_category", "language", "infra_group", "resourceName", "kind"
`trace_response_http_status`	Counter	Number of HTTP spans with particular status codes.	"method", "status_code", "service_name", "operation", "app", "instance", "transaction_type", "transaction_category", "transaction_sub_category", "language", "infra_group", "resourceName", "kind"
`trace_response_grpc_status`	Counter	Number of gRPC spans with particular status codes.	"status_code", "service_name", "operation", "app", "instance", "transaction_type", "transaction_category", "transaction_sub_category", "language", "infra_group", "resourceName", "kind"
`trace_apdex_latency`	Histogram	Similar to trace_operations_latency but has custom bucket values based on the Apdex threshold.	"service_name", "operation", "app", "instance", "transaction_type", "transaction_category", "transaction_sub_category", "language", "apdex_threshold", "error", "infra_group", "resourceName", "kind"
`trace_root_span`	Counter	Total number of root spans.	"service_name", "operation", "app", "instance", "transaction_type", "transaction_category", "transaction_sub_category", "language", "infra_group", "resourceName", "kind"
`trace_spans_count`	Counter	Total number of spans.	"service_name", "operation", "app", "instance", "transaction_type", "transaction_category", "transaction_sub_category", "language", "infra_group", "resourceName", "kind"

Trace Metrics

Metric Name	Metric Type	Description
`trace_duration_ms`	Histogram	Processing time spent by the span in trace proxy.
`trace_send_dropped`	Counter	Number of traces dropped by the sampler. In dry run mode, this remains 0, indicating all traces are sent to OpsRamp.
`trace_send_kept`	Counter	Number of traces sent after applying the sampling rule. In dry run mode, this increments while `trace_send_dropped` remains 0.
`trace_send_ejected_full`	Counter	Traces exceeding cache capacity that are sent based on this condition.
`trace_send_ejected_memsize`	Counter	Traces that cannot be kept but are put into a new cache and sent accordingly.
`trace_send_expired`	Counter	Traces sent when their timeout is completed.
`trace_send_got_root`	Counter	Traces containing root spans that are sent based on this condition.
`trace_send_has_root`	Counter	Count of spans that are root.
`trace_send_no_root`	Counter	Count of spans that are not root.
`trace_sent_cache_hit`	Counter	Trace proxy received a span from an already sent trace, checking its sampling decision to send or drop it.

Collector Metrics

Metric Name	Metric Type	Description
`collector_cache_buffer_overrun` (Metric Type: Counter)	This value should remain zero; a positive value could indicate the need to grow the size of the collector’s circular buffer. (The size of the circular buffer is set via the configuration field `CacheCapacity`.) Note that if `collect_cache_buffer_overrun` is increasing, it does not necessarily mean that the cache is full. You may see this value increasing while `collect_cache_entries` values remain low in comparison to `collect_cache_capacity`. This is due to the circular nature of the buffer, and can occur when traces stay unfinished for a long time in the face of high throughput traffic. Anytime a trace arrives that persists for longer than the time it takes to accept the same number of traces as `collect_cache_capacity` (also known as make a full circle around the ring), a cache buffer overrun is triggered. Setting `CacheCapacity` therefore depends not only on trace throughput but also on trace duration (both of which are tracked via other metrics). When a cache buffer overrun is triggered, it means that a trace has been sent to Opsramp before it has been completed. Depending on your tracing strategy, this could result in an incorrect sampling decision for the trace. For example, if all the fields have been received that you have sampling rules set up for, the decision could be correct. However, if some of those fields have not been received yet, the sampling decision could be incorrect.
`collector_cache_capacity`	Gauge	Equivalent to the value set in your configuration for `CacheCapacity`. Use this value in conjunction with `collect_cache_entries` to see how full the cache is over time.
`collector_cache_entries`	Histogram	Records avg, max, min, p50, p95, and p99 values, indicating how full the cache is over time.
`collector_cache_size`	Gauge	Length of a circular buffer of currently stored traces.
`collector_incoming_queue`	Histogram	Records avg, max, min, p50, p95, and p99 values, indicating how full the queue of spans is that were received from outside the trace proxy and need processing.
`collector_peer_queue`	Histogram	Records avg, max, min, p50, p95, and p99 values, indicating how full the queue of spans is that were received from other trace proxy peers and require processing.
`collector_metrics_labels_series`	Gauge	Represents the number of series in each metric. metric_name is the additional label added to this metric.
`collector_metrics_push_latency_ms`		Measures the time taken by the OpenTelemetry Collector to complete a metrics push request. Typically recorded in milliseconds, representing the duration from initiation to successful completion.

Routing Metrics

Metric Name	Description
`incoming_router_batch`	Increments when trace proxy batch event processing endpoint is hit.
`peer_router_batch`	Increments when trace proxy batch event processing endpoint is hit.
`incoming_router_dropped`	Increments when trace proxy fails to add new spans to a receive buffer while processing new events. Should be monitored closely as it indicates dropped spans.
`peer_router_dropped`	Increments when trace proxy fails to add new spans to a receive buffer while processing new events. Should be monitored closely as it indicates dropped spans.
`incoming_router_event`	Increments when trace proxy single event processing endpoint is hit.
`peer_router_event`	Increments when trace proxy single event processing endpoint is hit.
`incoming_router_nonspan`	Increments when Trace Proxy accepts other non-span events that are not part of a trace.
`peer_router_nonspan`	Increments when Trace Proxy accepts other non-span events that are not part of a trace.
`incoming_router_peer`	Count of traces routed in from traces generator (incoming).
`peer_router_peer`	Count of traces routed in from peer traces generator.
`incoming_router_proxied`	Count of traces routed in from traces generator (incoming) and reached to proxy.
`peer_router_proxied`	Count of traces routed in from peer traces generator and reached to proxy.
`incoming_router_span`	Increments when trace proxy accepts events that are part of a trace, also known as spans.
`peer_router_span`	Increments when trace proxy accepts events that are part of a trace, also known as spans.

Transmission Metrics

Metric Name	Description
`upstream_enqueue_errors`	Count of spans that encountered errors while dispatching the event to OpsRamp.
`peer_enqueue_errors`	Count of spans that encountered errors while dispatching the event to a peer.
`upstream_response_errors`	Count of spans that received an error response or had a StatusCode greater than 202 when hitting upstream addresses.
`peer_response_errors`	Count of spans that received an error response or had a StatusCode greater than 202 when hitting peer addresses.
`upstream_response_20x`	Count of spans that had no error response and received a StatusCode less than 203 when hitting upstream addresses.
`peer_response_20x`	Count of spans that had no error response and received a StatusCode less than 203 when hitting peer addresses.

Sampling Metrics

Metric Name	Description
`dynsampler_num_dropped`	Count of traces dropped due to dynamic sampling.
`rulessampler_num_dropped`	Count of traces dropped due to rules-based sampling.
`dynsampler_num_kept`	Count of traces that are not dropped due to dynamic sampling.
`rulessampler_num_kept`	Count of traces that are not dropped due to rules-based sampling.
`dynsampler_sample_rate`	Records avg, max, min, p50, p95, and p99 of the sample rate reported by the configured sampler.
`rulessampler_sample_rate`	Sample rate specified in the config section of the rules-based sampler.

Cuckoo Cache Metrics

This wraps a cuckoo filter implementation in a way that lets us keep it running forever without filling up. A cuckoo filter can’t be emptied (you can delete individual items if you know what they are, but you can’t get their names from the filter). Consequently, what we do is keep two filters, current and future. The current one is the one we use to check against, and when we add, we add to both. But the future one is started after the current one, so that when the current gets too full, we can discard it, replace it with future, and then start a new, empty future. This is why the future filter is nil until the current filter reaches 0.5.

Metric Name	Description
`cuckoo_current_capacity`	Dropped size of Cuckoo cache, as specified in the configuration section.
`cuckoo_future_load_factor`	Fraction of slots occupied in the future filter.
`cuckoo_current_load_factor`	Fraction of slots occupied in the current filter.

Note

There are some other Process and Go metrics which are used to find health of trace proxy.
These process and go metrics are prefixed with process_ and go_ .

Introduction

Trace Proxy Metrics

Application Trace Related Metrics

Note

Trace Metrics

Collector Metrics

Routing Metrics

Transmission Metrics

Sampling Metrics

Cuckoo Cache Metrics

Note