Introduction

Trace proxy metrics typically provide insights into the performance, behavior, and health of a system’s distributed tracing component. This document provides information about the various trace metrics monitored and reported by our tracing system.

Trace Proxy Metrics

Labels

service_name, operation, app, instance, language: The name of the service, operation, app, instance, language that metric belongs to correspondingly.

transaction_type, transaction_category, transaction_sub_category: These labels tells about the classification of the spans based on span attributes.

infra_group: this is a kubernates 2.0 label. it is a combination of clusterName, namespaceName, workloadName in the format of “clusterName_namespaceName_workloadName”. its value will be “NO_GROUP” incase of non kubernates 2.0 pipeline.

resourceName: this is equal to one of these attributes from the source (which comes first in that order): “instance”, “resourceName”, “k8s.pod.name”, “host.name”

apdex_threshold: Configured threshold value.

status_code: Status code of a span.

kind: Kind is the type of span. Can be used to specify additional relationships between spans.

Metric NameMetric TypeDescriptionLabel
trace_operations_latencyHistogramSpan latency in microseconds (µs). A Prometheus histogram with buckets {100, 200, ..., 1000}."service_name", "operation", "app", "instance", "transaction_type", "transaction_category", "transaction_sub_category", "language", "infra_group", "resourceName", "kind"
trace_root_operation_latencyHistogramRoot span latency in microseconds (µs). A Prometheus histogram with buckets {100, 200, ..., 1000}."service_name", "operation", "app", "instance", "transaction_type", "transaction_category", "transaction_sub_category", "language", "infra_group", "resourceName", "kind"
trace_operations_latency_msGaugeSpan latency in microseconds (µs)."service_name", "operation", "app", "instance", "transaction_type", "transaction_category", "transaction_sub_category", "language", "infra_group", "resourceName", "kind"
trace_root_operation_latency_msGaugeRoot span latency in microseconds (µs)."service_name", "operation", "app", "instance", "transaction_type", "transaction_category", "transaction_sub_category", "language", "infra_group", "resourceName", "kind"
trace_operations_failedCounterNumber of error spans."service_name", "operation", "app", "instance", "transaction_type", "transaction_category", "transaction_sub_category", "language", "infra_group", "resourceName", "kind"
trace_operations_succeededCounterNumber of non-error spans."service_name", "operation", "app", "instance", "transaction_type", "transaction_category", "transaction_sub_category", "language", "infra_group", "resourceName", "kind"
trace_operations_totalCounterNumber of total spans."service_name", "operation", "app", "instance", "transaction_type", "transaction_category", "transaction_sub_category", "language", "infra_group", "resourceName", "kind"
trace_root_operations_failedCounterNumber of root error spans."service_name", "operation", "app", "instance", "transaction_type", "transaction_category", "transaction_sub_category", "language", "infra_group", "resourceName", "kind"
trace_response_http_statusCounterNumber of HTTP spans with particular status codes."method", "status_code", "service_name", "operation", "app", "instance", "transaction_type", "transaction_category", "transaction_sub_category", "language", "infra_group", "resourceName", "kind"
trace_response_grpc_statusCounterNumber of gRPC spans with particular status codes."status_code", "service_name", "operation", "app", "instance", "transaction_type", "transaction_category", "transaction_sub_category", "language", "infra_group", "resourceName", "kind"
trace_apdex_latencyHistogramSimilar to trace_operations_latency but has custom bucket values based on the Apdex threshold."service_name", "operation", "app", "instance", "transaction_type", "transaction_category", "transaction_sub_category", "language", "apdex_threshold", "error", "infra_group", "resourceName", "kind"
trace_root_spanCounterTotal number of root spans."service_name", "operation", "app", "instance", "transaction_type", "transaction_category", "transaction_sub_category", "language", "infra_group", "resourceName", "kind"
trace_spans_countCounterTotal number of spans."service_name", "operation", "app", "instance", "transaction_type", "transaction_category", "transaction_sub_category", "language", "infra_group", "resourceName", "kind"

Trace Metrics

Metric NameMetric TypeDescription
trace_duration_msHistogramProcessing time spent by the span in trace proxy.
trace_send_droppedCounterNumber of traces dropped by the sampler. In dry run mode, this remains 0, indicating all traces are sent to OpsRamp.
trace_send_keptCounterNumber of traces sent after applying the sampling rule. In dry run mode, this increments while trace_send_dropped remains 0.
trace_send_ejected_fullCounterTraces exceeding cache capacity that are sent based on this condition.
trace_send_ejected_memsizeCounterTraces that cannot be kept but are put into a new cache and sent accordingly.
trace_send_expiredCounterTraces sent when their timeout is completed.
trace_send_got_rootCounterTraces containing root spans that are sent based on this condition.
trace_send_has_rootCounterCount of spans that are root.
trace_send_no_rootCounterCount of spans that are not root.
trace_sent_cache_hitCounterTrace proxy received a span from an already sent trace, checking its sampling decision to send or drop it.

Collector Metrics

Metric NameMetric TypeDescription
collector_cache_buffer_overrun (Metric Type: Counter)This value should remain zero; a positive value could indicate the need to grow the size of the collector’s circular buffer. (The size of the circular buffer is set via the configuration field CacheCapacity.) Note that if collect_cache_buffer_overrun is increasing, it does not necessarily mean that the cache is full. You may see this value increasing while collect_cache_entries values remain low in comparison to collect_cache_capacity. This is due to the circular nature of the buffer, and can occur when traces stay unfinished for a long time in the face of high throughput traffic. Anytime a trace arrives that persists for longer than the time it takes to accept the same number of traces as collect_cache_capacity (also known as make a full circle around the ring), a cache buffer overrun is triggered. Setting CacheCapacity therefore depends not only on trace throughput but also on trace duration (both of which are tracked via other metrics). When a cache buffer overrun is triggered, it means that a trace has been sent to Opsramp before it has been completed. Depending on your tracing strategy, this could result in an incorrect sampling decision for the trace. For example, if all the fields have been received that you have sampling rules set up for, the decision could be correct. However, if some of those fields have not been received yet, the sampling decision could be incorrect.
collector_cache_capacityGaugeEquivalent to the value set in your configuration for CacheCapacity. Use this value in conjunction with collect_cache_entries to see how full the cache is over time.
collector_cache_entriesHistogramRecords avg, max, min, p50, p95, and p99 values, indicating how full the cache is over time.
collector_cache_sizeGaugeLength of a circular buffer of currently stored traces.
collector_incoming_queueHistogramRecords avg, max, min, p50, p95, and p99 values, indicating how full the queue of spans is that were received from outside the trace proxy and need processing.
collector_peer_queueHistogramRecords avg, max, min, p50, p95, and p99 values, indicating how full the queue of spans is that were received from other trace proxy peers and require processing.
collector_metrics_labels_seriesGaugeRepresents the number of series in each metric.
metric_name is the additional label added to this metric.
collector_metrics_push_latency_msMeasures the time taken by the OpenTelemetry Collector to complete a metrics push request. Typically recorded in milliseconds, representing the duration from initiation to successful completion.

Routing Metrics

Metric NameDescription
incoming_router_batchIncrements when trace proxy batch event processing endpoint is hit.
peer_router_batchIncrements when trace proxy batch event processing endpoint is hit.
incoming_router_droppedIncrements when trace proxy fails to add new spans to a receive buffer while processing new events. Should be monitored closely as it indicates dropped spans.
peer_router_droppedIncrements when trace proxy fails to add new spans to a receive buffer while processing new events. Should be monitored closely as it indicates dropped spans.
incoming_router_eventIncrements when trace proxy single event processing endpoint is hit.
peer_router_eventIncrements when trace proxy single event processing endpoint is hit.
incoming_router_nonspanIncrements when Trace Proxy accepts other non-span events that are not part of a trace.
peer_router_nonspanIncrements when Trace Proxy accepts other non-span events that are not part of a trace.
incoming_router_peerCount of traces routed in from traces generator (incoming).
peer_router_peerCount of traces routed in from peer traces generator.
incoming_router_proxiedCount of traces routed in from traces generator (incoming) and reached to proxy.
peer_router_proxiedCount of traces routed in from peer traces generator and reached to proxy.
incoming_router_spanIncrements when trace proxy accepts events that are part of a trace, also known as spans.
peer_router_spanIncrements when trace proxy accepts events that are part of a trace, also known as spans.

Transmission Metrics

Metric NameDescription
upstream_enqueue_errorsCount of spans that encountered errors while dispatching the event to OpsRamp.
peer_enqueue_errorsCount of spans that encountered errors while dispatching the event to a peer.
upstream_response_errorsCount of spans that received an error response or had a StatusCode greater than 202 when hitting upstream addresses.
peer_response_errorsCount of spans that received an error response or had a StatusCode greater than 202 when hitting peer addresses.
upstream_response_20xCount of spans that had no error response and received a StatusCode less than 203 when hitting upstream addresses.
peer_response_20xCount of spans that had no error response and received a StatusCode less than 203 when hitting peer addresses.

Sampling Metrics

Metric NameDescription
dynsampler_num_droppedCount of traces dropped due to dynamic sampling.
rulessampler_num_droppedCount of traces dropped due to rules-based sampling.
dynsampler_num_keptCount of traces that are not dropped due to dynamic sampling.
rulessampler_num_keptCount of traces that are not dropped due to rules-based sampling.
dynsampler_sample_rateRecords avg, max, min, p50, p95, and p99 of the sample rate reported by the configured sampler.
rulessampler_sample_rateSample rate specified in the config section of the rules-based sampler.

Cuckoo Cache Metrics

This wraps a cuckoo filter implementation in a way that lets us keep it running forever without filling up. A cuckoo filter can’t be emptied (you can delete individual items if you know what they are, but you can’t get their names from the filter). Consequently, what we do is keep two filters, current and future. The current one is the one we use to check against, and when we add, we add to both. But the future one is started after the current one, so that when the current gets too full, we can discard it, replace it with future, and then start a new, empty future. This is why the future filter is nil until the current filter reaches 0.5.

Metric NameDescription
cuckoo_current_capacityDropped size of Cuckoo cache, as specified in the configuration section.
cuckoo_future_load_factorFraction of slots occupied in the future filter.
cuckoo_current_load_factorFraction of slots occupied in the current filter.