Excessive repetition

Medium risk Logs that repeat at high volume during incidents or spikes. Error floods during an outage. Retry storms when a dependency fails. The same message, thousands of times, telling you one thing: something is broken.

Why it happens

A downstream service goes down. Every request to it fails. Every failure logs an error. You get 100,000 “Connection timeout” errors in an hour. They’re all real errors. But they’re all symptoms of one problem. The logging is correct. The volume is the issue.

Example

Before
After

Error flood during a database outage:

{"severity_text": "ERROR", "body": "Connection to postgres failed: timeout", "service.name": "order-service"}
{"severity_text": "ERROR", "body": "Connection to postgres failed: timeout", "service.name": "order-service"}
{"severity_text": "ERROR", "body": "Connection to postgres failed: timeout", "service.name": "order-service"}

100,000 identical errors. One underlying problem.

Tero generates a policy to sample high-volume, low-variance log events:

id: sample-postgres-timeout-order-service
name: Sample postgres timeout errors from order-service
description: Reduce error flood volume while preserving signal. Sampled by trace ID to keep correlated logs together.
log:
  match:
    - resource_attribute: service.name
      exact: order-service
    - log_field: body
      regex: "^Connection to postgres failed"
  keep: 1%

Sampling is by trace ID when possible. All logs from the same request stay together or drop together. You don’t lose context, just volume.

Recommended enforcement

Enforce at edge

Sample before data reaches your provider.

Set SLOs

Define acceptable repetition thresholds.

Unlike hot path logs, these aren’t code mistakes. The logging is correct. You just don’t need every instance. Sampling at the edge reduces volume while preserving signal.

How it works

Tero identifies excessive repetition by analyzing volume spikes and content variance in your context graph. A log event that suddenly spikes 100x with identical content is flagged. The key is distinguishing repetition from legitimate high volume. 100,000 different users hitting an endpoint is high volume. 100,000 identical timeout errors is repetition. Tero looks at content variance to tell the difference.

Introduction

Data Quality

Context

Integrations

Why it happens

Example

Recommended enforcement

Enforce at edge

Set SLOs

How it works

Introduction

Data Quality

Context

Integrations

​Why it happens

​Example

​Recommended enforcement

Enforce at edge

Set SLOs

​How it works

Why it happens

Example

Recommended enforcement

How it works