Jaeger-06-JaegerV2与实战指南

本文档介绍 Jaeger V2 架构和完整的实战部署、调优指南。


第一部分:Jaeger V2 架构详解

1. V2 架构概述

Jaeger V2 是基于 OpenTelemetry Collector 重新实现的下一代架构,将 Jaeger 的核心功能封装为 OTEL Collector 的扩展组件。

1.1 核心变化

特性 Jaeger V1 Jaeger V2
架构 独立二进制(collector、query、ingester、agent) 基于 OTEL Collector 的扩展
配置 命令行参数 + 环境变量 YAML 配置文件
部署 多个独立二进制 单个 jaeger 二进制
协议 Jaeger、Zipkin、OTLP(通过 receiver) 所有 OTEL Collector 支持的协议
扩展性 有限(需修改源码) 插件化(OTEL Extension)
Agent 独立 jaeger-agent 二进制 不再需要(SDK 直接发送)

1.2 核心组件(OTEL Extensions)

flowchart TB
    subgraph JaegerV2["Jaeger V2 (单一二进制)"]
        direction TB
        
        subgraph Receivers["Receivers"]
            OTLP["OTLP Receiver"]
            JAEGER["Jaeger Receiver"]
            ZIPKIN["Zipkin Receiver"]
        end
        
        subgraph Processors["Processors"]
            BATCH["Batch Processor"]
            ADAPTIV["Adaptive Sampling<br/>Processor"]
        end
        
        subgraph Extensions["Extensions"]
            JSTORAGE["jaeger_storage<br/>(存储管理)"]
            JQUERY["jaeger_query<br/>(查询服务)"]
            RSAMP["remote_sampling<br/>(采样策略)"]
            HEALTH["healthcheckv2"]
        end
        
        subgraph Exporters["Exporters"]
            JEXPORTER["jaeger_storage_exporter<br/>(写入存储)"]
        end
        
        Receivers --> Processors
        Processors --> Exporters
        Exporters --> JSTORAGE
        JQUERY --> JSTORAGE
        RSAMP --> JSTORAGE
    end
    
    SDK["Client SDK"] -->|"OTLP/Jaeger/Zipkin"| Receivers
    UI["Jaeger UI"] -->|"HTTP API"| JQUERY
    
    style JSTORAGE fill:#e8f5e9
    style JQUERY fill:#f3e5f5
    style RSAMP fill:#e1f5ff

Extension 说明:

  1. jaeger_storage:

    • 管理多个存储后端配置(main、archive 等)
    • 支持 Cassandra、Elasticsearch、OpenSearch、Badger、Memory 等
    • 提供统一的存储接口给其他组件
  2. jaeger_query:

    • 提供 Jaeger Query API(兼容 V1)
    • 提供 Jaeger UI 静态资源
    • 支持归档存储查询
  3. remote_sampling:

    • 提供采样策略服务(文件或自适应)
    • 支持 gRPC 和 HTTP 接口
    • 与 SDK 兼容
  4. jaeger_storage_exporter:

    • 将 OTLP traces 写入 jaeger_storage
    • 作为 OTEL Collector 的 Exporter

2. V2 配置详解

2.1 基础配置(All-in-One 模式)

# config.yaml
service:
  # 启用的扩展
  extensions: [jaeger_storage, jaeger_query, remote_sampling, healthcheckv2, pprof]
  
  # 数据管道配置
  pipelines:
    traces:
      receivers: [otlp, jaeger, zipkin]
      processors: [batch, adaptive_sampling]
      exporters: [jaeger_storage_exporter]
  
  # 遥测配置
  telemetry:
    resource:
      service.name: jaeger
    metrics:
      level: detailed
      readers:
        - pull:
            exporter:
              prometheus:
                host: 0.0.0.0
                port: 8888
    logs:
      level: info

# 扩展配置
extensions:
  # 存储配置
  jaeger_storage:
    backends:
      main_storage:
        memory:
          max_traces: 100000
  
  # 查询服务配置
  jaeger_query:
    storage:
      traces: main_storage
    ui:
      config_file: ./config-ui.json
    grpc:
      host_port: 0.0.0.0:16685
    http:
      host_port: 0.0.0.0:16686
  
  # 采样策略配置
  remote_sampling:
    adaptive:
      sampling_store: main_storage
      initial_sampling_probability: 0.1
    http:
      host_port: 0.0.0.0:5778
    grpc:
      host_port: 0.0.0.0:14250
  
  # 健康检查
  healthcheckv2:
    use_v2: true
    http:
      endpoint: 0.0.0.0:13133
  
  # 性能分析
  pprof:
    endpoint: 0.0.0.0:1777

# Receivers 配置
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  
  jaeger:
    protocols:
      grpc:
        endpoint: 0.0.0.0:14250
      thrift_http:
        endpoint: 0.0.0.0:14268
      thrift_compact:
        endpoint: 0.0.0.0:6831
      thrift_binary:
        endpoint: 0.0.0.0:6832
  
  zipkin:
    endpoint: 0.0.0.0:9411

# Processors 配置
processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  
  adaptive_sampling:
    # 需配合 remote_sampling extension 使用

# Exporters 配置
exporters:
  jaeger_storage_exporter:
    trace_storage: main_storage

2.2 生产环境配置(Cassandra 存储)

service:
  extensions: [jaeger_storage, jaeger_query, remote_sampling]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [jaeger_storage_exporter]

extensions:
  jaeger_storage:
    backends:
      cassandra_main:
        cassandra:
          schema:
            keyspace: jaeger_v1_dc1
          connection:
            servers:
              - cassandra1:9042
              - cassandra2:9042
              - cassandra3:9042
            local_dc: dc1
            connections_per_host: 10
          write:
            timeout: 10s
            max_retries: 3
          read:
            timeout: 5s
            max_retries: 3
  
  jaeger_query:
    storage:
      traces: cassandra_main
    http:
      host_port: 0.0.0.0:16686
  
  remote_sampling:
    file:
      path: ./sampling-strategies.json
    http:
      host_port: 0.0.0.0:5778

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        max_recv_msg_size_mib: 8

processors:
  batch:
    timeout: 1s
    send_batch_size: 2048
    send_batch_max_size: 4096

exporters:
  jaeger_storage_exporter:
    trace_storage: cassandra_main

2.3 Kafka 缓冲模式

extensions:
  jaeger_storage:
    backends:
      kafka_buffer:
        kafka:
          producer:
            brokers:
              - kafka1:9092
              - kafka2:9092
            topic: jaeger-spans
            encoding: protobuf

exporters:
  jaeger_storage_exporter:
    trace_storage: kafka_buffer

3. V2 部署模式

3.1 All-in-One 模式(测试/开发)

# 单进程部署,包含所有功能
docker run -p 16686:16686 -p 4317:4317 -p 4318:4318 \
  -v $(pwd)/config.yaml:/etc/jaeger/config.yaml \
  jaegertracing/jaeger:2.0.0 \
  --config /etc/jaeger/config.yaml

3.2 分离部署模式(生产)

Collector 配置(collector-config.yaml):

service:
  extensions: [jaeger_storage]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [jaeger_storage_exporter]

extensions:
  jaeger_storage:
    backends:
      main:
        cassandra:
          # Cassandra 配置

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

exporters:
  jaeger_storage_exporter:
    trace_storage: main

Query 配置(query-config.yaml):

service:
  extensions: [jaeger_storage, jaeger_query]

extensions:
  jaeger_storage:
    backends:
      main:
        cassandra:
          # Cassandra 配置(只读)
  
  jaeger_query:
    storage:
      traces: main
    http:
      host_port: 0.0.0.0:16686

部署:

# Collector 部署
docker run -p 4317:4317 \
  -v $(pwd)/collector-config.yaml:/etc/jaeger/config.yaml \
  jaegertracing/jaeger:2.0.0 \
  --config /etc/jaeger/config.yaml

# Query 部署
docker run -p 16686:16686 \
  -v $(pwd)/query-config.yaml:/etc/jaeger/config.yaml \
  jaegertracing/jaeger:2.0.0 \
  --config /etc/jaeger/config.yaml

4. V2 与 V1 迁移

4.1 配置映射

V1 参数 V2 配置
--collector.grpc-server.host-port=:14250 receivers.jaeger.protocols.grpc.endpoint: "0.0.0.0:14250"
--collector.http-server.host-port=:14268 receivers.jaeger.protocols.thrift_http.endpoint: "0.0.0.0:14268"
--collector.num-workers=50 processors.batch.send_batch_size: 50
--span-storage.type=cassandra extensions.jaeger_storage.backends.main.cassandra
--query.max-clock-skew-adjust=1m extensions.jaeger_query.max_clock_skew_adjust: "1m"

4.2 迁移步骤

  1. 准备 V2 配置文件:将 V1 的命令行参数转换为 YAML 配置
  2. 并行部署:V2 与 V1 共享同一存储后端
  3. 流量切换:逐步将客户端流量从 V1 切换到 V2
  4. 下线 V1:确认 V2 稳定后下线 V1

第二部分:实战部署指南

1. Kubernetes 部署

1.1 All-in-One 部署

deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger
  namespace: observability
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jaeger
  template:
    metadata:
      labels:
        app: jaeger
    spec:
      containers:
      - name: jaeger
        image: jaegertracing/jaeger:2.0.0
        args:
          - --config=/etc/jaeger/config.yaml
        ports:
        - containerPort: 16686  # UI
          name: ui
        - containerPort: 4317   # OTLP gRPC
          name: otlp-grpc
        - containerPort: 4318   # OTLP HTTP
          name: otlp-http
        - containerPort: 14250  # Jaeger gRPC
          name: jaeger-grpc
        - containerPort: 9411   # Zipkin
          name: zipkin
        - containerPort: 8888   # Prometheus metrics
          name: metrics
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "2000m"
        volumeMounts:
        - name: config
          mountPath: /etc/jaeger
      volumes:
      - name: config
        configMap:
          name: jaeger-config

---
apiVersion: v1
kind: Service
metadata:
  name: jaeger
  namespace: observability
spec:
  type: ClusterIP
  selector:
    app: jaeger
  ports:
  - name: ui
    port: 16686
    targetPort: 16686
  - name: otlp-grpc
    port: 4317
    targetPort: 4317
  - name: otlp-http
    port: 4318
    targetPort: 4318
  - name: jaeger-grpc
    port: 14250
    targetPort: 14250

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: jaeger-config
  namespace: observability
data:
  config.yaml: |
    service:
      extensions: [jaeger_storage, jaeger_query, remote_sampling]
      pipelines:
        traces:
          receivers: [otlp]
          processors: [batch]
          exporters: [jaeger_storage_exporter]
    
    extensions:
      jaeger_storage:
        backends:
          main:
            memory:
              max_traces: 100000
      jaeger_query:
        storage:
          traces: main
      remote_sampling:
        adaptive:
          sampling_store: main
    
    receivers:
      otlp:
        protocols:
          grpc:
          http:
    
    processors:
      batch:
    
    exporters:
      jaeger_storage_exporter:
        trace_storage: main

部署:

kubectl apply -f deployment.yaml
kubectl port-forward -n observability svc/jaeger 16686:16686
# 访问 http://localhost:16686

1.2 生产级部署(Collector + Query 分离)

collector-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger-collector
  namespace: observability
spec:
  replicas: 3  # 多副本
  selector:
    matchLabels:
      app: jaeger-collector
  template:
    metadata:
      labels:
        app: jaeger-collector
    spec:
      containers:
      - name: collector
        image: jaegertracing/jaeger:2.0.0
        args:
          - --config=/etc/jaeger/collector-config.yaml
        ports:
        - containerPort: 4317
          name: otlp-grpc
        - containerPort: 4318
          name: otlp-http
        resources:
          requests:
            memory: "1Gi"
            cpu: "1000m"
          limits:
            memory: "4Gi"
            cpu: "4000m"
        env:
        - name: CASSANDRA_SERVERS
          value: "cassandra-0.cassandra.observability.svc.cluster.local:9042,cassandra-1.cassandra.observability.svc.cluster.local:9042"
        volumeMounts:
        - name: config
          mountPath: /etc/jaeger
      volumes:
      - name: config
        configMap:
          name: jaeger-collector-config

---
apiVersion: v1
kind: Service
metadata:
  name: jaeger-collector
  namespace: observability
spec:
  type: ClusterIP
  selector:
    app: jaeger-collector
  ports:
  - name: otlp-grpc
    port: 4317
    targetPort: 4317
  - name: otlp-http
    port: 4318
    targetPort: 4318

query-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger-query
  namespace: observability
spec:
  replicas: 2
  selector:
    matchLabels:
      app: jaeger-query
  template:
    metadata:
      labels:
        app: jaeger-query
    spec:
      containers:
      - name: query
        image: jaegertracing/jaeger:2.0.0
        args:
          - --config=/etc/jaeger/query-config.yaml
        ports:
        - containerPort: 16686
          name: ui
        - containerPort: 16685
          name: grpc
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "2000m"
        volumeMounts:
        - name: config
          mountPath: /etc/jaeger
      volumes:
      - name: config
        configMap:
          name: jaeger-query-config

---
apiVersion: v1
kind: Service
metadata:
  name: jaeger-query
  namespace: observability
spec:
  type: LoadBalancer  # 或 Ingress
  selector:
    app: jaeger-query
  ports:
  - name: ui
    port: 80
    targetPort: 16686

2. 性能调优

2.1 Collector 调优

高吞吐场景(> 10K spans/s):

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        max_recv_msg_size_mib: 16  # 增大消息大小限制

processors:
  batch:
    timeout: 200ms  # 减少批处理延迟
    send_batch_size: 2048  # 增大批次大小
    send_batch_max_size: 4096

exporters:
  jaeger_storage_exporter:
    trace_storage: main
    sending_queue:
      enabled: true
      num_consumers: 50  # 增加消费者数量
      queue_size: 10000  # 增大队列大小

资源配置:

resources:
  requests:
    memory: "2Gi"
    cpu: "2000m"
  limits:
    memory: "8Gi"
    cpu: "8000m"

2.2 存储调优

Cassandra:

jaeger_storage:
  backends:
    main:
      cassandra:
        connection:
          connections_per_host: 20  # 增加连接数
        write:
          timeout: 30s
          max_retries: 5
          consistency: LOCAL_ONE  # 降低一致性要求

Elasticsearch:

jaeger_storage:
  backends:
    main:
      elasticsearch:
        connection:
          servers:
            - http://es1:9200
            - http://es2:9200
          num_shards: 10  # 增加分片数
          num_replicas: 2
        bulk:
          size: 10000000  # 10MB
          workers: 10

3. 监控与告警

3.1 Prometheus 指标

ServiceMonitor(Prometheus Operator):

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: jaeger
  namespace: observability
spec:
  selector:
    matchLabels:
      app: jaeger
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics

关键指标:

# Collector 接收速率
rate(otelcol_receiver_accepted_spans[5m])

# 批处理延迟
histogram_quantile(0.95, otelcol_processor_batch_batch_send_size_bucket)

# 存储写入失败率
rate(otelcol_exporter_send_failed_spans[5m]) / rate(otelcol_exporter_sent_spans[5m])

# Query 延迟
histogram_quantile(0.95, jaeger_query_requests_duration_seconds_bucket)

3.2 告警规则

groups:
  - name: jaeger
    rules:
      - alert: JaegerHighDropRate
        expr: rate(otelcol_receiver_refused_spans[5m]) > 100
        for: 5m
        annotations:
          summary: "Jaeger dropping spans: {{ $value }} spans/sec"
      
      - alert: JaegerStorageWriteErrors
        expr: rate(otelcol_exporter_send_failed_spans[5m]) > 10
        for: 5m
        annotations:
          summary: "Jaeger storage write errors: {{ $value }} errors/sec"
      
      - alert: JaegerQueryHighLatency
        expr: histogram_quantile(0.95, jaeger_query_requests_duration_seconds_bucket) > 2
        for: 5m
        annotations:
          summary: "Jaeger Query P95 latency > 2s"

4. 故障排查

4.1 常见问题

问题 1:OTLP gRPC 连接失败

排查:

# 检查端口是否监听
kubectl exec -it jaeger-xxx -- netstat -tuln | grep 4317

# 检查防火墙规则
kubectl exec -it jaeger-xxx -- iptables -L -n

# 查看日志
kubectl logs -f jaeger-xxx | grep error

解决:

  • 确认 receiver 配置正确
  • 检查 Kubernetes Service 和 Endpoint
  • 检查网络策略(NetworkPolicy)

问题 2:UI 无法访问

排查:

# 检查 Query 服务状态
kubectl get svc jaeger-query

# 端口转发测试
kubectl port-forward svc/jaeger-query 16686:16686

解决:

  • 确认 jaeger_query extension 已启用
  • 检查 UI 配置文件路径
  • 查看 Query 日志

第三部分:实战案例

1. 微服务追踪集成

1.1 Go 服务集成

安装依赖:

go get go.opentelemetry.io/otel
go get go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc
go get go.opentelemetry.io/otel/sdk/trace

代码示例:

package main

import (
    "context"
    "log"
    "time"

    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.17.0"
)

func initTracer() (*sdktrace.TracerProvider, error) {
    // 创建 OTLP gRPC Exporter
    exporter, err := otlptracegrpc.New(
        context.Background(),
        otlptracegrpc.WithEndpoint("jaeger-collector.observability.svc.cluster.local:4317"),
        otlptracegrpc.WithInsecure(),
    )
    if err != nil {
        return nil, err
    }

    // 创建 Resource(服务元数据)
    res, err := resource.New(
        context.Background(),
        resource.WithAttributes(
            semconv.ServiceName("my-service"),
            semconv.ServiceVersion("1.0.0"),
        ),
    )
    if err != nil {
        return nil, err
    }

    // 创建 TracerProvider
    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(res),
        sdktrace.WithSampler(sdktrace.AlwaysSample()),  // 生产环境使用远程采样
    )

    otel.SetTracerProvider(tp)
    return tp, nil
}

func main() {
    tp, err := initTracer()
    if err != nil {
        log.Fatal(err)
    }
    defer tp.Shutdown(context.Background())

    // 创建 tracer
    tracer := otel.Tracer("my-service")

    // 创建 span
    ctx, span := tracer.Start(context.Background(), "main-operation")
    defer span.End()

    // 业务逻辑
    doWork(ctx)
}

func doWork(ctx context.Context) {
    tracer := otel.Tracer("my-service")
    _, span := tracer.Start(ctx, "do-work")
    defer span.End()

    // 添加属性
    span.SetAttributes(
        attribute.String("user.id", "12345"),
        attribute.Int("items.count", 10),
    )

    // 模拟工作
    time.Sleep(100 * time.Millisecond)
}

1.2 Java 服务集成(Spring Boot)

依赖(pom.xml):

<dependencies>
    <dependency>
        <groupId>io.opentelemetry</groupId>
        <artifactId>opentelemetry-api</artifactId>
        <version>1.31.0</version>
    </dependency>
    <dependency>
        <groupId>io.opentelemetry</groupId>
        <artifactId>opentelemetry-sdk</artifactId>
        <version>1.31.0</version>
    </dependency>
    <dependency>
        <groupId>io.opentelemetry</groupId>
        <artifactId>opentelemetry-exporter-otlp</artifactId>
        <version>1.31.0</version>
    </dependency>
    <dependency>
        <groupId>io.opentelemetry.instrumentation</groupId>
        <artifactId>opentelemetry-spring-boot-starter</artifactId>
        <version>1.31.0-alpha</version>
    </dependency>
</dependencies>

配置(application.yml):

otel:
  service:
    name: my-java-service
  exporter:
    otlp:
      endpoint: http://jaeger-collector:4317
  traces:
    sampler:
      type: always_on

2. 完整部署案例(生产环境)

架构图:

┌─────────────────────────────────────────────────────────┐
│                      Kubernetes Cluster                  │
│                                                          │
│  ┌──────────────┐   ┌──────────────┐   ┌─────────────┐│
│  │ Microservice │   │ Microservice │   │Microservice ││
│  │      A       │   │      B       │   │     C       ││
│  └──────┬───────┘   └──────┬───────┘   └──────┬──────┘│
│         │ OTLP gRPC        │                  │       │
│         └──────────────────┴──────────────────┘       │
│                            │                           │
│                            ▼                           │
│               ┌─────────────────────────┐              │
│               │  Jaeger Collector (x3)  │              │
│               │  (HPA: 3-10 replicas)   │              │
│               └───────────┬─────────────┘              │
│                           │                            │
│                           ▼                            │
│               ┌─────────────────────────┐              │
│               │   Cassandra Cluster     │              │
│               │   (3 nodes, RF=3)       │              │
│               └───────────┬─────────────┘              │
│                           │                            │
│         ┌─────────────────┴─────────────────┐         │
│         │                                     │         │
│         ▼                                     ▼         │
│  ┌──────────────┐                   ┌──────────────┐  │
│  │Jaeger Query  │                   │   Grafana    │  │
│  │    (x2)      │                   │ (Dashboards) │  │
│  └──────┬───────┘                   └──────────────┘  │
│         │                                              │
│         ▼                                              │
│  ┌──────────────┐                                     │
│  │  Jaeger UI   │                                     │
│  │  (Ingress)   │                                     │
│  └──────────────┘                                     │
└─────────────────────────────────────────────────────────┘

部署清单:

  1. Cassandra StatefulSet(略,使用 Cassandra Operator)
  2. Jaeger Collector Deployment(见上文)
  3. Jaeger Query Deployment(见上文)
  4. HPA 自动扩缩容:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: jaeger-collector-hpa
  namespace: observability
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: jaeger-collector
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: otelcol_receiver_accepted_spans_rate
      target:
        type: AverageValue
        averageValue: "10000"  # 每个 Pod 处理 10K spans/s

总结

本文档综合介绍了 Jaeger V2 架构和完整的实战部署指南:

Jaeger V2 核心优势:

  1. 统一二进制:简化部署和管理
  2. YAML 配置:声明式配置,易于版本控制
  3. 插件化扩展:支持 OTEL Collector 生态
  4. 向后兼容:支持 V1 的所有协议和 API

生产环境部署关键点:

  1. 高可用:Collector 和 Query 多副本部署
  2. 存储选型:生产环境推荐 Cassandra 或 Elasticsearch
  3. 监控告警:集成 Prometheus,配置关键指标告警
  4. 性能调优:根据流量调整批处理参数和资源配置