Jaeger-06-JaegerV2与实战指南
本文档介绍 Jaeger V2 架构和完整的实战部署、调优指南。
第一部分:Jaeger V2 架构详解
1. V2 架构概述
Jaeger V2 是基于 OpenTelemetry Collector 重新实现的下一代架构,将 Jaeger 的核心功能封装为 OTEL Collector 的扩展组件。
1.1 核心变化
| 特性 | Jaeger V1 | Jaeger V2 |
|---|---|---|
| 架构 | 独立二进制(collector、query、ingester、agent) | 基于 OTEL Collector 的扩展 |
| 配置 | 命令行参数 + 环境变量 | YAML 配置文件 |
| 部署 | 多个独立二进制 | 单个 jaeger 二进制 |
| 协议 | Jaeger、Zipkin、OTLP(通过 receiver) | 所有 OTEL Collector 支持的协议 |
| 扩展性 | 有限(需修改源码) | 插件化(OTEL Extension) |
| Agent | 独立 jaeger-agent 二进制 | 不再需要(SDK 直接发送) |
1.2 核心组件(OTEL Extensions)
flowchart TB
subgraph JaegerV2["Jaeger V2 (单一二进制)"]
direction TB
subgraph Receivers["Receivers"]
OTLP["OTLP Receiver"]
JAEGER["Jaeger Receiver"]
ZIPKIN["Zipkin Receiver"]
end
subgraph Processors["Processors"]
BATCH["Batch Processor"]
ADAPTIV["Adaptive Sampling<br/>Processor"]
end
subgraph Extensions["Extensions"]
JSTORAGE["jaeger_storage<br/>(存储管理)"]
JQUERY["jaeger_query<br/>(查询服务)"]
RSAMP["remote_sampling<br/>(采样策略)"]
HEALTH["healthcheckv2"]
end
subgraph Exporters["Exporters"]
JEXPORTER["jaeger_storage_exporter<br/>(写入存储)"]
end
Receivers --> Processors
Processors --> Exporters
Exporters --> JSTORAGE
JQUERY --> JSTORAGE
RSAMP --> JSTORAGE
end
SDK["Client SDK"] -->|"OTLP/Jaeger/Zipkin"| Receivers
UI["Jaeger UI"] -->|"HTTP API"| JQUERY
style JSTORAGE fill:#e8f5e9
style JQUERY fill:#f3e5f5
style RSAMP fill:#e1f5ff
Extension 说明:
-
jaeger_storage:
- 管理多个存储后端配置(main、archive 等)
- 支持 Cassandra、Elasticsearch、OpenSearch、Badger、Memory 等
- 提供统一的存储接口给其他组件
-
jaeger_query:
- 提供 Jaeger Query API(兼容 V1)
- 提供 Jaeger UI 静态资源
- 支持归档存储查询
-
remote_sampling:
- 提供采样策略服务(文件或自适应)
- 支持 gRPC 和 HTTP 接口
- 与 SDK 兼容
-
jaeger_storage_exporter:
- 将 OTLP traces 写入 jaeger_storage
- 作为 OTEL Collector 的 Exporter
2. V2 配置详解
2.1 基础配置(All-in-One 模式)
# config.yaml
service:
# 启用的扩展
extensions: [jaeger_storage, jaeger_query, remote_sampling, healthcheckv2, pprof]
# 数据管道配置
pipelines:
traces:
receivers: [otlp, jaeger, zipkin]
processors: [batch, adaptive_sampling]
exporters: [jaeger_storage_exporter]
# 遥测配置
telemetry:
resource:
service.name: jaeger
metrics:
level: detailed
readers:
- pull:
exporter:
prometheus:
host: 0.0.0.0
port: 8888
logs:
level: info
# 扩展配置
extensions:
# 存储配置
jaeger_storage:
backends:
main_storage:
memory:
max_traces: 100000
# 查询服务配置
jaeger_query:
storage:
traces: main_storage
ui:
config_file: ./config-ui.json
grpc:
host_port: 0.0.0.0:16685
http:
host_port: 0.0.0.0:16686
# 采样策略配置
remote_sampling:
adaptive:
sampling_store: main_storage
initial_sampling_probability: 0.1
http:
host_port: 0.0.0.0:5778
grpc:
host_port: 0.0.0.0:14250
# 健康检查
healthcheckv2:
use_v2: true
http:
endpoint: 0.0.0.0:13133
# 性能分析
pprof:
endpoint: 0.0.0.0:1777
# Receivers 配置
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
jaeger:
protocols:
grpc:
endpoint: 0.0.0.0:14250
thrift_http:
endpoint: 0.0.0.0:14268
thrift_compact:
endpoint: 0.0.0.0:6831
thrift_binary:
endpoint: 0.0.0.0:6832
zipkin:
endpoint: 0.0.0.0:9411
# Processors 配置
processors:
batch:
timeout: 1s
send_batch_size: 1024
adaptive_sampling:
# 需配合 remote_sampling extension 使用
# Exporters 配置
exporters:
jaeger_storage_exporter:
trace_storage: main_storage
2.2 生产环境配置(Cassandra 存储)
service:
extensions: [jaeger_storage, jaeger_query, remote_sampling]
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [jaeger_storage_exporter]
extensions:
jaeger_storage:
backends:
cassandra_main:
cassandra:
schema:
keyspace: jaeger_v1_dc1
connection:
servers:
- cassandra1:9042
- cassandra2:9042
- cassandra3:9042
local_dc: dc1
connections_per_host: 10
write:
timeout: 10s
max_retries: 3
read:
timeout: 5s
max_retries: 3
jaeger_query:
storage:
traces: cassandra_main
http:
host_port: 0.0.0.0:16686
remote_sampling:
file:
path: ./sampling-strategies.json
http:
host_port: 0.0.0.0:5778
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
max_recv_msg_size_mib: 8
processors:
batch:
timeout: 1s
send_batch_size: 2048
send_batch_max_size: 4096
exporters:
jaeger_storage_exporter:
trace_storage: cassandra_main
2.3 Kafka 缓冲模式
extensions:
jaeger_storage:
backends:
kafka_buffer:
kafka:
producer:
brokers:
- kafka1:9092
- kafka2:9092
topic: jaeger-spans
encoding: protobuf
exporters:
jaeger_storage_exporter:
trace_storage: kafka_buffer
3. V2 部署模式
3.1 All-in-One 模式(测试/开发)
# 单进程部署,包含所有功能
docker run -p 16686:16686 -p 4317:4317 -p 4318:4318 \
-v $(pwd)/config.yaml:/etc/jaeger/config.yaml \
jaegertracing/jaeger:2.0.0 \
--config /etc/jaeger/config.yaml
3.2 分离部署模式(生产)
Collector 配置(collector-config.yaml):
service:
extensions: [jaeger_storage]
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [jaeger_storage_exporter]
extensions:
jaeger_storage:
backends:
main:
cassandra:
# Cassandra 配置
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
exporters:
jaeger_storage_exporter:
trace_storage: main
Query 配置(query-config.yaml):
service:
extensions: [jaeger_storage, jaeger_query]
extensions:
jaeger_storage:
backends:
main:
cassandra:
# Cassandra 配置(只读)
jaeger_query:
storage:
traces: main
http:
host_port: 0.0.0.0:16686
部署:
# Collector 部署
docker run -p 4317:4317 \
-v $(pwd)/collector-config.yaml:/etc/jaeger/config.yaml \
jaegertracing/jaeger:2.0.0 \
--config /etc/jaeger/config.yaml
# Query 部署
docker run -p 16686:16686 \
-v $(pwd)/query-config.yaml:/etc/jaeger/config.yaml \
jaegertracing/jaeger:2.0.0 \
--config /etc/jaeger/config.yaml
4. V2 与 V1 迁移
4.1 配置映射
| V1 参数 | V2 配置 |
|---|---|
--collector.grpc-server.host-port=:14250 |
receivers.jaeger.protocols.grpc.endpoint: "0.0.0.0:14250" |
--collector.http-server.host-port=:14268 |
receivers.jaeger.protocols.thrift_http.endpoint: "0.0.0.0:14268" |
--collector.num-workers=50 |
processors.batch.send_batch_size: 50 |
--span-storage.type=cassandra |
extensions.jaeger_storage.backends.main.cassandra |
--query.max-clock-skew-adjust=1m |
extensions.jaeger_query.max_clock_skew_adjust: "1m" |
4.2 迁移步骤
- 准备 V2 配置文件:将 V1 的命令行参数转换为 YAML 配置
- 并行部署:V2 与 V1 共享同一存储后端
- 流量切换:逐步将客户端流量从 V1 切换到 V2
- 下线 V1:确认 V2 稳定后下线 V1
第二部分:实战部署指南
1. Kubernetes 部署
1.1 All-in-One 部署
deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
namespace: observability
spec:
replicas: 1
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- name: jaeger
image: jaegertracing/jaeger:2.0.0
args:
- --config=/etc/jaeger/config.yaml
ports:
- containerPort: 16686 # UI
name: ui
- containerPort: 4317 # OTLP gRPC
name: otlp-grpc
- containerPort: 4318 # OTLP HTTP
name: otlp-http
- containerPort: 14250 # Jaeger gRPC
name: jaeger-grpc
- containerPort: 9411 # Zipkin
name: zipkin
- containerPort: 8888 # Prometheus metrics
name: metrics
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "2000m"
volumeMounts:
- name: config
mountPath: /etc/jaeger
volumes:
- name: config
configMap:
name: jaeger-config
---
apiVersion: v1
kind: Service
metadata:
name: jaeger
namespace: observability
spec:
type: ClusterIP
selector:
app: jaeger
ports:
- name: ui
port: 16686
targetPort: 16686
- name: otlp-grpc
port: 4317
targetPort: 4317
- name: otlp-http
port: 4318
targetPort: 4318
- name: jaeger-grpc
port: 14250
targetPort: 14250
---
apiVersion: v1
kind: ConfigMap
metadata:
name: jaeger-config
namespace: observability
data:
config.yaml: |
service:
extensions: [jaeger_storage, jaeger_query, remote_sampling]
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [jaeger_storage_exporter]
extensions:
jaeger_storage:
backends:
main:
memory:
max_traces: 100000
jaeger_query:
storage:
traces: main
remote_sampling:
adaptive:
sampling_store: main
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch:
exporters:
jaeger_storage_exporter:
trace_storage: main
部署:
kubectl apply -f deployment.yaml
kubectl port-forward -n observability svc/jaeger 16686:16686
# 访问 http://localhost:16686
1.2 生产级部署(Collector + Query 分离)
collector-deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger-collector
namespace: observability
spec:
replicas: 3 # 多副本
selector:
matchLabels:
app: jaeger-collector
template:
metadata:
labels:
app: jaeger-collector
spec:
containers:
- name: collector
image: jaegertracing/jaeger:2.0.0
args:
- --config=/etc/jaeger/collector-config.yaml
ports:
- containerPort: 4317
name: otlp-grpc
- containerPort: 4318
name: otlp-http
resources:
requests:
memory: "1Gi"
cpu: "1000m"
limits:
memory: "4Gi"
cpu: "4000m"
env:
- name: CASSANDRA_SERVERS
value: "cassandra-0.cassandra.observability.svc.cluster.local:9042,cassandra-1.cassandra.observability.svc.cluster.local:9042"
volumeMounts:
- name: config
mountPath: /etc/jaeger
volumes:
- name: config
configMap:
name: jaeger-collector-config
---
apiVersion: v1
kind: Service
metadata:
name: jaeger-collector
namespace: observability
spec:
type: ClusterIP
selector:
app: jaeger-collector
ports:
- name: otlp-grpc
port: 4317
targetPort: 4317
- name: otlp-http
port: 4318
targetPort: 4318
query-deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger-query
namespace: observability
spec:
replicas: 2
selector:
matchLabels:
app: jaeger-query
template:
metadata:
labels:
app: jaeger-query
spec:
containers:
- name: query
image: jaegertracing/jaeger:2.0.0
args:
- --config=/etc/jaeger/query-config.yaml
ports:
- containerPort: 16686
name: ui
- containerPort: 16685
name: grpc
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "2000m"
volumeMounts:
- name: config
mountPath: /etc/jaeger
volumes:
- name: config
configMap:
name: jaeger-query-config
---
apiVersion: v1
kind: Service
metadata:
name: jaeger-query
namespace: observability
spec:
type: LoadBalancer # 或 Ingress
selector:
app: jaeger-query
ports:
- name: ui
port: 80
targetPort: 16686
2. 性能调优
2.1 Collector 调优
高吞吐场景(> 10K spans/s):
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
max_recv_msg_size_mib: 16 # 增大消息大小限制
processors:
batch:
timeout: 200ms # 减少批处理延迟
send_batch_size: 2048 # 增大批次大小
send_batch_max_size: 4096
exporters:
jaeger_storage_exporter:
trace_storage: main
sending_queue:
enabled: true
num_consumers: 50 # 增加消费者数量
queue_size: 10000 # 增大队列大小
资源配置:
resources:
requests:
memory: "2Gi"
cpu: "2000m"
limits:
memory: "8Gi"
cpu: "8000m"
2.2 存储调优
Cassandra:
jaeger_storage:
backends:
main:
cassandra:
connection:
connections_per_host: 20 # 增加连接数
write:
timeout: 30s
max_retries: 5
consistency: LOCAL_ONE # 降低一致性要求
Elasticsearch:
jaeger_storage:
backends:
main:
elasticsearch:
connection:
servers:
- http://es1:9200
- http://es2:9200
num_shards: 10 # 增加分片数
num_replicas: 2
bulk:
size: 10000000 # 10MB
workers: 10
3. 监控与告警
3.1 Prometheus 指标
ServiceMonitor(Prometheus Operator):
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: jaeger
namespace: observability
spec:
selector:
matchLabels:
app: jaeger
endpoints:
- port: metrics
interval: 30s
path: /metrics
关键指标:
# Collector 接收速率
rate(otelcol_receiver_accepted_spans[5m])
# 批处理延迟
histogram_quantile(0.95, otelcol_processor_batch_batch_send_size_bucket)
# 存储写入失败率
rate(otelcol_exporter_send_failed_spans[5m]) / rate(otelcol_exporter_sent_spans[5m])
# Query 延迟
histogram_quantile(0.95, jaeger_query_requests_duration_seconds_bucket)
3.2 告警规则
groups:
- name: jaeger
rules:
- alert: JaegerHighDropRate
expr: rate(otelcol_receiver_refused_spans[5m]) > 100
for: 5m
annotations:
summary: "Jaeger dropping spans: {{ $value }} spans/sec"
- alert: JaegerStorageWriteErrors
expr: rate(otelcol_exporter_send_failed_spans[5m]) > 10
for: 5m
annotations:
summary: "Jaeger storage write errors: {{ $value }} errors/sec"
- alert: JaegerQueryHighLatency
expr: histogram_quantile(0.95, jaeger_query_requests_duration_seconds_bucket) > 2
for: 5m
annotations:
summary: "Jaeger Query P95 latency > 2s"
4. 故障排查
4.1 常见问题
问题 1:OTLP gRPC 连接失败
排查:
# 检查端口是否监听
kubectl exec -it jaeger-xxx -- netstat -tuln | grep 4317
# 检查防火墙规则
kubectl exec -it jaeger-xxx -- iptables -L -n
# 查看日志
kubectl logs -f jaeger-xxx | grep error
解决:
- 确认 receiver 配置正确
- 检查 Kubernetes Service 和 Endpoint
- 检查网络策略(NetworkPolicy)
问题 2:UI 无法访问
排查:
# 检查 Query 服务状态
kubectl get svc jaeger-query
# 端口转发测试
kubectl port-forward svc/jaeger-query 16686:16686
解决:
- 确认 jaeger_query extension 已启用
- 检查 UI 配置文件路径
- 查看 Query 日志
第三部分:实战案例
1. 微服务追踪集成
1.1 Go 服务集成
安装依赖:
go get go.opentelemetry.io/otel
go get go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc
go get go.opentelemetry.io/otel/sdk/trace
代码示例:
package main
import (
"context"
"log"
"time"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.17.0"
)
func initTracer() (*sdktrace.TracerProvider, error) {
// 创建 OTLP gRPC Exporter
exporter, err := otlptracegrpc.New(
context.Background(),
otlptracegrpc.WithEndpoint("jaeger-collector.observability.svc.cluster.local:4317"),
otlptracegrpc.WithInsecure(),
)
if err != nil {
return nil, err
}
// 创建 Resource(服务元数据)
res, err := resource.New(
context.Background(),
resource.WithAttributes(
semconv.ServiceName("my-service"),
semconv.ServiceVersion("1.0.0"),
),
)
if err != nil {
return nil, err
}
// 创建 TracerProvider
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(res),
sdktrace.WithSampler(sdktrace.AlwaysSample()), // 生产环境使用远程采样
)
otel.SetTracerProvider(tp)
return tp, nil
}
func main() {
tp, err := initTracer()
if err != nil {
log.Fatal(err)
}
defer tp.Shutdown(context.Background())
// 创建 tracer
tracer := otel.Tracer("my-service")
// 创建 span
ctx, span := tracer.Start(context.Background(), "main-operation")
defer span.End()
// 业务逻辑
doWork(ctx)
}
func doWork(ctx context.Context) {
tracer := otel.Tracer("my-service")
_, span := tracer.Start(ctx, "do-work")
defer span.End()
// 添加属性
span.SetAttributes(
attribute.String("user.id", "12345"),
attribute.Int("items.count", 10),
)
// 模拟工作
time.Sleep(100 * time.Millisecond)
}
1.2 Java 服务集成(Spring Boot)
依赖(pom.xml):
<dependencies>
<dependency>
<groupId>io.opentelemetry</groupId>
<artifactId>opentelemetry-api</artifactId>
<version>1.31.0</version>
</dependency>
<dependency>
<groupId>io.opentelemetry</groupId>
<artifactId>opentelemetry-sdk</artifactId>
<version>1.31.0</version>
</dependency>
<dependency>
<groupId>io.opentelemetry</groupId>
<artifactId>opentelemetry-exporter-otlp</artifactId>
<version>1.31.0</version>
</dependency>
<dependency>
<groupId>io.opentelemetry.instrumentation</groupId>
<artifactId>opentelemetry-spring-boot-starter</artifactId>
<version>1.31.0-alpha</version>
</dependency>
</dependencies>
配置(application.yml):
otel:
service:
name: my-java-service
exporter:
otlp:
endpoint: http://jaeger-collector:4317
traces:
sampler:
type: always_on
2. 完整部署案例(生产环境)
架构图:
┌─────────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌─────────────┐│
│ │ Microservice │ │ Microservice │ │Microservice ││
│ │ A │ │ B │ │ C ││
│ └──────┬───────┘ └──────┬───────┘ └──────┬──────┘│
│ │ OTLP gRPC │ │ │
│ └──────────────────┴──────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────┐ │
│ │ Jaeger Collector (x3) │ │
│ │ (HPA: 3-10 replicas) │ │
│ └───────────┬─────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────┐ │
│ │ Cassandra Cluster │ │
│ │ (3 nodes, RF=3) │ │
│ └───────────┬─────────────┘ │
│ │ │
│ ┌─────────────────┴─────────────────┐ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │Jaeger Query │ │ Grafana │ │
│ │ (x2) │ │ (Dashboards) │ │
│ └──────┬───────┘ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Jaeger UI │ │
│ │ (Ingress) │ │
│ └──────────────┘ │
└─────────────────────────────────────────────────────────┘
部署清单:
- Cassandra StatefulSet(略,使用 Cassandra Operator)
- Jaeger Collector Deployment(见上文)
- Jaeger Query Deployment(见上文)
- HPA 自动扩缩容:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: jaeger-collector-hpa
namespace: observability
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: jaeger-collector
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: otelcol_receiver_accepted_spans_rate
target:
type: AverageValue
averageValue: "10000" # 每个 Pod 处理 10K spans/s
总结
本文档综合介绍了 Jaeger V2 架构和完整的实战部署指南:
Jaeger V2 核心优势:
- 统一二进制:简化部署和管理
- YAML 配置:声明式配置,易于版本控制
- 插件化扩展:支持 OTEL Collector 生态
- 向后兼容:支持 V1 的所有协议和 API
生产环境部署关键点:
- 高可用:Collector 和 Query 多副本部署
- 存储选型:生产环境推荐 Cassandra 或 Elasticsearch
- 监控告警:集成 Prometheus,配置关键指标告警
- 性能调优:根据流量调整批处理参数和资源配置