概述
传输层是Linux网络协议栈的核心层次,主要负责端到端的可靠数据传输。TCP和UDP协议的实现机制,包括连接管理、拥塞控制、流量控制以及各种优化策略。
1. 传输层架构
1.1 传输层核心职责
Linux传输层承担以下关键功能:
- TCP可靠传输:提供面向连接的可靠数据传输服务
- UDP无连接传输:提供无连接的数据报传输服务
- 端口管理:管理传输层端口的分配和绑定
- 连接管理:处理TCP连接的建立、维护和关闭
- 流量控制:防止发送方压垮接收方
- 拥塞控制:避免网络拥塞,优化传输性能
- 错误检测与恢复:检测并恢复传输错误
1.2 传输层架构图
graph TB
subgraph "应用层"
APP[应用程序]
SYSCALL[系统调用]
end
subgraph "Socket层"
SOCKET[Socket接口]
SOCKOPS[Socket操作]
SOCKBUF[Socket缓冲区]
end
subgraph "传输层协议"
subgraph "TCP子系统"
TCP_SOCKET[TCP Socket]
TCP_INPUT[TCP输入处理]
TCP_OUTPUT[TCP输出处理]
TCP_TIMER[TCP定时器]
TCP_CONG[拥塞控制]
end
subgraph "UDP子系统"
UDP_SOCKET[UDP Socket]
UDP_INPUT[UDP输入处理]
UDP_OUTPUT[UDP输出处理]
UDP_HASH[UDP哈希表]
end
subgraph "连接管理"
LISTEN[监听管理]
CONNECT[连接建立]
CLOSE[连接关闭]
BIND[端口绑定]
end
subgraph "哈希表管理"
HASH_TABLE[套接字哈希表]
HASH_LISTEN[监听哈希表]
HASH_ESTAB[已建立连接哈希表]
HASH_TIME_WAIT[TIME_WAIT哈希表]
end
end
subgraph "网络层"
IP_LAYER[IP层]
ROUTE[路由查找]
NEIGH[邻居子系统]
end
%% 应用到Socket连接
APP --> SYSCALL
SYSCALL --> SOCKET
SOCKET --> SOCKOPS
SOCKET --> SOCKBUF
%% TCP处理流程
SOCKOPS --> TCP_SOCKET
TCP_SOCKET --> TCP_INPUT
TCP_SOCKET --> TCP_OUTPUT
TCP_INPUT --> TCP_TIMER
TCP_OUTPUT --> TCP_CONG
%% UDP处理流程
SOCKOPS --> UDP_SOCKET
UDP_SOCKET --> UDP_INPUT
UDP_SOCKET --> UDP_OUTPUT
UDP_SOCKET --> UDP_HASH
%% 连接管理
TCP_SOCKET --> LISTEN
TCP_SOCKET --> CONNECT
TCP_SOCKET --> CLOSE
UDP_SOCKET --> BIND
%% 哈希表管理
TCP_SOCKET --> HASH_TABLE
LISTEN --> HASH_LISTEN
CONNECT --> HASH_ESTAB
CLOSE --> HASH_TIME_WAIT
UDP_SOCKET --> HASH_TABLE
%% 到网络层连接
TCP_OUTPUT --> IP_LAYER
UDP_OUTPUT --> IP_LAYER
IP_LAYER --> ROUTE
ROUTE --> NEIGH
%% 反向连接
IP_LAYER --> TCP_INPUT
IP_LAYER --> UDP_INPUT
style TCP_SOCKET fill:#e1f5fe
style UDP_SOCKET fill:#f3e5f5
style HASH_TABLE fill:#e8f5e8
style TCP_CONG fill:#fff3e0
2. TCP协议实现
2.1 TCP头部结构
|
|
2.2 TCP连接建立流程
|
|
2.3 TCP状态机转换
stateDiagram-v2
[*] --> CLOSED: 初始状态
CLOSED --> LISTEN: 被动打开
CLOSED --> SYN_SENT: 主动打开
LISTEN --> SYN_RECV: 收到SYN
LISTEN --> CLOSED: 关闭
SYN_SENT --> ESTABLISHED: 收到SYN+ACK,发送ACK
SYN_SENT --> SYN_RECV: 收到SYN,发送SYN+ACK
SYN_SENT --> CLOSED: 连接失败或RST
SYN_RECV --> ESTABLISHED: 收到ACK
SYN_RECV --> LISTEN: 收到RST
SYN_RECV --> CLOSED: 超时
ESTABLISHED --> FIN_WAIT1: 主动关闭,发送FIN
ESTABLISHED --> CLOSE_WAIT: 收到FIN,发送ACK
FIN_WAIT1 --> FIN_WAIT2: 收到ACK
FIN_WAIT1 --> CLOSING: 收到FIN,发送ACK
FIN_WAIT1 --> TIME_WAIT: 收到FIN+ACK,发送ACK
FIN_WAIT2 --> TIME_WAIT: 收到FIN,发送ACK
CLOSE_WAIT --> LAST_ACK: 发送FIN
CLOSING --> TIME_WAIT: 收到ACK
LAST_ACK --> CLOSED: 收到ACK
TIME_WAIT --> CLOSED: 2MSL超时
note right of ESTABLISHED: 数据传输状态
note right of TIME_WAIT: 等待2MSL(2分钟)
note right of CLOSE_WAIT: 被动关闭等待
2.4 三次握手时序图(主动/被动)
sequenceDiagram
participant CAPP as 客户端应用
participant CSOCK as 客户端Socket层
participant CTCP as 客户端TCP
participant IP as IP层
participant SNET as 网络
participant STCP as 服务器TCP
participant SSOCK as 服务器Socket层
participant SAPP as 服务器应用
CAPP->>CSOCK: connect()
CSOCK->>CTCP: tcp_v4_connect()
CTCP->>CTCP: tcp_connect() 构造SYN
CTCP->>IP: tcp_transmit_skb(SYN)
IP->>SNET: 发送SYN
SNET->>STCP: SYN到达
STCP->>STCP: tcp_v4_rcv() -> 监听队列
STCP->>STCP: 生成半连接,发送SYN+ACK
STCP->>IP: tcp_transmit_skb(SYN+ACK)
IP->>SNET: 发送SYN+ACK
SNET->>CTCP: SYN+ACK到达
CTCP->>CTCP: tcp_rcv_synsent_state_process()
CTCP->>CTCP: tcp_finish_connect() 建立
CTCP->>IP: tcp_send_ack()
IP->>SNET: 发送ACK
SNET->>STCP: ACK到达
STCP->>SSOCK: 完成accept队列就绪
SSOCK->>SAPP: accept() 返回已连接套接字
2.5 TCP数据发送路径时序图
sequenceDiagram
participant APP as 应用
participant SOCK as Socket层
participant TCP as TCP
participant IP as IP层
participant DEV as 设备层
APP->>SOCK: send()/write()
SOCK->>TCP: tcp_sendmsg_locked()
TCP->>TCP: tcp_push() / tcp_write_xmit()
TCP->>IP: __tcp_transmit_skb()
IP->>IP: ip_queue_xmit() -> ip_local_out()
IP->>DEV: dst_output() -> dev_queue_xmit()
DEV-->>网络: ndo_start_xmit()
2.6 TCP数据接收路径时序图
sequenceDiagram
participant NIC as 网卡
participant DRV as 驱动/NAPI
participant NET as 网络层(IP)
participant TCP as TCP
participant SOCK as Socket层
participant APP as 应用
NIC->>DRV: DMA接收+中断
DRV->>DRV: NAPI poll() -> napi_gro_receive()
DRV->>NET: netif_receive_skb()
NET->>NET: ip_rcv() -> ip_local_deliver()
NET->>TCP: ip_protocol_deliver_rcu()
TCP->>TCP: tcp_v4_rcv() / tcp_rcv_established()
TCP->>SOCK: 入队到接收缓冲并唤醒
APP->>SOCK: recv()/read()
SOCK->>APP: 拷贝数据到用户态
2.7 超时重传与快速重传时序图
sequenceDiagram
participant TCP as TCP发送端
participant TMR as 定时器
participant NET as 网络
participant RCV as 接收端
TCP->>NET: 发送数据段(SN=N)
activate TMR
TMR-->>TCP: 启动RTO(icsk_rto)
NET-->>RCV: 数据到达或丢失
alt 丢包且未收到ACK
TMR-->>TCP: 超时到期 inet_csk_reset_xmit_timer()
TCP->>TCP: tcp_retransmit_timer()
TCP->>NET: 重传段(N),减小ssthresh,进入拥塞恢复
else 连续DupACK>=3
RCV-->>TCP: ACK(N)重复3次
TCP->>TCP: tcp_fastretrans_alert()
TCP->>NET: 快速重传段(N),进入Fast Recovery
end
2.8 连接终止(四次挥手)时序图
sequenceDiagram
participant A as 主动关闭端
participant B as 被动关闭端
participant NET as 网络
A->>NET: FIN
NET->>B: FIN到达
B->>A: ACK
B-->>应用: 进入CLOSE_WAIT
B->>NET: FIN
NET->>A: FIN到达
A->>B: ACK (进入TIME_WAIT)
A-->>A: 2MSL计时后CLOSED
3. UDP协议实现
3.1 UDP头部结构与套接字
|
|
3.2 UDP数据包处理
|
|
3.3 UDP收发时序图
sequenceDiagram
participant APP as 应用
participant SOCK as Socket层
participant UDP as UDP
participant IP as IP层
participant DEV as 设备
participant DRV as 驱动/NAPI
APP->>SOCK: sendto()
SOCK->>UDP: udp_sendmsg()
UDP->>IP: ip_append_data()/udp_push_pending_frames()
IP->>DEV: ip_local_out() -> dev_queue_xmit()
DEV-->>网络: 发送帧
DRV-->>IP: 接收路径 netif_receive_skb()
IP->>UDP: __udp4_lib_rcv()
UDP->>SOCK: udp_queue_rcv_skb() 入队
APP->>SOCK: recvfrom()
SOCK->>APP: 返回数据
4. TCP拥塞控制算法
4.1 拥塞控制算法框架
|
|
4.2 Cubic拥塞控制算法
|
|
4.3 BBR拥塞控制算法
BBR(Bottleneck Bandwidth and RTT)是谷歌开发的新一代拥塞控制算法:
|
|
5. 套接字哈希表管理
5.1 哈希表结构
|
|
6. 性能优化要点
- 拥塞控制选择:根据网络环境选择适合算法
- 哈希表优化:合理配置哈希表大小
- 缓存局部性:优化数据结构布局
- 内存管理:减少内存分配和释放开销
7. 关键函数调用路径速查(传输层)
- 应用发送(TCP)
__sys_sendto
->sock_sendmsg
->inet_sendmsg
->tcp_sendmsg_locked
->tcp_push
->tcp_write_xmit
->__tcp_transmit_skb
->ip_queue_xmit
- 应用接收(TCP)
- 驱动
poll
->napi_gro_receive
->netif_receive_skb
->ip_rcv
->ip_local_deliver
->tcp_v4_rcv
->tcp_rcv_established
->tcp_recvmsg
- 驱动
- TCP连接建立/监听
__sys_connect
->inet_stream_connect
->tcp_v4_connect
->tcp_connect
-> 三次握手__sys_listen
->inet_listen
->inet_csk_listen_start
;inet_csk_accept
->inet_accept
- 超时重传/快速重传
- RTO:
inet_csk_reset_xmit_timer
->tcp_retransmit_timer
->tcp_retransmit_skb
- Fast:
tcp_fastretrans_alert
-> 快速重传与拥塞恢复
- RTO:
- 应用发送(UDP)
__sys_sendto
->sock_sendmsg
->inet_sendmsg
->udp_sendmsg
->ip_append_data
/udp_push_pending_frames
->ip_local_out
- 应用接收(UDP)
- 驱动
poll
->netif_receive_skb
->ip_rcv
->ip_local_deliver
->__udp4_lib_rcv
->udp_queue_rcv_skb
->udp_recvmsg
- 驱动
7.1 分模块路径清单
-
连接管理(Connect/Listen/Accept/Close/Time-Wait)
- 主动连接:
__sys_connect
->inet_stream_connect
->tcp_v4_connect
->tcp_connect
->tcp_transmit_skb(SYN)
-> 三次握手 ->tcp_finish_connect
- 监听:
__sys_listen
->inet_listen
->inet_csk_listen_start
- 接受连接:
__sys_accept4
->inet_accept
->inet_csk_accept
-> 返回已连接socket
- 主动关闭:
__sys_close
/close
->sock_close
->__sock_release
->inet_release
->tcp_close
->tcp_send_fin
-> 四次挥手 ->tcp_time_wait
- 被动关闭:
tcp_v4_rcv
(收到FIN) ->tcp_fin()
/tcp_ack
->tcp_send_ack
->tcp_time_wait
- TIME-WAIT:
tcp_time_wait
->inet_twsk_schedule
->inet_twsk_deschedule
-> 过期后资源回收
- 主动连接:
-
TCP 输出(发送与分段/硬件卸载)
- 常规发送:
tcp_sendmsg_locked
->tcp_push
->tcp_write_xmit
->__tcp_transmit_skb
->ip_queue_xmit
->ip_local_out
->dst_output
->dev_queue_xmit
- TSO/GSO:
dev_queue_xmit
->validate_xmit_skb
->gso_segment
/TSO -> 驱动ndo_start_xmit
- 带VLAN:
__vlan_hwaccel_put_tag
->dev_queue_xmit
- 常规发送:
-
TCP 输入(接收与ACK处理)
- 接收路径: 驱动
poll
->napi_gro_receive
->netif_receive_skb
->ip_rcv
->ip_local_deliver
->ip_protocol_deliver_rcu
->tcp_v4_rcv
->tcp_rcv_established
-> 入队socket
- ACK处理:
tcp_ack
->tcp_clean_rtx_queue
->tcp_fastretrans_alert
(判断快速重传/拥塞恢复) - Delayed ACK:
inet_csk_schedule_ack
-> 定时器tcp_delack_timer
->tcp_send_ack
- 接收路径: 驱动
-
重传与定时器(RTO/TLP/Keepalive)
- RTO超时:
inet_csk_reset_xmit_timer(…ICSK_TIME_RETRANS…)
->tcp_retransmit_timer
->tcp_retransmit_skb
- 快速重传: 连续DupACK>=3 ->
tcp_fastretrans_alert
-> 重传并进入Fast Recovery - TLP(可选):
tcp_send_loss_probe
/tcp_tlp_timer
-> 触发探测重传 - Keepalive: 周期定时器
tcp_keepalive_timer
->tcp_write_wakeup
(发送探测)
- RTO超时:
-
拥塞控制(CUBIC/BBR示例)
- 通用回调:
cong_control
/cong_avoid
/ssthresh
/undo_cwnd
由tcp_congestion_ops
实现 - Cubic:
bictcp_cong_avoid
->bictcp_update
-> 调整tp->snd_cwnd
- BBR:
bbr_main
->bbr_update_model_and_state
->bbr_update_control_parameters
- 通用回调:
-
套接字哈希/查找(绑定、监听、已建立、TIME-WAIT)
- 发送侧目的套接字:
__inet_lookup_established
- 监听查找:
inet_lookup_listener
- 绑定表:
inet_bind_bucket
/inet_hashinfo.bhash
- TIME-WAIT迁移:
inet_twsk_hashdance
- 发送侧目的套接字:
-
UDP 输出
- 普通发送:
udp_sendmsg
->ip_append_data
->udp_push_pending_frames
->ip_local_out
->dev_queue_xmit
- Cork与分片:
udp_sendmsg
(cork) -> 多次ip_append_data
->udp_push_pending_frames
- 普通发送:
-
UDP 输入
- 普通接收:
__udp4_lib_rcv
->udp_queue_rcv_skb
->udp_recvmsg
- 组播/广播:
__udp4_lib_rcv
->__udp4_lib_mcast_deliver
- UDP GRO:
udp_gro_receive
->udp_gro_complete
- 普通接收:
-
TCP Fast Open(TFO)
- 主动侧:
tcp_v4_connect
->tcp_connect
->tcp_send_syn_data
(SYN携带数据) - 被动侧:
tcp_v4_rcv
(SYN+TFO) ->tcp_fastopen_create_child
-> 接收数据并建立子连接 - SYN-ACK处理:
tcp_rcv_synsent_state_process
->tcp_rcv_fastopen_synack
- 主动侧:
-
SACK/乱序重组
- 接收SACK:
tcp_sacktag_write_queue
-> 更新记分板tp->sacked_out
/tp->lost_out
- 乱序:
tcp_data_queue_ofo
->ofo_queue
/合并 ->tcp_ofo_queue
消化
- 接收SACK:
-
GRO/GSO/TSO 关键路径
- GRO接收:
napi_gro_receive
->inet_gro_receive
/tcp_gro_receive
->napi_gro_flush
- GSO发送:
dev_queue_xmit
->validate_xmit_skb
->gso_segment
-> 驱动TSO
- GRO接收:
tommie blog