零拷贝与Direct I/O

章节概述

在高性能I/O场景中，数据拷贝的开销往往成为性能瓶颈。本章将深入探讨零拷贝技术（sendfile、splice、mmap）和Direct I/O，帮助你理解如何减少数据拷贝，提升I/O性能。

学习目标：

理解传统I/O的数据拷贝过程
掌握各种零拷贝技术的原理和应用场景
学会使用Direct I/O绕过PageCache
能够根据场景选择合适的I/O优化技术

核心概念

1. 传统I/O的问题

典型的文件传输过程：

应用程序想要发送文件内容到网络

1. read()系统调用
   磁盘 → 内核PageCache → 用户空间buffer
   (2次拷贝: DMA拷贝 + CPU拷贝)

2. write()系统调用  
   用户空间buffer → 内核Socket Buffer → 网卡
   (2次拷贝: CPU拷贝 + DMA拷贝)

总共: 4次拷贝 + 4次上下文切换

详细流程图：

┌─────────────────────────────────────────────┐
│           用户空间                           │
│                                             │
│   read()              write()               │
│     ↓                   ↑                   │
└─────┼───────────────────┼───────────────────┘
      │                   │
┌─────┼───────────────────┼───────────────────┐
│     │    内核空间       │                   │
│     │                   │                   │
│  ┌──▼──────┐       ┌───┴────────┐          │
│  │PageCache│──CPU─→│Socket Buffer│          │
│  └──▲──────┘       └───┬────────┘          │
│     │                  │                    │
│    DMA                DMA                   │
│     │                  │                    │
└─────┼──────────────────┼────────────────────┘
      │                  │
   ┌──▼───┐          ┌──▼───┐
   │ 磁盘  │          │ 网卡  │
   └──────┘          └──────┘

拷贝次数：4次
- DMA: 磁盘 → PageCache
- CPU: PageCache → 用户buffer
- CPU: 用户buffer → Socket buffer
- DMA: Socket buffer → 网卡

上下文切换：4次
- read()进入内核
- read()返回用户态
- write()进入内核
- write()返回用户态

性能损耗：

CPU拷贝浪费CPU周期
多次上下文切换开销
用户buffer占用内存
Cache污染

2. 零拷贝技术概览

零拷贝 (Zero-Copy): 减少或消除内核态和用户态之间的数据拷贝。

主要技术：

1. sendfile
   - 内核内部直接传输
   - 适合文件→Socket

2. splice
   - 基于管道的零拷贝
   - 更灵活，支持任意fd

3. mmap
   - 内存映射文件
   - 适合随机访问

4. MSG_ZEROCOPY (Linux 4.14+)
   - 网络发送零拷贝
   - 适合大数据量传输

技术详解

1. sendfile

系统调用：

#include <sys/sendfile.h>

ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);

工作原理：

传统方式：
磁盘 → PageCache → 用户buffer → Socket buffer → 网卡
(4次拷贝)

sendfile方式：
磁盘 → PageCache → Socket buffer → 网卡
(3次拷贝，无CPU拷贝到用户空间)

优化版sendfile (DMA gather copy):
磁盘 → PageCache → 网卡
(2次拷贝，只有DMA拷贝)

流程图：

┌─────────────────────────────────────┐
│         用户空间                     │
│   sendfile(socket_fd, file_fd)     │
└─────────────┬───────────────────────┘
              │ (1次系统调用)
┌─────────────▼───────────────────────┐
│         内核空间                     │
│                                     │
│  ┌──────────┐      ┌──────────┐   │
│  │PageCache │─desc→│Socket    │   │
│  │          │      │Buffer    │   │
│  └────▲─────┘      └────┬─────┘   │
│       │                 │          │
│      DMA          DMA Gather       │
└───────┼─────────────────┼──────────┘
        │                 │
     ┌──▼──┐          ┌──▼──┐
     │磁盘 │          │网卡 │
     └─────┘          └─────┘

优点：
- 只2次拷贝（都是DMA）
- 只1次上下文切换
- 无CPU拷贝

限制：
- 只能文件→Socket
- 无法修改数据

C语言示例：

#include <sys/sendfile.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <stdio.h>
#include <unistd.h>

void send_file_traditional(int socket_fd, const char *filename) {
    int file_fd = open(filename, O_RDONLY);
    if (file_fd < 0) {
        perror("open");
        return;
    }
    
    char buffer[4096];
    ssize_t bytes_read;
    
    // 传统方式：read + write
    while ((bytes_read = read(file_fd, buffer, sizeof(buffer))) > 0) {
        write(socket_fd, buffer, bytes_read);
    }
    
    close(file_fd);
}

void send_file_zero_copy(int socket_fd, const char *filename) {
    int file_fd = open(filename, O_RDONLY);
    if (file_fd < 0) {
        perror("open");
        return;
    }
    
    struct stat file_stat;
    fstat(file_fd, &file_stat);
    
    // 零拷贝方式：sendfile
    off_t offset = 0;
    ssize_t sent = sendfile(socket_fd, file_fd, &offset, file_stat.st_size);
    
    if (sent < 0) {
        perror("sendfile");
    }
    
    close(file_fd);
}

Go语言示例：

package main

import (
    "io"
    "net"
    "os"
    "syscall"
)

// 传统方式
func sendFileTraditional(conn net.Conn, filename string) error {
    file, err := os.Open(filename)
    if err != nil {
        return err
    }
    defer file.Close()
    
    // io.Copy内部会使用buffer
    _, err = io.Copy(conn, file)
    return err
}

// 零拷贝方式（Go自动使用sendfile）
func sendFileZeroCopy(conn net.Conn, filename string) error {
    file, err := os.Open(filename)
    if err != nil {
        return err
    }
    defer file.Close()
    
    // Go的io.Copy在TCP连接和文件之间会自动使用sendfile
    _, err = io.Copy(conn, file)
    return err
}

// 手动使用sendfile（Linux）
func sendFileManual(conn *net.TCPConn, filename string) error {
    file, err := os.Open(filename)
    if err != nil {
        return err
    }
    defer file.Close()
    
    stat, err := file.Stat()
    if err != nil {
        return err
    }
    
    // 获取底层socket fd
    connFile, err := conn.File()
    if err != nil {
        return err
    }
    defer connFile.Close()
    
    // 调用sendfile系统调用
    offset := int64(0)
    _, err = syscall.Sendfile(
        int(connFile.Fd()),
        int(file.Fd()),
        &offset,
        int(stat.Size()),
    )
    
    return err
}

2. splice

系统调用：

#include <fcntl.h>

ssize_t splice(int fd_in, loff_t *off_in, 
               int fd_out, loff_t *off_out,
               size_t len, unsigned int flags);

特点：

更灵活，支持任意文件描述符
至少一端必须是管道
可以链式操作

工作原理：

文件 → 管道 → Socket
(零拷贝，数据在内核空间移动)

C语言示例：

#include <fcntl.h>
#include <sys/socket.h>
#include <unistd.h>
#include <stdio.h>

#define PIPE_SIZE 65536

void splice_file_to_socket(int socket_fd, const char *filename) {
    int file_fd = open(filename, O_RDONLY);
    if (file_fd < 0) {
        perror("open");
        return;
    }
    
    // 创建管道
    int pipe_fd[2];
    if (pipe(pipe_fd) < 0) {
        perror("pipe");
        close(file_fd);
        return;
    }
    
    ssize_t bytes;
    
    // 文件 → 管道
    while ((bytes = splice(file_fd, NULL, pipe_fd[1], NULL, 
                          PIPE_SIZE, SPLICE_F_MOVE)) > 0) {
        // 管道 → Socket
        ssize_t sent = 0;
        while (sent < bytes) {
            ssize_t s = splice(pipe_fd[0], NULL, socket_fd, NULL,
                              bytes - sent, SPLICE_F_MOVE);
            if (s <= 0) break;
            sent += s;
        }
    }
    
    close(pipe_fd[0]);
    close(pipe_fd[1]);
    close(file_fd);
}

3. mmap (Memory-Mapped I/O)

系统调用：

#include <sys/mman.h>

void *mmap(void *addr, size_t length, int prot, int flags, 
           int fd, off_t offset);

工作原理：

将文件映射到进程的虚拟地址空间

┌──────────────────────────┐
│      进程虚拟内存         │
│  ┌────────────────────┐  │
│  │  mmap区域          │  │
│  │  (映射到文件)      │  │
│  └────────┬───────────┘  │
└───────────┼──────────────┘
            │ (页表映射)
┌───────────▼──────────────┐
│      PageCache            │
│  ┌────────────────────┐  │
│  │  文件数据           │  │
│  └────────┬───────────┘  │
└───────────┼──────────────┘
            │
         ┌──▼──┐
         │磁盘 │
         └─────┘

特点：
- 读写文件如同读写内存
- 减少系统调用
- 多进程共享内存
- 延迟加载（缺页时才读取）

适用场景：

大文件随机访问
进程间共享内存
数据库文件访问

C语言示例：

#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

void process_file_mmap(const char *filename) {
    int fd = open(filename, O_RDWR);
    if (fd < 0) {
        perror("open");
        return;
    }
    
    struct stat sb;
    fstat(fd, &sb);
    
    // 映射文件到内存
    char *mapped = mmap(NULL, sb.st_size, 
                       PROT_READ | PROT_WRITE, 
                       MAP_SHARED, fd, 0);
    
    if (mapped == MAP_FAILED) {
        perror("mmap");
        close(fd);
        return;
    }
    
    // 像操作内存一样操作文件
    // 例如：查找并替换
    for (size_t i = 0; i < sb.st_size - 3; i++) {
        if (memcmp(mapped + i, "old", 3) == 0) {
            memcpy(mapped + i, "new", 3);
        }
    }
    
    // 确保数据写回磁盘
    msync(mapped, sb.st_size, MS_SYNC);
    
    munmap(mapped, sb.st_size);
    close(fd);
}

Go语言示例：

package main

import (
    "fmt"
    "golang.org/x/sys/unix"
    "os"
)

func processFileMmap(filename string) error {
    file, err := os.OpenFile(filename, os.O_RDWR, 0644)
    if err != nil {
        return err
    }
    defer file.Close()
    
    stat, err := file.Stat()
    if err != nil {
        return err
    }
    
    // mmap映射
    data, err := unix.Mmap(
        int(file.Fd()),
        0,
        int(stat.Size()),
        unix.PROT_READ|unix.PROT_WRITE,
        unix.MAP_SHARED,
    )
    if err != nil {
        return err
    }
    defer unix.Munmap(data)
    
    // 像操作切片一样操作文件
    for i := 0; i < len(data)-2; i++ {
        if string(data[i:i+3]) == "old" {
            copy(data[i:], "new")
        }
    }
    
    // 同步到磁盘
    err = unix.Msync(data, unix.MS_SYNC)
    
    return err
}

4. Direct I/O

特点：

绕过PageCache
直接在用户buffer和磁盘间传输
需要对齐（512字节或4KB）

使用场景：

数据库系统（自己管理缓存）
大文件顺序读写
避免PageCache污染

限制：

性能要求：SSD或高速存储
应用需要自己缓存管理
buffer必须对齐

C语言示例：

#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>
#include <sys/stat.h>

#define BLOCK_SIZE 4096

void read_file_direct_io(const char *filename) {
    // O_DIRECT标志
    int fd = open(filename, O_RDONLY | O_DIRECT);
    if (fd < 0) {
        perror("open");
        return;
    }
    
    // buffer必须对齐
    void *buffer;
    if (posix_memalign(&buffer, BLOCK_SIZE, BLOCK_SIZE) != 0) {
        perror("posix_memalign");
        close(fd);
        return;
    }
    
    ssize_t bytes_read;
    while ((bytes_read = read(fd, buffer, BLOCK_SIZE)) > 0) {
        // 处理数据
        process_data(buffer, bytes_read);
    }
    
    free(buffer);
    close(fd);
}

void write_file_direct_io(const char *filename, const void *data, size_t size) {
    int fd = open(filename, O_WRONLY | O_CREAT | O_DIRECT, 0644);
    if (fd < 0) {
        perror("open");
        return;
    }
    
    // 对齐buffer
    void *aligned_buffer;
    size_t aligned_size = (size + BLOCK_SIZE - 1) & ~(BLOCK_SIZE - 1);
    
    if (posix_memalign(&aligned_buffer, BLOCK_SIZE, aligned_size) != 0) {
        perror("posix_memalign");
        close(fd);
        return;
    }
    
    memcpy(aligned_buffer, data, size);
    
    write(fd, aligned_buffer, aligned_size);
    
    free(aligned_buffer);
    close(fd);
}

性能对比实验

实验代码

// zero_copy_benchmark.c
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/sendfile.h>
#include <sys/time.h>
#include <sys/stat.h>
#include <sys/mman.h>

double get_time() {
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return tv.tv_sec + tv.tv_usec / 1000000.0;
}

// 方法1：传统read/write
void method_read_write(const char *src, const char *dst) {
    int src_fd = open(src, O_RDONLY);
    int dst_fd = open(dst, O_WRONLY | O_CREAT | O_TRUNC, 0644);
    
    char buffer[4096];
    ssize_t bytes;
    
    while ((bytes = read(src_fd, buffer, sizeof(buffer))) > 0) {
        write(dst_fd, buffer, bytes);
    }
    
    close(src_fd);
    close(dst_fd);
}

// 方法2：sendfile
void method_sendfile(const char *src, const char *dst) {
    int src_fd = open(src, O_RDONLY);
    int dst_fd = open(dst, O_WRONLY | O_CREAT | O_TRUNC, 0644);
    
    struct stat st;
    fstat(src_fd, &st);
    
    sendfile(dst_fd, src_fd, NULL, st.st_size);
    
    close(src_fd);
    close(dst_fd);
}

// 方法3：mmap
void method_mmap(const char *src, const char *dst) {
    int src_fd = open(src, O_RDONLY);
    int dst_fd = open(dst, O_RDWR | O_CREAT | O_TRUNC, 0644);
    
    struct stat st;
    fstat(src_fd, &st);
    
    ftruncate(dst_fd, st.st_size);
    
    void *src_map = mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, src_fd, 0);
    void *dst_map = mmap(NULL, st.st_size, PROT_WRITE, MAP_SHARED, dst_fd, 0);
    
    memcpy(dst_map, src_map, st.st_size);
    
    munmap(src_map, st.st_size);
    munmap(dst_map, st.st_size);
    
    close(src_fd);
    close(dst_fd);
}

int main() {
    const char *src = "/tmp/test_file.dat";
    const char *dst = "/tmp/output.dat";
    
    // 创建测试文件（100MB）
    system("dd if=/dev/zero of=/tmp/test_file.dat bs=1M count=100 2>/dev/null");
    
    double start, end;
    
    // 测试1：read/write
    start = get_time();
    method_read_write(src, dst);
    end = get_time();
    printf("read/write:  %.3f seconds\n", end - start);
    
    // 测试2：sendfile
    start = get_time();
    method_sendfile(src, dst);
    end = get_time();
    printf("sendfile:    %.3f seconds\n", end - start);
    
    // 测试3：mmap
    start = get_time();
    method_mmap(src, dst);
    end = get_time();
    printf("mmap:        %.3f seconds\n", end - start);
    
    unlink(src);
    unlink(dst);
    
    return 0;
}

编译运行：

gcc -o zero_copy_benchmark zero_copy_benchmark.c
./zero_copy_benchmark

# 典型输出（结果因硬件而异）：
# read/write:  0.250 seconds
# sendfile:    0.080 seconds
# mmap:        0.120 seconds