当前位置：首页 > news >正文

告别手动折腾：用Ansible一键自动化部署Ubuntu 20.04/22.04的NVIDIA驱动和CUDA

news 2026/6/15 5:57:15

深度自动化：Ansible全流程部署NVIDIA驱动与CUDA环境实战指南

在实验室服务器集群或开发测试环境中，GPU计算环境的部署往往成为技术团队的头号痛点。想象一下这样的场景：当你需要在20台新到货的服务器上配置相同的深度学习环境时，传统手动安装方式意味着要重复执行数十次驱动下载、依赖安装、配置修改等操作——这不仅耗时耗力，还极易因人为疏忽导致环境差异。而通过Ansible实现的"基础设施即代码"方案，只需一个Playbook就能让所有机器获得完全一致的运行环境。

1. 环境准备与Ansible基础配置

1.1 搭建Ansible控制节点

Ansible的强大之处在于其无代理架构，只需在控制节点安装相应软件即可管理所有目标机器。对于Ubuntu系统，执行以下命令完成基础环境搭建：

# 更新软件源并安装必要组件 sudo apt update && sudo apt upgrade -y sudo apt install -y software-properties-common # 添加Ansible官方PPA并安装 sudo add-apt-repository --yes --update ppa:ansible/ansible sudo apt install -y ansible # 验证安装结果 ansible --version | head -n 1

控制节点配置完成后，需要在/etc/ansible/hosts文件中定义目标服务器组。例如，为GPU服务器集群创建专用分组：

[gpu_servers] server1 ansible_host=192.168.1.101 server2 ansible_host=192.168.1.102 server3 ansible_host=192.168.1.103 [gpu_servers:vars] ansible_user=admin ansible_ssh_private_key_file=~/.ssh/gpu_cluster_key

1.2 目标节点前置检查

在正式部署前，建议通过Ansible的ad-hoc命令快速验证节点连通性并收集基础信息：

# 检查所有节点连通性 ansible gpu_servers -m ping # 获取各节点系统信息 ansible gpu_servers -m setup -a "filter=ansible_distribution*"

为确保后续驱动安装顺利进行，需要确认目标系统已安装基础编译工具链。通过以下Playbook片段可自动完成依赖安装：

- name: Install essential build tools apt: name: ["build-essential", "cmake", "linux-headers-generic"] state: present update_cache: yes

2. NVIDIA驱动自动化安装方案

2.1 官方仓库集成方案

Ubuntu系统自带的图形化驱动安装方式虽然简单，但难以满足批量部署需求。我们可以通过Ansible自动添加NVIDIA官方仓库并安装指定版本驱动：

- name: Add NVIDIA GPU driver repository apt_repository: repo: "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu{{ ansible_distribution_version | replace('.', '') }}/x86_64 /" key_url: "https://developer.download.nvidia.com/compute/cuda/repos/ubuntu{{ ansible_distribution_version | replace('.', '') }}/x86_64/3bf863cc.pub" state: present - name: Install specific version of NVIDIA driver apt: name: "nvidia-driver-535" state: present update_cache: yes

提示：驱动版本号应根据实际GPU型号和CUDA需求选择，可通过ubuntu-drivers devices命令查询推荐版本

2.2 安全禁用nouveau驱动

NVIDIA驱动安装前必须禁用系统默认的nouveau驱动，这一过程可以通过Ansible自动完成：

- name: Blacklist nouveau driver blockinfile: path: /etc/modprobe.d/blacklist-nouveau.conf block: | blacklist nouveau options nouveau modeset=0 create: yes - name: Update initramfs command: update-initramfs -u - name: Reboot to apply changes reboot: msg: "Rebooting to disable nouveau" connect_timeout: 5 reboot_timeout: 600 pre_reboot_delay: 0 post_reboot_delay: 30

验证nouveau是否成功禁用可通过以下任务实现：

- name: Verify nouveau is disabled command: lsmod | grep nouveau register: nouveau_status failed_when: nouveau_status.stdout != ""

3. CUDA Toolkit自动化部署

3.1 多版本CUDA灵活安装

针对不同深度学习框架的版本需求，可能需要安装特定版本的CUDA Toolkit。以下Playbook展示了如何从NVIDIA官方仓库安装CUDA 11.6：

- name: Install CUDA toolkit apt: name: - "cuda-toolkit-11-6" - "libcudnn8" - "libcudnn8-dev" state: present - name: Set CUDA environment variables blockinfile: path: /etc/profile.d/cuda.sh block: | export PATH=/usr/local/cuda-11.6/bin${PATH:+:${PATH}} export LD_LIBRARY_PATH=/usr/local/cuda-11.6/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} create: yes

3.2 驱动与CUDA版本兼容性管理

NVIDIA驱动与CUDA版本之间存在严格的兼容性要求。通过Ansible的模板功能，可以动态生成版本兼容性对照表：

- name: Generate version compatibility table template: src: cuda_versions.j2 dest: /etc/cuda_versions.txt

对应的Jinja2模板文件cuda_versions.j2内容示例：

Supported CUDA and Driver Version Combinations: ------------------------------------------------- | CUDA Version | Min Driver Version | Tested With | |--------------|--------------------|-------------| | 12.x | 525.60.13 | 530.30.02 | | 11.8 | 450.80.02 | 520.56.06 | | 11.6 | 450.80.02 | 510.47.03 | | 11.4 | 450.80.02 | 470.82.01 |

4. 高级配置与验证

4.1 持久化模式与GPU健康监控

对于生产环境，建议启用NVIDIA持久化模式并配置监控：

- name: Enable NVIDIA persistence mode copy: dest: /etc/systemd/system/nvidia-persistenced.service content: | [Unit] Description=NVIDIA Persistence Daemon Wants=syslog.target [Service] Type=forking ExecStart=/usr/bin/nvidia-persistenced --verbose ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced [Install] WantedBy=multi-user.target - name: Start and enable persistence service systemd: name: nvidia-persistenced state: started enabled: yes

4.2 自动化验证流程

部署完成后，应通过自动化任务验证环境配置的正确性：

- name: Verify NVIDIA driver installation command: nvidia-smi register: nvidia_smi changed_when: false - name: Display GPU information debug: msg: "{{ nvidia_smi.stdout_lines[0:10] | join('\n') }}" - name: Verify CUDA compiler command: nvcc --version register: nvcc_version changed_when: false ignore_errors: yes - name: Check CUDA samples compilation block: - name: Copy CUDA samples copy: src: /usr/local/cuda/samples/ dest: /tmp/cuda_samples remote_src: yes - name: Compile deviceQuery sample command: make -C /tmp/cuda_samples/1_Utilities/deviceQuery args: chdir: /tmp/cuda_samples/1_Utilities/deviceQuery

5. 多环境策略与最佳实践

5.1 条件化部署策略

针对不同型号的GPU设备，可以通过条件判断实现差异化部署：

- name: Install drivers based on GPU model block: - name: Install drivers for Tesla series apt: name: "nvidia-driver-{{ tesla_driver_version }}" when: "'Tesla' in ansible_local.gpu_info.model" - name: Install drivers for GeForce series apt: name: "nvidia-driver-{{ geforce_driver_version }}" when: "'GeForce' in ansible_local.gpu_info.model" vars: tesla_driver_version: "470" geforce_driver_version: "535"

5.2 完整Playbook示例

将上述所有组件整合，形成完整的部署Playbook：

--- - name: Deploy NVIDIA Driver and CUDA Toolkit hosts: gpu_servers become: yes vars: cuda_version: "11-6" driver_version: "535" tasks: - include_tasks: pre_checks.yml - include_tasks: disable_nouveau.yml - include_tasks: install_drivers.yml - include_tasks: install_cuda.yml - include_tasks: post_verification.yml

实际项目中，我们通过这种自动化方式将50台服务器的环境部署时间从3人天缩短到30分钟，且完全消除了人为操作导致的配置差异。特别是在需要频繁重建环境的CI/CD流水线中，这种方案的价值更加凸显。

查看全文

http://www.rkmt.cn/news/1528302.html