当前位置：首页 > news >正文

别再死记硬背MDP公式了！用Python手搓一个强化学习‘贪吃蛇’来理解马尔科夫决策过程

news 2026/5/28 1:42:42

用Python构建贪吃蛇游戏：从零理解马尔科夫决策过程

在咖啡厅里，我经常看到学生对着厚厚的强化学习教材皱眉——那些抽象的数学符号和理论推导确实容易让人望而生畏。直到有一天，我让学生用Python写了个简单的贪吃蛇游戏，他们突然恍然大悟："原来MDP就是游戏规则！"这种通过具体项目理解抽象概念的方式，往往比死记硬背公式有效十倍。

1. 项目准备：搭建游戏骨架

我们先从最基础的贪吃蛇实现开始。这个版本不需要复杂的图形界面，用字符矩阵就能清晰展示游戏状态：

import numpy as np import random class SnakeGame: def __init__(self, width=10, height=10): self.width = width self.height = height self.snake = [(width//2, height//2)] # 蛇初始位置在中心 self.direction = (1, 0) # 初始向右移动 self.food = self._generate_food() self.score = 0 self.game_over = False def _generate_food(self): while True: food = (random.randint(0, self.width-1), random.randint(0, self.height-1)) if food not in self.snake: return food

这个基础框架已经包含了MDP的几个关键要素：

状态空间(𝒮)：由蛇身位置、食物位置和移动方向组成
动作集(𝒜)：{上, 下, 左, 右}四个基本动作
奖励函数(ℛ)：吃到食物+1分，撞墙或自身-10分

2. MDP五元组的代码映射

让我们把理论概念与代码实现一一对应起来：

2.1 状态空间(𝒮)的实现

在贪吃蛇游戏中，完整状态应该包含：

蛇头坐标(x,y)
蛇身各段坐标列表
食物坐标
当前移动方向

def get_state(self): head_x, head_y = self.snake[0] food_x, food_y = self.food direction_x, direction_y = self.direction # 计算相对食物位置 food_left = food_x < head_x food_right = food_x > head_x food_up = food_y < head_y food_down = food_y > head_y # 危险检测：四个方向是否安全 danger_straight = self._check_collision( (head_x + direction_x, head_y + direction_y)) danger_left = self._check_collision( (head_x + direction_y, head_y - direction_x)) danger_right = self._check_collision( (head_x - direction_y, head_y + direction_x)) return np.array([ food_left, food_right, food_up, food_down, direction_x == 1, direction_x == -1, direction_y == 1, direction_y == -1, danger_straight, danger_left, danger_right ], dtype=int)

2.2 动作集(𝒜)与状态转移(𝒫)

贪吃蛇的动作空间是离散的四个方向。状态转移在基础版本中是确定性的：

def step(self, action): # 动作映射：0=直行, 1=右转, 2=左转 directions = [(1,0), (0,1), (-1,0), (0,-1)] # 右,下,左,上 if action == 1: # 右转 self.direction = (self.direction[1], -self.direction[0]) elif action == 2: # 左转 self.direction = (-self.direction[1], self.direction[0]) # 计算新头部位置 new_head = (self.snake[0][0] + self.direction[0], self.snake[0][1] + self.direction[1]) # 检查碰撞 if self._check_collision(new_head): self.game_over = True return self.get_state(), -10, True # 移动蛇身 self.snake.insert(0, new_head) # 检查是否吃到食物 if new_head == self.food: self.score += 1 self.food = self._generate_food() return self.get_state(), 1, False else: self.snake.pop() return self.get_state(), -0.1, False # 小惩罚鼓励尽快吃食物

2.3 奖励函数(ℛ)的设计技巧

设计良好的奖励函数是强化学习成功的关键。在贪吃蛇中，我们可以考虑：

事件	奖励值	设计意图
吃到食物	+1	主要目标
撞墙/撞自己	-10	避免死亡
每步移动	-0.1	鼓励高效
靠近食物	+0.2	引导行为

def _calculate_reward(self, new_head): if self._check_collision(new_head): return -10, True reward = -0.1 # 基础移动惩罚 # 计算与食物的距离变化 old_dist = abs(self.snake[0][0]-self.food[0]) + abs(self.snake[0][1]-self.food[1]) new_dist = abs(new_head[0]-self.food[0]) + abs(new_head[1]-self.food[1]) if new_head == self.food: return 1, False elif new_dist < old_dist: reward += 0.2 return reward, False

3. 引入Q-learning算法

有了MDP框架，我们现在可以引入最简单的强化学习算法：

class QLearningAgent: def __init__(self, state_size, action_size): self.state_size = state_size self.action_size = action_size self.q_table = np.zeros((2**state_size, action_size)) # 简化状态编码 self.learning_rate = 0.1 self.discount_factor = 0.95 self.epsilon = 0.1 def get_action(self, state): if random.random() < self.epsilon: return random.randint(0, self.action_size-1) # 将状态转换为二进制索引 state_idx = int("".join(map(str, state)), 2) % len(self.q_table) return np.argmax(self.q_table[state_idx]) def learn(self, state, action, reward, next_state, done): state_idx = int("".join(map(str, state)), 2) % len(self.q_table) next_idx = int("".join(map(str, next_state)), 2) % len(self.q_table) # Q-learning更新公式 current_q = self.q_table[state_idx, action] max_next_q = np.max(self.q_table[next_idx]) new_q = current_q + self.learning_rate * ( reward + self.discount_factor * max_next_q * (1-done) - current_q) self.q_table[state_idx, action] = new_q

4. 训练与可视化

训练循环将游戏与学习过程结合起来：

def train_agent(episodes=1000): env = SnakeGame(8, 8) agent = QLearningAgent(state_size=11, action_size=3) # 3动作：直行/左转/右转 for episode in range(episodes): state = env.get_state() total_reward = 0 while not env.game_over: # 选择并执行动作 action = agent.get_action(state) next_state, reward, done = env.step(action) # 学习 agent.learn(state, action, reward, next_state, done) state = next_state total_reward += reward # 每100轮显示一次进度 if episode % 100 == 0: print(f"Episode {episode}, Score: {env.score}, Reward: {total_reward}") env.reset() return agent

通过这个项目，MDP的抽象概念变得触手可及：