尧图网站建设 尧图网络
  • 首页
  • 关于我们
  • 服务项目
  • 案例展示
  • 建站流程
  • 资讯中心
  • 联系我们
首页/资讯中心/详情

Pandas - read_html()

Pandas - read_html()
📅 发布时间:2026/6/21 23:07:43
Pandas - read_html()
When I ran below code: df = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_population') I got error: HTTPError: HTTP Error 403: Forbidden I can open that URL in browser.
 
ChatGPT said:

This is a common issue when using pandas.read_html() on Wikipedia.
Even though you can open the page in a browser, Wikipedia blocks requests that do not have a User-Agent header, and read_html() uses a simple request underneath (via urllib) that often gets rejected with HTTP 403 Forbidden.

✅ Why your code failed

  • Browser requests include headers like user-agent, cookies, etc. → Wikipedia allows.

  • pandas.read_html() default request looks like an automated bot. → Wikipedia blocks → 403 Forbidden.

✅ How to fix it

You need to supply a proper User-Agent header by downloading the page yourself (using requests) and then passing the HTML to read_html().

✔ Working code

import pandas as pd
import requestsurl = "https://en.wikipedia.org/wiki/List_of_countries_by_population"
headers = {"User-Agent": "Mozilla/5.0"}response = requests.get(url, headers=headers)
response.raise_for_status()   # ensure no HTTP errors

dfs = pd.read_html(response.text)
print(len(dfs))
dfs[0].head()

This works because:

  • requests.get() sends a normal browser-like User-Agent.

  • Wikipedia returns the page.

  • pd.read_html() parses tables from the HTML content.

You should see the first table as expected.

 

You said:
In the code you gave: headers = {"User-Agent": "Mozilla/5.0"} But my browser is Chrome, will it still work or do I need to modify the code?
 
ChatGPT said:

You do not need to modify anything —
User-Agent: "Mozilla/5.0" is a generic, browser-like User-Agent and works perfectly for Wikipedia.

Why it works

Most modern browsers (including Chrome, Edge, Firefox) send a User-Agent string that begins with:

Mozilla/5.0

For example, your actual Chrome User-Agent probably looks like:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/142.0.0.0 Safari/537.36

But Wikipedia does not require the full long string — it only checks that the request looks like it came from a browser, not a script.

✔ Conclusion

  • headers = {"User-Agent": "Mozilla/5.0"} is enough

  • No need to pretend to be Chrome specifically

  • The code will work exactly the same

 

相关新闻

  • 实用指南:Linux企业级解决方案架构:字节跳动短视频推荐系统全链路实践
  • RSS and Atom
  • 通用会话控制方案

最新新闻

  • HCS08片上温度传感器精度优化:从ADC配置、校准到定点运算实战
  • 喜马拉雅音频下载器完整指南:三步构建个人离线音频库
  • Ubuntu 20.04 手动部署 LAMP+WordPress 完整指南
  • 基于LPC5500 SCTimer的HDMI-CEC底层驱动实现与调试实战
  • CPPM学习周期多久?众智商学院APP刷题两周够吗? - 众智商学院课程中心
  • 2026年国内铜屑压饼机厂家实力及核心参数对比 - 起跑123

日新闻

  • Visual C++运行库修复终极指南:5分钟快速解决Windows软件启动错误
  • 手把手教你构建统计局地区经济数据爬虫:从环境搭建到数据持久化全指南
  • 2026多Agent深度解析:用AI团队替代单一模型,四种架构实战落地

周新闻

  • Visual C++运行库修复终极指南:5分钟快速解决Windows软件启动错误
  • 手把手教你构建统计局地区经济数据爬虫:从环境搭建到数据持久化全指南
  • 2026多Agent深度解析:用AI团队替代单一模型,四种架构实战落地

月新闻

  • 【总结】入门篇:50句话让你记住架构核心概念
  • WeChatMsg技术方案解析:实现Mac微信数据自主管理的完整解决方案
  • WeChatMsg:革新性微信数据备份方案,打造你的专属数字记忆库

关于尧图

  • 公司简介
  • 团队介绍
  • 企业文化
  • 荣誉资质

服务项目

  • 定制开发
  • 电商建站
  • UI 设计
  • 运维服务

快速链接

  • 案例展示
  • 建站流程
  • 常见问题
  • 资讯中心

联系方式

  • 📍北京市朝阳区互联网产业园 A 座 10 层
  • 📞400-888-8888
  • ✉️contact@rkmt.cn
  • 🕐周一至周日 9:00-21:00

© 2024 北京尧图网络科技有限公司 版权所有 | 京 ICP 备 XXXXXXXX 号