尧图网站建设 尧图网络
  • 首页
  • 关于我们
  • 服务项目
  • 案例展示
  • 建站流程
  • 资讯中心
  • 联系我们
首页/资讯中心/详情

PySpark - CountVectorizer

PySpark - CountVectorizer
📅 发布时间:2026/6/20 18:37:30
PySpark - CountVectorizer

In Spark, I ran below code snippet:

# The vocabSize parameter was set to 7 (0 to 6 - a total of 7 words), 
# meaaning that the vocabulary size (unique words) would be limited to 6.
# The model counted how many times each word appeared in each document.
cv = CountVectorizer(inputCol='words', outputCol='features', vocabSize=7)
model = cv.fit(df)
result = model.transform(df)
result.show(truncate=False)

The output is as below:

+---+------------------------------+-------------------------------+
|id |words                         |features                       |
+---+------------------------------+-------------------------------+
|0  |[apple, banana, orange, grape]|(7,[0,1,3,4],[1.0,1.0,1.0,1.0])|
|1  |[apple, banana, cherry, pear] |(7,[1,4,5,6],[1.0,1.0,1.0,1.0])|
|2  |[banana, cherry, grape, kiwi] |(7,[1,2,3,5],[1.0,1.0,1.0,1.0])|
|3  |[apple, pear, kiwi, orange]   |(7,[0,2,4,6],[1.0,1.0,1.0,1.0])|
|4  |[cherry, grape, kiwi, orange] |(7,[0,2,3,5],[1.0,1.0,1.0,1.0])|
+---+------------------------------+-------------------------------+

Please help me understand the values in the features column.

 

✅ Understanding the features Column

Each features value is a SparseVector, represented as:

(size, [indices], [values])

Where:

  • size = length of the vocabulary (here vocabSize=7)

  • indices = positions (indices) of the words that appear in the row

  • values = counts (or TF if using TF-IDF, but CountVectorizer gives raw counts)


✅ 1. Vocabulary Built by CountVectorizer

Spark sorts words by descending frequency, keeping the top vocabSize=7.

Let’s manually count frequencies:

Across all rows:

wordcount
apple 3
banana 3
orange 3
grape 3
cherry 3
pear 2
kiwi 3

All except pear (2) appear 3 times.

So Spark will create a vocabulary of length 7.
When multiple words have the same frequency, Spark sorts alphabetically.

Thus the vocabulary is:

Index → Word
0 → apple
1 → banana
2 → cherry
3 → grape
4 → orange
5 → kiwi
6 → pear

(You can confirm by printing model.vocabulary.)

 

model.vocabulary ['orange', 'banana', 'kiwi', 'grape', 'apple', 'cherry', 'pear']

 

ChatGPT said:

Great — now that we have the actual vocabulary, we can decode every features vector precisely.

Your vocabulary is:

Index → Word
0 → orange
1 → banana
2 → kiwi
3 → grape
4 → apple
5 → cherry
6 → pear

 


✅ Now decode each row

Row 0

words = [apple, banana, orange, grape]
features = (7,[0,1,3,4],[1.0,1.0,1.0,1.0])

Indices → words:

  • 0 → orange ✔

  • 1 → banana ✔

  • 3 → grape ✔

  • 4 → apple ✔

Exactly matches the input list.


Row 1

words = [apple, banana, cherry, pear]
features = (7,[1,4,5,6],[1.0,1.0,1.0,1.0])

Indices → words:

  • 1 → banana ✔

  • 4 → apple ✔

  • 5 → cherry ✔

  • 6 → pear ✔

Matches the input list.


Row 2

words = [banana, cherry, grape, kiwi]
features = (7,[1,2,3,5],[1.0,1.0,1.0,1.0])

Indices → words:

  • 1 → banana ✔

  • 2 → kiwi ✔

  • 3 → grape ✔

  • 5 → cherry ✔

Matches the input list.


Row 3

words = [apple, pear, kiwi, orange]
features = (7,[0,2,4,6],[1.0,1.0,1.0,1.0])

Indices → words:

  • 0 → orange ✔

  • 2 → kiwi ✔

  • 4 → apple ✔

  • 6 → pear ✔

Matches the input list.


Row 4

words = [cherry, grape, kiwi, orange]
features = (7,[0,2,3,5],[1.0,1.0,1.0,1.0])

Indices → words:

  • 0 → orange ✔

  • 2 → kiwi ✔

  • 3 → grape ✔

  • 5 → cherry ✔

Matches the input list.


🎯 Summary: How to interpret the features column

Each features entry:

(size, [indices], [values])

means:

  • indices correspond to positions in model.vocabulary

  • values are the counts of those words

  • Example: index 3 means word "grape" because vocab[3] = "grape"

Your sparse vectors are exactly correct given the vocabulary.

 

相关新闻

  • Elasticsearch从入门到进阶——搜索优化原理 - 教程
  • EDK II PCD的概念、类型、使用
  • 实用指南:开源AI智能客服、AI智能名片与S2B2C商城小程序融合下的商家客服能力提升策略研究

最新新闻

  • 2026年法穆兰官方售后服务体系升级公告:全新官方维修门店新址升级落地,最新服务热线同步开通 - 亨得利中国服务中心
  • Ultimate ASI Loader终极指南:3分钟掌握游戏MOD加载神器
  • 2026年龙华区靠谱的驾校,科技赋能龙华便民驾培!专访宝华驾校 AI 驾驶馆,观平路新联大厦双模式智能教学,服务产业园上班族与青年学子 - GrowthUME
  • 官方最新发布!2026安徽合肥中考分数在普高线上下徘徊?这所学校的综合高中班可以关注 - 小张zc
  • 2026 安徽省宠物护理特色专业招录,咨询方式看最新简章 - cc江江
  • 元气AI助手实战指南:结构化信息处理与工作流嵌入方法

日新闻

  • Visual C++运行库修复终极指南:5分钟快速解决Windows软件启动错误
  • 手把手教你构建统计局地区经济数据爬虫:从环境搭建到数据持久化全指南
  • 2026多Agent深度解析:用AI团队替代单一模型,四种架构实战落地

周新闻

  • Visual C++运行库修复终极指南:5分钟快速解决Windows软件启动错误
  • 手把手教你构建统计局地区经济数据爬虫:从环境搭建到数据持久化全指南
  • 2026多Agent深度解析:用AI团队替代单一模型,四种架构实战落地

月新闻

  • 【总结】入门篇:50句话让你记住架构核心概念
  • WeChatMsg技术方案解析:实现Mac微信数据自主管理的完整解决方案
  • WeChatMsg:革新性微信数据备份方案,打造你的专属数字记忆库

关于尧图

  • 公司简介
  • 团队介绍
  • 企业文化
  • 荣誉资质

服务项目

  • 定制开发
  • 电商建站
  • UI 设计
  • 运维服务

快速链接

  • 案例展示
  • 建站流程
  • 常见问题
  • 资讯中心

联系方式

  • 📍北京市朝阳区互联网产业园 A 座 10 层
  • 📞400-888-8888
  • ✉️contact@rkmt.cn
  • 🕐周一至周日 9:00-21:00

© 2024 北京尧图网络科技有限公司 版权所有 | 京 ICP 备 XXXXXXXX 号