尧图网站建设 尧图网络
  • 首页
  • 关于我们
  • 服务项目
  • 案例展示
  • 建站流程
  • 资讯中心
  • 联系我们
首页/资讯中心/详情

PySpark - OneHotEncoder

PySpark - OneHotEncoder
📅 发布时间:2026/6/18 17:54:42
PySpark - OneHotEncoder

 

from pyspark.ml.feature import OneHotEncoderdata = [(0.0, 1.0), (1.0, 0.0), (2.0, 1.0)]
columns = ["input1", "input2"]
df = spark.createDataFrame(data, columns)encoder = OneHotEncoder(inputCols=["input1", "input2"],outputCols=["output1", "output2"]
)encoder_model = encoder.fit(df)
encoded_df = encoder_model.transform(df)
encoded_df.show(truncate=False)

 

+------+------+-------------+-------------+
|input1|input2|output1      |output2      |
+------+------+-------------+-------------+
|0.0   |1.0   |(2,[0],[1.0])|(1,[],[])    |
|1.0   |0.0   |(2,[1],[1.0])|(1,[0],[1.0])|
|2.0   |1.0   |(2,[],[])    |(1,[],[])    |
+------+------+-------------+-------------+

 

 

OneHotEncoder output format

Each encoded column is a SparseVector, shown as:

(size, [indices], [values])

Where:

  • size = number of categories minus 1
    (because OneHotEncoder drops the last category to avoid multicollinearity)

  • indices = which category this row belongs to

  • values = always [1.0]


📌 Vocabulary sizes in your case

From the output:

For output1

(2,[0],[1.0]) → size = 2  

Meaning input1 has 3 categories:

0.0 → index 0
1.0 → index 1
2.0 → index dropped (all zeros)

For output2

(1,[0],[1.0]) → size = 1

Meaning input2 has 2 categories:

0.0 → index 0
1.0 → dropped (all zeros)

 


🔍 Now decode your rows


▶ Row 1

input1 = 0.0
input2 = 1.0output1 = (2,[0],[1.0])
output2 = (1,[],[])
  • input1=0 → category index 0 → one-hot at index 0
    → (2,[0],[1.0])

  • input2=1 is the last category, so encoder drops it
    → all zeros (1,[],[])


▶ Row 2

input1 = 1.0
input2 = 0.0output1 = (2,[1],[1.0])
output2 = (1,[0],[1.0])
  • input1=1 → index 1 → (2,[1],[1.0])

  • input2=0 → index 0 → (1,[0],[1.0])


▶ Row 3

input1 = 2.0
input2 = 1.0output1 = (2,[],[])
output2 = (1,[],[])
  • input1=2 → last category, dropped → (2,[],[])

  • input2=1 → last category, dropped → (1,[],[])


🎯 Summary

output1

  • 3 categories in input1 → output vector size = 2

  • Encoded as:

    input1=0.0 → (2,[0],[1.0])
    input1=1.0 → (2,[1],[1.0])
    input1=2.0 → (2,[],[])     # dropped category

output2

  • 2 categories in input2 → output vector size = 1

  • Encoded as:

    input2=0.0 → (1,[0],[1.0])
    input2=1.0 → (1,[],[])     # dropped category

If you want, I can also show you how to:

✔ keep all categories (no drop)
✔ extract the category → vector index mapping
✔ convert SparseVector to dense vector for readability

 
You said:
Yes, I want.
 
ChatGPT said:

Great — here are the three things you asked for, clearly explained with ready-to-use PySpark code:

✔ (A) Keep all categories (no drop)
✔ (B) Get the category → vector index mapping
✔ (C) Convert SparseVector to DenseVector for readability


🅰 A. Keep all categories (no drop)

By default, PySpark OneHotEncoder drops the last category to avoid multicollinearity (dummy variable trap).

To keep all categories:

from pyspark.ml.feature import OneHotEncoderencoder = OneHotEncoder(inputCols=["input1", "input2"],outputCols=["output1", "output2"],dropLast=False   # ← keep every category
)encoder_model = encoder.fit(df)
encoded_df = encoder_model.transform(df)
encoded_df.show(truncate=False)

If a column has k categories, the output vector will have k elements (instead of k-1).


🅱 B. Get category → vector index mapping

After fitting:

encoder_model.categorySizes

This returns the number of categories for each input column.

Example output:

[3, 2]

Meaning:

  • input1 has 3 distinct categories → output1 vector length = 2 (or 3 if dropLast=False)

  • input2 has 2 distinct categories → output2 vector length = 1 (or 2 if dropLast=False)

You can also get full mapping:

1. Extract labels for input columns

If input columns were originally string -> indexed -> encoded,
you can inspect the StringIndexer model.

But since your inputs were numeric categories, Spark assumes the category index = the value itself.

So the mapping is simply:

Category value → vector index

Unless categories are not dense (e.g., 10, 20, 50), in which case Spark remaps them internally.

To see mapping explicitly:

for i, size in enumerate(encoder_model.categorySizes):print(f"Column {i} categories: 0..{size - 1}")

Example output:

Column 0 categories: 0..2
Column 1 categories: 0..1

If you want a fully explicit mapping:

for col, size in zip(["input1", "input2"], encoder_model.categorySizes):print(f"Mapping for {col}:")for idx in range(size):print(f"  category {idx} → vector index {idx}")

 


🅲 C. Convert SparseVector to DenseVector

Sometimes you want to see the full vector (e.g., [0,1,0]) instead of:

(3,[1],[1.0])

Use a UDF:

from pyspark.ml.linalg import DenseVectorto_dense = F.udf(lambda v: DenseVector(v.toArray()), DenseVector())df_dense = encoded_df.withColumn("output1_dense", to_dense("output1")) \.withColumn("output2_dense", to_dense("output2"))df_dense.show(truncate=False)

 

Example output

 
+------+------+-------------+-------------+-------------+-------------+ |input1|input2|output1 |output2 |output1_dense|output2_dense| +------+------+-------------+-------------+-------------+-------------+ |0.0 |1.0 |(2,[0],[1.0])|(1,[],[]) |[1.0,0.0] |[0.0] | |1.0 |0.0 |(2,[1],[1.0])|(1,[0],[1.0])|[0.0,1.0] |[1.0] | |2.0 |1.0 |(2,[],[]) |(1,[],[]) |[0.0,0.0] |[0.0] | +------+------+-------------+-------------+-------------+-------------+

This makes the encoding much easier to inspect.


🎉 Summary

✔ Keep all categories

 
dropLast=False

✔ See number of categories per column

 
encoder_model.categorySizes

✔ Convert SparseVector → DenseVector

 
DenseVector(v.toArray())

 

相关新闻

  • .NET 10 中 C# 14 和 F# 10 的新情况
  • 题解:Luogu P14522 【MX-S11-T3】空之碎物
  • 1088. Rational Arithmetic (20)

最新新闻

  • 从CVE-2026-24763看沙箱逃逸:环境变量注入如何攻破AI智能体安全防线
  • 【人员】人员批量处理与外部数据导入
  • 5分钟快速上手OpenVSP:NASA开源飞机设计软件的完整教程
  • 从人脸识别到AR面具:技术实现与创意应用全解析
  • MPC801外部信号全景解析:从引脚到系统交互
  • 2026 年 6 月上海婚姻律师榜 专业靠谱婚家纠纷处理律师执业全梳理 - 外贸老黄

日新闻

  • 5分钟掌握Python进化算法:Geatpy高性能优化工具完全指南
  • Microchip 24AA044 EEPROM选型与应用全指南:从参数解析到实战编程
  • 华为的鸿蒙到底有多牛?为什么称作遥遥领先?

周新闻

  • 3步解锁iOS设备:applera1n激活锁绕过完全指南
  • 39 2026 人工智能证书终极盘点,普通人选 AI 证书可以从这些方向入手
  • Redis 暴露公网有多危险?从端口检查到补救步骤

月新闻

  • 【总结】入门篇:50句话让你记住架构核心概念
  • WeChatMsg技术方案解析:实现Mac微信数据自主管理的完整解决方案
  • WeChatMsg:革新性微信数据备份方案,打造你的专属数字记忆库

关于尧图

  • 公司简介
  • 团队介绍
  • 企业文化
  • 荣誉资质

服务项目

  • 定制开发
  • 电商建站
  • UI 设计
  • 运维服务

快速链接

  • 案例展示
  • 建站流程
  • 常见问题
  • 资讯中心

联系方式

  • 📍北京市朝阳区互联网产业园 A 座 10 层
  • 📞400-888-8888
  • ✉️contact@rkmt.cn
  • 🕐周一至周日 9:00-21:00

© 2024 北京尧图网络科技有限公司 版权所有 | 京 ICP 备 XXXXXXXX 号