# The vocabSize parameter was set to 7 (0 to 6 - a total of 7 words), # meaaning that the vocabulary size (unique words) would be limited to 6. # The model counted how many times each word appeared in each document. cv = CountVectorizer(inputCol='words', outputCol='features', vocabSize=7) model = cv.fit(df) result = model.transform(df) result.show(truncate=False)

+---+------------------------------+-------------------------------+ |id |words |features | +---+------------------------------+-------------------------------+ |0 |[apple, banana, orange, grape]|(7,[0,1,3,4],[1.0,1.0,1.0,1.0])| |1 |[apple, banana, cherry, pear] |(7,[1,4,5,6],[1.0,1.0,1.0,1.0])| |2 |[banana, cherry, grape, kiwi] |(7,[1,2,3,5],[1.0,1.0,1.0,1.0])| |3 |[apple, pear, kiwi, orange] |(7,[0,2,4,6],[1.0,1.0,1.0,1.0])| |4 |[cherry, grape, kiwi, orange] |(7,[0,2,3,5],[1.0,1.0,1.0,1.0])| +---+------------------------------+-------------------------------+

✅ 1. Vocabulary Built by CountVectorizer

Spark sorts words by descending frequency, keeping the top vocabSize=7.

Let’s manually count frequencies:

Across all rows:

word	count
apple	3
banana	3
orange	3
grape	3
cherry	3
pear	2
kiwi	3

All except pear (2) appear 3 times.

So Spark will create a vocabulary of length 7.
When multiple words have the same frequency, Spark sorts alphabetically.

Thus the vocabulary is:

model.vocabulary ['orange', 'banana', 'kiwi', 'grape', 'apple', 'cherry', 'pear']

Great — now that we have the actual vocabulary, we can decode every features vector precisely.

Your vocabulary is:

✅ Now decode each row

Row 0

0 → orange ✔
1 → banana ✔
3 → grape ✔
4 → apple ✔

Exactly matches the input list.

Row 1

1 → banana ✔
4 → apple ✔
5 → cherry ✔
6 → pear ✔

Matches the input list.

Row 2

1 → banana ✔
2 → kiwi ✔
3 → grape ✔
5 → cherry ✔

Matches the input list.

Row 3

0 → orange ✔
2 → kiwi ✔
4 → apple ✔
6 → pear ✔

Matches the input list.

Row 4

0 → orange ✔
2 → kiwi ✔
3 → grape ✔
5 → cherry ✔

Matches the input list.

🎯 Summary: How to interpret the features column

Each features entry:

indices correspond to positions in model.vocabulary
values are the counts of those words
Example: index 3 means word "grape" because vocab[3] = "grape"

Your sparse vectors are exactly correct given the vocabulary.

PySpark - CountVectorizer

✅ Understanding the `features` Column

✅ 1. Vocabulary Built by CountVectorizer

✅ Now decode each row

Row 0

Row 1

Row 2

Row 3

Row 4

🎯 Summary: How to interpret the features column

PySpark - CountVectorizer

✅ Understanding the features Column

✅ 1. Vocabulary Built by CountVectorizer

ChatGPT said:

✅ Now decode each row

Row 0

Row 1

Row 2

Row 3

Row 4

🎯 Summary: How to interpret the features column

✅ Understanding the `features` Column