词频统计、生成（任意形状的）词云图片。效果如下：

（从水浒传生成的词云图）

需要安装

• jieba ：用于分词
• wordcloud：用于生成词云。

python -m pip install jieba wordcloud

1. 词频统计

1.1 使用jieba分词并统计

使用jieba 分词[1]，然后用字典统计词的出现频率。步骤如下：

1. 读取txt文本
2. 使用jieba分词,得到词的列表 lst=jieba.lcut(t1)
3. 使用字典统计词频
4. 按照频率对词排序，然后输出

import jieba

# 1. 读取
with open('水浒传.txt', 'r', encoding='utf-8') as f:
    t1 = f.read()

# 2. jieba分词,得到词的列表
word_list = jieba.lcut(t1)

# 3. 统计词频
counts = {}
for word in word_list:
    counts[word] = counts.get(word, 0) + 1

# 4. 按照频率对词排序，然后输出
items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)

for i in range(30):
    word, count = items[i]
    print(f"{word:<10}{ count:>5}")

但是这样会发现频率靠前的是标点符号和一些助词。因此我们过滤掉这些词：

excludes = {'两个', '一个', '说道', '犹言', '哪里'}#...

cleaned_list = []  # 清理词列表
for word in word_list:
    if len(word) <= 1 or word in excludes:
        continue
    else:
        cleaned_list.append(word)

word_list = cleaned_list

此外，还会有一些同义词，比如宋江和宋江道被当成了两个词，我们可以将其合并。

t1=t1.replace('宋江道','宋江')

完整代码如下：

import jieba

# 1. 读取
with open('水浒传.txt', 'r', encoding='utf-8') as f:
    t1 = f.read()

t1=t1.replace('宋江道','宋江')

# 2. jieba分词,得到词的列表
word_list = jieba.lcut(t1)

excludes = {'两个', '一个', '说道', '犹言', '哪里', '如何', '只见', '这里', '这个',
            '妇人', '便是', '起来', '问道', '人马', '之意', '不是', '我们', '甚么',
            '三个', '只是', '因此', '不知', '且说', '正是', '如此', '不曾', '不敢',
            '不得', '却是', '看时', '如今', '次日', '来到'}

cleaned_list = []  # 清理词列表
for word in word_list:
    if len(word) <= 1 or word in excludes:
        continue
    else:
        cleaned_list.append(word)

word_list = cleaned_list
# 3. 统计词频
counts = {}
for word in word_list:
    counts[word] = counts.get(word, 0) + 1

# 4. 按照频率对词排序，然后输出
items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)
print([x[0] for x in items[:50]])

for i in range(30):
    word, count = items[i]
    print(f"{word:<10}{count:>5}")

1.2 词性

除了简单的分词，还可以获得词性[2]。比如 t是时间，m是数词，n是名词，nr是人名，x是非语素字…

import jieba
import jieba.posseg as pseg
words = pseg.cut("今天大部分地区多云。") #jieba默认模式
for word, flag in words:
    print(word, flag)

# 输出:
今天 t
大部分 m
地区 n
多云 nr
。x

可以利用词性，只统计某种类型的词。

...
# 2. jieba分词,得到词的列表
word_list = pseg.lcut(t1)

cleaned_list = []  # 清理词列表
for word, label in word_list:
    if label == 'nr': #人名
        cleaned_list.append(word)

word_list = cleaned_list
...

此外，还可以使用paddle模型获得更准确的分词和标注，需要安装paddlepaddle-tiny：pip install paddlepaddle-tiny==1.6.1。

import jieba
import jieba.posseg as pseg
words = pseg.cut("your_text") #jieba默认模式
jieba.enable_paddle() #启动paddle模式。 0.40版之后开始支持，早期版本不支持
words2 = pseg.cut("your_text",use_paddle=True) #paddle模式

参考

[1] jieba的github地址: https://github.com/fxsjy/jieba
[2] 相关博客: https://blog.csdn.net/Yellow_python/article/details/83991967

2. 生成词云图片

2.1 wordcloud生成词云

分词后的结果可以用wordcloud[1]生成词云。

import wordcloud as wc

#词云生成
txt1 = " ".join(word_list)

w1=wc.WordCloud(width=1000,height=700,background_color='white',
                max_words=200,font_path='msyh.ttc')
w1.generate(txt1)
w1.to_file("水浒.png")

但是这个比较干燥，如果能显示到一个人物上会更好。我们只需要一张mask图片（只有黑白的掩码图片），并在WordCloud中使用mask参数。

from PIL import Image
import numpy as np
from wordcloud import WordCloud

# read the mask image
bird_mask = np.array(Image.open("pics/masked_bird.png"))

wc = WordCloud(background_color="white", max_words=200, mask=bird_mask,
               contour_width=3, contour_color='steelblue', font_path='msyh.ttc')

# generate word cloud
wc.generate(text)
wc.to_file("水浒2.png")

(mask图片)

(生成的词云)

2.2 获得掩码图片

那么我们平时的图片一般都是包含物体和背景的彩色图片，怎么变成这样的黑白的掩码图片呢？只需要两步：

1. 抠图，可以用画图3d或者其它抠图工具，将物体抠出来
2. 二值化，使用PIL或opencv将抠图后的物体二值化。

(原图)

1. 抠图，使用画图3d将物体和背景分离[2]：

(抠图后)

2. 二值化，可以用PIL将抠图后的结果二值化[3]：

# 图片二值化
from PIL import Image

img = Image.open('pics/bird_mask.png')
Img = img.convert('L')  # 模式L”为灰度图，取值范围0-255,0表示黑，255表示白。

threshold = 10
table = [0 if i > threshold else 1 for i in range(256)]  # 图片中灰度值 >threshold 设置为0，<threshold设置为1。
photo = Img.point(table, '1')
photo.save("masked.png")

(二值化)

参考

[1] wordcloud文档：https://github.com/amueller/word_cloud
[2] 使用画图3d抠图: https://zhuanlan.zhihu.com/p/139511541
[3] PIL库的二值图像转换：https://zhuanlan.zhihu.com/p/138960514