sklearn函数记录

记录用到的一些sklearn的函数~

文章目录

1.关于CountVectorizer

作用：统计所有的训练文本中，每个词语的词频，不考虑文本的顺序，所以，这里统计使用的方法是词袋法（Bag of Words），例如：

from sklearn.feature_extraction.text import CountVectorizer

texts = ['hello how are you', 'you and me', 'bye bye']    # 已经切分好词语的文本列表， 每一个元素表示一个文本
cv = CountVectorizer()
cv_fit = cv.fit_transform(texts)

print(cv.get_feature_names())       # 输出所有的词语组成的词典
# 输出结果：['and', 'are', 'bye', 'hello', 'how', 'me', 'you']

print(cv.vocabulary_)         # 以字典的形式输出所有词语
# 输出结果：{'hello': 3, 'how': 4, 'are': 1, 'you': 6, 'and': 0, 'me': 5, 'bye': 2}

print(cv_fit)       # 输出统计结果
# 输出结果是： 
  (0, 6)	1       （i， j）k 表示 ：i是第i个文本， j是j个词语，这里是词典中的序号， k表示词语出现次数
  (0, 1)	1
  (0, 4)	1
  (0, 3)	1
  (1, 5)	1
  (1, 0)	1
  (1, 6)	1
  (2, 2)	2

print(cv_fit.toarray())   # 将统计结果转化为稀疏矩阵的形式
# 输出：
[[0 1 0 1 1 0 1]
 [1 0 0 0 0 1 1]
 [0 0 2 0 0 0 0]]

如果在创建 CountVectorizer的过程中传入 stop_words参数，表示在统计过程中去掉停用词，例如，在上面的例子中传入停用词列表：

stop_words = ['you', 'and']

并将定义语句改成：

cv = CountVectorizer(stop_words=stop_words)

其余部分不用修改，可以看到生成的词典中已经没有停用词了：

['are', 'bye', 'hello', 'how', 'me']
{'hello': 2, 'how': 3, 'are': 0, 'me': 4, 'bye': 1}

上面使用的例子是英文文本，也同样适合中文文本。

2.关于TfidfVectorizer

主要用于计算tf和idf值用于评估字词对一个文本数据集的重要程度，简单来说，字词的重要程度正比于它在一篇文章中出现的次数（对一篇文章），反比于它在语料库中出现的次数（对所有的文章）
TF（Term Frequency）的计算方法是：词语在文档中出现的次数 / 文档的总词语数
IDF（InverseDocument Frequency）的计算方法是：log [总文档数 /（包含该词的文档数 + 1） ] (这个加1主要是为了防止分母为0，也可以加其他的常数)
使用 TF * IDF 值的结果作为衡量的标准。
例如：

from sklearn.feature_extraction.text import TfidfVectorizer

texts = ['hello how are you', 'how to do this']
stop_words = ['are','to']

tfidf = TfidfVectorizer(stop_words=stop_words)
tfidf_fit = tfidf.fit_transform(texts)

# 输出词典
print(tfidf.get_feature_names())
# 输出：
['do', 'hello', 'how', 'this', 'you']

# 输出词频信息
for term, num in tfidf.vocabulary_.items():
    print(term, num)
# 输出
hello 1
how 2
you 4
do 0
this 3



# 输出TFIDF矩阵
print(tfidf_fit.toarray())
#输出：
[[0.         0.6316672  0.44943642 0.         0.6316672 ]
 [0.6316672  0.         0.44943642 0.6316672  0.        ]]

#上面的TF-IDF矩阵不太清楚，可以输出的清楚些：
for i in range(len(tfidf_fit.toarray())):
    print('doc number:', i)
    for j in range(len(tfidf.get_feature_names())):
        print(tfidf.get_feature_names()[j], ":", tfidf_fit.toarray()[i][j])
# 输出：
doc number: 0
do : 0.0
hello : 0.6316672017376245
how : 0.4494364165239821
this : 0.0
you : 0.6316672017376245
doc number: 1
do : 0.6316672017376245
hello : 0.0
how : 0.4494364165239821
this : 0.6316672017376245
you : 0.0

3.关于LabelEncoder类

使用LabelEncode类可以对数据进行编码（从0开始），例如：

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(['a', 'b', 'c', 'd'])

labels = le.transform(['b', 'c', 'a', 'a', 'd', 'd'])
print(labels)

# 输出：
[1 2 0 0 3 3]

4.关于Pipeline

简单理解来说，Pipeline更像是一个管理工具，可以把很多算法任务添加到Pipeline中统一按照流程进行计算，例如拿到一个数据集要进行分类任务，一种思路是：

特征标准化，使用StandardScaler
降维，使用PCA
分类，使用LogisticRegression

一般方式上，先将数据传入到StandardScaler中，拿到结果后，将这个结果传入到PCA中，以此类推，麻烦在于中间过程我们需要用代码进行处理，例如定义中间变量接收中间结果，Pipeline将这三个过程综合起来形成一个完整的流程，那么只需要定义好流程，然后在一开始输入数据，那么中间过程就不用处理了，定义好Pipeline之后，就像使用其他类一样fit就好了，例如：

pi_line = Pipeline([('sc', StandardScaler()),
                    ('pca', PCA(n_components = 2)),
                    ('clf', LogisticRegression(random_state=1)),
                    ])

pi_line.fit(X_train, y_train)

文章由极客之音整理，本文链接：https://www.bmabk.com/index.php/post/116707.html