ML之NB:利用NB朴素贝叶斯算法(CountVectorizer/TfidfVectorizer+去除停用词)进行分类预测、评估
目录
输出结果
设计思路
核心代码
输出结果
设计思路

核心代码
class CountVectorizer Found at: sklearn.feature_extraction.text
class CountVectorizer(BaseEstimator, VectorizerMixin):
"""Convert a collection of text documents to a matrix of token counts
This implementation produces a sparse representation of the counts using
scipy.sparse.csr_matrix.
If you do not provide an a-priori dictionary and you do not use an analyzer
that does some kind of feature selection then the number of features will
be equal to the vocabulary size found by analyzing the data.
Read more in the :ref:`User Guide `.
Parameters
----------
input : string {'filename', 'file', 'content'}
If 'filename', the sequence passed as an argument to fit is
expected to be a list of filenames that need reading to fetch
the raw content to analyze.
If 'file', the sequence items must have a 'read' method (file-like
object) that is called to fetch the bytes in memory.
Otherwise the input is expected to be the sequence strings or
bytes items are expected to be analyzed directly.
encoding : string, 'utf-8' by default.
If bytes or files are given to analyze, this encoding is used to
decode.
decode_error : {'strict', 'ignore', 'replace'}
Instruction on what to do if a byte sequence is given to analyze that
contains characters not of the given `encoding`. By default, it is
'strict', meaning that a UnicodeDecodeError will be raised. Other
values are 'ignore' and 'replace'.
strip_accents : {'ascii', 'unicode', None}
Remove accents during the preprocessing step.
'ascii' is a fast method that only works on characters that have
an direct ASCII mapping.
'unicode' is a slightly slower method that works on any characters.
None (default) does nothing.
analyzer : string, {'word', 'char', 'char_wb'} or callable
Whether the feature should be made of word or character n-grams.
Option 'char_wb' creates character n-grams only from text inside
word boundaries; n-grams at the edges of words are padded with space.
If a callable is passed it is used to extract the sequence of features
out of the raw, unprocessed input.
preprocessor : callable or None (default)
Override the preprocessing (string transformation) stage while
preserving the tokenizing and n-grams generation steps.
tokenizer : callable or None (default)
Override the string tokenization step while preserving the
preprocessing and n-grams generation steps.
Only applies if ``analyzer == 'word'``.
ngram_range : tuple (min_n, max_n)
The lower and upper boundary of the range of n-values for different
n-grams to be extracted. All values of n such that min_n
关注
打赏
热门博文
- Computer:C语言/C++语言的简介、发展历史、应用领域、编程语言环境IDE安装、学习路线之详细攻略
- DBMS/Database:数据库管理的简介、安装(注意事项等)、学习路线(基于SQLSever深入理解SQL命令语句综合篇《初级→中级→高级》/几十项代码案例集合)之详细攻略
- DayDayUp之Job:牛客网—算法工程师—剑指offer之66道在线编程(解决思路及其代码)——1~20
- High&NewTech:一文了解计算机思维、数学思维的本质区别,以及算法和程序的认知比较
- Algorithm:【Algorithm算法进阶之路】之十大经典排序算法
- DataScience:数据生成之在原始数据上添加小量噪声(可自定义噪声)进而实现构造新数据(dataframe格式数据存储案例)
- CV:Image.open 和cv2.imread的简介、区别及PIL.Image格式/OpenCV格式相互转换代码实现之详细攻略
- Py之shap:shap.explainers.shap_values函数的简介、解读(shap_values[1]索引为1的原因)、使用方法之详细攻略
- Py之PaddleFL:PaddleFL/paddle_fl的简介、安装、使用方法之详细攻略
- Python语言学习:Python语言学习之正则表达式常用函数之re.search方法【输出仅一个匹配结果(内容+位置)】、re.findall方法【输出所有匹配结果(内容)】案例集合之详细攻略