登录注册写文章

分词器

分词器

日语

单个句子 分词

% echo "ＭｅＣａｂで形態素解析を行うとこうなる．" | /Users/admin/Documents/mecab/bin/mecab -Owakati

整个文件 分词

% /Users/admin/Documents/mecab/bin/mecab INPUT -o OUTPUT -O wakati

mecab参数配置
 mecab安装
 很棒的总结（日文）
mecab配置文件

中文

Execute Tokenization.py to perform segmentation by using Jieba.

Common Methods of segmentation:

Methods of Chinese Segmentation	Algorithm	Related Link
Jieba	Based on a prefix dictionary structure to achieve efficient word graph scanning. Build a directed acyclic graph (DAG) for all possible word combinations.Use dynamic programming to find the most probable combination based on the word frequency.For unknown words, a HMM-based model is used with the Viterbi algorithm.	Github	Sun, J. "‘Jieba’Chinese word segmentation tool." (2012).
THULAC(THU Lexical Analyzer for Chinese)	Based on Structured Perceptron	Github paper(2009)	Maosong Sun, Xinxiong Chen, Kaixu Zhang, Zhipeng Guo, Zhiyuan Liu. THULAC: An Efficient Lexical Analyzer for Chinese. 2016.
StanfordSegmenter	Based on CRF	Github Tutorials paper(2005) paper(2008)

get the code from here.

©著作权归作者所有,转载或内容合作请联系作者
平台声明：文章内容（如有图片或视频亦包括在内）由作者上传并发布，文章内容仅代表作者本人观点，简书系信息发布平台，仅提供信息存储服务。

推荐阅读更多精彩内容

如何开发自己的搜索帝国之安装ik分词器
Elasticsearch默认提供的分词器，会把每个汉字分开，而不是我们想要的根据关键词来分词，我是中国人不能简...
欢醉阅读 5,074评论 0赞 3
solr系列2——solr集成IK Analyzer中文分词器
solr 中文分词器 IK Analyzer整合下载 IK Analyzer2012ff_hf1.zip分词包 ...
逐暗者阅读 4,914评论 0赞 2

在Solr中配置中文分词器word
Solr是在schema.xml（managed-schema）中配置Analyzer。注意：在Solr 6.1...
尚亦汐阅读 7,400评论 0赞 1
ES中文分词器之精确短语匹配（自定义分词器）
楼主在上篇文章中，提出了将词和字分开，用不同的分词器分别构建索引，来解决match_phrase在中文中的短语或者...
YG_9013阅读 13,669评论 3赞 9
有时候真的很想笑～笑的很大声，然而终究是没有笑出声音
以前我有个特异功能，就算有人喊停，我依旧会笑个不停～不停～不停～曾经我叫笑机子……
oldmans阅读 2,865评论 0赞 0

赞1赞

赞赏

手机看全文