ITA # 信息系统!
●钱爱兵 1 , 江岚 2
(11南京中医药大学经贸管理学院, 江苏南京 210046; 21南京大学信息管理系, 江苏南京
210093)
基于改进 TF2IDF的中文网页关键词抽取
———以新闻网页为例
摘要: 结合新闻网页的内容特征对中文网页关键词的构成特点进行阐述; 对经典的 TF2IDF加权公
式进行改进, 构建一个综合考虑多种影响因素的候选关键词评分加权公式; 对 Sha rp ICTCLAS分词进行改
进, 增加位置标注; 选择评分较高的词作为候选关键词, 利用词的位置标注进行关键词抽取优化操作, 将
“切碎”的候选关键词进行组配, 形成正式抽取的关键词。实验结果表明: 该方法明显优于基准方法, 能
够抽取到令人满意的关键词。
关键词: 词频; 逆文档频率; 新闻网页; 关键词抽取
Ab str a ct: This pape r give s a desc ription of the charac teristics of the keywords of Chinese W eb pages bi2
naton with the charac teristic s of the content of Web news, and ba sed on the imp roved c lassic TF2IDF weighting for2
m ula, propose s a candidate keyword grading and we ighting formula which take s varieties of impact factors into ac2
count. M oreover, the pape r improve s the Sha rp ICTCLAS, and adds the position tag. The m ethod selects the key2
wo rds with h igh scores a s the candida te keywords, and tries to link them toge ther according to their positions inW eb
news. F inally, the form al keywords a re extrac ted. The experim enta l results show that the proposed m ethod can sig2
nif icantly outpe rform the ba seline m ethod, and the quality of the extracted keywords are satisfac tory.
Keywor ds: term frequency; inve rse document frequency; W eb news; keyword extraction
目前, 国内外的许多学者已经在关键词抽取领域做了接影响关键词抽取的结果。综上所述, 与英文关键词抽取
大量研究工作, 并且提出诸多有代表性的方法。简立峰采研究相比, 中文关键词
基于改进TF-IDF的中文网页关键词抽取——以新闻网页为例.pdf 来自淘豆网www.taodocs.com转载请标明出处.