首页  |  English  |  中国科学院
  • 系统科学系列报告会
Information Extraction and Text Summarization in Thai Language
主讲:Thanaruk Theeramunkong (SIIT, Thammasat University, Thailand)
举办时间:2011年10月21日上午9:00    地点:思源楼712
Abstract:
In this work, we have studied named entity recognition and text summarization in Thai language. Named entity recognition is a nontrivial and challenging task for information extraction in Thai language since a Thai text has no word, phrase and sentence boundary. In the first work, we have proposed a method to exploit the concept of character clusters, a sequence of inseparable characters, to group characters into clusters and then utilize statistics among characters and their clusters to extract Thai words and then recognize named entities, simultaneously. Integrated of two phases, the word-segmentation model and the named-entity-recognition model, context features are exploited to learn parameters for these two discriminative probabilistic models, i.e., CRFs, to rank a set of word and named entity candidates generated. Moreover, three alternative discriminative probabilistic approaches called (1) phase-independent approach, (2) phase-merging approach, and (3) phase-cascading approach are proposed and compared. In the second work, we study on a number of techniques for construction of a comprehensive summary from multiple documents. Towards summarization of multiple news articles related to a specific event, we studied a method to find relationship among entities using association rule mining and proposed a graph-based summarization method which constructs a summarization graph by modelling text portions as nodes and relationships among them as edges. An ideal summary should include only important common descriptions of these articles, together with some dominant differences among them.
附件下载:
中国科学院系统科学研究所 2013 版权所有 京ICP备05002810号-1
北京市海淀区中关村东路55号 邮政编码:100190, 中国科学院系统科学研究所
电话:86-10-82541881  网址:http://iss.amss.cas.cn/