山东省语言资源开发与应用重点实验室
新闻详情
北京大学多视图中文树库1.0版

The Peking University Multi-view Chinese Treebank 1.0

PMT 1.0

According to the proposed multi-view treebanking schema for Chinese, we constructed the Peking University Multi-view Chinese Treebank (PMT), version 1.0, which contains about 14,463 sentences and 336K words, and supports both the PS view and DS view. Our treebanking is based on the work of Yu et al. (2003), who built a segmented and POS-tagged Chinese corpus (the PFR Corpus), and released a sub-corpus containing about 1.1M words for free. We choose the previous 14,463 sentences from the corpus, follow the original word segmentation standard but simplify the POS tagset according to a set of mapping rules described in our Coling 2014 paper (shown in Table 1). Then each sentence is annotated into a projective dependency tree (the dependency tagset is shown in Table 2) according to the annotation framework described in this paper. Since the treebank is designed for DS to PS conversion in advance, the annotated dependency tree can be converted into a constituent tree by our conversion script without information loss.

PMT 1.0 contains all the articles of People’s Daily from January 1st to January 10th, 1998. For experiments of syntactic parsing, we take sentences 12001-13000 and 13001-14463 as the development and test set, respectively. The remaining sentences are used as training data. The experimental results based on automatic pos-tags are shown in Table 5.

A tool for Chinese word segmentation, POS-tagging and dependency parsing together with a DS2PS converter according to the annotation standard of PMT 1.0 can be downloaded from ZPar at Sourceforge.

This resource is freely published. If you are interested in acquiring it, please fill in this form and send it to wanghf@pku.edu.cn.

If you have any questions, suggestions or comments, please feel free to contact us fromqiulikun@pku.edu.cn or wanghf@pku.edu.cn.

For detailed introduction, please see our Coling2014 paper:

Likun Qiu, Yue Zhang, Peng Jin and Houfeng Wang. 2014. Multi-view Chinese Treebanking. InProceedings of the 25th International Conference on Computational Linguistics (COLING). Dublin, Ireland, August.

北京大学多视图中文树库1.0版

基于北京大学计算语言学研究(ICL)所提出的多视图树库标注框架,北京大学计算语言学教育部重点实验室、山东省语言资源开发与应用重点实验室(鲁东大学)、乐山师范学院智能信息处理校级重点实验室于2012年开始构建“北京大学多视图中文树库”。在此发布的为1.0版,包含14463个句子、33.6万词,支持依存语法和短语结构语法两种句法视图。

本树库在ICL十余年前已免费公开的1998年1月份人民日报分词和词性标注语料库的基础上加工而成。此次发布的1.0版包含1998年1月份人民日报前十天的语料,词性标记集上做了简化,对一些词性进行了合并,合并规则如Table 1所示。依存语法视图所采用的标记体系如Table 2所示,短语结构语法视图遵循宾州中文树库的标记体系。基于预先设计的依存语法标注体系,我们视图转换算法可以实现由依存语法树到短语结构树的自动转换。

基于PMT1.0,我们进行了依存语法和短语结构语法的自动句法分析实验。在实验中,我们分别使用编号12001到13000和13001到14463的句子为开发集和测试集,编号1到12000的句子为训练集,实验结果如Table 5所示。基于自动标注的词性,依存语法UAS和短语结构语法的正确率分别达到了83.28%和84.84%。

基于PMT1.0所开发的分词、词性标注和依存句法分析工具及依存到短语的视图转换工具可以从ZPar at Sourceforge 免费获取.

此资源是免费的,不收取任何费用。如果您有兴趣使用此资源,请填写表格并发送给wanghf@pku.edu.cn,我们会在收到表格后尽快反馈并把资源发送给您。

衷心欢迎您就此资源向我们提出问题、建议和评价,请联系qiulikun@pku.edu.cn or wanghf@pku.edu.cn.

For detailed introduction, please see our Coling2014 paper:

Likun Qiu, Yue Zhang, Peng Jin and Houfeng Wang. 2014. Multi-view Chinese Treebanking. InProceedings of the 25th International Conference on Computational Linguistics (COLING). Dublin, Ireland, August.