Skip to content

Latest commit

 

History

History
19 lines (10 loc) · 6.49 KB

README.md

File metadata and controls

19 lines (10 loc) · 6.49 KB

Description 说明

本同义词语料库内容基于在线成语词典(见我的另外一个repository: Chinese-fixed-phrases-idioms)、哈工大同义词词林扩展版汉语大辞典的近义词大全在线成语词典在线近义词查询等。

对于哈工大同义词词林,我只取至少“成对出现”的同义词。部分光杆的所谓“同义词”,如“Aa01C05@ 众学生”,则不取。如果一个词在不同来源有部分不同或者完全不同的同义词时,则取这些同义词的并集。另外,除了哈工大同义词词林,所有其他来源均注明哪些词是哪个词的同义词,而哈工大同义词词林则只简单给出一组同义词词汇。在操作上,我简单把每组同义词词汇的第一个词定义为目标词,其余的词汇则为目标词的同义词(放在列表list里)。由此而来,我总共发现了18589条同义词语例,以字典的形式保存在synonyms.json.

synonyms_expanded_narrow.jsonsynonyms_expanded_broad.jsonsynonyms.json的扩展,均含有52157条同义词语例。Narrow版的是将每个目标词的同义词单作是另外一个目标词,而原本的目标词则变为其同义词的同义词。比如A的同义词是B和C,那么B和C的同义词都是A。假如B也是D的同义词,那么B的同义词则有A和D,以此类推。Broad扩展版的则预设同义词间的广泛联系,认定既然A的同义词是B和C,那么B和C也存在同义词联系,所以B的同义词就应该是A和C、C的同义词就应该是A和B。假如B和C还是其他的词存在同义词联系,那么B和C的同义词就会更多更广泛。

很显然,Narrow扩展版对同义词的定义比较保守、更可靠,但它相对无法将一些潜在的同义词对联系在一起;而Broad扩展版虽然尽可能广泛地组建同义词网络,但是不少由此而来的同义词对并不能成立。比如在synonyms.json中,“暗娼”的同义词是“私娼“和”野鸡“,但是反过来的同义词语例则不存在。在synonyms_expanded_narrow.json中,查”私娼“,只得到同义词“暗娼”,而在synonyms_expanded_broad.json中,”私娼“的同义词则为“暗娼”和”野鸡“,显然更为精准。不过,如果查“野鸡”,Narrow扩展版给的同义词会是“非法”, “雉”, “暗娼”,对应着“野鸡”在不同语境中的不同语义,但是Broad扩展版却给出了“山鸡”,”越轨“,“非法”,“地下”,“私自”,“黑”,“非法定”,“翟”,“私”,“暗娼”, “不法”, “非官方”, “私娼”, ”雉“,”伪“,“暗”,鱼龙混杂。

当然,由于一词多义的现象的存在,针对某些拥有一组同义词的词汇,无法简单地通过读取词典来准确找出对应的同义词,这个时候或许可以统计学或者机器学习的方式来构建语言模型,进而排歧。

The contents of this corpus are based on several reputable sources: 在线成语词典(see my another repository: Chinese-fixed-phrases-idioms)、哈工大同义词词林扩展版汉语大辞典的近义词大全在线成语词典在线近义词查询

For 哈工大同义词词林扩展版, I discarded instances where only a word or phrase is given as there are no proper synonym(s). Also, due to the fact that 哈工大同义词词林扩展版 is the only source in which a list of synonyms, versus word-synonym pair, are given, the first word in the list is constantly taken as the target word with the rest being its synonyms. This results in a corpus of 18,589 word-synonym pairs, saved in the form of a dictionary in synonyms.json.

synonyms_expanded_narrow.json and synonyms_expanded_broad.json are expanded versions of synonyms.json, both of which have 52,157 synonym pairs. For the narrowly expanded version, the synonym(s) of a word, as in synonyms.json, are taken as a target word, respectively, with that word of which they used to be a synonym, being their synonym. That is, if A has synonyms B and C, then B and C both have A as their synonym. Additionally, if B or C is a synonym of another word, say D, then D is also a synoym of B or C. For the broadly expanded version, however, it sees a more broad connection between different words. In the same example, if B and C are synonyms of A, then B will have synonyms of A and C and C will have synonyms of A and B, so on and so forth. Likewise, if B or C is also synonym of another word,it synonym will also include all the synonyms that word possess.

Due to the existence of polysemy in natural language, a corpus of synonyms will not automatically give you the correct synonym for words who have different meanings and thus types of synonyms in different contexts. A way to get around this can be as simple as building a statistical or machine learning language model so that we can utilize the linguistic context to disambiguate when accessing the synonyms of a given word.