|
Japanese lexical properties database (JLPD) |
[Updated: 20200422] |
Within the present revision of website, this page is still under construction.
Once revision complete, the page will outline the development of an ontology of Japanese lexical properties (Joyce & Hodošček, 2014) as a framework for constructing a large-scale lexical resource on the Japanese lexicon. As such, it will draw on a number of earlier and current projects to construct various component databases, such as ones of compound word morphology and semantic transparency (Joyce, Masuda, & Ogawa 2014; Masuda & Joyce, 2005, 2019; Masuda et at. 2012).
However, current version continues to make publicly available the Corpus Word Lists, reported on in Joyce, Hodošček & Nishina (2012), which were extracted from the BCCWJ corpus. |
For details of other aspects, it will be perhaps be more expedient to check out some of the other papers, which are downloadable from my Output page. |
Corpus word lists |
There are three tables of information providing (1) corpus type and token counts, (2) listings of the files within the two compressed archives, and explanation of the Excel column labels. |
Download links |
The corpus word lists are grouped according to the two word unit definitions used within the BCCWJ project (NINJAL, 2011) and UniDic (Den, et al, 2007); namely, short unit words (SUW) and long unit words (LUW). |
SUW data: | 14 files compressed into one archive (13 MB) | download |
LUW data: | 17* files compressed into one archive (115 MB) | download |
| *LUW noun word list was split into 4 Excel files, as it was too large for one worksheet. |
Respecting standard curtesies in these matters, individuals who utilize the corpus word lists within published works are politely asked to cite the Joyce, et al (2012) paper. |
Corpus type and token counts |
Word Class |
Units |
Tokens |
Lemma Types |
Orthographic Types |
LOTR |
SOFR |
Nouns |
SUW |
35,450,054 |
99,229 |
141,649 |
0.70 |
0.77 |
LUW |
23,993,514 |
1,935,336 |
2,037,164 |
0.95 |
0.96 |
Verbs |
SUW |
14,117,717 |
9,311 |
27,540 |
0.34 |
0.39 |
LUW |
10,556,859 |
95,640 |
127,960 |
0.75 |
0.82 |
i-Adjectives |
SUW |
1,584,323 |
779 |
3,061 |
0.25 |
0.22 |
LUW |
1,331,979 |
11,776 |
15,815 |
0.74 |
0.82 |
Adverbs |
SUW |
1,825,691 |
3,050 |
7,078 |
0.43 |
0.35 |
LUW |
2,266,356 |
31,556 |
37,462 |
0.84 |
0.88 |
Others |
SUW |
51,724,211 |
63,339 |
98,464 |
0.64 |
0.77 |
LUW |
45,436,956 |
322,207 |
329,962 |
0.98 |
0.98 |
Totals |
SUW |
104,701,996 |
175,708 |
277,792 |
0.63 |
0.74 |
LUW |
83,585,664 |
2,396,515 |
2,548,363 |
0.94 |
0.96 |
Based on Table 3 from Joyce, et al (2012: 262), which presents token, lemma types and orthographic type counts for the four main word lists (according to word class; nouns, verbs, i-adjectives and adverbs) and the 'others' group. |
File names and uncompressed sizes in respective archives |
SUW file name |
File size |
|
LUW file name |
File size |
SUW-Adjective-i.xlsx |
160 KB |
|
LUW-Adjective-i.xlsx |
836 KB |
SUW-Adjective-na.xlsx |
180 KB |
|
LUW-Adjective-na.xlsx |
2,011 KB |
SUW-Adverb.xlsx |
357 KB |
|
LUW-Adverb.xlsx |
2,025 KB |
SUW-Affix.xlsx |
85 KB |
|
LUW-Affix.xlsx |
43 KB |
SUW-Auxiliary-Symbol.xlsx |
110 KB |
|
LUW-Auxiliary-Symbol.xlsx |
114 KB |
SUW-Auxiliary-Verb.xlsx |
13 KB |
|
LUW-Auxiliary-Verb.xlsx |
29 KB |
SUW-Conjunction.xlsx |
11 KB |
|
LUW-Conjunction.xlsx |
45 KB |
SUW-Interjection.xlsx |
77 KB |
|
LUW-Interjection.xlsx |
86 KB |
SUW-Noun.xlsx |
7,025 KB |
|
LUW-Noun-1.xlsx |
29,622 KB |
LUW-Noun-2.xlsx |
29,703 KB |
LUW-Noun-3.xlsx |
29,635 KB |
LUW-Noun-4.xlsx |
29,379 KB |
SUW-Particle.xlsx |
27 KB |
|
LUW-Particle.xlsx |
37 KB |
SUW-Prenominal.xlsx |
13 KB |
|
LUW-Prenominal.xlsx |
15 KB |
SUW-Proper-Noun.xlsx |
4,281 KB |
|
LUW-Proper-Noun.xlsx |
15,279 KB |
SUW-Symbol.xlsx |
75 KB |
|
LUW-Symbol.xlsx |
88 KB |
SUW-Verb.xlsx |
1,422 KB |
|
LUW-Verb.xlsx |
6,881 KB |
Excel column labels |
Column Label | Explanation |
Lemma |
Essentially, equates to headword within UniDic for SUWs, and combination of one or more headwords within UniDic for LUWs |
OrthVar |
Number of orthographic variants for a lemma |
Etymol |
Etymology of the lemma:
和 = Native-Japanese, 漢 = Sino-Japanese, 混 = Mixed Japanese + Sino-Japanese,
外 = Foreign loan, 固 = Proper noun, 記号 = Symbols, 不明 = Unknown |
Freq |
Frequency of the lemma within the BCCWJ corpus |
OrthBase |
Orthographic variation of the lemma |
PronOrthB |
Pronuciation of the orthographic base form |
OrthBLen |
Character length of the orthographic base form |
OrthBFreq |
Frequency of a particular orthographic base form |
OrthCover |
Ratio of total lemma frequency covered by a particular orthographic base form |
References |
Den, Yasuharu, Toshinobu Ogiso, Hideki Ogura, Atsushi Yamada, Nobuaki Minematsu, Kiyotaka Uchimoto, & Hanae Koiso. (2007). Kōpasu nihongogaku no tame no gengo shigen: Keitaisokaisekiyō denshijisho no kaihatsu to ōyō [The development of an electronic dictionary for morphological analysis and its application to Japanese corpus linguistics], Nihongo Kagaku [Japanese Linguistics], 22, 101-122. |
Joyce, Terry, Hodošček, Bor, & Nishina, Kikuko. (2012). Orthographic representation and variation within the Japanese writing system: Some corpus-based observations. (Special issue: Units of language - units of writing, edited by Terry Joyce and David Roberts), Written Language and Literacy, 15(2), 254-278. DOI: 10.1075/wll.15.2.07joy |
National Institute for Japanese Language and Literature (NINJAL) [Kokuritsu Kokugo Kenkyūjo]. (2011). Tokuteiryōiki kenkyū nihongo kōpasu kenkyū seika hōkoku [Priority-Area Research “Japanese Corpus”: Research Report] [DVD format of data and research reports]. Tokyo: General Headquarters, Priority-Area Research “Japanese Corpus”. |
|
|