|
Corpus word lists |
[Updated: 06/01/2014] |
The prime purpose of this webpage is to make publicly available the BCCWJ-corpus-based Japanese word lists created and reported on in Joyce, Hodošček & Nishina (2012), together with three tables of information providing (1) corpus type and token counts, (2) listings of the files within the two compressed archives, and explanation of the Excel column labels. |
Download links |
The corpus word lists are grouped according to the two word unit definitions used within the BCCWJ project (NINJAL, 2011) and UniDic (Den, et al, 2007); namely, short unit words (SUW) and long unit words (LUW). |
SUW data: | 14 files compressed into one archive (13 MB) | download |
LUW data: | 17* files compressed into one archive (115 MB) | download |
| *LUW noun word list was split into 4 Excel files, as it was too large for one worksheet. |
Respecting standard curtesies in these matters, individuals who utilize the corpus word lists within published works are politely asked to cite the Joyce, et al (2012) paper. |
Corpus type and token counts |
Word Class |
Units |
Tokens |
Lemma Types |
Orthographic Types |
LOTR |
SOFR |
Nouns |
SUW |
35,450,054 |
99,229 |
141,649 |
0.70 |
0.77 |
LUW |
23,993,514 |
1,935,336 |
2,037,164 |
0.95 |
0.96 |
Verbs |
SUW |
14,117,717 |
9,311 |
27,540 |
0.34 |
0.39 |
LUW |
10,556,859 |
95,640 |
127,960 |
0.75 |
0.82 |
i-Adjectives |
SUW |
1,584,323 |
779 |
3,061 |
0.25 |
0.22 |
LUW |
1,331,979 |
11,776 |
15,815 |
0.74 |
0.82 |
Adverbs |
SUW |
1,825,691 |
3,050 |
7,078 |
0.43 |
0.35 |
LUW |
2,266,356 |
31,556 |
37,462 |
0.84 |
0.88 |
Others |
SUW |
51,724,211 |
63,339 |
98,464 |
0.64 |
0.77 |
LUW |
45,436,956 |
322,207 |
329,962 |
0.98 |
0.98 |
Totals |
SUW |
104,701,996 |
175,708 |
277,792 |
0.63 |
0.74 |
LUW |
83,585,664 |
2,396,515 |
2,548,363 |
0.94 |
0.96 |
Based on Table 3 from Joyce, et al (2012: 262), which presents token, lemma types and orthographic type counts for the four main word lists (according to word class; nouns, verbs, i-adjectives and adverbs) and the 'others' group. |
File names and uncompressed sizes in respective archives |
SUW file name |
File size |
|
LUW file name |
File size |
SUW-Adjective-i.xlsx |
160 KB |
|
LUW-Adjective-i.xlsx |
836 KB |
SUW-Adjective-na.xlsx |
180 KB |
|
LUW-Adjective-na.xlsx |
2,011 KB |
SUW-Adverb.xlsx |
357 KB |
|
LUW-Adverb.xlsx |
2,025 KB |
SUW-Affix.xlsx |
85 KB |
|
LUW-Affix.xlsx |
43 KB |
SUW-Auxiliary-Symbol.xlsx |
110 KB |
|
LUW-Auxiliary-Symbol.xlsx |
114 KB |
SUW-Auxiliary-Verb.xlsx |
13 KB |
|
LUW-Auxiliary-Verb.xlsx |
29 KB |
SUW-Conjunction.xlsx |
11 KB |
|
LUW-Conjunction.xlsx |
45 KB |
SUW-Interjection.xlsx |
77 KB |
|
LUW-Interjection.xlsx |
86 KB |
SUW-Noun.xlsx |
7,025 KB |
|
LUW-Noun-1.xlsx |
29,622 KB |
LUW-Noun-2.xlsx |
29,703 KB |
LUW-Noun-3.xlsx |
29,635 KB |
LUW-Noun-4.xlsx |
29,379 KB |
SUW-Particle.xlsx |
27 KB |
|
LUW-Particle.xlsx |
37 KB |
SUW-Prenominal.xlsx |
13 KB |
|
LUW-Prenominal.xlsx |
15 KB |
SUW-Proper-Noun.xlsx |
4,281 KB |
|
LUW-Proper-Noun.xlsx |
15,279 KB |
SUW-Symbol.xlsx |
75 KB |
|
LUW-Symbol.xlsx |
88 KB |
SUW-Verb.xlsx |
1,422 KB |
|
LUW-Verb.xlsx |
6,881 KB |
Excel column labels |
Column Label | Explanation |
Lemma |
Essentially, equates to headword within UniDic for SUWs, and combination of one or more headwords within UniDic for LUWs |
OrthVar |
Number of orthographic variants for a lemma |
Etymol |
Etymology of the lemma:
和 = Native-Japanese, 漢 = Sino-Japanese, 混 = Mixed Japanese + Sino-Japanese,
外 = Foreign loan, 固 = Proper noun, 記号 = Symbols, 不明 = Unknown |
Freq |
Frequency of the lemma within the BCCWJ corpus |
OrthBase |
Orthographic variation of the lemma |
PronOrthB |
Pronuciation of the orthographic base form |
OrthBLen |
Character length of the orthographic base form |
OrthBFreq |
Frequency of a particular orthographic base form |
OrthCover |
Ratio of total lemma frequency covered by a particular orthographic base form |
References |
Den, Yasuharu, Toshinobu Ogiso, Hideki Ogura, Atsushi Yamada, Nobuaki Minematsu, Kiyotaka Uchimoto, & Hanae Koiso. (2007). Kōpasu nihongogaku no tame no gengo shigen: Keitaisokaisekiyō denshijisho no kaihatsu to ōyō [The development of an electronic dictionary for morphological analysis and its application to Japanese corpus linguistics], Nihongo Kagaku [Japanese Linguistics], 22, 101-122. |
Joyce, Terry, Hodošček, Bor, & Nishina, Kikuko. (2012). Orthographic representation and variation within the Japanese writing system: Some corpus-based observations. (Special issue: Units of language - units of writing, edited by Terry Joyce and David Roberts), Written Language and Literacy, 15(2), 254-278. DOI: 10.1075/wll.15.2.07joy |
National Institute for Japanese Language and Literature (NINJAL) [Kokuritsu Kokugo Kenkyūjo]. (2011). Tokuteiryōiki kenkyū nihongo kōpasu kenkyū seika hōkoku [Priority-Area Research “Japanese Corpus”: Research Report] [DVD format of data and research reports]. Tokyo: General Headquarters, Priority-Area Research “Japanese Corpus”. |
|
|