Corpus word lists
[Updated: 06/01/2014]
The prime purpose of this webpage is to make publicly available the BCCWJ-corpus-based Japanese word lists created and reported on in Joyce, Hodošček & Nishina (2012), together with three tables of information providing (1) corpus type and token counts, (2) listings of the files within the two compressed archives, and explanation of the Excel column labels.

Download links
The corpus word lists are grouped according to the two word unit definitions used within the BCCWJ project (NINJAL, 2011) and UniDic (Den, et al, 2007); namely, short unit words (SUW) and long unit words (LUW).
SUW data:14 files compressed into one archive (13 MB)download
LUW data:17* files compressed into one archive (115 MB)download
*LUW noun word list was split into 4 Excel files, as it was too large for one worksheet.
Respecting standard curtesies in these matters, individuals who utilize the corpus word lists within published works are politely asked to cite the Joyce, et al (2012) paper.

Tables of information
Corpus type and token counts
Word Class Units Tokens Lemma Types Orthographic Types LOTR SOFR
Nouns SUW 35,450,054 99,229 141,649 0.70 0.77
LUW 23,993,514 1,935,336 2,037,164 0.95 0.96
Verbs SUW 14,117,717 9,311 27,540 0.34 0.39
LUW 10,556,859 95,640 127,960 0.75 0.82
i-Adjectives SUW 1,584,323 779 3,061 0.25 0.22
LUW 1,331,979 11,776 15,815 0.74 0.82
Adverbs SUW 1,825,691 3,050 7,078 0.43 0.35
LUW 2,266,356 31,556 37,462 0.84 0.88
Others SUW 51,724,211 63,339 98,464 0.64 0.77
LUW 45,436,956 322,207 329,962 0.98 0.98
Totals SUW 104,701,996 175,708 277,792 0.63 0.74
LUW 83,585,664 2,396,515 2,548,363 0.94 0.96
Based on Table 3 from Joyce, et al (2012: 262), which presents token, lemma types and orthographic type counts for the four main word lists (according to word class; nouns, verbs, i-adjectives and adverbs) and the 'others' group.

File names and uncompressed sizes in respective archives
SUW file name File size LUW file name File size
SUW-Adjective-i.xlsx 160 KB LUW-Adjective-i.xlsx 836 KB
SUW-Adjective-na.xlsx 180 KB LUW-Adjective-na.xlsx 2,011 KB
SUW-Adverb.xlsx 357 KB LUW-Adverb.xlsx 2,025 KB
SUW-Affix.xlsx 85 KB LUW-Affix.xlsx 43 KB
SUW-Auxiliary-Symbol.xlsx 110 KB LUW-Auxiliary-Symbol.xlsx 114 KB
SUW-Auxiliary-Verb.xlsx 13 KB LUW-Auxiliary-Verb.xlsx 29 KB
SUW-Conjunction.xlsx 11 KB LUW-Conjunction.xlsx 45 KB
SUW-Interjection.xlsx 77 KB LUW-Interjection.xlsx 86 KB
SUW-Noun.xlsx 7,025 KB LUW-Noun-1.xlsx 29,622 KB
LUW-Noun-2.xlsx 29,703 KB
LUW-Noun-3.xlsx 29,635 KB
LUW-Noun-4.xlsx 29,379 KB
SUW-Particle.xlsx 27 KB LUW-Particle.xlsx 37 KB
SUW-Prenominal.xlsx 13 KB LUW-Prenominal.xlsx 15 KB
SUW-Proper-Noun.xlsx 4,281 KB LUW-Proper-Noun.xlsx 15,279 KB
SUW-Symbol.xlsx 75 KB LUW-Symbol.xlsx 88 KB
SUW-Verb.xlsx 1,422 KB LUW-Verb.xlsx 6,881 KB

Excel column labels
Column LabelExplanation
Lemma Essentially, equates to headword within UniDic for SUWs, and combination of one or more headwords within UniDic for LUWs
OrthVar Number of orthographic variants for a lemma
Etymol Etymology of the lemma:
和 = Native-Japanese, 漢 = Sino-Japanese, 混 = Mixed Japanese + Sino-Japanese,
外 = Foreign loan, 固 = Proper noun, 記号 = Symbols, 不明 = Unknown
Freq Frequency of the lemma within the BCCWJ corpus
OrthBase Orthographic variation of the lemma
PronOrthB Pronuciation of the orthographic base form
OrthBLen Character length of the orthographic base form
OrthBFreq Frequency of a particular orthographic base form
OrthCover Ratio of total lemma frequency covered by a particular orthographic base form

Den, Yasuharu, Toshinobu Ogiso, Hideki Ogura, Atsushi Yamada, Nobuaki Minematsu, Kiyotaka Uchimoto, & Hanae Koiso. (2007). Kōpasu nihongogaku no tame no gengo shigen: Keitaisokaisekiyō denshijisho no kaihatsu to ōyō [The development of an electronic dictionary for morphological analysis and its application to Japanese corpus linguistics], Nihongo Kagaku [Japanese Linguistics], 22, 101-122.
Joyce, Terry, Hodošček, Bor, & Nishina, Kikuko. (2012). Orthographic representation and variation within the Japanese writing system: Some corpus-based observations. (Special issue: Units of language - units of writing, edited by Terry Joyce and David Roberts), Written Language and Literacy, 15(2), 254-278. DOI: 10.1075/wll.15.2.07joy
National Institute for Japanese Language and Literature (NINJAL) [Kokuritsu Kokugo Kenkyūjo]. (2011). Tokuteiryōiki kenkyū nihongo kōpasu kenkyū seika hōkoku [Priority-Area Research “Japanese Corpus”: Research Report] [DVD format of data and research reports]. Tokyo: General Headquarters, Priority-Area Research “Japanese Corpus”.