Terry Joyce: Corpus word lists

Corpus word lists

[Updated: 06/01/2014]

The prime purpose of this webpage is to make publicly available the BCCWJ-corpus-based Japanese word lists created and reported on in Joyce, Hodošček & Nishina (2012), together with three tables of information providing (1) corpus type and token counts, (2) listings of the files within the two compressed archives, and explanation of the Excel column labels.

Download links
The corpus word lists are grouped according to the two word unit definitions used within the BCCWJ project (NINJAL, 2011) and UniDic (Den, et al, 2007); namely, short unit words (SUW) and long unit words (LUW).
SUW data:	14 files compressed into one archive (13 MB)	download
LUW data:	17* files compressed into one archive (115 MB)	download
	*LUW noun word list was split into 4 Excel files, as it was too large for one worksheet.
Respecting standard curtesies in these matters, individuals who utilize the corpus word lists within published works are politely asked to cite the Joyce, et al (2012) paper.

Tables of information

Corpus type and token counts
Word Class	Units	Tokens	Lemma Types	Orthographic Types	LOTR	SOFR
Nouns	SUW	35,450,054	99,229	141,649	0.70	0.77
Nouns	LUW	23,993,514	1,935,336	2,037,164	0.95	0.96
Verbs	SUW	14,117,717	9,311	27,540	0.34	0.39
Verbs	LUW	10,556,859	95,640	127,960	0.75	0.82
i-Adjectives	SUW	1,584,323	779	3,061	0.25	0.22
i-Adjectives	LUW	1,331,979	11,776	15,815	0.74	0.82
Adverbs	SUW	1,825,691	3,050	7,078	0.43	0.35
Adverbs	LUW	2,266,356	31,556	37,462	0.84	0.88
Others	SUW	51,724,211	63,339	98,464	0.64	0.77
Others	LUW	45,436,956	322,207	329,962	0.98	0.98
Totals	SUW	104,701,996	175,708	277,792	0.63	0.74
Totals	LUW	83,585,664	2,396,515	2,548,363	0.94	0.96
Based on Table 3 from Joyce, et al (2012: 262), which presents token, lemma types and orthographic type counts for the four main word lists (according to word class; nouns, verbs, i-adjectives and adverbs) and the 'others' group.

File names and uncompressed sizes in respective archives
SUW file name	File size	LUW file name	File size
SUW-Adjective-i.xlsx	160 KB	LUW-Adjective-i.xlsx	836 KB
SUW-Adjective-na.xlsx	180 KB	LUW-Adjective-na.xlsx	2,011 KB
SUW-Adverb.xlsx	357 KB	LUW-Adverb.xlsx	2,025 KB
SUW-Affix.xlsx	85 KB	LUW-Affix.xlsx	43 KB
SUW-Auxiliary-Symbol.xlsx	110 KB	LUW-Auxiliary-Symbol.xlsx	114 KB
SUW-Auxiliary-Verb.xlsx	13 KB	LUW-Auxiliary-Verb.xlsx	29 KB
SUW-Conjunction.xlsx	11 KB	LUW-Conjunction.xlsx	45 KB
SUW-Interjection.xlsx	77 KB	LUW-Interjection.xlsx	86 KB
SUW-Noun.xlsx	7,025 KB	LUW-Noun-1.xlsx	29,622 KB
		LUW-Noun-2.xlsx	29,703 KB
		LUW-Noun-3.xlsx	29,635 KB
		LUW-Noun-4.xlsx	29,379 KB
SUW-Particle.xlsx	27 KB	LUW-Particle.xlsx	37 KB
SUW-Prenominal.xlsx	13 KB	LUW-Prenominal.xlsx	15 KB
SUW-Proper-Noun.xlsx	4,281 KB	LUW-Proper-Noun.xlsx	15,279 KB
SUW-Symbol.xlsx	75 KB	LUW-Symbol.xlsx	88 KB
SUW-Verb.xlsx	1,422 KB	LUW-Verb.xlsx	6,881 KB

Excel column labels
Column Label	Explanation
Lemma	Essentially, equates to headword within UniDic for SUWs, and combination of one or more headwords within UniDic for LUWs
OrthVar	Number of orthographic variants for a lemma
Etymol	Etymology of the lemma: �a = Native-Japanese, �� = Sino-Japanese, �� = Mixed Japanese + Sino-Japanese, �O = Foreign loan, �� = Proper noun, �L�� = Symbols, �s�� = Unknown
Freq	Frequency of the lemma within the BCCWJ corpus
OrthBase	Orthographic variation of the lemma
PronOrthB	Pronuciation of the orthographic base form
OrthBLen	Character length of the orthographic base form
OrthBFreq	Frequency of a particular orthographic base form
OrthCover	Ratio of total lemma frequency covered by a particular orthographic base form

References

Den, Yasuharu, Toshinobu Ogiso, Hideki Ogura, Atsushi Yamada, Nobuaki Minematsu, Kiyotaka Uchimoto, & Hanae Koiso. (2007). Kōpasu nihongogaku no tame no gengo shigen: Keitaisokaisekiyō denshijisho no kaihatsu to ōyō [The development of an electronic dictionary for morphological analysis and its application to Japanese corpus linguistics], Nihongo Kagaku [Japanese Linguistics], 22, 101-122.

Joyce, Terry, Hodošček, Bor, & Nishina, Kikuko. (2012). Orthographic representation and variation within the Japanese writing system: Some corpus-based observations. (Special issue: Units of language - units of writing, edited by Terry Joyce and David Roberts), Written Language and Literacy, 15(2), 254-278. DOI: 10.1075/wll.15.2.07joy

National Institute for Japanese Language and Literature (NINJAL) [Kokuritsu Kokugo Kenkyūjo]. (2011). Tokuteiryōiki kenkyū nihongo kōpasu kenkyū seika hōkoku [Priority-Area Research “Japanese Corpus”: Research Report] [DVD format of data and research reports]. Tokyo: General Headquarters, Priority-Area Research “Japanese Corpus”.