Monday, September 1, 2014

Creating A Word List For Language Study

So I'm trying to come up with a “master list” of English vocabulary words that I can use as a base to translate into other languages. My thought is that if I come up with a list of words that covers a good deal of the gamut of concepts that English speakers express, then translating those words into the target language should provide a good basis for plundering towards fluency in that language, either in flash cards, the creation of oral materials that reinforce those words, or other language learning materials.

Don't ask WHY I'm doing this. There is no good answer for that. It's just a function of the set of things that I do.

I came up with the seed of this idea a few years ago, when I made a Spanish deck of flash cards for a friend of mine. I simply took the 1000 most frequent English words, put them in a single column in a spreadsheet, copied that column and pasted it in Google Docs, translated the whole column into Spanish, and then pasted the Spanish version in the next column over from the English version in the spreadsheet. I proofread a little and did some tweaking to correct some errors (since I'm fluent in Spanish), and then imported the whole thing into Anki. Then I created some instructions in the programming language that created a new set of flash cards that reversed the order of the first set (the first set was all English-Spanish, so I created a set that was Spanish-English). There, an easy-peasy way to make 2000 flash cards (actually, it was less than 2000 because some of the words were duplicative).

So, I wanted to put a little more thought and effort into making a list that would take one to a pretty sophisticated knowledge of the language. I wanted to have an even mix of frequency and utility, and have the list have several levels that progressed from basic to advanced. In doing so, I encountered some interesting problems (well, interesting to me anyway), which I'll talk about later.

I started with the Dolch Word List. This is a list of 315 words created by Charles Dolch in 1936, and published in 1948. It is subdivided into pre-primer through third grade utility words, and a separate list of nouns. The list has a few words that, while I wouldn't call them “archaic” due to the age of the list, probably have shifted somewhat in their frequency and their utility. But it's a good start for my first layer.

Next I went to the 1000 most frequent words in English, and broke it into two halves, the first 500, and the next 500. I picked out all the words in each half that had not already been covered by the Dolch List (by creating a formula in Open Office Calc that weeded out the words already present), and these two lists became my next two layers. Now I've got 1031 words, which is interesting in that it means there are 31 words in the Dolch List that are not in the 1000 most frequent. One slight problem I see in dealing with frequency is that the frequency of words in another language might not be the same as the frequency of English words, but I figure that most languages used by modern industrial societies will speak to a similar enough range of human experiences that it will be close enough. If I get to analyzing a language like Yanomamo, all bets may be off on that.

Then I turned to the list of words in Basic English (turning away from the focus on frequency back to utility). This is a simplified subset of the English language that is designed for teaching English as a Second Language. There are about 850 words in Basic English. I used all the words in the list of Basic English that were not previously duplicated in the previous layers, and those became my fourth layer. There is a superset of Basic English called Special English, that consists of about 1500 words, and is the vocabulary used for broadcasting by the Voice of America (the theory being that a simplified vocabulary will be must more understandable by foreign listeners). I decided not to draw from Special English, because I figured it would mostly duplicate my next lists.

Then I turned to the General Service Headwords List, a list of about 2000 words created in 1953 taken from a compendium of common English written sources for the purpose of language learning. The General Service List (GSL) is the first list I ran across to substantially segment out lemmatisations. A lemma is a different inflection of a core word, for example, the formation of “running,” “runner, ” and “ran” from the word “run.” Not only can there be different forms, but the word can also be transformed into a different part of speech, such as a noun made from a verb. And some of the lemmas can be substantially different, as in the formation of “better” from “good.” So there are two GSL lists, one of just core words (headwords), and one with inflected forms of those words. I took the headwords, and eliminated duplicates, and this was my fifth layer.

At this point, I'm starting to note some fairly significant problems. The biggest one is that categories of words that should be kept together are not kept together, and it is hard for me to tell if the sets of those words are missing important members. For instance, some colors are in the Dolch list, some are in Basic English, and some are in the GSL. I haven't even gotten to colors like “mauve” or “turquoise”, just the basic palette. And numbers are all over the place, too. Geographical directions, articles of clothing, kitchen utensils...the most basic functional groupings, are scattered and incomplete.

Another problem is that I have eliminated homophones. Yes, my list is homophobic (discriminatory, though, only in a homophonic fashion). Because there are words like “march”, “may”, and “august” (actually, “august” comes later), those months have been wiped out from my list, and I have to add them back in.

Then, I turned to the General Service List, which has about 800 more words than the GSL, but has coverage of more concepts in fewer words, and eliminates some words that have fallen more out of favor in the 60 years since the first list was created. Once again, I'm still just adding in the headwords.

My next stop is the Academic Word List (AWL), a supplement to the GSL, with words that are a little more erudite or hoity-toity. So now I've got a little over 3000 words, mostly just headwords.

My next two layers are from lists of lemmas of the GSL and the AWL. There are a LOT of words here, though not a lot of new concepts. But this is going to help with structure, morphology, lexicology, syntax and grammar. Now my complete list is a little over 9000 words.

So that's where I am at with this right now. It still needs a lot of shaping. I need to get more defined on functional grouping, fill in any holes in the structure, and redistribute the levels of learning. I know folks like Charles Berlitz and Paul Pimsleur have probably dealt with matters like this in their own way but I just want to take a fresh look at it in a methodical fashion, albeit one that does not arise from any disciplined training whatsoever. What's the worst that can happen?