So I'm trying to come up with a “master
list” of English vocabulary words that I can use as a base to
translate into other languages. My thought is that if I come up with
a list of words that covers a good deal of the gamut of concepts that
English speakers express, then translating those words into the
target language should provide a good basis for plundering towards
fluency in that language, either in flash cards, the creation of oral
materials that reinforce those words, or other language learning
materials.
Don't ask WHY I'm doing this. There is
no good answer for that. It's just a function of the set of things
that I do.
I came up with the seed of this idea a
few years ago, when I made a Spanish deck of flash cards for a friend
of mine. I simply took the 1000 most frequent English words, put
them in a single column in a spreadsheet, copied that column and
pasted it in Google Docs, translated the whole column into Spanish,
and then pasted the Spanish version in the next column over from the
English version in the spreadsheet. I proofread a little and did
some tweaking to correct some errors (since I'm fluent in Spanish),
and then imported the whole thing into Anki. Then I created some
instructions in the programming language that created a new set of
flash cards that reversed the order of the first set (the first set
was all English-Spanish, so I created a set that was
Spanish-English). There, an easy-peasy way to make 2000 flash cards
(actually, it was less than 2000 because some of the words were
duplicative).
So, I wanted to put a little more
thought and effort into making a list that would take one to a pretty
sophisticated knowledge of the language. I wanted to have an even
mix of frequency and utility, and have the list have several levels
that progressed from basic to advanced. In doing so, I encountered
some interesting problems (well, interesting to me anyway), which
I'll talk about later.
I started with the Dolch Word List.
This is a list of 315 words created by Charles Dolch in 1936, and
published in 1948. It is subdivided into pre-primer through third
grade utility words, and a separate list of nouns. The list has a
few words that, while I wouldn't call them “archaic” due to the
age of the list, probably have shifted somewhat in their frequency
and their utility. But it's a good start for my first layer.
Next I went to the 1000 most frequent
words in English, and broke it into two halves, the first 500, and
the next 500. I picked out all the words in each half that had not
already been covered by the Dolch List (by creating a formula in Open
Office Calc that weeded out the words already present), and these two
lists became my next two layers. Now I've got 1031 words, which is
interesting in that it means there are 31 words in the Dolch List
that are not in the 1000 most frequent. One slight problem I see in
dealing with frequency is that the frequency of words in another
language might not be the same as the frequency of English words, but
I figure that most languages used by modern industrial societies will
speak to a similar enough range of human experiences that it will be
close enough. If I get to analyzing a language like Yanomamo, all
bets may be off on that.
Then I turned to the list of words in
Basic English (turning away from the focus on frequency back to
utility). This is a simplified subset of the English language that
is designed for teaching English as a Second Language. There are
about 850 words in Basic English. I used all the words in the list
of Basic English that were not previously duplicated in the previous
layers, and those became my fourth layer. There is a superset of
Basic English called Special English, that consists of about 1500
words, and is the vocabulary used for broadcasting by the Voice of
America (the theory being that a simplified vocabulary will be must
more understandable by foreign listeners). I decided not to draw
from Special English, because I figured it would mostly duplicate my
next lists.
Then I turned to the General Service
Headwords List, a list of about 2000 words created in 1953 taken from
a compendium of common English written sources for the purpose of
language learning. The General Service List (GSL) is the first list
I ran across to substantially segment out lemmatisations. A lemma is
a different inflection of a core word, for example, the formation of
“running,” “runner, ” and “ran” from the word “run.”
Not only can there be different forms, but the word can also be
transformed into a different part of speech, such as a noun made from
a verb. And some of the lemmas can be substantially different, as in
the formation of “better” from “good.” So there are two GSL
lists, one of just core words (headwords), and one with inflected
forms of those words. I took the headwords, and eliminated
duplicates, and this was my fifth layer.
At this point, I'm starting to note
some fairly significant problems. The biggest one is that categories
of words that should be kept together are not kept together, and it
is hard for me to tell if the sets of those words are missing
important members. For instance, some colors are in the Dolch list,
some are in Basic English, and some are in the GSL. I haven't even
gotten to colors like “mauve” or “turquoise”, just the basic
palette. And numbers are all over the place, too. Geographical
directions, articles of clothing, kitchen utensils...the most basic
functional groupings, are scattered and incomplete.
Another problem is that I have
eliminated homophones. Yes, my list is homophobic (discriminatory,
though, only in a homophonic fashion). Because there are words like
“march”, “may”, and “august” (actually, “august”
comes later), those months have been wiped out from my list, and I
have to add them back in.
Then, I turned to the General Service
List, which has about 800 more words than the GSL, but has coverage
of more concepts in fewer words, and eliminates some words that have
fallen more out of favor in the 60 years since the first list was
created. Once again, I'm still just adding in the headwords.
My next stop is the Academic Word List
(AWL), a supplement to the GSL, with words that are a little more
erudite or hoity-toity. So now I've got a little over 3000 words,
mostly just headwords.
My next two layers are from lists of
lemmas of the GSL and the AWL. There are a LOT of words here, though
not a lot of new concepts. But this is going to help with structure,
morphology, lexicology, syntax and grammar. Now my complete list is
a little over 9000 words.
So that's where I am at with this right
now. It still needs a lot of shaping. I need to get more defined on
functional grouping, fill in any holes in the structure, and
redistribute the levels of learning. I know folks like Charles
Berlitz and Paul Pimsleur have probably dealt with matters like this
in their own way but I just want to take a fresh look at it in a
methodical fashion, albeit one that does not arise from any
disciplined training whatsoever. What's the worst that can happen?