Monday, September 1, 2014

Creating A Word List For Language Study

So I'm trying to come up with a “master list” of English vocabulary words that I can use as a base to translate into other languages. My thought is that if I come up with a list of words that covers a good deal of the gamut of concepts that English speakers express, then translating those words into the target language should provide a good basis for plundering towards fluency in that language, either in flash cards, the creation of oral materials that reinforce those words, or other language learning materials.

Don't ask WHY I'm doing this. There is no good answer for that. It's just a function of the set of things that I do.

I came up with the seed of this idea a few years ago, when I made a Spanish deck of flash cards for a friend of mine. I simply took the 1000 most frequent English words, put them in a single column in a spreadsheet, copied that column and pasted it in Google Docs, translated the whole column into Spanish, and then pasted the Spanish version in the next column over from the English version in the spreadsheet. I proofread a little and did some tweaking to correct some errors (since I'm fluent in Spanish), and then imported the whole thing into Anki. Then I created some instructions in the programming language that created a new set of flash cards that reversed the order of the first set (the first set was all English-Spanish, so I created a set that was Spanish-English). There, an easy-peasy way to make 2000 flash cards (actually, it was less than 2000 because some of the words were duplicative).

So, I wanted to put a little more thought and effort into making a list that would take one to a pretty sophisticated knowledge of the language. I wanted to have an even mix of frequency and utility, and have the list have several levels that progressed from basic to advanced. In doing so, I encountered some interesting problems (well, interesting to me anyway), which I'll talk about later.

I started with the Dolch Word List. This is a list of 315 words created by Charles Dolch in 1936, and published in 1948. It is subdivided into pre-primer through third grade utility words, and a separate list of nouns. The list has a few words that, while I wouldn't call them “archaic” due to the age of the list, probably have shifted somewhat in their frequency and their utility. But it's a good start for my first layer.

Next I went to the 1000 most frequent words in English, and broke it into two halves, the first 500, and the next 500. I picked out all the words in each half that had not already been covered by the Dolch List (by creating a formula in Open Office Calc that weeded out the words already present), and these two lists became my next two layers. Now I've got 1031 words, which is interesting in that it means there are 31 words in the Dolch List that are not in the 1000 most frequent. One slight problem I see in dealing with frequency is that the frequency of words in another language might not be the same as the frequency of English words, but I figure that most languages used by modern industrial societies will speak to a similar enough range of human experiences that it will be close enough. If I get to analyzing a language like Yanomamo, all bets may be off on that.

Then I turned to the list of words in Basic English (turning away from the focus on frequency back to utility). This is a simplified subset of the English language that is designed for teaching English as a Second Language. There are about 850 words in Basic English. I used all the words in the list of Basic English that were not previously duplicated in the previous layers, and those became my fourth layer. There is a superset of Basic English called Special English, that consists of about 1500 words, and is the vocabulary used for broadcasting by the Voice of America (the theory being that a simplified vocabulary will be must more understandable by foreign listeners). I decided not to draw from Special English, because I figured it would mostly duplicate my next lists.

Then I turned to the General Service Headwords List, a list of about 2000 words created in 1953 taken from a compendium of common English written sources for the purpose of language learning. The General Service List (GSL) is the first list I ran across to substantially segment out lemmatisations. A lemma is a different inflection of a core word, for example, the formation of “running,” “runner, ” and “ran” from the word “run.” Not only can there be different forms, but the word can also be transformed into a different part of speech, such as a noun made from a verb. And some of the lemmas can be substantially different, as in the formation of “better” from “good.” So there are two GSL lists, one of just core words (headwords), and one with inflected forms of those words. I took the headwords, and eliminated duplicates, and this was my fifth layer.

At this point, I'm starting to note some fairly significant problems. The biggest one is that categories of words that should be kept together are not kept together, and it is hard for me to tell if the sets of those words are missing important members. For instance, some colors are in the Dolch list, some are in Basic English, and some are in the GSL. I haven't even gotten to colors like “mauve” or “turquoise”, just the basic palette. And numbers are all over the place, too. Geographical directions, articles of clothing, kitchen utensils...the most basic functional groupings, are scattered and incomplete.

Another problem is that I have eliminated homophones. Yes, my list is homophobic (discriminatory, though, only in a homophonic fashion). Because there are words like “march”, “may”, and “august” (actually, “august” comes later), those months have been wiped out from my list, and I have to add them back in.

Then, I turned to the General Service List, which has about 800 more words than the GSL, but has coverage of more concepts in fewer words, and eliminates some words that have fallen more out of favor in the 60 years since the first list was created. Once again, I'm still just adding in the headwords.

My next stop is the Academic Word List (AWL), a supplement to the GSL, with words that are a little more erudite or hoity-toity. So now I've got a little over 3000 words, mostly just headwords.

My next two layers are from lists of lemmas of the GSL and the AWL. There are a LOT of words here, though not a lot of new concepts. But this is going to help with structure, morphology, lexicology, syntax and grammar. Now my complete list is a little over 9000 words.

So that's where I am at with this right now. It still needs a lot of shaping. I need to get more defined on functional grouping, fill in any holes in the structure, and redistribute the levels of learning. I know folks like Charles Berlitz and Paul Pimsleur have probably dealt with matters like this in their own way but I just want to take a fresh look at it in a methodical fashion, albeit one that does not arise from any disciplined training whatsoever. What's the worst that can happen?

Saturday, August 30, 2014

New Rewrite of Polish Verb Conjugation Post

I just rewrote my post on Polish Verb Conjugation to include more detail about the sites I had listed, and to add some new sites that I have found that are pretty good resources.  Also, some of the sites I had listed before have changed their design, so I updated the post to reflect any adjustments in the navigation through their sites.  Many of the sites have also expanded the verbs that they offer, and filled in blank spots in their conjugation tables.  So if you haven't looked at this post in some time, you may want to take another look.  This post is usually listed on the "most popular posts" widget on the right side of the main page of my blog, as it seems to be one of the most visited pages here.

Saturday, April 26, 2014

Rubbing And Spreading: -cierać, -trzeć

It has been a while since I posted a blog post here.  I have been studying Polish just about every day, but haven't taken the time to create a blog post.  Perhaps my recently ended campaign for US Congress kept me a little too busy to make blog posts. :-)

Verbs that end with -cierać, -trzeć seem to have meanings associated with "rubbing" or "spreading" for the most part.  Most of these verbs follow the pattern of -cierać for the imperfective and -trzeć for the perfective though "patrzeć" is a little different, as we shall see.

These verbs tend to conjugate in "-am, asz" for the -cierać form, and in "trę, trzesz" for the -trzeć form.

docierać, dotrzeć - to reach, to arrive, to get through to

This verb seems to be the farthest from the "rubbing" or "spreading" meaning of the root.

dotrzeć do czegoś - to get through to something

nacierać, natrzeć - to rub (in); to charge, to attack

natrzeć na (+ accusative) - to charge at...

obcierać, obetrzeć - to wipe; to chafe (skin)

Buty mnie obcierają - My shoes are chafing against my feet

pocierać, potrzeć - to rub

przecierać, przetrzeć - to wipe; to sieve; to wear through

przycierać, przytrzeć - to wear down, to chafe, to fray

rozpościerać, rozpostrzeć - to spread (e.g. blanket, wings); to stretch

rozpościerać skrzydła - to spread one's wings

rozpościerać się, rozpostrzeć się - to extend

ścierać, zetrzeć - to scrape, to abrade; to rub out, to wipe off, to clean; to grate (e.g. carrots)

ścierać kurze z (+ genitive) - to dust...

ścierać się, zetrzeć się - to wear out, to wear off, to wear thin; to clash

wcierać, wetrzeć - to rub in

wcierać krem w skórę - to rub cream into the skin

wycierać, wytrzeć - to wipe off

Proszę wytrzeć buty - Please wipe your shoes

"Patrzeć/popatrzeć" is different from the above verbs in that it never takes on the -cierać form, and simply uses the "-po" prefix for the perfective.  It also conjugates as "-rzę -rzysz."

patrzeć, popatrzeć - to look (at) 

patrzeć na siebie w lustrze - to look at oneself in the mirror 
patrzeć przez okno - to look through the window
patrzeć w lusterko - to look in the mirror
patrzeć przez okno - to look through the window Tylko patrzeć jak... - It won't be long before...
Miło byłoby popatrzeć jak bawi się z Aaronem - It would have been nice to see her play with Aaron
Nienawidziła patrzeć na siebie w lustrze - She hated looking at herself in the mirror
Tylko patrzeć jak... - It won't be long before...
patrzeć na kogoś z góry - look down one’s nose on somebody

I am including "rozprzestrzeniać, rozprzestrzenić" here mostly because it is a synonym for "rozpościerać, rozpostrzeć" and because the "-trzeć" root seems to elongate to "-trzeniać, -trzeć."

rozprzestrzeniać, rozprzestrzenić - to spread / -iam -iasz, -nię -nisz

szybko się rozprzestrzeniać - spread like wildfire

rozprzestrzeniać się, rozprzestrzenić się - to disperse / -am -asz, -nię -nisz

rozprzestrzeniać się z szybkością błyskawicy - to spread like a wildfire