Monday, September 1, 2014

Creating A Word List For Language Study

So I'm trying to come up with a “master list” of English vocabulary words that I can use as a base to translate into other languages. My thought is that if I come up with a list of words that covers a good deal of the gamut of concepts that English speakers express, then translating those words into the target language should provide a good basis for plundering towards fluency in that language, either in flash cards, the creation of oral materials that reinforce those words, or other language learning materials.

Don't ask WHY I'm doing this. There is no good answer for that. It's just a function of the set of things that I do.

I came up with the seed of this idea a few years ago, when I made a Spanish deck of flash cards for a friend of mine. I simply took the 1000 most frequent English words, put them in a single column in a spreadsheet, copied that column and pasted it in Google Docs, translated the whole column into Spanish, and then pasted the Spanish version in the next column over from the English version in the spreadsheet. I proofread a little and did some tweaking to correct some errors (since I'm fluent in Spanish), and then imported the whole thing into Anki. Then I created some instructions in the programming language that created a new set of flash cards that reversed the order of the first set (the first set was all English-Spanish, so I created a set that was Spanish-English). There, an easy-peasy way to make 2000 flash cards (actually, it was less than 2000 because some of the words were duplicative).

So, I wanted to put a little more thought and effort into making a list that would take one to a pretty sophisticated knowledge of the language. I wanted to have an even mix of frequency and utility, and have the list have several levels that progressed from basic to advanced. In doing so, I encountered some interesting problems (well, interesting to me anyway), which I'll talk about later.

I started with the Dolch Word List. This is a list of 315 words created by Charles Dolch in 1936, and published in 1948. It is subdivided into pre-primer through third grade utility words, and a separate list of nouns. The list has a few words that, while I wouldn't call them “archaic” due to the age of the list, probably have shifted somewhat in their frequency and their utility. But it's a good start for my first layer.

Next I went to the 1000 most frequent words in English, and broke it into two halves, the first 500, and the next 500. I picked out all the words in each half that had not already been covered by the Dolch List (by creating a formula in Open Office Calc that weeded out the words already present), and these two lists became my next two layers. Now I've got 1031 words, which is interesting in that it means there are 31 words in the Dolch List that are not in the 1000 most frequent. One slight problem I see in dealing with frequency is that the frequency of words in another language might not be the same as the frequency of English words, but I figure that most languages used by modern industrial societies will speak to a similar enough range of human experiences that it will be close enough. If I get to analyzing a language like Yanomamo, all bets may be off on that.

Then I turned to the list of words in Basic English (turning away from the focus on frequency back to utility). This is a simplified subset of the English language that is designed for teaching English as a Second Language. There are about 850 words in Basic English. I used all the words in the list of Basic English that were not previously duplicated in the previous layers, and those became my fourth layer. There is a superset of Basic English called Special English, that consists of about 1500 words, and is the vocabulary used for broadcasting by the Voice of America (the theory being that a simplified vocabulary will be must more understandable by foreign listeners). I decided not to draw from Special English, because I figured it would mostly duplicate my next lists.

Then I turned to the General Service Headwords List, a list of about 2000 words created in 1953 taken from a compendium of common English written sources for the purpose of language learning. The General Service List (GSL) is the first list I ran across to substantially segment out lemmatisations. A lemma is a different inflection of a core word, for example, the formation of “running,” “runner, ” and “ran” from the word “run.” Not only can there be different forms, but the word can also be transformed into a different part of speech, such as a noun made from a verb. And some of the lemmas can be substantially different, as in the formation of “better” from “good.” So there are two GSL lists, one of just core words (headwords), and one with inflected forms of those words. I took the headwords, and eliminated duplicates, and this was my fifth layer.

At this point, I'm starting to note some fairly significant problems. The biggest one is that categories of words that should be kept together are not kept together, and it is hard for me to tell if the sets of those words are missing important members. For instance, some colors are in the Dolch list, some are in Basic English, and some are in the GSL. I haven't even gotten to colors like “mauve” or “turquoise”, just the basic palette. And numbers are all over the place, too. Geographical directions, articles of clothing, kitchen utensils...the most basic functional groupings, are scattered and incomplete.

Another problem is that I have eliminated homophones. Yes, my list is homophobic (discriminatory, though, only in a homophonic fashion). Because there are words like “march”, “may”, and “august” (actually, “august” comes later), those months have been wiped out from my list, and I have to add them back in.

Then, I turned to the General Service List, which has about 800 more words than the GSL, but has coverage of more concepts in fewer words, and eliminates some words that have fallen more out of favor in the 60 years since the first list was created. Once again, I'm still just adding in the headwords.

My next stop is the Academic Word List (AWL), a supplement to the GSL, with words that are a little more erudite or hoity-toity. So now I've got a little over 3000 words, mostly just headwords.

My next two layers are from lists of lemmas of the GSL and the AWL. There are a LOT of words here, though not a lot of new concepts. But this is going to help with structure, morphology, lexicology, syntax and grammar. Now my complete list is a little over 9000 words.

So that's where I am at with this right now. It still needs a lot of shaping. I need to get more defined on functional grouping, fill in any holes in the structure, and redistribute the levels of learning. I know folks like Charles Berlitz and Paul Pimsleur have probably dealt with matters like this in their own way but I just want to take a fresh look at it in a methodical fashion, albeit one that does not arise from any disciplined training whatsoever. What's the worst that can happen?

Saturday, August 30, 2014

New Rewrite of Polish Verb Conjugation Post

I just rewrote my post on Polish Verb Conjugation to include more detail about the sites I had listed, and to add some new sites that I have found that are pretty good resources.  Also, some of the sites I had listed before have changed their design, so I updated the post to reflect any adjustments in the navigation through their sites.  Many of the sites have also expanded the verbs that they offer, and filled in blank spots in their conjugation tables.  So if you haven't looked at this post in some time, you may want to take another look.  This post is usually listed on the "most popular posts" widget on the right side of the main page of my blog, as it seems to be one of the most visited pages here.

Saturday, April 26, 2014

Rubbing And Spreading: -cierać, -trzeć

It has been a while since I posted a blog post here.  I have been studying Polish just about every day, but haven't taken the time to create a blog post.  Perhaps my recently ended campaign for US Congress kept me a little too busy to make blog posts. :-)

Verbs that end with -cierać, -trzeć seem to have meanings associated with "rubbing" or "spreading" for the most part.  Most of these verbs follow the pattern of -cierać for the imperfective and -trzeć for the perfective though "patrzeć" is a little different, as we shall see.

These verbs tend to conjugate in "-am, asz" for the -cierać form, and in "trę, trzesz" for the -trzeć form.

docierać, dotrzeć - to reach, to arrive, to get through to

This verb seems to be the farthest from the "rubbing" or "spreading" meaning of the root.

dotrzeć do czegoś - to get through to something

nacierać, natrzeć - to rub (in); to charge, to attack

natrzeć na (+ accusative) - to charge at...

obcierać, obetrzeć - to wipe; to chafe (skin)

Buty mnie obcierają - My shoes are chafing against my feet

pocierać, potrzeć - to rub

przecierać, przetrzeć - to wipe; to sieve; to wear through

przycierać, przytrzeć - to wear down, to chafe, to fray

rozpościerać, rozpostrzeć - to spread (e.g. blanket, wings); to stretch

rozpościerać skrzydła - to spread one's wings

rozpościerać się, rozpostrzeć się - to extend

ścierać, zetrzeć - to scrape, to abrade; to rub out, to wipe off, to clean; to grate (e.g. carrots)

ścierać kurze z (+ genitive) - to dust...

ścierać się, zetrzeć się - to wear out, to wear off, to wear thin; to clash

wcierać, wetrzeć - to rub in

wcierać krem w skórę - to rub cream into the skin

wycierać, wytrzeć - to wipe off

Proszę wytrzeć buty - Please wipe your shoes

"Patrzeć/popatrzeć" is different from the above verbs in that it never takes on the -cierać form, and simply uses the "-po" prefix for the perfective.  It also conjugates as "-rzę -rzysz."

patrzeć, popatrzeć - to look (at) 

patrzeć na siebie w lustrze - to look at oneself in the mirror 
patrzeć przez okno - to look through the window
patrzeć w lusterko - to look in the mirror
patrzeć przez okno - to look through the window Tylko patrzeć jak... - It won't be long before...
Miło byłoby popatrzeć jak bawi się z Aaronem - It would have been nice to see her play with Aaron
Nienawidziła patrzeć na siebie w lustrze - She hated looking at herself in the mirror
Tylko patrzeć jak... - It won't be long before...
patrzeć na kogoś z góry - look down one’s nose on somebody

I am including "rozprzestrzeniać, rozprzestrzenić" here mostly because it is a synonym for "rozpościerać, rozpostrzeć" and because the "-trzeć" root seems to elongate to "-trzeniać, -trzeć."

rozprzestrzeniać, rozprzestrzenić - to spread / -iam -iasz, -nię -nisz

szybko się rozprzestrzeniać - spread like wildfire

rozprzestrzeniać się, rozprzestrzenić się - to disperse / -am -asz, -nię -nisz

rozprzestrzeniać się z szybkością błyskawicy - to spread like a wildfire

Wednesday, December 4, 2013

Mixing It Up In Polish (Big Time)

I had previously posted this in three parts, but recently decided to consolidate it into one post and add some more words to it, as well as alphabetize the list by the first entry.  I did that because I was having a hard time figuring out whether I had already included words, so alphabetizing them and having them in one list will make it easier to determine whether I have already considered any given words.  From now on I will just add words to this post when I have identified more easily confused words.

When you are learning Polish, there are a whole lot of words that can be easily confused with each other.  For an English speaker, this is compounded because all the words are tongue twisters anyway.  You have no idea how many times I repeated the words "coś zjeść" (something to eat) over and over again before I could finally say them somewhat reliably.

Plus, when you learn new words that are similar to other words you knew, it can suddenly create a minefield of confusion.

Here are some words that I have gotten mixed up about at one time or another:

bezbronny – defenseless, unprotected
nieuchronny – inevitable

chwała – glory
chwila – moment, while

czeluść – abyss, depths
czułość – sensitivity, tenderness, sentimentality

cześć – reverence, worship (also used as "hello" or "goodbye" informally)
część – portion, part, section, piece
sześć – six

dokonany – accomplished, executed
pokonany – defeated

dowód – evidence, proof
powód – reason, cause, ground, motive

flet – flute
flota – fleet

gałka -- knob, ball (like an eyeball), scoop (as in ice cream)
pałka -- club (cudgel)

grzywka – fringe (hair)
grzywna – fine
grzywa  mane

kaczka – duck
paczka – package
taczka – wheelbarrow
teczka  briefcase, folder

komar  mosquito, gnat
konar  bough, branch

koparka  excavator
kopiarka  photocopier

kosa – scythe
koza – goat

kotlina – basin, hollow
kotwica – anchor

lawina – avalanche
macica – womb

leżak – deckchair
lizak – lollipop

lina – rope, cable, line
linia – line, route

łaska – favor, grace, mercy, generosity
łuska – (fish) scale, husk, (ammunition) shell

mąka – flour (I picture myself going into a sklep spożywczy [grocery store] and asking for a kilo of wheat torture, please)
męka – torture, torment

nadludzki – superhuman
przeludnienie – overpopulation

nadmiernie  excessively
niezmiernie  extremely, immensely, very

nawias – bracket, parenthesis
zawias – hinge

niezniszczalny – indestructible
znieczulenie – anesthesia

oparty – based, grounded, founded
uparty – stubborn, obstinate

opór – opposition, resistance
upór – obstinacy, stubbornness, determination

odprawa  briefing, clearance, gratuity, rebuff
oprawa  frame, rim, cover, binding (book)

pochodzenie  origin, descent
pogodzenie  reconciliation, resignation

początkowy – initial, preliminary, elementary
porządkowy – serial, ordinal

podeszwa  sole
poszewka  pillowcase

pokrywka  lid, cover
pokrzywa  stinging nettle

poprawienie – improvement, correction, revision
uprawnienie – entitlement, right, authorization

poszewka – pillowcase
soczewka – lens

potwierdzenie  confirmation, corroboration
stwierdzenie – statement, assertion

poważanie – respect, esteem, deference
poważnie – seriously, gravely, with dignity

powtórnie – once again, one more time
powtórzenie – repetition

pozbawiony – deprived
rozbawiony – amused

pozór – pretense, appearance
pożar – blaze, conflagration

przyczyna – cause, reason
przyzwoity – decent, proper

przygoda – adventure
przyroda – nature

ręcznik – towel
rzecznik – spokesman
rzeźnik - butcher

szczepionka  vaccine
szczypiorek  chive

skazany – condemned, doomed
wskazany – advisable

spinacz – paper clip
szpinak – spinach
wspinacz – climber

sporny  controversial, debatable
spójny  coherent

stały – solid, constant, permanent, direct (current)
trwały – permanent, durable, enduring, lasting (probably both these words mean pretty much the same thing except for maybe the “direct current” [prąd stały] connotation; I've seen both “stały związek” and “trwały związek” for “steady relationship”)

ścierka – dishcloth
ścieżka – path

świt – dawn, daybreak
świta – suite, retinue, entourage

trujący – poisonous, toxic
trwający – lasting

uderzenie – blow, stroke, hit
zdarzenie – event, occurrence
zderzenie – collision, crash

uległy – submissive, docile, compliant
upadły – bankrupt, fallen

ułożony – arranged, well-mannered
złożony – complex, composite, compound

uwieńczony – crowned, adorned with wreaths
uwięziony – trapped, stuck, imprisoned

wadliwy – defective, faulty
wątpliwy – questionable, doubtful

wezwanie  call, summons
wyznanie  confession, admission, religion
wyzwanie  challenge
zerwanie  rupture
zeznanie  testimony

władanie – reign, possession
włamanie – burglary

właśnie – just, exactly
własny – (someone's) own

wpływ – influence, impact
wstyd – shame, disgrace

wygląd – appearance, looks
wzgląd – regard, consideration, respect

wykaz – list, statement
wyraz – expression, word

zabarwienie  tinge, tint
zbawienie  rescue, deliverance, salvation, redemption

zabieg – procedure, treatment, operation
zasięg – range, reach

zbocze  slope
zboże  corn, cereal, grain

Other potentially confusing items:

Phrases with "północ" can be confusing because the word can either mean "north" or "midnight":

na północy – in the north
na północ – to the north
z północy – from the north

o północy – at midnight

"Południe" can mean either "south" or "noon" as well:

na południe – to the south
na południu – in the south
z południa – from the south

po południu – in the afternoon
w południe – at midday

There are also a whole bunch of words that have to do with thinking and/or mental processes that either contain "-myśl" or "-mysł" (where the accented mark switches between the "s" and the "l".  There are just so many that it makes my head swim.

Also I have mixed up a lot of words that start with "przy"- and "prze-".

I'd be interested in hearing what words you have gotten mixed up with other words in any languages you were studying.

Friday, November 22, 2013

Cutting Edge: Ciąć

The root "-ciąć" means "to cut."  There are several variations of this word, and most of them have pretty close-knit meanings; that is, the meanings of the derivative words don't seem to wander much from the main root, unlike with other Polish verbs.  The main distinguishing feature of this group of words seems to be the irregular present-tense conjugation:

tnę
tniesz
tnie
tniemy
tniecie
tną

This conjugation seems to follow all the "ciąć" words through the different variations with different prefixes.  On occasion, one might stumble upon the slangy imaginary infinitive "tnąć," which is an incorrect form of "ciąć" reverse engineered and bastardized from the irregular conjugation, though there does exist a participle "tnąc" (with no accent mark on the final consonant).

The words that end in the variation "-cinać" have a more regular "-am -asz" conjugation, for example:

obcinam
obcinasz
obcina
obcinamy
obcinacie
obcinają

The main word related to this root is:

ciąć, pociąć - to cut, to cut through, to clip, to chop, to hack

ciąć na kawałki - to cut into pieces
ciać na plastry - to cut into slices, to slice
ciać na kostkę - to cut into cubes, to dice
ciąć się o coś - to fight over something

docinać, dociąć - to cut (all the way) through, to cut extra, to cut to fit; to taunt

docinek - taunt (noun)

nacinać, naciąć - to cut down

nadcinać, nadciąć - to nick, to score

nadcięcie - (surgical) incision; nick, notch (noun)

obcinać, obciąć - to cut, to clip

obcinać włosy - to have one's hair cut
obcinać paznokcie - to cut one's nails

odcinać, odciąć - to cut off, to chop off, to cut away

pociąć - to cut up, to chop, to shred; to incise

podcinać, podciąć - to trim, to prune, to clip

podcinać komuś nogi - to trip somebody (up)
podcinać/podciąć sobie żyły - to slit one's wrists
podciąć skrzydła - to take the wind out of somebody’s sails

przecinać, przeciąć - to cut, to slice; to cut short, to interrupt

przeciąć sobie palec - to cut one's finger
przeciąć wstęgę - to cut a ribbon
przeciąc zakład - to close a bet
przecinać na pół - to cut in half
przecinać ciszę - to break silence
przecinać kłótnie - to interrupt an argument

przecinek - comma (noun)

przycinać, przyciąć - to trim, to crop; to catch

przycinać na wymiar - to trim down to size
przycinać drzewa/krzewy - to prune trees/bushes
przyciąć sobie język - to bite one's tongue

przycinanie - pruning (noun)

rozcinać, rozciąć - to cut, to cleave, to sever

rozcięcie - dissection, slit; vent (noun)

ścinać, ściąć - to cut off, to fell (tree); to smash (ball in sports); to coagulate

ścinać zakręty - to cut corners
ściąć żniwa - to cut down the harvest
ściąć się - to flunk

ucinać, uciąć - to break off, to cut short; to have, to do

ucinać/uciąć sobie drzemkę - to take a nap

wcinać, wciąć - to cut into

wcięcie - indenture, notch (noun)

wycinać, wyciąć - to cut out

zacinać, zaciąć - to cut; to whip, to lash; to clench, to set; to jam

zacinać się, zaciać się - to cut oneself; to be stuck; to persist; to stammer
Coś się zacięło - Something got stuck

Friday, November 15, 2013

Cursing In Polish

Probably some of the most important words you can learn in any language (after you learn where you can run off to when you are doing the pee dance, and where you can get something to eat that is not currently squirming) are swear words.  Expressing frustration, anger, ridicule and disgust are important parts of creating shades of meaning, and Polish seems to be especially expressive in this regard, next to maybe Russian.  The Polish nation has been through a wide range of historically frustrating experiences (to put it mildly), and this has colored the vernacular with a wide palette.

One site that has a plethora of cussage is "YouSwear" which has a number of Polish curse words, among other languages (curiously enough, they also have curse words in "Chicken").  There is also the Toolpaq Guide To Polish Curse Words as well, which is more systematic and discriminating in its treatment.  Another list is on Nawcon, which has a page called "Polish Language Swearing."

A site that is not quite as comprehensive is the "Cursing And Swearing Dictionary," which, nonetheless, seems to cover some of the basics.  There is another short list on insults.net, and a short blog post on Transparent, which discusses usage somewhat but not nearly enough.  Also, there are some words on Memrise, but you can only see five words a page and there are a total of twenty.

Doubtless there are other sites out there as well...let me know if you find the gold mine.

Thursday, November 14, 2013

Just Had To Remove A Link To An Infected Site

I just had to remove a link in my list of language learning blogs to a site that is apparently infected with malware.  It was called speakingadventure-dot-com (it's spelled out...DON'T GO THERE!!!).  I did not click on the link at any time after that site was compromised so I doubt I was infected.  The link had been on there for quite some time but the site must have recently gotten infected as Google Chrome would no longer let me on my blog site and gave a malware warning about content from this other site.

Though I could not get to my blog at all, I could get to the maintenance pages, and I just searched all the gadgets for any trace of the site that Chrome was telling me harbored the infection.  And, bingo, I found a link to it and just deleted it, resolving the problem in less than ten minutes after getting the warning and being blocked from my site.

So keep in mind that at any moment, any link that you have posted on your site could become infected somehow.  Or, even worse, someone could hack into your site and it could become infected as well.  Sobering thought.

Whew...I'm glad it was not something more involved, and I'm glad I figured it out quickly.

Sunday, September 15, 2013

Fried...Or, Really, Fraj-ed

The only two words I have found so far in Polish with the root "fraj-" in them are:

"frajer, frajerka" which means "sucker"

"frajda," which means "fun" or "for kicks"
e.g. ale frajda! (what a blast!)