Job here is to check a hypothesis:
All t-words that in singular indefinite form end with a consonant stay
unchanged in indefinite plural.
In order to check this we need a dictionary that stores inflection information. I've found Lexin.
Parsing the xml and looping through it we can look for exceptions to the rule.
import re
import xml.etree.ElementTree as ET
tree = ET.parse('LEXIN/LEXIN.xml')
root = tree.getroot()
count = {'word':0, 'noun':0, 't-noun':0,
'consonant t-noun':0, 'irregular consonant t-noun':0}
for word in root.findall('lemma-entry'):
count['word'] += 1
try:
pos = word.find('pos').text
except AttributeError:
continue
if pos == "subst.":
count['noun'] += 1
form = word.find('form').text
inflection = word.find("inflection").text
if inflection == None:
continue
inflections = inflection.split(' ')
if inflections[0].endswith('t'):
count['t-noun'] += 1
if form[-1] in "bcdfghjklmnpqrstvwxz":
count['consonant t-noun'] += 1
if len(inflections) > 1:
pl_def = inflections[1]
ending = pl_def[-2:]
if ('(' in pl_def) or ('.' in pl_def):
continue
else:
continue
if not form.endswith(ending):
count['irregular consonant t-noun'] += 1
print(form + ", " + inflection.split(' ')[1])
count
akvarium, akvarier ankar, ankare antikvariat, antikvariaten arbets~namn, -namnen auditorium, auditorier betsel, betslen blommo~gram, -grammen bot~färdig, -färdiga cell~gift, -gifter decennium, decennier diarium, diarier evangelium, evangelier fett, fetter finger, fingrar foster~land, -länder garn, garner gymnasium, gymnasier hand~gången, -gångna hem~land, -länder idog, idoga i-land, i-länder imperium, imperier indicium, indicier jubileum, jubileer juver, juvren kartotek, kartoteken kassett~däck, -däcken kol~hydrat, -hydrater kollegium, kollegier kompendium, kompendier kranium, kranier krematorium, krematorier kriterium, kriterier laboratorium, laboratorier lill~finger, -fingrar luder, ludren lång~finger, -fingrar maskulinum, maskuliner mass~medium, -medier medium, medier motgift, motgifter museum, museer mysterium, mysterier mörk~hyad, -hyade neutrum, neutrer nidings~dåd, -dåden observatorium, observatorier pek~finger, -fingrar plisserad, plisserade podium, podier privilegium, privilegier protein, proteiner prång, prången pur~ung, -unga ring~finger, -fingrar rums~ren, -rena röd~vin, -viner sammelsurium, sammelsurier sanatorium, sanatorier seminarium, seminarier skam~grepp, -greppen skepps~brott, -brotten smink, sminker solarium, solarier stadium, stadier stipendium, stipendier studium, studier supinum, supiner symposium, symposier syn~skadad, -skadade territorium, territorier tjänste~fel, -felen tomte~bloss, -blossen tröst~äter, -ät! tänk~värd, -värda u-land, u~länder uppblåsbar, uppblåsbara utvandrar~land, -länder utvecklings~land, -länder vin, viner vitamin, vitaminer vuxen~gymnasium, -gymnasier
{'consonant t-noun': 1896, 'irregular consonant t-noun': 82, 'noun': 10782, 't-noun': 2582, 'word': 19718}
Well there are a few exceptions. Most of these are false positives but there are a few that are true exceptions:
fett, finger, garn, gift, land, protein, smink, vin, vitamin
oh, and all latin words ending in -um:
akvarium, auditorium, decennium, diarium, gymnasium, imperium, jubileum, kollegium, kompendium, kranium, krematorium, kriterium, laboratorium, medium, museum, mysterium, podium, privilegium, sammelsurium, sanatorium, seminarium, solarium, stadium, stipendium, symposium, territorium
irregular = ["fett", "finger", "garn", "gift", "land", "protein", "smink", "vin",
"vitamin", "akvarium", "auditorium", "decennium", "diarium",
"gymnasium", "imperium", "jubileum", "kollegium", "kompendium",
"kranium", "krematorium", "kriterium", "laboratorium", "medium",
"museum", "mysterium", "podium", "privilegium", "sammelsurium",
"sanatorium", "seminarium", "solarium", "stadium", "stipendium",
"symposium", "territorium"]
print("Chance of making a mistake when following the rule: %0.4f %%" %
(100*float(len(irregular))/count['consonant t-noun'],))
Chance of making a mistake when following the rule: 1.8460 %