An evidence-based language-acquisition method¶

"The Martians" were a group of prominent Hungarian scientists of Jewish descent (mostly, but not exclusively, physicists and mathematicians) who emigrated to the United States in the early half of the 20th century. They included, among others, Theodore von Kármán, John von Neumann, Paul Halmos, Eugene Wigner, Edward Teller, George Pólya, John G. Kemeny and Paul Erdős. -- Wikipedia

The method of language aquisition of these geniuses? They memorized books.

This is an executable script, not an article¶

This is a Google Colaboratory notebook. Upon execution, it generates a deck of cards for German spaced repetition learning.

Resources and tools¶

300k+ German / English sentence pairs: tatoeba.org
German word frequency list: DeReWo by Das Leibniz-Institut für Deutsche Sprache
Anki, an Open Source spaced repetition learning tool
Execution environment: Google Colaboratory
Code repo: https://gist.github.com/lorinc/edbe6ef72eb6f8259ab6c30b715170e7
Notebook rendering: nbviewer.jupyter.org

TL;DR¶

Results of this script is freely and conveniently available.

Download and install Anki, an Open Source spaced repetition learning tool for Linux, Android and Windows.
From this tool, get the shared deck of learning cards this script generates.
Learn every day while you commute, get fluent.

In [0]:

%%capture
%%bash

# cleaning up residues from past executions and sample data folder

rm *
rm -rf sample_data

# downloading the tatoeba corpus

wget -nv http://downloads.tatoeba.org/exports/sentences_detailed.tar.bz2 \
         http://downloads.tatoeba.org/exports/user_languages.tar.bz2 \
         http://downloads.tatoeba.org/exports/links.tar.bz2

# downloading a 10k German word frequency list
wget -nv http://www1.ids-mannheim.de/fileadmin/kl/derewo/DeReKo-2014-II-MainArchive-STT.100000.freq.7z

# 7z is already pre-installed on hosted free Colab
7z e DeReKo-2014-II-MainArchive-STT.100000.freq.7z
mv DeReKo-2014-II-MainArchive-STT.100000.freq freq.csv

# extracting tatoeba corpus

tar xvjf sentences_detailed.tar.bz2 
tar xvjf user_languages.tar.bz2
tar xvjf links.tar.bz2

# cleaning up
rm *.bz2
rm *.7z
rm *.readme

# show files
ls -la

In [0]:

%%bash

# in tatoeba CSVs null is represented as a '\N' string

# selecting reference users, whose translations will be used
grep -P '^eng\t[45]' user_languages.csv > eng_users.csv
grep -P '^deu\t[45]' user_languages.csv > deu_users.csv

# German sentences (length, owner, punctuation)
awk -F, '
  BEGIN {FS="\t"};
  {
    if (($2 == "deu" &&
      $4 != "\\N" &&
      length($3) > 40 &&
      length($3) < 100 &&
      substr($3, 1, length($3)-1) !~ /[\.\?\!]/))
    print $0
  } ' sentences_detailed.csv > deu.csv

# English sentences (owner)
awk -F, '
  BEGIN {FS="\t"};
  {
    if (($2 == "eng" && $4 != "\\N")) print $0
  } ' sentences_detailed.csv > eng.csv

In [0]:

# pulling data into Pandas DataFrames

import warnings
import pandas as pd

# suppressing futurewarnings
warnings.simplefilter(action='ignore', category=FutureWarning)

freq = pd.read_csv(
                'freq.csv', 
                sep='\t', 
                header=None,
                names=['word', 'lemma', 'POS_tag', 'POS_confidence'])

links = pd.read_csv('links.csv',
                 delimiter='\t',
                 error_bad_lines=False,
                 warn_bad_lines=True,
                 index_col=0,
                 header=None,
                 mangle_dupe_cols=True)

eng_sentences = pd.read_csv('eng.csv',
                 delimiter='\t',
                 error_bad_lines=False,
                 warn_bad_lines=True,
                 index_col=0,
                 usecols=[0,2,3],
                 names=['id', 'text', 'owner'],
                 header=None)

deu_sentences = pd.read_csv('deu.csv',
                 delimiter='\t',
                 error_bad_lines=False,
                 warn_bad_lines=True,
                 index_col=0,
                 usecols=[0,2,3],
                 names=['id', 'text', 'owner'],
                 header=None)

eng_users = pd.read_csv('eng_users.csv',
                 delimiter='\t',
                 error_bad_lines=False,
                 warn_bad_lines=True,
                 index_col=0,
                 usecols=[2],
                 names=['owner'],
                 header=None)

deu_users = pd.read_csv('deu_users.csv',
                 delimiter='\t',
                 error_bad_lines=False,
                 warn_bad_lines=True,
                 index_col=0,
                 usecols=[2],
                 names=['owner'],
                 header=None)

In [0]:

translations = links.join(deu_sentences, how='right')\
     .dropna().set_index(1)\
     .join(eng_sentences, how='right', lsuffix='_deu', rsuffix='_eng')\
     .dropna().loc[:,['text_deu','text_eng']].reset_index(drop=True)

In [0]:

# cleaning / trimming down the frequency list
# and also diminishing the rarity as it grows

bad_POS = ['TRUNC','$(','$,','$.','156259594','XY', 'CARD', 'NE']
bad_lemma = ['UNKNOWN', 'unknown']

POS_filter = ~freq.POS_tag.isin(bad_POS)
lemma_filter = ~freq.lemma.isin(bad_lemma)

freq = freq[POS_filter]
freq = freq[lemma_filter]

freq['log_freq'] = (
    pd.np.log(
        freq.index
        .astype(pd.np.int64)
        +1 # there I fixed np.log(0) with a ducktape
    )
    .astype(pd.np.int)
)

# preparing the German sentences to be probed against the frequency list
word_lists = (
    translations['text_deu']
    .str.replace(r'[,:]', '')
    .str.split()
)

# left join the frequency list to every single word list
# and deriving the median rarity of the words in the sentence
translations['rarity'] = (
    word_lists
      .apply(
          lambda word_list: 
            pd.DataFrame(word_list)
              .merge(freq, left_on=0, right_on='word', how='left')['log_freq']
              .median()
      )
)

# calculating the complexity value
translations['complexity'] = (
    translations.text_deu.str.len()
    *
    translations.rarity
)

# sorting sentences by complexity, then resetting index
# ready to export
(
    translations
      .sort_values('complexity', ascending=True)
      .reset_index(drop=True, inplace=True)
)

In [0]:

%%capture
from google.colab import files

url = ('https://translate.google.com/' +
       'translate_tts?ie=UTF-8&tl=de-DE&client=tw-ob&q=')

# making the German text Google Translate URL compatible
translations['audio'] = (url +

      translations['text_deu'].str.replace('[ \'\"]', '+') + '+')

In [0]:

# taking a look at the results. Remember - grammar does not count into
# complexity, only how common the words are in the sentence
(translations[['text_deu', 'text_eng', 'complexity']]
   .sample(50)
   .set_index(['text_deu','text_eng'])
   .sort_values('complexity'))

Out[0]:

		complexity
text_deu	text_eng
Tom sagte, dass nicht nur er es hasse, das zu tun.	Tom said that he wasn't the only one who hated doing that.	175.0
Das ist nur ein vorübergehender Rückschlag.	This is only a temporary setback.	180.0
Ich weiß, dass ich hier nicht willkommen bin.	I know I'm not welcome here.	184.0
Ich kann diesen Rechner nicht reparieren.	I can't fix this computer.	184.5
Ich möchte so weit weg von hier, wie ich kann.	I want to get as far away from here as I can.	188.0
Ich war erfolgreich, weil ich Glück hatte.	The reason I succeeded was because I was lucky.	193.5
Ich habe zwei Jahre in Rio de Janeiro gearbeitet.	I worked in Rio de Janeiro for two years.	196.0
Sein Gehalt wurde um zehn Prozent erhöht.	His salary was increased by ten percent.	210.0
Ich sollte wohl besser allein hineingehen.	I think I should go in alone.	210.0
Ich habe es niemandem gesagt, selbst Tom nicht.	I haven't told anyone, not even Tom.	211.5
Wir unterhielten uns bei einer Tasse Kaffee.	We talked over a cup of coffee.	220.0
Tom hat seinen Schirm in der Klasse vergessen.	Tom left his umbrella in the classroom.	230.0
Sie hätten mir die Wahrheit sagen sollen.	You should've told me the truth.	231.0
Werden Sie Tom anrufen, oder möchten Sie, dass ich das tu?	Are you going to call Tom or do you want me to?	236.0
Wir können nicht einfach nur herumsitzen und gar nicht tun.	We can't just sit around and do nothing.	240.0
Tom spielt nicht nur Mundharmonika, sondern auch Gitarre.	Not only does Tom play the harmonica, he plays the guitar, too.	256.5
Das Mädchen mit den blauen Augen ist Jane.	The girl with blue eyes is Jane.	258.0
Er hat gute Aussichten, gewählt zu werden.	He has good chances of being chosen.	258.0
Wie kann man die Gefahren des Internets meiden?	How can you avoid the dangers of the Internet?	258.5
Ich arbeitete den ganzen Tag auf dem Bauernhof.	I worked on the farm all day.	258.5
Dies allein reicht schon aus, um uns zu überzeugen.	This alone is enough to convince us.	260.0
Wir sind ein paar Wochen hinter unserem Zeitplan.	We're a few weeks behind schedule.	269.5
Hat Tom gesagt, wo Mary hingegangen sein könnte?	Did Tom say where Mary might've gone?	269.5
Maria sah so aus, als hätte sie seit Tagen nicht geschlafen.	Mary looked like she hadn't slept in days.	274.5
Ich kann nicht glauben, dass es das ist, was Tom wirklich beunruhigt.	I can't believe that's what's really bothering Tom.	276.0
Tom hat seinen Enkelkindern sehr viel Geld hinterlassen.	Tom left his grandchildren a lot of money.	280.0
Nicht weit von meinem Haus gibt es einen Fluss.	There's a river near my house.	282.0
Niemand kann die Tatsache leugnen, dass die Erde rund ist.	No one can deny the fact that the earth is round.	290.0
Tom arbeitet gewöhnlich von neun bis halb sechs.	Tom usually works from nine to five-thirty.	294.0
Wenn du mich nur gefragt hättest, dann hätte ich es getan.	If you'd just asked me, I would've done it.	300.0
Lass uns hier verschwinden, bevor es zu spät ist!	Let's get out of here before it's too late.	300.0
Tom hätte diese Stelle haben können, hätte er sie gewollt.	Tom could've had this job if he'd wanted it.	305.0
Ich habe den ganzen Nachmittag vergeblich gewartet.	I waited all afternoon in vain.	306.0
Ich frage mich, was ich zum Abendessen kochen soll.	I'm wondering what to cook for dinner.	306.0
Du sagtest uns nicht, was er in diesem Brief geschrieben hatte.	You didn't tell us what he had written in this letter.	315.0
Layla nannte der Polizei einen falschen Namen.	Layla gave the police a fake name.	322.0
Ihr glaubt wohl, ich begehe einen Fehler, oder?	You think I'm making a mistake, don't you?	329.0
Was immer er für Fehler haben mag, Geiz gehört nicht dazu.	Whatever faults he may have, meanness is not one of them.	330.0
Es wurden bedeutende Fortschritte erzielt.	Great progress has been made.	336.0
Tom entschuldigte sich dafür, zu spät gekommen zu sein.	Tom excused himself for being late.	342.0
Tom und Maria gingen händchenhaltend den Pfad entlang.	Tom and Mary walked down the path, holding hands.	357.5
Der Revolutionsrat kam zusammen, um eine Strategie zu planen.	The revolutionary council met to plan strategy.	366.0
Kennst du irgendwelche finnischen Zungenbrecher?	Do you know any Finnish tongue-twisters?	408.0
Wir hätten vorher anrufen und einen Tisch bestellen sollen.	We should have phoned ahead and reserved a table.	420.0
Niemand wusste, dass Tom ein ehemaliger Gefangener war.	No one knew Tom was an ex-con.	440.0
Ich gebe meinen Hunden jeden Abend zwei Becher Hundefutter.	I feed my dog two cups of dog food every evening.	442.5
Solange du dich ruhig verhältst, kannst du in diesem Zimmer bleiben.	As long as you keep quiet, you can stay in this room.	483.0
Wenn man sie Englisch sprechen hört, könnte man annehmen, sie sei Amerikanerin.	If you heard her speak English, you would take her for an American.	486.0
Natürlich ist es nur der Anfang unserer Aufgabe, unsere gemeine Menschlichkeit zu erkennen.	Of course, recognizing our common humanity is only the beginning of our task.	552.0
Die Fahrgäste, die nach Hogwarts fahren, mögen sich bitte auf Bahnsteig neundreiviertel begeben.	Passengers for Hogwarts, please make your way to platform nine and three-quarters.	686.0

In [0]:

# generating result and downloading it in chrome
(translations[['text_deu', 'text_eng', 'audio', 'complexity']]
    .to_csv('export.csv', sep='\t', encoding='utf-8', header=False))

In [0]:

# downloading script results to your machine
files.download('export.csv')

This is how the solution side of the flashcard looks like on my phone¶

If you use ANKI, format your cards something like this to get the text-to-speach play automatically when you flip the card:

{{FrontSide}}

<hr/>

{{Back}}

<br\><br\>

{{#Back URL}} 
<iframe
  src="{{Back URL}}"
  style="border:2px solid black; padding: 25px; width: 340px; height: 120px;"
> {{/Back URL}}