Hierarchical topic modelling of Kiva loan descriptions

Goals

  • Create a visual explorer for a static set of 775K English loan descriptions from Kiva.org
  • Use (hierarchical) topic modelling
  • Publish the explorer

Approach

  • Use gensim to derive flat topic models over (part of) the Kiva corpus, taking the tutorial as guideline
  • Organize the found topic models into a hierarchy
  • Convert that hierarchy into a JSON data file compliant with Hierarchie
  • Visualize everything with Hierarchie

Data preprocessing

Step 1: download Kiva data dump (JSON format), and extract into data/static

Step 2: since the 'description' fields in the Kiva data dump often mix multiple languages (due to manual translations), the language codes are not reliable. Therefore, we:

  • split the descriptions in paragraps
  • do language detection on the paragraphs (using the langid library)
  • store the recombined paragraphs and their language code in new processed_description field

The data are written to a locally installed MongoDB 'kiva', collection 'loans'

In [10]:
# Next line commented out because we only want to run this once
# nohup python src/load_kiva_loans_to_mongodb.py

Step 3: load (a subset) of the English data from MongoDB, and convert it to the Blei LDA-C format, using a gensim utility function

In [6]:
!python src/convert_mongodb_to_blei_ldac.py --dataDir data/topicmodelling\
                                            --corpusBaseName kiva \
                                            --stopwordFile=data/topicmodelling/kiva_stopwords.tsv \
                                            --startYear 1990 \
                                            --maxNrDocs 800000 \
                                            --filterBelow 10 \
                                            --filterAbove 0.5 \
                                            --filterKeepN 1000
Creating MongoDB cursor ... done
Number of loans in 'en' since 1990: 774952
First pass: streaming from MongoDB ...
creating the dictionary ...
read 5000 documents ...
read 10000 documents ...
read 15000 documents ...
read 20000 documents ...
read 25000 documents ...
read 30000 documents ...
read 35000 documents ...
read 40000 documents ...
read 45000 documents ...
read 50000 documents ...
read 55000 documents ...
read 60000 documents ...
read 65000 documents ...
read 70000 documents ...
read 75000 documents ...
read 80000 documents ...
read 85000 documents ...
read 90000 documents ...
read 95000 documents ...
read 100000 documents ...
read 105000 documents ...
read 110000 documents ...
read 115000 documents ...
read 120000 documents ...
read 125000 documents ...
read 130000 documents ...
read 135000 documents ...
read 140000 documents ...
read 145000 documents ...
read 150000 documents ...
read 155000 documents ...
read 160000 documents ...
read 165000 documents ...
read 170000 documents ...
read 175000 documents ...
read 180000 documents ...
read 185000 documents ...
read 190000 documents ...
read 195000 documents ...
read 200000 documents ...
read 205000 documents ...
read 210000 documents ...
read 215000 documents ...
read 220000 documents ...
read 225000 documents ...
read 230000 documents ...
read 235000 documents ...
read 240000 documents ...
read 245000 documents ...
read 250000 documents ...
read 255000 documents ...
read 260000 documents ...
read 265000 documents ...
read 270000 documents ...
read 275000 documents ...
read 280000 documents ...
read 285000 documents ...
read 290000 documents ...
read 295000 documents ...
read 300000 documents ...
read 305000 documents ...
read 310000 documents ...
read 315000 documents ...
read 320000 documents ...
read 325000 documents ...
read 330000 documents ...
read 335000 documents ...
read 340000 documents ...
read 345000 documents ...
read 350000 documents ...
read 355000 documents ...
read 360000 documents ...
read 365000 documents ...
read 370000 documents ...
read 375000 documents ...
read 380000 documents ...
read 385000 documents ...
read 390000 documents ...
read 395000 documents ...
read 400000 documents ...
read 405000 documents ...
read 410000 documents ...
read 415000 documents ...
read 420000 documents ...
read 425000 documents ...
read 430000 documents ...
read 435000 documents ...
read 440000 documents ...
read 445000 documents ...
read 450000 documents ...
read 455000 documents ...
read 460000 documents ...
read 465000 documents ...
read 470000 documents ...
read 475000 documents ...
read 480000 documents ...
read 485000 documents ...
read 490000 documents ...
read 495000 documents ...
read 500000 documents ...
read 505000 documents ...
read 510000 documents ...
read 515000 documents ...
read 520000 documents ...
read 525000 documents ...
read 530000 documents ...
read 535000 documents ...
read 540000 documents ...
read 545000 documents ...
read 550000 documents ...
read 555000 documents ...
read 560000 documents ...
read 565000 documents ...
read 570000 documents ...
read 575000 documents ...
read 580000 documents ...
read 585000 documents ...
read 590000 documents ...
read 595000 documents ...
read 600000 documents ...
read 605000 documents ...
read 610000 documents ...
read 615000 documents ...
read 620000 documents ...
read 625000 documents ...
read 630000 documents ...
read 635000 documents ...
read 640000 documents ...
read 645000 documents ...
read 650000 documents ...
read 655000 documents ...
read 660000 documents ...
read 665000 documents ...
read 670000 documents ...
read 675000 documents ...
read 680000 documents ...
read 685000 documents ...
read 690000 documents ...
read 695000 documents ...
read 700000 documents ...
read 705000 documents ...
read 710000 documents ...
read 715000 documents ...
read 720000 documents ...
read 725000 documents ...
read 730000 documents ...
read 735000 documents ...
read 740000 documents ...
read 745000 documents ...
read 750000 documents ...
read 755000 documents ...
read 760000 documents ...
read 765000 documents ...
read 770000 documents ...
filtering the dictionary ... done
wrote data/topicmodelling/kiva_dict.bin ... and data/topicmodelling/kiva_dict.txt ... done
Second pass: streaming from MongoDB ... saving into data/topicmodelling/kiva.lda-c (Blei corpus format) ... done
Number of documents converted: 774952
Vocabulary size: 1000

Topic modelling

In [8]:
!python src/model_topics.py --dataDir data/topicmodelling \
                            --modelDir data/topicmodelling \
                            --corpusBaseName kiva \
                            --nrTopics 64 \
                            --nrWords 8
Loading Blei corpus file data/topicmodelling/kiva.lda-c ... done
<gensim.corpora.bleicorpus.BleiCorpus object at 0x108ba6a90>
Dictionary(1000 unique tokens: [u'neighbors', u'sector', u'managed', u'lack', u'eldest']...)
Making topic model ... done
0.217*region + 0.179*applied + 0.129*borrowed + 0.123*pakistan + 0.082*sheep + 0.070*times + 0.045*develop + 0.036*rented
0.039*started + 0.031*ago + 0.024*start + 0.024*support + 0.022*selling + 0.020*decided + 0.018*thanks + 0.018*increase
0.179*college + 0.174*hardworking + 0.109*studies + 0.071*word + 0.065*degree + 0.046*teacher + 0.044*continues + 0.036*profession
0.687*food + 0.091*beverages + 0.089*foods + 0.045*25,000 + 0.028*serves + 0.012*prepare + 0.011*selling + 0.007*cooking
0.082*living + 0.073*earn + 0.055*income + 0.048*hopes + 0.048*rice + 0.038*village + 0.038*support + 0.028*family
0.124*poultry + 0.114*fellowship + 0.108*god + 0.092*chickens + 0.072*attain + 0.069*chicken + 0.069*eggs + 0.068*build
0.060*meet + 0.053*customers + 0.049*needs + 0.041*increase + 0.031*demand + 0.027*family + 0.024*sales + 0.021*good
0.190*member + 0.147*program + 0.086*joined + 0.065*pmpc + 0.057*help + 0.056*foundation + 0.041*recently + 0.037*paglaum
0.075*money + 0.074*save + 0.072*requested + 0.067*family + 0.066*enough + 0.052*works + 0.050*hard + 0.040*old
0.173*community + 0.106*services + 0.081*costs + 0.066*providing + 0.046*service + 0.044*new + 0.044*provides + 0.042*access
0.191*higher + 0.176*low + 0.175*price + 0.170*prices + 0.046*competition + 0.040*al + 0.037*wholesale + 0.034*commercial
0.349*clothing + 0.096*shoes + 0.075*tuition + 0.062*merchandise + 0.061*sales + 0.058*cosmetics + 0.046*selling + 0.038*baby
0.198*get + 0.151*ahead + 0.114*maria + 0.083*forward + 0.070*desire + 0.066*greatest + 0.053*borrower + 0.046*spouse
0.321*nwtf + 0.067*dream + 0.054*sell + 0.042*past + 0.042*charcoal + 0.039*selling + 0.036*expand + 0.034*also
0.179*pigs + 0.175*raising + 0.076*earns + 0.058*healthy + 0.057*old + 0.055*living + 0.046*raise + 0.041*activities
0.177*products + 0.109*goods + 0.079*etc + 0.070*canned + 0.067*sells + 0.044*sell + 0.044*like + 0.043*shampoo
0.177*electricity + 0.090*funds + 0.089*soon + 0.079*ensure + 0.078*including + 0.055*onions + 0.052*mainly + 0.041*along
0.155*php + 0.134*philippines + 0.106*additional + 0.056*future + 0.053*earns + 0.047*secure + 0.046*income + 0.039*hard
0.286*fish + 0.189*fishing + 0.100*pig + 0.089*blessed + 0.073*mary + 0.055*firewood + 0.046*sell + 0.039*cassava
0.063*happy + 0.059*well + 0.049*says + 0.037*likes + 0.031*hand + 0.028*good + 0.028*time + 0.027*gets
0.320*lenders + 0.235*entrepreneur + 0.200*village + 0.127*engaged + 0.076*northern + 0.017*english + 0.013*kind + 0.003*eight
0.366*like + 0.320*would + 0.117*partner + 0.054*future + 0.046*describes + 0.033*aspires + 0.019*domestic + 0.016*committed
0.221*dreams + 0.212*clothes + 0.096*selling + 0.063*goes + 0.041*sell + 0.041*woman + 0.041*sells + 0.036*shirts
0.344*man + 0.190*photo + 0.095*wife + 0.077*father + 0.071*equipment + 0.050*furniture + 0.042*right + 0.032*pesos
0.063*lives + 0.049*city + 0.048*works + 0.045*house + 0.037*located + 0.035*old + 0.035*area + 0.034*home
0.315*borrowing + 0.211*institution + 0.205*communal + 0.090*raised + 0.088*often + 0.059*economy + 0.032*rest + 0.000*recently
0.151*bank + 0.109*members + 0.095*member + 0.070*time + 0.065*cycle + 0.056*health + 0.036*community + 0.030*payments
0.126*farming + 0.073*harvest + 0.060*land + 0.060*farm + 0.040*fertilizer + 0.039*crops + 0.037*farmers + 0.030*seeds
0.092*income + 0.086*husband + 0.065*family + 0.062*expenses + 0.042*help + 0.039*needs + 0.038*cover + 0.036*household
0.185*materials + 0.110*home + 0.106*making + 0.083*cement + 0.078*sustaining + 0.055*wood + 0.039*make + 0.037*raw
0.237*district + 0.143*province + 0.101*lives + 0.086*requested + 0.068*inputs + 0.055*cambodia + 0.044*old + 0.042*anticipated
0.188*son + 0.111*university + 0.085*pay + 0.070*studying + 0.058*noodles + 0.053*parts + 0.047*year + 0.042*education
0.241*living + 0.182*improve + 0.122*family + 0.109*conditions + 0.027*better + 0.026*income + 0.025*situation + 0.024*new
0.207*previous + 0.108*back + 0.106*new + 0.094*total + 0.082*paid + 0.077*loans + 0.070*used + 0.059*third
0.073*sewing + 0.055*machine + 0.049*training + 0.041*hair + 0.033*tools + 0.033*beauty + 0.033*salon + 0.030*tailoring
0.092*local + 0.082*small + 0.080*successful + 0.078*capital + 0.056*working + 0.052*assistance + 0.043*stable + 0.042*financially
0.264*group + 0.102*women + 0.081*members + 0.050*one + 0.037*fund + 0.032*leader + 0.029*first + 0.019*use
0.206*rural + 0.180*field + 0.105*biggest + 0.075*kiva’s + 0.068*benefit + 0.059*sector + 0.055*cooperative + 0.049*areas
0.263*repaid + 0.252*successfully + 0.218*involved + 0.072*loans + 0.043*individual + 0.041*56 + 0.038*adult + 0.026*completed
0.092*school + 0.075*daughters + 0.068*students + 0.066*restaurant + 0.065*study + 0.063*two + 0.051*sons + 0.043*stories
0.316*water + 0.119*monthly + 0.083*week + 0.054*aged + 0.049*days + 0.048*family + 0.042*piped + 0.035*every
0.106*shop + 0.071*use + 0.062*hopes + 0.050*kes + 0.047*future + 0.043*operates + 0.041*retail + 0.039*stock
0.040*work + 0.027*able + 0.026*family + 0.021*help + 0.019*works + 0.017*wants + 0.015*lives + 0.014*day
0.292*store + 0.124*general + 0.081*items + 0.054*grocery + 0.051*groceries + 0.050*sell + 0.048*runs + 0.030*variety
0.193*build + 0.184*house + 0.057*expanding + 0.056*building + 0.048*sand + 0.045*basis + 0.044*construction + 0.043*labor
0.081*maize + 0.061*farmer + 0.059*income + 0.056*milk + 0.042*dairy + 0.040*cows + 0.037*cattle + 0.037*animals
0.104*rice + 0.101*sugar + 0.070*oil + 0.062*flour + 0.050*cooking + 0.045*bread + 0.040*meat + 0.035*beans
0.269*livestock + 0.172*feed + 0.125*agriculture + 0.110*fattening + 0.070*primarily + 0.068*gain + 0.064*agricultural + 0.056*shown
0.158*day + 0.113*average + 0.083*per + 0.073*credit + 0.072*every + 0.071*within + 0.069*usd + 0.046*month
0.073*daughter + 0.069*school + 0.054*lives + 0.051*house + 0.049*old + 0.049*husband + 0.045*applying + 0.039*faces
0.211*vegetables + 0.103*vending + 0.083*fruits + 0.074*fruit + 0.069*stall + 0.053*bananas + 0.049*vegetable + 0.049*standing
0.081*grade + 0.074*manage + 0.073*born + 0.069*due + 0.069*supports + 0.059*became + 0.054*honest + 0.054*financial
0.108*school + 0.057*fees + 0.044*challenge + 0.042*kenya + 0.027*uganda + 0.025*major + 0.024*profits + 0.023*pay
0.308*coffee + 0.258*drinks + 0.195*soft + 0.093*weather + 0.088*peru + 0.037*mr. + 0.013*wife + 0.003*thus
0.099*repair + 0.087*fertilizers + 0.084*transportation + 0.077*motorcycle + 0.073*transport + 0.058*maintenance + 0.057*driver + 0.054*resell
0.283*access + 0.117*loans + 0.093*! + 0.092*stocks + 0.069*taken + 0.067*institutions + 0.057*said + 0.055*cloth
0.416*one + 0.245*child + 0.136*year + 0.063*old + 0.050*wheat + 0.042*fourth + 0.028*south + 0.007*youngest
0.072*traditional + 0.070*ingredients + 0.066*profit + 0.056*bags + 0.046*live + 0.042*plans + 0.038*soro + 0.037*yiriwaso
0.102*businesses + 0.084*community + 0.069*families + 0.052*microfinance + 0.049*groups + 0.044*intends + 0.041*repay + 0.039*share
0.168*poor + 0.110*production + 0.102*poverty + 0.073*financial + 0.067*development + 0.066*francs + 0.045*country + 0.043*organization
0.240*cash + 0.134*hours + 0.104*hoping + 0.099*manages + 0.068*expects + 0.066*housewife + 0.065*assist + 0.059*net
0.251*brac + 0.205*ages + 0.159*generate + 0.104*rosa + 0.101*renovate + 0.062*2011 + 0.031*currently + 0.027*small
0.126*amount + 0.099*old + 0.059*past + 0.056*purchase + 0.045*hopes + 0.045*lives + 0.043*requesting + 0.041*two
0.055*life + 0.040*quality + 0.038*products + 0.038*better + 0.036*able + 0.027*customers + 0.026*improve + 0.025*good
Writing model file data/topicmodelling/kiva.lda_model ... done
Creating complete topic/word matrix in memory:
topic 1/64 ...
topic 2/64 ...
topic 3/64 ...
topic 4/64 ...
topic 5/64 ...
topic 6/64 ...
topic 7/64 ...
topic 8/64 ...
topic 9/64 ...
topic 10/64 ...
topic 11/64 ...
topic 12/64 ...
topic 13/64 ...
topic 14/64 ...
topic 15/64 ...
topic 16/64 ...
topic 17/64 ...
topic 18/64 ...
topic 19/64 ...
topic 20/64 ...
topic 21/64 ...
topic 22/64 ...
topic 23/64 ...
topic 24/64 ...
topic 25/64 ...
topic 26/64 ...
topic 27/64 ...
topic 28/64 ...
topic 29/64 ...
topic 30/64 ...
topic 31/64 ...
topic 32/64 ...
topic 33/64 ...
topic 34/64 ...
topic 35/64 ...
topic 36/64 ...
topic 37/64 ...
topic 38/64 ...
topic 39/64 ...
topic 40/64 ...
topic 41/64 ...
topic 42/64 ...
topic 43/64 ...
topic 44/64 ...
topic 45/64 ...
topic 46/64 ...
topic 47/64 ...
topic 48/64 ...
topic 49/64 ...
topic 50/64 ...
topic 51/64 ...
topic 52/64 ...
topic 53/64 ...
topic 54/64 ...
topic 55/64 ...
topic 56/64 ...
topic 57/64 ...
topic 58/64 ...
topic 59/64 ...
topic 60/64 ...
topic 61/64 ...
topic 62/64 ...
topic 63/64 ...
topic 64/64 ...
done
Writing topic/word matrix file data/topicmodelling/kiva_topic_words_matrix.h5 .../usr/local/lib/python2.7/site-packages/pandas/io/pytables.py:2486: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->unicode,key->axis0] [items->None]

  warnings.warn(ws, PerformanceWarning)
/usr/local/lib/python2.7/site-packages/pandas/io/pytables.py:2486: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->unicode,key->block0_items] [items->None]

  warnings.warn(ws, PerformanceWarning)
done

Infer topic distribution over the document set

In [9]:
!python src/infer_document_topic_distributions.py --modelDir data/topicmodelling \
                                                  --modelBaseName kiva \
                                                  --maxNrDocs 250000
Reading topic/word matrix file data/topicmodelling/kiva_topic_words_matrix.h5 ... done
Loading model file data/topicmodelling/kiva.lda_model ... done
Loading Blei corpus file data/topicmodelling/kiva.lda-c ... done
<gensim.corpora.bleicorpus.BleiCorpus object at 0x1117a8790>
Processed 5000 documents ...
Processed 10000 documents ...
Processed 15000 documents ...
Processed 20000 documents ...
Processed 25000 documents ...
Processed 30000 documents ...
Processed 35000 documents ...
Processed 40000 documents ...
Processed 45000 documents ...
Processed 50000 documents ...
Processed 55000 documents ...
Processed 60000 documents ...
Processed 65000 documents ...
Processed 70000 documents ...
Processed 75000 documents ...
Processed 80000 documents ...
Processed 85000 documents ...
Processed 90000 documents ...
Processed 95000 documents ...
Processed 100000 documents ...
Processed 105000 documents ...
Processed 110000 documents ...
Processed 115000 documents ...
Processed 120000 documents ...
Processed 125000 documents ...
Processed 130000 documents ...
Processed 135000 documents ...
Processed 140000 documents ...
Processed 145000 documents ...
Processed 150000 documents ...
Processed 155000 documents ...
Processed 160000 documents ...
Processed 165000 documents ...
Processed 170000 documents ...
Processed 175000 documents ...
Processed 180000 documents ...
Processed 185000 documents ...
Processed 190000 documents ...
Processed 195000 documents ...
Processed 200000 documents ...
Processed 205000 documents ...
Processed 210000 documents ...
Processed 215000 documents ...
Processed 220000 documents ...
Processed 225000 documents ...
Processed 230000 documents ...
Processed 235000 documents ...
Processed 240000 documents ...
Processed 245000 documents ...
Processed 250000 documents ...
Processed 255000 documents ...
Processed 260000 documents ...
Processed 265000 documents ...
Processed 270000 documents ...
Processed 275000 documents ...
Processed 280000 documents ...
Processed 285000 documents ...
Processed 290000 documents ...
Processed 295000 documents ...
Processed 300000 documents ...
Processed 305000 documents ...
Processed 310000 documents ...
Processed 315000 documents ...
Processed 320000 documents ...
Processed 325000 documents ...
Processed 330000 documents ...
Processed 335000 documents ...
Processed 340000 documents ...
Processed 345000 documents ...
Processed 350000 documents ...
Processed 355000 documents ...
Processed 360000 documents ...
Processed 365000 documents ...
Processed 370000 documents ...
Processed 375000 documents ...
Processed 380000 documents ...
Processed 385000 documents ...
Processed 390000 documents ...
Processed 395000 documents ...
Processed 400000 documents ...
Processed 405000 documents ...
Processed 410000 documents ...
Processed 415000 documents ...
Processed 420000 documents ...
Processed 425000 documents ...
Processed 430000 documents ...
Processed 435000 documents ...
Processed 440000 documents ...
Processed 445000 documents ...
Processed 450000 documents ...
Processed 455000 documents ...
Processed 460000 documents ...
Processed 465000 documents ...
Processed 470000 documents ...
Processed 475000 documents ...
Processed 480000 documents ...
Processed 485000 documents ...
Processed 490000 documents ...
Processed 495000 documents ...
Processed 500000 documents ...
Processed 505000 documents ...
Processed 510000 documents ...
Processed 515000 documents ...
Processed 520000 documents ...
Processed 525000 documents ...
Processed 530000 documents ...
Processed 535000 documents ...
Processed 540000 documents ...
Processed 545000 documents ...
Processed 550000 documents ...
Processed 555000 documents ...
Processed 560000 documents ...
Processed 565000 documents ...
Processed 570000 documents ...
Processed 575000 documents ...
Processed 580000 documents ...
Processed 585000 documents ...
Processed 590000 documents ...
Processed 595000 documents ...
Processed 600000 documents ...
Processed 605000 documents ...
Processed 610000 documents ...
Processed 615000 documents ...
Processed 620000 documents ...
Processed 625000 documents ...
Processed 630000 documents ...
Processed 635000 documents ...
Processed 640000 documents ...
Processed 645000 documents ...
Processed 650000 documents ...
Processed 655000 documents ...
Processed 660000 documents ...
Processed 665000 documents ...
Processed 670000 documents ...
Processed 675000 documents ...
Processed 680000 documents ...
Processed 685000 documents ...
Processed 690000 documents ...
Processed 695000 documents ...
Processed 700000 documents ...
Processed 705000 documents ...
Processed 710000 documents ...
Processed 715000 documents ...
Processed 720000 documents ...
Processed 725000 documents ...
Processed 730000 documents ...
Processed 735000 documents ...
Processed 740000 documents ...
Processed 745000 documents ...
Processed 750000 documents ...
Processed 755000 documents ...
Processed 760000 documents ...
Processed 765000 documents ...
Processed 770000 documents ...
475 winning (0.06%); 0.67% weight in [region applied borrowed pakistan sheep times develop rented]
73127 winning (9.44%); 5.46% weight in [started ago start support selling decided thanks increase]
136 winning (0.02%); 0.72% weight in [college hardworking studies word degree teacher continues profession]
63 winning (0.01%); 0.56% weight in [food beverages foods 25,000 serves prepare selling cooking]
33504 winning (4.32%); 2.47% weight in [living earn income hopes rice village support family]
273 winning (0.04%); 0.59% weight in [poultry fellowship god chickens attain chicken eggs build]
31849 winning (4.11%); 3.10% weight in [meet customers needs increase demand family sales good]
5630 winning (0.73%); 1.23% weight in [member program joined pmpc help foundation recently paglaum]
34069 winning (4.40%); 2.74% weight in [money save requested family enough works hard old]
24826 winning (3.20%); 2.21% weight in [community services costs providing service new provides access]
26 winning (0.00%); 0.49% weight in [higher low price prices competition al wholesale commercial]
1551 winning (0.20%); 0.86% weight in [clothing shoes tuition merchandise sales cosmetics selling baby]
77 winning (0.01%); 0.75% weight in [get ahead maria forward desire greatest borrower spouse]
1582 winning (0.20%); 0.92% weight in [nwtf dream sell past charcoal selling expand also]
4406 winning (0.57%); 1.26% weight in [pigs raising earns healthy old living raise activities]
4342 winning (0.56%); 1.39% weight in [products goods etc canned sells sell like shampoo]
245 winning (0.03%); 0.74% weight in [electricity funds soon ensure including onions mainly along]
17558 winning (2.27%); 2.56% weight in [php philippines additional future earns secure income hard]
1370 winning (0.18%); 0.55% weight in [fish fishing pig blessed mary firewood sell cassava]
4230 winning (0.55%); 1.88% weight in [happy well says likes hand good time gets]
30 winning (0.00%); 0.69% weight in [lenders entrepreneur village engaged northern english kind eight]
334 winning (0.04%); 1.27% weight in [like would partner future describes aspires domestic committed]
1882 winning (0.24%); 0.75% weight in [dreams clothes selling goes sell woman sells shirts]
308 winning (0.04%); 0.60% weight in [man photo wife father equipment furniture right pesos]
46233 winning (5.97%); 4.16% weight in [lives city works house located old area home]
0 winning (0.00%); 0.39% weight in [borrowing institution communal raised often economy rest recently]
5308 winning (0.68%); 1.44% weight in [bank members member time cycle health community payments]
21246 winning (2.74%); 2.63% weight in [farming harvest land farm fertilizer crops farmers seeds]
26007 winning (3.36%); 3.05% weight in [income husband family expenses help needs cover household]
1560 winning (0.20%); 0.88% weight in [materials home making cement sustaining wood make raw]
563 winning (0.07%); 0.84% weight in [district province lives requested inputs cambodia old anticipated]
1676 winning (0.22%); 0.89% weight in [son university pay studying noodles parts year education]
6894 winning (0.89%); 1.96% weight in [living improve family conditions better income situation new]
2276 winning (0.29%); 0.95% weight in [previous back new total paid loans used third]
6750 winning (0.87%); 1.10% weight in [sewing machine training hair tools beauty salon tailoring]
1696 winning (0.22%); 1.36% weight in [local small successful capital working assistance stable financially]
21624 winning (2.79%); 2.39% weight in [group women members one fund leader first use]
0 winning (0.00%); 0.81% weight in [rural field biggest kiva’s benefit sector cooperative areas]
14 winning (0.00%); 0.60% weight in [repaid successfully involved loans individual 56 adult completed]
1235 winning (0.16%); 0.88% weight in [school daughters students restaurant study two sons stories]
1715 winning (0.22%); 1.04% weight in [water monthly week aged days family piped every]
38299 winning (4.94%); 2.56% weight in [shop use hopes kes future operates retail stock]
149146 winning (19.25%); 9.51% weight in [work able family help works wants lives day]
10037 winning (1.30%); 1.73% weight in [store general items grocery groceries sell runs variety]
1413 winning (0.18%); 1.04% weight in [build house expanding building sand basis construction labor]
23737 winning (3.06%); 2.06% weight in [maize farmer income milk dairy cows cattle animals]
8970 winning (1.16%); 1.57% weight in [rice sugar oil flour cooking bread meat beans]
66 winning (0.01%); 0.58% weight in [livestock feed agriculture fattening primarily gain agricultural shown]
4903 winning (0.63%); 1.21% weight in [day average per credit every within usd month]
24381 winning (3.15%); 2.54% weight in [daughter school lives house old husband applying faces]
1886 winning (0.24%); 0.86% weight in [vegetables vending fruits fruit stall bananas vegetable standing]
46 winning (0.01%); 0.65% weight in [grade manage born due supports became honest financial]
41496 winning (5.35%); 2.60% weight in [school fees challenge kenya uganda major profits pay]
125 winning (0.02%); 0.40% weight in [coffee drinks soft weather peru mr. wife thus]
2053 winning (0.26%); 0.87% weight in [repair fertilizers transportation motorcycle transport maintenance driver resell]
104 winning (0.01%); 0.63% weight in [access loans ! stocks taken institutions said cloth]
422 winning (0.05%); 0.85% weight in [one child year old wheat fourth south youngest]
5011 winning (0.65%); 1.12% weight in [traditional ingredients profit bags live plans soro yiriwaso]
7876 winning (1.02%); 1.29% weight in [businesses community families microfinance groups intends repay share]
442 winning (0.06%); 0.89% weight in [poor production poverty financial development francs country organization]
14 winning (0.00%); 0.38% weight in [cash hours hoping manages expects housewife assist net]
246 winning (0.03%); 0.45% weight in [brac ages generate rosa renovate 2011 currently small]
17439 winning (2.25%); 1.83% weight in [amount old past purchase hopes lives requesting two]
50150 winning (6.47%); 4.53% weight in [life quality products better able customers improve good]

Organize the "flat" topic list into a hierarchy for visualization purposes

In [36]:
!python src/build_topic_hierarchy.py --modelDir data/topicmodelling \
                                     --modelBaseName kiva \
                                     --nrClusters 16 \
                                     --nrWords 7
Reading topic/word matrix file data/topicmodelling/kiva_topic_words_matrix.h5 ... done
Hierarchically clustering topics ...recursive hierarchy:
[[[[1, 42, 63], 30, 24, 4, 41, 46, 34, 58, 9, 28, 52, 44, 49, 6, 19, 26],
  37,
  2,
  17,
  14,
  32,
  8,
  33,
  20,
  31,
  29,
  0,
  36,
  39,
  62,
  56],
 [[45, 57], 15, 5, 47, 27, 18, 59, 43, 11, 23, 12, 7, 50, 54, 40, 16],
 13,
 55,
 21,
 25,
 48,
 38,
 3,
 22,
 53,
 10,
 51,
 35,
 60,
 61]
 done
Loading model file data/topicmodelling/kiva.lda_model ... done
Building nested hierarchy in memory ... done
{u'topic_data': [{u'a_words': [u'family',
                               u'group',
                               u'income',
                               u'lives',
                               u'community',
                               u'living',
                               u'school'],
                  u'b_name': u'topic_1_42_63_30_24_4_41_46_34_58_9_28_52_44_49_6_19_26_37_2_17_14_32_8_33_20_31_29_0_36_39_62_56',
                  u'b_size': 553701,
                  u'children': [{u'a_words': [u'community',
                                              u'family',
                                              u'house',
                                              u'lives',
                                              u'school',
                                              u'income',
                                              u'able'],
                                 u'b_name': u'topic_1_42_63_30_24_4_41_46_34_58_9_28_52_44_49_6_19_26',
                                 u'b_size': 398051,
                                 u'children': [{u'a_words': [u'able',
                                                             u'work',
                                                             u'life',
                                                             u'family',
                                                             u'started',
                                                             u'help',
                                                             u'quality'],
                                                u'b_name': u'topic_1_42_63',
                                                u'b_size': 151103,
                                                u'children': [{u'a_words': [u'started',
                                                                            u'ago',
                                                                            u'start',
                                                                            u'support',
                                                                            u'selling',
                                                                            u'decided',
                                                                            u'thanks'],
                                                               u'b_name': u'topic_1',
                                                               u'b_size': 42309},
                                                              {u'a_words': [u'work',
                                                                            u'able',
                                                                            u'family',
                                                                            u'help',
                                                                            u'works',
                                                                            u'wants',
                                                                            u'lives'],
                                                               u'b_name': u'topic_42',
                                                               u'b_size': 73725},
                                                              {u'a_words': [u'life',
                                                                            u'quality',
                                                                            u'products',
                                                                            u'better',
                                                                            u'able',
                                                                            u'customers',
                                                                            u'improve'],
                                                               u'b_name': u'topic_63',
                                                               u'b_size': 35069}]},
                                               {u'a_words': [u'district',
                                                             u'province',
                                                             u'lives',
                                                             u'requested',
                                                             u'inputs',
                                                             u'cambodia',
                                                             u'old'],
                                                u'b_name': u'topic_30',
                                                u'b_size': 6478},
                                               {u'a_words': [u'lives',
                                                             u'city',
                                                             u'works',
                                                             u'house',
                                                             u'located',
                                                             u'old',
                                                             u'area'],
                                                u'b_name': u'topic_24',
                                                u'b_size': 32244},
                                               {u'a_words': [u'living',
                                                             u'earn',
                                                             u'income',
                                                             u'hopes',
                                                             u'rice',
                                                             u'village',
                                                             u'support'],
                                                u'b_name': u'topic_4',
                                                u'b_size': 19162},
                                               {u'a_words': [u'shop',
                                                             u'use',
                                                             u'hopes',
                                                             u'kes',
                                                             u'future',
                                                             u'operates',
                                                             u'retail'],
                                                u'b_name': u'topic_41',
                                                u'b_size': 19875},
                                               {u'a_words': [u'rice',
                                                             u'sugar',
                                                             u'oil',
                                                             u'flour',
                                                             u'cooking',
                                                             u'bread',
                                                             u'meat'],
                                                u'b_name': u'topic_46',
                                                u'b_size': 12137},
                                               {u'a_words': [u'sewing',
                                                             u'machine',
                                                             u'training',
                                                             u'hair',
                                                             u'tools',
                                                             u'beauty',
                                                             u'salon'],
                                                u'b_name': u'topic_34',
                                                u'b_size': 8531},
                                               {u'a_words': [u'businesses',
                                                             u'community',
                                                             u'families',
                                                             u'microfinance',
                                                             u'groups',
                                                             u'intends',
                                                             u'repay'],
                                                u'b_name': u'topic_58',
                                                u'b_size': 9961},
                                               {u'a_words': [u'community',
                                                             u'services',
                                                             u'costs',
                                                             u'providing',
                                                             u'service',
                                                             u'new',
                                                             u'provides'],
                                                u'b_name': u'topic_9',
                                                u'b_size': 17156},
                                               {u'a_words': [u'income',
                                                             u'husband',
                                                             u'family',
                                                             u'expenses',
                                                             u'help',
                                                             u'needs',
                                                             u'cover'],
                                                u'b_name': u'topic_28',
                                                u'b_size': 23664},
                                               {u'a_words': [u'school',
                                                             u'fees',
                                                             u'challenge',
                                                             u'kenya',
                                                             u'uganda',
                                                             u'major',
                                                             u'profits'],
                                                u'b_name': u'topic_52',
                                                u'b_size': 20147},
                                               {u'a_words': [u'build',
                                                             u'house',
                                                             u'expanding',
                                                             u'building',
                                                             u'sand',
                                                             u'basis',
                                                             u'construction'],
                                                u'b_name': u'topic_44',
                                                u'b_size': 8058},
                                               {u'a_words': [u'daughter',
                                                             u'school',
                                                             u'lives',
                                                             u'house',
                                                             u'old',
                                                             u'husband',
                                                             u'applying'],
                                                u'b_name': u'topic_49',
                                                u'b_size': 19714},
                                               {u'a_words': [u'meet',
                                                             u'customers',
                                                             u'needs',
                                                             u'increase',
                                                             u'demand',
                                                             u'family',
                                                             u'sales'],
                                                u'b_name': u'topic_6',
                                                u'b_size': 24043},
                                               {u'a_words': [u'happy',
                                                             u'well',
                                                             u'says',
                                                             u'likes',
                                                             u'hand',
                                                             u'good',
                                                             u'time'],
                                                u'b_name': u'topic_19',
                                                u'b_size': 14593},
                                               {u'a_words': [u'bank',
                                                             u'members',
                                                             u'member',
                                                             u'time',
                                                             u'cycle',
                                                             u'health',
                                                             u'community'],
                                                u'b_name': u'topic_26',
                                                u'b_size': 11185}]},
                                {u'a_words': [u'rural',
                                              u'field',
                                              u'biggest',
                                              u'kiva\u2019s',
                                              u'benefit',
                                              u'sector',
                                              u'cooperative'],
                                 u'b_name': u'topic_37',
                                 u'b_size': 6275},
                                {u'a_words': [u'college',
                                              u'hardworking',
                                              u'studies',
                                              u'word',
                                              u'degree',
                                              u'teacher',
                                              u'continues'],
                                 u'b_name': u'topic_2',
                                 u'b_size': 5595},
                                {u'a_words': [u'php',
                                              u'philippines',
                                              u'additional',
                                              u'future',
                                              u'earns',
                                              u'secure',
                                              u'income'],
                                 u'b_name': u'topic_17',
                                 u'b_size': 19862},
                                {u'a_words': [u'pigs',
                                              u'raising',
                                              u'earns',
                                              u'healthy',
                                              u'old',
                                              u'living',
                                              u'raise'],
                                 u'b_name': u'topic_14',
                                 u'b_size': 9802},
                                {u'a_words': [u'living',
                                              u'improve',
                                              u'family',
                                              u'conditions',
                                              u'better',
                                              u'income',
                                              u'situation'],
                                 u'b_name': u'topic_32',
                                 u'b_size': 15200},
                                {u'a_words': [u'money',
                                              u'save',
                                              u'requested',
                                              u'family',
                                              u'enough',
                                              u'works',
                                              u'hard'],
                                 u'b_name': u'topic_8',
                                 u'b_size': 21213},
                                {u'a_words': [u'previous',
                                              u'back',
                                              u'new',
                                              u'total',
                                              u'paid',
                                              u'loans',
                                              u'used'],
                                 u'b_name': u'topic_33',
                                 u'b_size': 7391},
                                {u'a_words': [u'lenders',
                                              u'entrepreneur',
                                              u'village',
                                              u'engaged',
                                              u'northern',
                                              u'english',
                                              u'kind'],
                                 u'b_name': u'topic_20',
                                 u'b_size': 5316},
                                {u'a_words': [u'son',
                                              u'university',
                                              u'pay',
                                              u'studying',
                                              u'noodles',
                                              u'parts',
                                              u'year'],
                                 u'b_name': u'topic_31',
                                 u'b_size': 6903},
                                {u'a_words': [u'materials',
                                              u'home',
                                              u'making',
                                              u'cement',
                                              u'sustaining',
                                              u'wood',
                                              u'make'],
                                 u'b_name': u'topic_29',
                                 u'b_size': 6798},
                                {u'a_words': [u'region',
                                              u'applied',
                                              u'borrowed',
                                              u'pakistan',
                                              u'sheep',
                                              u'times',
                                              u'develop'],
                                 u'b_name': u'topic_0',
                                 u'b_size': 5167},
                                {u'a_words': [u'group',
                                              u'women',
                                              u'members',
                                              u'one',
                                              u'fund',
                                              u'leader',
                                              u'first'],
                                 u'b_name': u'topic_36',
                                 u'b_size': 18547},
                                {u'a_words': [u'school',
                                              u'daughters',
                                              u'students',
                                              u'restaurant',
                                              u'study',
                                              u'two',
                                              u'sons'],
                                 u'b_name': u'topic_39',
                                 u'b_size': 6816},
                                {u'a_words': [u'amount',
                                              u'old',
                                              u'past',
                                              u'purchase',
                                              u'hopes',
                                              u'lives',
                                              u'requesting'],
                                 u'b_name': u'topic_62',
                                 u'b_size': 14207},
                                {u'a_words': [u'one',
                                              u'child',
                                              u'year',
                                              u'old',
                                              u'wheat',
                                              u'fourth',
                                              u'south'],
                                 u'b_name': u'topic_56',
                                 u'b_size': 6558}]},
                 {u'a_words': [u'store',
                               u'farming',
                               u'water',
                               u'clothing',
                               u'products',
                               u'member',
                               u'general'],
                  u'b_name': u'topic_45_57_15_5_47_27_18_59_43_11_23_12_7_50_54_40_16',
                  u'b_size': 143246,
                  u'children': [{u'a_words': [u'maize',
                                              u'farmer',
                                              u'income',
                                              u'milk',
                                              u'dairy',
                                              u'cows',
                                              u'traditional'],
                                 u'b_name': u'topic_45_57',
                                 u'b_size': 24628,
                                 u'children': [{u'a_words': [u'maize',
                                                             u'farmer',
                                                             u'income',
                                                             u'milk',
                                                             u'dairy',
                                                             u'cows',
                                                             u'cattle'],
                                                u'b_name': u'topic_45',
                                                u'b_size': 15951},
                                               {u'a_words': [u'traditional',
                                                             u'ingredients',
                                                             u'profit',
                                                             u'bags',
                                                             u'live',
                                                             u'plans',
                                                             u'soro'],
                                                u'b_name': u'topic_57',
                                                u'b_size': 8677}]},
                                {u'a_words': [u'products',
                                              u'goods',
                                              u'etc',
                                              u'canned',
                                              u'sells',
                                              u'sell',
                                              u'like'],
                                 u'b_name': u'topic_15',
                                 u'b_size': 10764},
                                {u'a_words': [u'poultry',
                                              u'fellowship',
                                              u'god',
                                              u'chickens',
                                              u'attain',
                                              u'chicken',
                                              u'eggs'],
                                 u'b_name': u'topic_5',
                                 u'b_size': 4570},
                                {u'a_words': [u'livestock',
                                              u'feed',
                                              u'agriculture',
                                              u'fattening',
                                              u'primarily',
                                              u'gain',
                                              u'agricultural'],
                                 u'b_name': u'topic_47',
                                 u'b_size': 4485},
                                {u'a_words': [u'farming',
                                              u'harvest',
                                              u'land',
                                              u'farm',
                                              u'fertilizer',
                                              u'crops',
                                              u'farmers'],
                                 u'b_name': u'topic_27',
                                 u'b_size': 20389},
                                {u'a_words': [u'fish',
                                              u'fishing',
                                              u'pig',
                                              u'blessed',
                                              u'mary',
                                              u'firewood',
                                              u'sell'],
                                 u'b_name': u'topic_18',
                                 u'b_size': 4286},
                                {u'a_words': [u'poor',
                                              u'production',
                                              u'poverty',
                                              u'financial',
                                              u'development',
                                              u'francs',
                                              u'country'],
                                 u'b_name': u'topic_59',
                                 u'b_size': 6911},
                                {u'a_words': [u'store',
                                              u'general',
                                              u'items',
                                              u'grocery',
                                              u'groceries',
                                              u'sell',
                                              u'runs'],
                                 u'b_name': u'topic_43',
                                 u'b_size': 13387},
                                {u'a_words': [u'clothing',
                                              u'shoes',
                                              u'tuition',
                                              u'merchandise',
                                              u'sales',
                                              u'cosmetics',
                                              u'selling'],
                                 u'b_name': u'topic_11',
                                 u'b_size': 6693},
                                {u'a_words': [u'man',
                                              u'photo',
                                              u'wife',
                                              u'father',
                                              u'equipment',
                                              u'furniture',
                                              u'right'],
                                 u'b_name': u'topic_23',
                                 u'b_size': 4631},
                                {u'a_words': [u'get',
                                              u'ahead',
                                              u'maria',
                                              u'forward',
                                              u'desire',
                                              u'greatest',
                                              u'borrower'],
                                 u'b_name': u'topic_12',
                                 u'b_size': 5809},
                                {u'a_words': [u'member',
                                              u'program',
                                              u'joined',
                                              u'pmpc',
                                              u'help',
                                              u'foundation',
                                              u'recently'],
                                 u'b_name': u'topic_7',
                                 u'b_size': 9541},
                                {u'a_words': [u'vegetables',
                                              u'vending',
                                              u'fruits',
                                              u'fruit',
                                              u'stall',
                                              u'bananas',
                                              u'vegetable'],
                                 u'b_name': u'topic_50',
                                 u'b_size': 6640},
                                {u'a_words': [u'repair',
                                              u'fertilizers',
                                              u'transportation',
                                              u'motorcycle',
                                              u'transport',
                                              u'maintenance',
                                              u'driver'],
                                 u'b_name': u'topic_54',
                                 u'b_size': 6741},
                                {u'a_words': [u'water',
                                              u'monthly',
                                              u'week',
                                              u'aged',
                                              u'days',
                                              u'family',
                                              u'piped'],
                                 u'b_name': u'topic_40',
                                 u'b_size': 8058},
                                {u'a_words': [u'electricity',
                                              u'funds',
                                              u'soon',
                                              u'ensure',
                                              u'including',
                                              u'onions',
                                              u'mainly'],
                                 u'b_name': u'topic_16',
                                 u'b_size': 5713}]},
                 {u'a_words': [u'nwtf',
                               u'dream',
                               u'sell',
                               u'past',
                               u'charcoal',
                               u'selling',
                               u'expand'],
                  u'b_name': u'topic_13',
                  u'b_size': 7160},
                 {u'a_words': [u'access',
                               u'loans',
                               u'!',
                               u'stocks',
                               u'taken',
                               u'institutions',
                               u'said'],
                  u'b_name': u'topic_55',
                  u'b_size': 4890},
                 {u'a_words': [u'like',
                               u'would',
                               u'partner',
                               u'future',
                               u'describes',
                               u'aspires',
                               u'domestic'],
                  u'b_name': u'topic_21',
                  u'b_size': 9813},
                 {u'a_words': [u'borrowing',
                               u'institution',
                               u'communal',
                               u'raised',
                               u'often',
                               u'economy',
                               u'rest'],
                  u'b_name': u'topic_25',
                  u'b_size': 3045},
                 {u'a_words': [u'day',
                               u'average',
                               u'per',
                               u'credit',
                               u'every',
                               u'within',
                               u'usd'],
                  u'b_name': u'topic_48',
                  u'b_size': 9393},
                 {u'a_words': [u'repaid',
                               u'successfully',
                               u'involved',
                               u'loans',
                               u'individual',
                               u'56',
                               u'adult'],
                  u'b_name': u'topic_38',
                  u'b_size': 4680},
                 {u'a_words': [u'food',
                               u'beverages',
                               u'foods',
                               u'25,000',
                               u'serves',
                               u'prepare',
                               u'selling'],
                  u'b_name': u'topic_3',
                  u'b_size': 4324},
                 {u'a_words': [u'dreams',
                               u'clothes',
                               u'selling',
                               u'goes',
                               u'sell',
                               u'woman',
                               u'sells'],
                  u'b_name': u'topic_22',
                  u'b_size': 5775},
                 {u'a_words': [u'coffee',
                               u'drinks',
                               u'soft',
                               u'weather',
                               u'peru',
                               u'mr.',
                               u'wife'],
                  u'b_name': u'topic_53',
                  u'b_size': 3098},
                 {u'a_words': [u'higher',
                               u'low',
                               u'price',
                               u'prices',
                               u'competition',
                               u'al',
                               u'wholesale'],
                  u'b_name': u'topic_10',
                  u'b_size': 3787},
                 {u'a_words': [u'grade',
                               u'manage',
                               u'born',
                               u'due',
                               u'supports',
                               u'became',
                               u'honest'],
                  u'b_name': u'topic_51',
                  u'b_size': 5044},
                 {u'a_words': [u'local',
                               u'small',
                               u'successful',
                               u'capital',
                               u'working',
                               u'assistance',
                               u'stable'],
                  u'b_name': u'topic_35',
                  u'b_size': 10551},
                 {u'a_words': [u'cash',
                               u'hours',
                               u'hoping',
                               u'manages',
                               u'expects',
                               u'housewife',
                               u'assist'],
                  u'b_name': u'topic_60',
                  u'b_size': 2963},
                 {u'a_words': [u'brac',
                               u'ages',
                               u'generate',
                               u'rosa',
                               u'renovate',
                               u'2011',
                               u'currently'],
                  u'b_name': u'topic_61',
                  u'b_size': 3450}]}
 Dumping object hierarchy into JSON file data/topicmodelling/kivadata.json ... done

References