We'd like to test how Taylor Salo integrated MALLET into NeuroSynth, and whether that integration works in a docker container.
First, let's import some dependencies and text to work with.
For testing, we'll use an XML file separately downloaded from PubMed. In the spirit of NeuroSynth, we downloaded Tal Yarkoni's bibliography. Thanks, Tal!
from bs4 import BeautifulSoup
import pandas as pd
with open('../neurosynth/tests/data/yarkoni_pubmed.xml') as infile:
xml_file = infile.read()
soup = BeautifulSoup(xml_file, 'lxml')
try:
assert type(soup) == BeautifulSoup
except AssertionError:
print('Check file type! Must be HTML or XML.')
titles = soup.find_all('articletitle')
abstracts = soup.find_all('abstract')
if len(titles) != len(abstracts):
print('Warning: Some articles do not have abstracts on PubMed!')
print('Only articles with complete data will be included.')
Warning: Some articles do not have abstracts on PubMed! Only articles with complete data will be included.
Three articles do not have abstracts:
Maybe because they're commentaries? We'll need to filter the results to only consider articles with abstracts. Then, import any matching articles into a pandas dataframe.
abstracts = []
pmids = []
articles = soup.find_all('pubmedarticle')
for a in articles:
if a.find_all('abstract')!= []:
# This is a little messy, but pulls out the
# results in plain text without another loop.
abstracts.append(a.find_all('abstracttext')[0].get_text())
pmids.append(a.find_all(idtype='pubmed')[0].get_text())
df = pd.DataFrame({'pmid': pmids,
'abstract': abstracts})
df.head()
abstract | pmid | |
---|---|---|
0 | Compassion is critical for societal wellbeing.... | 27018610 |
1 | Open access, open data, open source and other ... | 27387362 |
2 | The functional organization of human medial fr... | 27307242 |
3 | Social scientists often seek to demonstrate th... | 27031707 |
4 | Decades of animal and human neuroimaging resea... | 26831091 |
We have a test dataset! Let's see how it plays with MALLET.
import os
import subprocess
import shutil
import sys
sys.path.append(os.path.abspath('..'))
from neurosynth.analysis.reduce import topic_models
weights_df, keys_df = topic_models(df)
keys_df.head()
MALLET toolbox found! Abstracts folder not found. Creating abstract files... Generating topics...
terms | |
---|---|
topic | |
topic_000 | decision making neuroimaging gains demonstrate... |
topic_001 | social anxiety disorder generalized type givin... |
topic_002 | orthographic visual language widespread neighb... |
topic_003 | connectivity findings global found state lpfc ... |
topic_004 | data human provide coactivation brain map api ... |