import pyreadstat
dtafile = './SimData/FifthDayData.dta'
df, meta = pyreadstat.read_dta(dtafile,
usecols=['index', 'Gender', 'Name', 'ID',
'Response'])
df.head()
index | ID | Name | Response | Gender | |
---|---|---|---|---|---|
0 | 0 | 1 | John | 0.453733 | 0 |
1 | 1 | 2 | Billie | 0.257360 | 0 |
2 | 2 | 3 | Robert | 0.443393 | 0 |
3 | 3 | 4 | Don | 0.423592 | 0 |
4 | 4 | 5 | Joseph | 0.571355 | 0 |
meta.__doc__
'\n This class holds metadata we want to give back to python\n '
import pandas as pd
Here we use pandas read_stata to read a .dta file to a Pandas datframe. Note, you need to download the FifthDayData.dta from here and put it in a subfolder, to this notebook, called "SimData". Another option is to change the dtafile to wherever your .dta file is.
dtafile = './SimData/FifthDayData.dta'
df = pd.read_stata(dtafile)
df.head()
df.tail()
index | ID | Name | Day | Age | Response | Gender | |
---|---|---|---|---|---|---|---|
195 | 195 | 196 | Francisca | Fifth | 27 | 0.260849 | 1 |
196 | 196 | 197 | Nia | Fifth | 20 | 0.431105 | 1 |
197 | 197 | 198 | Christina | Fifth | 29 | 0.231316 | 1 |
198 | 198 | 199 | Marta | Fifth | 26 | 0.424948 | 1 |
199 | 199 | 200 | Julia | Fifth | 25 | 0.280474 | 1 |
As can be seen in the image above, there's a column named index. We use the parameter index_col to set this column as index:
dtafile = './SimData/FifthDayData.dta'
df = pd.read_stata(dtafile, index_col='index')
df2 = display( HTML( df.head().style.render()))
Here we read a Stata file from a URL:
%matplotlib inline
url = 'http://www.principlesofeconometrics.com/stata/broiler.dta'
df = pd.read_stata(url)
df.plot.scatter(x='pchick',
y='cpi')
<matplotlib.axes._subplots.AxesSubplot at 0x1edf0279588>
url = 'http://www.principlesofeconometrics.com/stata/broiler.dta'
df.head()
year | q | y | pchick | pbeef | pcor | pf | cpi | qproda | pop | meatex | time | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1950.0 | 14.3 | 7863.0 | 69.500000 | 31.200001 | 59.799999 | NaN | 24.100000 | 2628500.0 | 151.684006 | NaN | 41.0 |
1 | 1951.0 | 15.1 | 7953.0 | 72.900002 | 36.500000 | 72.099998 | NaN | 26.000000 | 2843000.0 | 154.287003 | NaN | 42.0 |
2 | 1952.0 | 15.3 | 8071.0 | 73.099998 | 36.200001 | 71.300003 | NaN | 26.500000 | 2851200.0 | 156.953995 | NaN | 43.0 |
3 | 1953.0 | 15.2 | 8319.0 | 71.300003 | 28.500000 | 62.700001 | NaN | 26.700001 | 2953900.0 | 159.565002 | NaN | 44.0 |
4 | 1954.0 | 15.8 | 8276.0 | 64.400002 | 27.400000 | 63.400002 | NaN | 26.900000 | 3099700.0 | 162.391006 | NaN | 45.0 |
url = 'http://www.principlesofeconometrics.com/stata/broiler.dta'
cols = ['year', 'q', 'y',
'pchick', 'pcor']
df = pd.read_stata(url, columns=cols)
df.head()
year | q | y | pchick | pcor | |
---|---|---|---|---|---|
0 | 1950.0 | 14.3 | 7863.0 | 69.500000 | 59.799999 |
1 | 1951.0 | 15.1 | 7953.0 | 72.900002 | 72.099998 |
2 | 1952.0 | 15.3 | 8071.0 | 73.099998 | 71.300003 |
3 | 1953.0 | 15.2 | 8319.0 | 71.300003 | 62.700001 |
4 | 1954.0 | 15.8 | 8276.0 | 64.400002 | 63.400002 |
In this example, we are going to use Pandas to_stata to save a .dta file to our harddrive.
pyreadstat.write_dta(df, 'broilerdata_edited.dta')
df.to_csv('broilerdata_edited.dta')
To save a .csv file as a stata file we just use read_csv and to_stata:
df = pd.read_csv("./SimData/FifthDayData.csv")
df.to_stata("./SimData/FifthDayData.dta")
To save a .xlsx file as a stata file we just use read_excel and, again, to_stata:
df = pd.read_excel("./SimData/example_concat.xlsx")
df.to_stata("./SimData/example_concat.dta")
from IPython.display import display, HTML
display( HTML( df.head().style.render()))
import os
import time
from selenium import webdriver
#Via https://stackoverflow.com/a/52572919/454773
def setup_screenshot(driver,path):
# Ref: https://stackoverflow.com/a/52572919/
original_size = driver.get_window_size()
required_width = driver.execute_script('return document.body.parentNode.scrollWidth')
required_height = driver.execute_script('return document.body.parentNode.scrollHeight')
driver.set_window_size(required_width, required_height)
# driver.save_screenshot(path) # has scrollbar
driver.find_element_by_tag_name('body').screenshot(path) # avoids scrollbar
driver.set_window_size(original_size['width'], original_size['height'])
def getTableImage(url, fn='dummy_table', basepath='.', path='.', delay=5, height=420, width=800):
''' Render HTML file in browser and grab a screenshot. '''
browser = webdriver.Chrome()
browser.get(url)
#Give the html some time to load
time.sleep(delay)
imgpath='{}/{}.png'.format(path,fn)
imgfn = '{}/{}'.format(basepath, imgpath)
imgfile = '{}/{}'.format(os.getcwd(),imgfn)
setup_screenshot(browser,imgfile)
browser.quit()
os.remove(imgfile.replace('.png','.html'))
#print(imgfn)
return imgpath
def getTablePNG(tablehtml, basepath='.', path='testpng', fnstub='testhtml'):
''' Save HTML table as: {basepath}/{path}/{fnstub}.png '''
if not os.path.exists(path):
os.makedirs('{}/{}'.format(basepath, path))
fn='{cwd}/{basepath}/{path}/{fn}.html'.format(cwd=os.getcwd(), basepath=basepath, path=path,fn=fnstub)
tmpurl='file://{fn}'.format(fn=fn)
with open(fn, 'w') as out:
out.write(tablehtml)
return getTableImage(tmpurl, fnstub, basepath, path)
getTablePNG(display( HTML( df.head().style.render())), path=".")
#where s is a string containing html, eg s = df.style.render()
df1
df.to_stata('broilerdata_edited.dta')
Help on method to_stata in module pandas.core.frame: to_stata(fname, convert_dates=None, write_index=True, encoding='latin-1', byteorder=None, time_stamp=None, data_label=None, variable_labels=None, version=114, convert_strl=None) method of pandas.core.frame.DataFrame instance Export DataFrame object to Stata dta format. Writes the DataFrame to a Stata dataset file. "dta" files contain a Stata dataset. Parameters ---------- fname : str, buffer or path object String, path object (pathlib.Path or py._path.local.LocalPath) or object implementing a binary write() function. If using a buffer then the buffer will not be automatically closed after the file data has been written. convert_dates : dict Dictionary mapping columns containing datetime types to stata internal format to use when writing the dates. Options are 'tc', 'td', 'tm', 'tw', 'th', 'tq', 'ty'. Column can be either an integer or a name. Datetime columns that do not have a conversion type specified will be converted to 'tc'. Raises NotImplementedError if a datetime column has timezone information. write_index : bool Write the index to Stata dataset. encoding : str Default is latin-1. Unicode is not supported. byteorder : str Can be ">", "<", "little", or "big". default is `sys.byteorder`. time_stamp : datetime A datetime to use as file creation date. Default is the current time. data_label : str, optional A label for the data set. Must be 80 characters or smaller. variable_labels : dict Dictionary containing columns as keys and variable labels as values. Each label must be 80 characters or smaller. .. versionadded:: 0.19.0 version : {114, 117}, default 114 Version to use in the output dta file. Version 114 can be used read by Stata 10 and later. Version 117 can be read by Stata 13 or later. Version 114 limits string variables to 244 characters or fewer while 117 allows strings with lengths up to 2,000,000 characters. .. versionadded:: 0.23.0 convert_strl : list, optional List of column names to convert to string columns to Stata StrL format. Only available if version is 117. Storing strings in the StrL format can produce smaller dta files if strings have more than 8 characters and values are repeated. .. versionadded:: 0.23.0 Raises ------ NotImplementedError * If datetimes contain timezone information * Column dtype is not representable in Stata ValueError * Columns listed in convert_dates are neither datetime64[ns] or datetime.datetime * Column listed in convert_dates is not in DataFrame * Categorical label contains more than 32,000 characters .. versionadded:: 0.19.0 See Also -------- read_stata : Import Stata data files. io.stata.StataWriter : Low-level writer for Stata data files. io.stata.StataWriter117 : Low-level writer for version 117 files. Examples -------- >>> df = pd.DataFrame({'animal': ['falcon', 'parrot', 'falcon', ... 'parrot'], ... 'speed': [350, 18, 361, 15]}) >>> df.to_stata('animals.dta') # doctest: +SKIP