#!/usr/bin/env python # coding: utf-8 # ![image.png](attachment:image.png) # ## Modin-spreadsheet # Modin-spreadsheet is a Jupyter notebook widget that allows users to interact with Modin DataFrames in a spreadsheet-like fashion while taking advantage of the underlying capabilities of Modin. The widget makes it quick and easy to explore, sort, filter, edit data and export reproducible code. # # This tutorial will showcase how to use modin-spreadsheet. To follow along, just run the cells; no editing required! # In[ ]: # This notebook expects that Modin and Ray are installed, e.g. by `pip install modin[ray]`. # For all ways to install Modin see official documentation at: # https://modin.readthedocs.io/en/latest/installation.html import modin.pandas as pd # modin-spreadsheet is the pip distribution name, modin_spreadsheet is the import name import modin_spreadsheet # ### Create a Modin DataFrame # The following cells creates a DataFrame using a NYC taxi dataset. # In[ ]: columns_names = [ "trip_id", "vendor_id", "pickup_datetime", "dropoff_datetime", "store_and_fwd_flag", "rate_code_id", "pickup_longitude", "pickup_latitude", "dropoff_longitude", "dropoff_latitude", "passenger_count", "trip_distance", "fare_amount", "extra", "mta_tax", "tip_amount", "tolls_amount", "ehail_fee", "improvement_surcharge", "total_amount", "payment_type", "trip_type", "pickup", "dropoff", "cab_type", "precipitation", "snow_depth", "snowfall", "max_temperature", "min_temperature", "average_wind_speed", "pickup_nyct2010_gid", "pickup_ctlabel", "pickup_borocode", "pickup_boroname", "pickup_ct2010", "pickup_boroct2010", "pickup_cdeligibil", "pickup_ntacode", "pickup_ntaname", "pickup_puma", "dropoff_nyct2010_gid", "dropoff_ctlabel", "dropoff_borocode", "dropoff_boroname", "dropoff_ct2010", "dropoff_boroct2010", "dropoff_cdeligibil", "dropoff_ntacode", "dropoff_ntaname", "dropoff_puma", ] parse_dates=["pickup_datetime", "dropoff_datetime"] # In[ ]: df = pd.read_csv('https://modin-datasets.s3.amazonaws.com/trips_data.csv', names=columns_names, header=None, parse_dates=parse_dates) # In[ ]: df # ### Generate a spreadsheet widget with the DataFrame # `modin-spreadsheet.show_grid` takes in a DataFrame, optional configuration options, and returns a `SpreadsheetWidget`, which contains all the logic for displaying the spreadsheet view of the DataFrame. The object returned will not be rendered unless displayed. # In[ ]: spreadsheet = modin_spreadsheet.show_grid(df, show_toolbar=True, grid_options={'forceFitColumns': False}) # ### Displaying the Spreadsheet # The widget is displayed when the widget is returned by an input cell or passed to the `display` function e.g. `display(spreadsheet)`. When displayed, the SpreadsheetWidget will generate a transformation history cell that contains a record of the transformations applied to the DataFrame unless the cell already exists or the feature is disabled. # # ### Basic Usage # `Modin-spreadsheet` creates a copy of the input DataFrame, so changes do not alter the original DataFrame. # # **Filter** - Each column can be filtered according to its datatype using the filter button to the right of the column header. Any number of columns can be filtered simultaneously.\ # **Sort** - Each column can be sorted by clicking on the column header. Assumptions on the order of the data should only be made according to the latest sort i.e. the 2nd last sort may not be in order even if grouped by the duplicates in the last sorted column.\ # **Cell Edit** - Double click on a cell to edit its value.\ # **Add Row**(toolbar) - Click on the `Add Row` button in the toolbar to duplicate the last row in the DataFrame.\ # **Remove Row**(toolbar) - Select row(s) on the spreadsheet and click the `Remove Row` button in the toolbar to remove them.\ # **Reset Filters**(toolbar) - Click on the `Reset Filters` button in the toolbar to remove all filters on the data.\ # **Reset Sort**(toolbar) - Click on the `Reset Sort` button in the toolbar to remove any sorting on the data. # # ### Transformation History and Reproducible Code # The widget records the history of transformations, such as filtering, that occur on the spreadsheet. These transformations are updated in the `spreadsheet transformation history` cell as they happen and can be easily copied for reproducibility. The history can be cleared using the `Clear History` button in the toolbar. # # **Try making some changes to the spreadsheet!** # In[ ]: spreadsheet # ## Modin-spreadsheet API # The API on `SpreadsheetWidget` allows users to duplicate some of the functionality on the GUI, but also provides other functionality such as applying the history on another DataFrame or getting the DataFrame that matches the spreadsheet state. # In[ ]: # Duplicates the `Reset Filters` button spreadsheet.reset_filters() # In[ ]: # Duplicates the `Reset Sort` button spreadsheet.reset_sort() # In[ ]: # Duplicates the `Clear History` button spreadsheet.clear_history() # In[ ]: # Gets the modified DataFrame that matches the changes to the spreadsheet spreadsheet.get_changed_df() # ### Retrieving and Applying Transformation History # The transformation history can be retrieved as a list of code snippets using the `get_history` API. The `apply_history` API will apply the transformations on the input DataFrame and return the resultant DataFrame. # In[ ]: spreadsheet.get_history() # In[ ]: another_df = df.copy() spreadsheet.apply_history(another_df)