Hi, this notebook will show you almost all the columns operation availables in Optimus. For row operation, IO, ML and DL please go to the examples folder in the repo

In [1]:

%load_ext autoreload
%autoreload 2

In [2]:

import sys

In [3]:

sys.path.append("..")

Columns Operations¶

In this notebook you can see a detailed overview ok all the columns operations available in Optimus. You can access the operation via df.cols.

In [4]:

from optimus import Optimus

C:\Users\argenisleon\Anaconda3\lib\site-packages\socks.py:58: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  from collections import Callable

    You are using PySparkling of version 2.4.10, but your PySpark is of
    version 2.3.1. Please make sure Spark and PySparkling versions are compatible. 
`formatargspec` is deprecated since Python 3.5. Use `signature` and the `Signature` object directly

In [5]:

# Create optimus
op = Optimus()

Create dataframe¶

In [6]:

from pyspark.sql.types import StringType, IntegerType, ArrayType

df = op.create.df(
    [
        ("words", "str", True),
        ("num", "int", True),
        ("animals", "str", True),
        ("thing", StringType(), True),
        ("two strings", StringType(), True),
        ("filter", StringType(), True),
        ("num 2", "string", True),
        ("col_array", ArrayType(StringType()), True),
        ("col_int", ArrayType(IntegerType()), True)

    ]
    ,
    [
        ("  I like     fish  ", 1, "dog", "housé", "cat-car", "a", "1", ["baby", "sorry"], [1, 2, 3]),
        ("    zombies", 2, "cat", "tv", "dog-tv", "b", "2", ["baby 1", "sorry 1"], [3, 4]),
        ("simpsons   cat lady", 2, "frog", "table", "eagle-tv-plus", "1", "3", ["baby 2", "sorry 2"], [5, 6, 7]),
        (None, 3, "eagle", "glass", "lion-pc", "c", "4", ["baby 3", "sorry 3"], [7, 8])
    ])

df.table()

Viewing 4 of 4 rows / 9 columns

8 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable
⋅⋅I⋅like⋅⋅⋅⋅⋅fish⋅⋅	1	dog	housé	cat-car	a	1	['baby',⋅'sorry']	[1,⋅2,⋅3]
⋅⋅⋅⋅zombies	2	cat	tv	dog-tv	b	2	['baby⋅1',⋅'sorry⋅1']	[3,⋅4]
simpsons⋅⋅⋅cat⋅lady	2	frog	table	eagle-tv-plus	1	3	['baby⋅2',⋅'sorry⋅2']	[5,⋅6,⋅7]
None	3	eagle	glass	lion-pc	c	4	['baby⋅3',⋅'sorry⋅3']	[7,⋅8]

Viewing 4 of 4 rows / 9 columns

8 partition(s)

Create Columns¶

Spark¶

You can not create multiple columns at the same time
You need to use the lit function. lit???

Pandas¶

Assing function seems to do the job https://stackoverflow.com/questions/12555323/adding-new-column-to-existing-dataframe-in-python-pandas

Create a column with a constant value¶

In [7]:

df = df.cols.append("new_col_1", 1)
df.table()

Viewing 4 of 4 rows / 10 columns

8 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable	new_col_1 10 (int) not nullable
⋅⋅I⋅like⋅⋅⋅⋅⋅fish⋅⋅	1	dog	housé	cat-car	a	1	['baby',⋅'sorry']	[1,⋅2,⋅3]	1
⋅⋅⋅⋅zombies	2	cat	tv	dog-tv	b	2	['baby⋅1',⋅'sorry⋅1']	[3,⋅4]	1
simpsons⋅⋅⋅cat⋅lady	2	frog	table	eagle-tv-plus	1	3	['baby⋅2',⋅'sorry⋅2']	[5,⋅6,⋅7]	1
None	3	eagle	glass	lion-pc	c	4	['baby⋅3',⋅'sorry⋅3']	[7,⋅8]	1

Viewing 4 of 4 rows / 10 columns

8 partition(s)

In [17]:

df.table()

Viewing 4 of 4 rows / 10 columns

8 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable	new_col_1 10 (int) not nullable
⋅⋅I⋅like⋅⋅⋅⋅⋅fish⋅⋅	1	dog	housé	cat-car	a	1	['baby',⋅'sorry']	[1,⋅2,⋅3]	1
⋅⋅⋅⋅zombies	2	cat	tv	dog-tv	b	2	['baby⋅1',⋅'sorry⋅1']	[3,⋅4]	1
simpsons⋅⋅⋅cat⋅lady	2	frog	table	eagle-tv-plus	1	3	['baby⋅2',⋅'sorry⋅2']	[5,⋅6,⋅7]	1
None	3	eagle	glass	lion-pc	c	4	['baby⋅3',⋅'sorry⋅3']	[7,⋅8]	1

Viewing 4 of 4 rows / 10 columns

8 partition(s)

Append 2 or multiples dataframes columns like¶

In [79]:

df_col = op.create.df(
    [
        ("new_col", "str", True),
       

    ],[
        ("q"),("w"), ("e"), ("r"),

    ])

In [87]:

df.cols.append(df_col).table()

run profiler
send to queue
<dispatched append>
run profiler
send to queue
<function cols.<locals>.select at 0x000002335A5F1C80>
run profiler
send to queue
<function cols.<locals>.count at 0x000002335A7611E0>

Viewing 4 of 4 rows / 11 columns

200 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable	new_col_1 10 (int) nullable	new_col 11 (string) nullable
⋅⋅I⋅like⋅⋅⋅⋅⋅fish⋅⋅	1	dog	housé	cat-car	a	1	['baby',⋅'sorry']	[1,⋅2,⋅3]	1	q
None	3	eagle	glass	lion-pc	c	4	['baby⋅3',⋅'sorry⋅3']	[7,⋅8]	1	r
simpsons⋅⋅⋅cat⋅lady	2	frog	table	eagle-tv-plus	1	3	['baby⋅2',⋅'sorry⋅2']	[5,⋅6,⋅7]	1	e
⋅⋅⋅⋅zombies	2	cat	tv	dog-tv	b	2	['baby⋅1',⋅'sorry⋅1']	[3,⋅4]	1	w

Viewing 4 of 4 rows / 11 columns

200 partition(s)

Create multiple columns with a constant value¶

In [7]:

from pyspark.sql.functions import *

df.cols.append([
    ("new_col_2", 2.22),
    ("new_col_3", lit(3))
]).table()

Viewing 4 of 4 rows / 12 columns

8 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable	new_col_1 10 (int)	new_col_2 11 (double)	new_col_3 12 (int)
⸱⸱I⸱like⸱⸱⸱⸱⸱fish⸱⸱	1	dog	housé	cat-car	a	1	['baby',⸱'sorry']	[1,⸱2,⸱3]	1	2.22	3
⸱⸱⸱⸱zombies	2	cat	tv	dog-tv	b	2	['baby⸱1',⸱'sorry⸱1']	[3,⸱4]	1	2.22	3
simpsons⸱⸱⸱cat⸱lady	2	frog	table	eagle-tv-plus	1	3	['baby⸱2',⸱'sorry⸱2']	[5,⸱6,⸱7]	1	2.22	3
None	3	eagle	glass	lion-pc	c	4	['baby⸱3',⸱'sorry⸱3']	[7,⸱8]	1	2.22	3

Viewing 4 of 4 rows / 12 columns

8 partition(s)

Create multiple columns with a constant string, a new column with existing columns value and an array¶

In [ ]:

df.cols.append([
    ("new_col_4", "test"),
    ("new_col_5", df['num'] * 2),
    ("new_col_6", [1, 2, 3])
]).table()

Select columns¶

Spark¶

You can not select columns by string and index at the same time

Pandas¶

You can not select columns by string and index at the same time

In [9]:

df.table()
columns = ["words", 1, "animals", 3]
df.cols.select(columns).table()

Viewing 4 of 4 rows / 10 columns

8 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable	new_col_1 10 (int)
⸱⸱I⸱like⸱⸱⸱⸱⸱fish⸱⸱	1	dog	housé	cat-car	a	1	['baby',⸱'sorry']	[1,⸱2,⸱3]	1
⸱⸱⸱⸱zombies	2	cat	tv	dog-tv	b	2	['baby⸱1',⸱'sorry⸱1']	[3,⸱4]	1
simpsons⸱⸱⸱cat⸱lady	2	frog	table	eagle-tv-plus	1	3	['baby⸱2',⸱'sorry⸱2']	[5,⸱6,⸱7]	1
None	3	eagle	glass	lion-pc	c	4	['baby⸱3',⸱'sorry⸱3']	[7,⸱8]	1

Viewing 4 of 4 rows / 10 columns

8 partition(s)

Viewing 4 of 4 rows / 4 columns

8 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable
⸱⸱I⸱like⸱⸱⸱⸱⸱fish⸱⸱	1	dog	housé
⸱⸱⸱⸱zombies	2	cat	tv
simpsons⸱⸱⸱cat⸱lady	2	frog	table
None	3	eagle	glass

Viewing 4 of 4 rows / 4 columns

8 partition(s)

Select columns with a Regex¶

In [10]:

df.cols.select("n.*", regex=True).table()

Viewing 4 of 4 rows / 3 columns

8 partition(s)

num 1 (int) nullable	num 2 2 (string) nullable	new_col_1 3 (int)
1	1	1
2	2	1
2	3	1
3	4	1

Viewing 4 of 4 rows / 3 columns

8 partition(s)

Select all the columns of type string¶

In [11]:

df.cols.select("*", data_type="str").table()

['col_int', 'num', 'new_col_1', 'col_array'] column(s) was not processed because is not ['str']

Viewing 4 of 4 rows / 6 columns

8 partition(s)

words 1 (string) nullable	num 2 2 (string) nullable	two strings 3 (string) nullable	animals 4 (string) nullable	thing 5 (string) nullable	filter 6 (string) nullable
⸱⸱I⸱like⸱⸱⸱⸱⸱fish⸱⸱	1	cat-car	dog	housé	a
⸱⸱⸱⸱zombies	2	dog-tv	cat	tv	b
simpsons⸱⸱⸱cat⸱lady	3	eagle-tv-plus	frog	table	1
None	4	lion-pc	eagle	glass	c

Viewing 4 of 4 rows / 6 columns

8 partition(s)

Rename Column¶

Spark¶

You can not rename multiple columns using Spark Vanilla API

Pandas¶

Almost the same behavior https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html

In [12]:

df.cols.rename('num', 'number').table()

Viewing 4 of 4 rows / 10 columns

8 partition(s)

words 1 (string) nullable	number 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable	new_col_1 10 (int)
⸱⸱I⸱like⸱⸱⸱⸱⸱fish⸱⸱	1	dog	housé	cat-car	a	1	['baby',⸱'sorry']	[1,⸱2,⸱3]	1
⸱⸱⸱⸱zombies	2	cat	tv	dog-tv	b	2	['baby⸱1',⸱'sorry⸱1']	[3,⸱4]	1
simpsons⸱⸱⸱cat⸱lady	2	frog	table	eagle-tv-plus	1	3	['baby⸱2',⸱'sorry⸱2']	[5,⸱6,⸱7]	1
None	3	eagle	glass	lion-pc	c	4	['baby⸱3',⸱'sorry⸱3']	[7,⸱8]	1

Viewing 4 of 4 rows / 10 columns

8 partition(s)

Rename multiple columns and uppercase all the columns¶

In [13]:

df.cols.rename([('num', 'number'), ("animals", "gods")], str.upper).table()

Viewing 4 of 4 rows / 10 columns

8 partition(s)

WORDS 1 (string) nullable	NUM 2 (int) nullable	ANIMALS 3 (string) nullable	THING 4 (string) nullable	TWO STRINGS 5 (string) nullable	FILTER 6 (string) nullable	NUM 2 7 (string) nullable	COL_ARRAY 8 (array<string>) nullable	COL_INT 9 (array<int>) nullable	NEW_COL_1 10 (int)
⸱⸱I⸱like⸱⸱⸱⸱⸱fish⸱⸱	1	dog	housé	cat-car	a	1	['baby',⸱'sorry']	[1,⸱2,⸱3]	1
⸱⸱⸱⸱zombies	2	cat	tv	dog-tv	b	2	['baby⸱1',⸱'sorry⸱1']	[3,⸱4]	1
simpsons⸱⸱⸱cat⸱lady	2	frog	table	eagle-tv-plus	1	3	['baby⸱2',⸱'sorry⸱2']	[5,⸱6,⸱7]	1
None	3	eagle	glass	lion-pc	c	4	['baby⸱3',⸱'sorry⸱3']	[7,⸱8]	1

Viewing 4 of 4 rows / 10 columns

8 partition(s)

Convert to lower case¶

In [14]:

df.cols.rename(str.lower).table()

Viewing 4 of 4 rows / 10 columns

8 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable	new_col_1 10 (int)
⸱⸱I⸱like⸱⸱⸱⸱⸱fish⸱⸱	1	dog	housé	cat-car	a	1	['baby',⸱'sorry']	[1,⸱2,⸱3]	1
⸱⸱⸱⸱zombies	2	cat	tv	dog-tv	b	2	['baby⸱1',⸱'sorry⸱1']	[3,⸱4]	1
simpsons⸱⸱⸱cat⸱lady	2	frog	table	eagle-tv-plus	1	3	['baby⸱2',⸱'sorry⸱2']	[5,⸱6,⸱7]	1
None	3	eagle	glass	lion-pc	c	4	['baby⸱3',⸱'sorry⸱3']	[7,⸱8]	1

Viewing 4 of 4 rows / 10 columns

8 partition(s)

Convert to uppercase¶

In [15]:

df.cols.rename(str.upper).table()

Viewing 4 of 4 rows / 10 columns

8 partition(s)

WORDS 1 (string) nullable	NUM 2 (int) nullable	ANIMALS 3 (string) nullable	THING 4 (string) nullable	TWO STRINGS 5 (string) nullable	FILTER 6 (string) nullable	NUM 2 7 (string) nullable	COL_ARRAY 8 (array<string>) nullable	COL_INT 9 (array<int>) nullable	NEW_COL_1 10 (int)
⸱⸱I⸱like⸱⸱⸱⸱⸱fish⸱⸱	1	dog	housé	cat-car	a	1	['baby',⸱'sorry']	[1,⸱2,⸱3]	1
⸱⸱⸱⸱zombies	2	cat	tv	dog-tv	b	2	['baby⸱1',⸱'sorry⸱1']	[3,⸱4]	1
simpsons⸱⸱⸱cat⸱lady	2	frog	table	eagle-tv-plus	1	3	['baby⸱2',⸱'sorry⸱2']	[5,⸱6,⸱7]	1
None	3	eagle	glass	lion-pc	c	4	['baby⸱3',⸱'sorry⸱3']	[7,⸱8]	1

Viewing 4 of 4 rows / 10 columns

8 partition(s)

Cast a columns¶

This is a opinionated way to handle column casting. One of the first thing that every data cleaning process need to acomplish is define a data dictionary. Because of that we prefer to create a tuple like this:

df.cols().cast( [("words","str"), ("num","int"), ("animals","float"), ("thing","str")] )

Spark¶

Can not cast multiple columns

Pandas¶

with astype() https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.astype.html

In [11]:

df.cols.cast(columns=[("num", "string"), ("num 2", "integer")]).dtypes

Out[11]:

[('words', 'string'),
 ('num', 'string'),
 ('animals', 'string'),
 ('thing', 'string'),
 ('two strings', 'string'),
 ('filter', 'string'),
 ('num 2', 'int'),
 ('col_array', 'array<string>'),
 ('col_int', 'array<int>')]

Cast a column to string¶

In [17]:

df.cols.cast("num", "string").dtypes

Out[17]:

[('words', 'string'),
 ('num', 'string'),
 ('animals', 'string'),
 ('thing', 'string'),
 ('two strings', 'string'),
 ('filter', 'string'),
 ('num 2', 'string'),
 ('col_array', 'array<string>'),
 ('col_int', 'array<int>'),
 ('new_col_1', 'int')]

Cast all columns to string¶

In [18]:

df.cols.cast("*", "string").dtypes

Out[18]:

[('words', 'string'),
 ('num', 'string'),
 ('animals', 'string'),
 ('thing', 'string'),
 ('two strings', 'string'),
 ('filter', 'string'),
 ('num 2', 'string'),
 ('col_array', 'string'),
 ('col_int', 'string'),
 ('new_col_1', 'string')]

Cast a column to Vectors¶

In [19]:

from pyspark.ml.linalg import Vectors

df.cols.cast("col_int", Vectors)

Out[19]:

DataFrame[words: string, num: int, animals: string, thing: string, two strings: string, filter: string, num 2: string, col_array: array<string>, col_int: vector, new_col_1: int]

Keep columns¶

Spark¶

You can you df.select() to get the columns you want

Pandas¶

Via drop()

In [ ]:

from pyspark.sql.functions import *

In [ ]:

df.withColumn("num", col("num").cast(StringType()))

In [21]:

df.table()
df.cols.keep("num").table()

Viewing 4 of 4 rows / 10 columns

8 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable	new_col_1 10 (int)
⸱⸱I⸱like⸱⸱⸱⸱⸱fish⸱⸱	1	dog	housé	cat-car	a	1	['baby',⸱'sorry']	[1,⸱2,⸱3]	1
⸱⸱⸱⸱zombies	2	cat	tv	dog-tv	b	2	['baby⸱1',⸱'sorry⸱1']	[3,⸱4]	1
simpsons⸱⸱⸱cat⸱lady	2	frog	table	eagle-tv-plus	1	3	['baby⸱2',⸱'sorry⸱2']	[5,⸱6,⸱7]	1
None	3	eagle	glass	lion-pc	c	4	['baby⸱3',⸱'sorry⸱3']	[7,⸱8]	1

Viewing 4 of 4 rows / 10 columns

8 partition(s)

Viewing 4 of 4 rows / 1 columns

8 partition(s)

num 1 (int) nullable
1
2
2
3

Viewing 4 of 4 rows / 1 columns

8 partition(s)

Move columns¶

Spark¶

Do not exist in spark

Pandas¶

Do not exist in pandas

In [22]:

df.cols.move("words", "after", "thing").table()

Viewing 4 of 4 rows / 10 columns

8 partition(s)

num 1 (int) nullable	animals 2 (string) nullable	thing 3 (string) nullable	words 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable	new_col_1 10 (int)
1	dog	housé	⸱⸱I⸱like⸱⸱⸱⸱⸱fish⸱⸱	cat-car	a	1	['baby',⸱'sorry']	[1,⸱2,⸱3]	1
2	cat	tv	⸱⸱⸱⸱zombies	dog-tv	b	2	['baby⸱1',⸱'sorry⸱1']	[3,⸱4]	1
2	frog	table	simpsons⸱⸱⸱cat⸱lady	eagle-tv-plus	1	3	['baby⸱2',⸱'sorry⸱2']	[5,⸱6,⸱7]	1
3	eagle	glass	None	lion-pc	c	4	['baby⸱3',⸱'sorry⸱3']	[7,⸱8]	1

Viewing 4 of 4 rows / 10 columns

8 partition(s)

Sorting Columns¶

Spark¶

You can not sort columns using Spark Vanilla API

Pandas¶

df.reindex_axis(sorted(df.columns), axis=1)

Sort in Alphabetical order¶

In [23]:

df.cols.sort().table()

Viewing 4 of 4 rows / 10 columns

8 partition(s)

animals 1 (string) nullable	col_array 2 (array<string>) nullable	col_int 3 (array<int>) nullable	filter 4 (string) nullable	new_col_1 5 (int)	num 6 (int) nullable	num 2 7 (string) nullable	thing 8 (string) nullable	two strings 9 (string) nullable	words 10 (string) nullable
dog	['baby',⸱'sorry']	[1,⸱2,⸱3]	a	1	1	1	housé	cat-car	⸱⸱I⸱like⸱⸱⸱⸱⸱fish⸱⸱
cat	['baby⸱1',⸱'sorry⸱1']	[3,⸱4]	b	1	2	2	tv	dog-tv	⸱⸱⸱⸱zombies
frog	['baby⸱2',⸱'sorry⸱2']	[5,⸱6,⸱7]	1	1	2	3	table	eagle-tv-plus	simpsons⸱⸱⸱cat⸱lady
eagle	['baby⸱3',⸱'sorry⸱3']	[7,⸱8]	c	1	3	4	glass	lion-pc	None

Viewing 4 of 4 rows / 10 columns

8 partition(s)

Sort in Reverse Alphabetical order¶

In [24]:

df.cols.sort(order="desc").table()

Viewing 4 of 4 rows / 10 columns

8 partition(s)

words 1 (string) nullable	two strings 2 (string) nullable	thing 3 (string) nullable	num 2 4 (string) nullable	num 5 (int) nullable	new_col_1 6 (int)	filter 7 (string) nullable	col_int 8 (array<int>) nullable	col_array 9 (array<string>) nullable	animals 10 (string) nullable
⸱⸱I⸱like⸱⸱⸱⸱⸱fish⸱⸱	cat-car	housé	1	1	1	a	[1,⸱2,⸱3]	['baby',⸱'sorry']	dog
⸱⸱⸱⸱zombies	dog-tv	tv	2	2	1	b	[3,⸱4]	['baby⸱1',⸱'sorry⸱1']	cat
simpsons⸱⸱⸱cat⸱lady	eagle-tv-plus	table	3	2	1	1	[5,⸱6,⸱7]	['baby⸱2',⸱'sorry⸱2']	frog
None	lion-pc	glass	4	3	1	c	[7,⸱8]	['baby⸱3',⸱'sorry⸱3']	eagle

Viewing 4 of 4 rows / 10 columns

8 partition(s)

Drop columns¶

Spark¶

You can not delete multiple colums

Pandas¶

Almost the same as pandas

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html

Drop one columns¶

In [25]:

df2 = df.cols.drop("num")
df2.table()

Viewing 4 of 4 rows / 9 columns

8 partition(s)

words 1 (string) nullable	animals 2 (string) nullable	thing 3 (string) nullable	two strings 4 (string) nullable	filter 5 (string) nullable	num 2 6 (string) nullable	col_array 7 (array<string>) nullable	col_int 8 (array<int>) nullable	new_col_1 9 (int)
⸱⸱I⸱like⸱⸱⸱⸱⸱fish⸱⸱	dog	housé	cat-car	a	1	['baby',⸱'sorry']	[1,⸱2,⸱3]	1
⸱⸱⸱⸱zombies	cat	tv	dog-tv	b	2	['baby⸱1',⸱'sorry⸱1']	[3,⸱4]	1
simpsons⸱⸱⸱cat⸱lady	frog	table	eagle-tv-plus	1	3	['baby⸱2',⸱'sorry⸱2']	[5,⸱6,⸱7]	1
None	eagle	glass	lion-pc	c	4	['baby⸱3',⸱'sorry⸱3']	[7,⸱8]	1

Viewing 4 of 4 rows / 9 columns

8 partition(s)

Drop multiple columns¶

In [26]:

df2 = df.cols.drop(["num", "words"])
df2.table()

Viewing 4 of 4 rows / 8 columns

8 partition(s)

animals 1 (string) nullable	thing 2 (string) nullable	two strings 3 (string) nullable	filter 4 (string) nullable	num 2 5 (string) nullable	col_array 6 (array<string>) nullable	col_int 7 (array<int>) nullable	new_col_1 8 (int)
dog	housé	cat-car	a	1	['baby',⸱'sorry']	[1,⸱2,⸱3]	1
cat	tv	dog-tv	b	2	['baby⸱1',⸱'sorry⸱1']	[3,⸱4]	1
frog	table	eagle-tv-plus	1	3	['baby⸱2',⸱'sorry⸱2']	[5,⸱6,⸱7]	1
eagle	glass	lion-pc	c	4	['baby⸱3',⸱'sorry⸱3']	[7,⸱8]	1

Viewing 4 of 4 rows / 8 columns

8 partition(s)

In [27]:

df.table()

Viewing 4 of 4 rows / 10 columns

8 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable	new_col_1 10 (int)
⸱⸱I⸱like⸱⸱⸱⸱⸱fish⸱⸱	1	dog	housé	cat-car	a	1	['baby',⸱'sorry']	[1,⸱2,⸱3]	1
⸱⸱⸱⸱zombies	2	cat	tv	dog-tv	b	2	['baby⸱1',⸱'sorry⸱1']	[3,⸱4]	1
simpsons⸱⸱⸱cat⸱lady	2	frog	table	eagle-tv-plus	1	3	['baby⸱2',⸱'sorry⸱2']	[5,⸱6,⸱7]	1
None	3	eagle	glass	lion-pc	c	4	['baby⸱3',⸱'sorry⸱3']	[7,⸱8]	1

Viewing 4 of 4 rows / 10 columns

8 partition(s)

Chaining¶

.cols y .rows attributes are used to organize and encapsulate optimus functionality apart from Apache Spark Dataframe API.

At the same time it can be helpfull when you look at the code because every line is self explained.

The past transformations were done step by step, but this can be achieved by chaining all operations into one line of code, like the cell below. This way is much more efficient and scalable because it uses all optimization issues from the lazy evaluation approach.

In [28]:

df.table()
df \
    .cols.rename([('num', 'number')]) \
    .cols.drop(["number", "words"]) \
    .withColumn("new_col_2", lit("spongebob")) \
    .cols.append("new_col_1", 1) \
    .cols.sort(order="desc") \
    .rows.drop(df["num 2"] == 3) \
    .table()

Viewing 4 of 4 rows / 10 columns

8 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable	new_col_1 10 (int)
⸱⸱I⸱like⸱⸱⸱⸱⸱fish⸱⸱	1	dog	housé	cat-car	a	1	['baby',⸱'sorry']	[1,⸱2,⸱3]	1
⸱⸱⸱⸱zombies	2	cat	tv	dog-tv	b	2	['baby⸱1',⸱'sorry⸱1']	[3,⸱4]	1
simpsons⸱⸱⸱cat⸱lady	2	frog	table	eagle-tv-plus	1	3	['baby⸱2',⸱'sorry⸱2']	[5,⸱6,⸱7]	1
None	3	eagle	glass	lion-pc	c	4	['baby⸱3',⸱'sorry⸱3']	[7,⸱8]	1

Viewing 4 of 4 rows / 10 columns

8 partition(s)

Viewing 3 of 3 rows / 9 columns

8 partition(s)

two strings 1 (string) nullable	thing 2 (string) nullable	num 2 3 (string) nullable	new_col_2 4 (string)	new_col_1 5 (int)	filter 6 (string) nullable	col_int 7 (array<int>) nullable	col_array 8 (array<string>) nullable	animals 9 (string) nullable
cat-car	housé	1	spongebob	1	a	[1,⸱2,⸱3]	['baby',⸱'sorry']	dog
dog-tv	tv	2	spongebob	1	b	[3,⸱4]	['baby⸱1',⸱'sorry⸱1']	cat
lion-pc	glass	4	spongebob	1	c	[7,⸱8]	['baby⸱3',⸱'sorry⸱3']	eagle

Viewing 3 of 3 rows / 9 columns

8 partition(s)

Unnest Columns¶

With unnest you can convert one column into multiple ones. it can hadle string, array and vectors

Spark¶

Can split strings with split()

Pandas¶

via str.split()

In [29]:

df.table()
df.cols.unnest("two strings", "-") \
    .table()

Viewing 4 of 4 rows / 10 columns

8 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable	new_col_1 10 (int)
⸱⸱I⸱like⸱⸱⸱⸱⸱fish⸱⸱	1	dog	housé	cat-car	a	1	['baby',⸱'sorry']	[1,⸱2,⸱3]	1
⸱⸱⸱⸱zombies	2	cat	tv	dog-tv	b	2	['baby⸱1',⸱'sorry⸱1']	[3,⸱4]	1
simpsons⸱⸱⸱cat⸱lady	2	frog	table	eagle-tv-plus	1	3	['baby⸱2',⸱'sorry⸱2']	[5,⸱6,⸱7]	1
None	3	eagle	glass	lion-pc	c	4	['baby⸱3',⸱'sorry⸱3']	[7,⸱8]	1

Viewing 4 of 4 rows / 10 columns

8 partition(s)

Viewing 4 of 4 rows / 13 columns

8 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable	new_col_1 10 (int)	two strings_0 11 (string) nullable	two strings_1 12 (string) nullable	two strings_2 13 (string) nullable
⸱⸱I⸱like⸱⸱⸱⸱⸱fish⸱⸱	1	dog	housé	cat-car	a	1	['baby',⸱'sorry']	[1,⸱2,⸱3]	1	cat	car	None
⸱⸱⸱⸱zombies	2	cat	tv	dog-tv	b	2	['baby⸱1',⸱'sorry⸱1']	[3,⸱4]	1	dog	tv	None
simpsons⸱⸱⸱cat⸱lady	2	frog	table	eagle-tv-plus	1	3	['baby⸱2',⸱'sorry⸱2']	[5,⸱6,⸱7]	1	eagle	tv	plus
None	3	eagle	glass	lion-pc	c	4	['baby⸱3',⸱'sorry⸱3']	[7,⸱8]	1	lion	pc	None

Viewing 4 of 4 rows / 13 columns

8 partition(s)

Only get the first element¶

In [30]:

df.cols.unnest("two strings", "-", index=1).table()

Viewing 4 of 4 rows / 11 columns

8 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable	new_col_1 10 (int)	two strings_1 11 (string) nullable
⸱⸱I⸱like⸱⸱⸱⸱⸱fish⸱⸱	1	dog	housé	cat-car	a	1	['baby',⸱'sorry']	[1,⸱2,⸱3]	1	car
⸱⸱⸱⸱zombies	2	cat	tv	dog-tv	b	2	['baby⸱1',⸱'sorry⸱1']	[3,⸱4]	1	tv
simpsons⸱⸱⸱cat⸱lady	2	frog	table	eagle-tv-plus	1	3	['baby⸱2',⸱'sorry⸱2']	[5,⸱6,⸱7]	1	tv
None	3	eagle	glass	lion-pc	c	4	['baby⸱3',⸱'sorry⸱3']	[7,⸱8]	1	pc

Viewing 4 of 4 rows / 11 columns

8 partition(s)

Unnest array of string¶

In [31]:

df \
    .cols.unnest("col_array") \
    .table()

Viewing 4 of 4 rows / 12 columns

8 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable	new_col_1 10 (int)	col_array_0 11 (string) nullable	col_array_1 12 (string) nullable
⸱⸱I⸱like⸱⸱⸱⸱⸱fish⸱⸱	1	dog	housé	cat-car	a	1	['baby',⸱'sorry']	[1,⸱2,⸱3]	1	baby	sorry
⸱⸱⸱⸱zombies	2	cat	tv	dog-tv	b	2	['baby⸱1',⸱'sorry⸱1']	[3,⸱4]	1	baby⸱1	sorry⸱1
simpsons⸱⸱⸱cat⸱lady	2	frog	table	eagle-tv-plus	1	3	['baby⸱2',⸱'sorry⸱2']	[5,⸱6,⸱7]	1	baby⸱2	sorry⸱2
None	3	eagle	glass	lion-pc	c	4	['baby⸱3',⸱'sorry⸱3']	[7,⸱8]	1	baby⸱3	sorry⸱3

Viewing 4 of 4 rows / 12 columns

8 partition(s)

Unnest and array of ints¶

In [32]:

df \
    .cols.unnest(["col_int"]) \
    .table()

Viewing 4 of 4 rows / 13 columns

8 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable	new_col_1 10 (int)	col_int_0 11 (int) nullable	col_int_1 12 (int) nullable	col_int_2 13 (int) nullable
⸱⸱I⸱like⸱⸱⸱⸱⸱fish⸱⸱	1	dog	housé	cat-car	a	1	['baby',⸱'sorry']	[1,⸱2,⸱3]	1	1	2	3
⸱⸱⸱⸱zombies	2	cat	tv	dog-tv	b	2	['baby⸱1',⸱'sorry⸱1']	[3,⸱4]	1	3	4	None
simpsons⸱⸱⸱cat⸱lady	2	frog	table	eagle-tv-plus	1	3	['baby⸱2',⸱'sorry⸱2']	[5,⸱6,⸱7]	1	5	6	7
None	3	eagle	glass	lion-pc	c	4	['baby⸱3',⸱'sorry⸱3']	[7,⸱8]	1	7	8	None

Viewing 4 of 4 rows / 13 columns

8 partition(s)

Spits in 3 parts¶

In [33]:

df \
    .cols.unnest(["two strings"], splits=3, separator="-") \
    .table()

Viewing 4 of 4 rows / 13 columns

8 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable	new_col_1 10 (int)	two strings_0 11 (string) nullable	two strings_1 12 (string) nullable	two strings_2 13 (string) nullable
⸱⸱I⸱like⸱⸱⸱⸱⸱fish⸱⸱	1	dog	housé	cat-car	a	1	['baby',⸱'sorry']	[1,⸱2,⸱3]	1	cat	car	None
⸱⸱⸱⸱zombies	2	cat	tv	dog-tv	b	2	['baby⸱1',⸱'sorry⸱1']	[3,⸱4]	1	dog	tv	None
simpsons⸱⸱⸱cat⸱lady	2	frog	table	eagle-tv-plus	1	3	['baby⸱2',⸱'sorry⸱2']	[5,⸱6,⸱7]	1	eagle	tv	plus
None	3	eagle	glass	lion-pc	c	4	['baby⸱3',⸱'sorry⸱3']	[7,⸱8]	1	lion	pc	None

Viewing 4 of 4 rows / 13 columns

8 partition(s)

Unnest a Vector¶

In [ ]:

# from pyspark.ml.linalg import Vectors

# df1 = op.sc.parallelize([
#     ("assert", Vectors.dense([1, 2, 3])),
#     ("require", Vectors.sparse(3, {1: 2}))
# ]).toDF()

In [ ]:

# df1 \
#     .cols.unnest(["vector"]) \
#     .table()

In [ ]:

# df = df.cols.append("new_col_1", 1)

Impute¶

Fill missing data¶

In [35]:

df_fill = op.spark.createDataFrame([(1.0, float("nan"), "1"),
                                    (2.0, float("nan"), "nan"),
                                    (float("nan"), 3.0, None),
                                    (4.0, 4.0, "2"),
                                    (5.0, 5.0, "2")
                                    ], ["a", "b", "c"]
                                   )

In [36]:

df_fill.table()

Viewing 5 of 5 rows / 3 columns

8 partition(s)

a 1 (double) nullable	b 2 (double) nullable	c 3 (string) nullable
1.0	nan	1
2.0	nan	nan
nan	3.0	None
4.0	4.0	2
5.0	5.0	2

Viewing 5 of 5 rows / 3 columns

8 partition(s)

In [37]:

df_fill.cols.impute(["a", "b"], "continuous", "median").table()

1 values imputed for column(s) 'a'
2 values imputed for column(s) 'b'

Viewing 5 of 5 rows / 5 columns

8 partition(s)

a 1 (float) nullable	b 2 (float) nullable	c 3 (string) nullable	a_imputed 4 (float) nullable	b_imputed 5 (float) nullable
1.0	nan	1	1.0	4.0
2.0	nan	nan	2.0	4.0
nan	3.0	None	2.0	3.0
4.0	4.0	2	4.0	4.0
5.0	5.0	2	5.0	5.0

Viewing 5 of 5 rows / 5 columns

8 partition(s)

In [38]:

df_fill.cols.impute(["c"], "categorical").table()

Including 'nan' as Null in processing 'c'
2 values imputed for column(s) 'c' with '2'

Viewing 5 of 5 rows / 3 columns

8 partition(s)

a 1 (double) nullable	b 2 (double) nullable	c 3 (string) nullable
1.0	nan	1
2.0	nan	2
nan	3.0	2
4.0	4.0	2
5.0	5.0	2

Viewing 5 of 5 rows / 3 columns

8 partition(s)

Get columns by type¶

Spark¶

Not implemented in Spark Vanilla

Pandas¶

In [39]:

df.cols.select_by_dtypes("int").table()

['words', 'filter', 'two strings', 'animals', 'col_array', 'col_int', 'thing', 'num 2'] column(s) was not processed because is not ['int']

Viewing 4 of 4 rows / 2 columns

8 partition(s)

num 1 (int) nullable	new_col_1 2 (int)
1	1
2	1
2	1
3	1

Viewing 4 of 4 rows / 2 columns

8 partition(s)

Apply custom function¶

Spark have few ways to transform data rdd, Columns Expression, UDF and Pandas UDF. apply() and apply_expr() try to make a consistent way to call this expression without knowing the implementation details.

Spark¶

You need to declare a UDF Spark function

Pandas¶

Almost the same behavior that Optimus

In [40]:

df.table()

Viewing 4 of 4 rows / 10 columns

8 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable	new_col_1 10 (int)
⸱⸱I⸱like⸱⸱⸱⸱⸱fish⸱⸱	1	dog	housé	cat-car	a	1	['baby',⸱'sorry']	[1,⸱2,⸱3]	1
⸱⸱⸱⸱zombies	2	cat	tv	dog-tv	b	2	['baby⸱1',⸱'sorry⸱1']	[3,⸱4]	1
simpsons⸱⸱⸱cat⸱lady	2	frog	table	eagle-tv-plus	1	3	['baby⸱2',⸱'sorry⸱2']	[5,⸱6,⸱7]	1
None	3	eagle	glass	lion-pc	c	4	['baby⸱3',⸱'sorry⸱3']	[7,⸱8]	1

Viewing 4 of 4 rows / 10 columns

8 partition(s)

Create a function that only apply to string value in column filter¶

Sometimes there are columns with for example with numbers even when are supposed to be only of words or letters.

In order to solve this problem, apply_by_dtypes() function can be used.

In the next example we replace a number in a string column with "new string"

In [41]:

def func(val, attr):
    return attr


df.cols.apply_by_dtypes("filter", func, "string", "new string", data_type="integer").table()

Viewing 4 of 4 rows / 10 columns

8 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable	new_col_1 10 (int)
⸱⸱I⸱like⸱⸱⸱⸱⸱fish⸱⸱	1	dog	housé	cat-car	a	1	['baby',⸱'sorry']	[1,⸱2,⸱3]	1
⸱⸱⸱⸱zombies	2	cat	tv	dog-tv	b	2	['baby⸱1',⸱'sorry⸱1']	[3,⸱4]	1
simpsons⸱⸱⸱cat⸱lady	2	frog	table	eagle-tv-plus	new⸱string	3	['baby⸱2',⸱'sorry⸱2']	[5,⸱6,⸱7]	1
None	3	eagle	glass	lion-pc	c	4	['baby⸱3',⸱'sorry⸱3']	[7,⸱8]	1

Viewing 4 of 4 rows / 10 columns

8 partition(s)

Create a UDF function that sum a values(32 in this case) to two columns¶

In [42]:

def func(val, attr):
    return val + attr


df.cols.apply(["num", "new_col_1"], func, "int", 32, "udf").table()

Viewing 4 of 4 rows / 10 columns

8 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable	new_col_1 10 (int) nullable
⸱⸱I⸱like⸱⸱⸱⸱⸱fish⸱⸱	33	dog	housé	cat-car	a	1	['baby',⸱'sorry']	[1,⸱2,⸱3]	33
⸱⸱⸱⸱zombies	34	cat	tv	dog-tv	b	2	['baby⸱1',⸱'sorry⸱1']	[3,⸱4]	33
simpsons⸱⸱⸱cat⸱lady	34	frog	table	eagle-tv-plus	1	3	['baby⸱2',⸱'sorry⸱2']	[5,⸱6,⸱7]	33
None	35	eagle	glass	lion-pc	c	4	['baby⸱3',⸱'sorry⸱3']	[7,⸱8]	33

Viewing 4 of 4 rows / 10 columns

8 partition(s)

Create a Pandas UDF function that sum a values(32 in this case) to two columns¶

In [43]:

def func(val, attr):
    return val + attr


df.cols.apply(["num", "new_col_1"], func, "int", 10).table()

Viewing 4 of 4 rows / 10 columns

8 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable	new_col_1 10 (int) nullable
⸱⸱I⸱like⸱⸱⸱⸱⸱fish⸱⸱	11	dog	housé	cat-car	a	1	['baby',⸱'sorry']	[1,⸱2,⸱3]	11
⸱⸱⸱⸱zombies	12	cat	tv	dog-tv	b	2	['baby⸱1',⸱'sorry⸱1']	[3,⸱4]	11
simpsons⸱⸱⸱cat⸱lady	12	frog	table	eagle-tv-plus	1	3	['baby⸱2',⸱'sorry⸱2']	[5,⸱6,⸱7]	11
None	13	eagle	glass	lion-pc	c	4	['baby⸱3',⸱'sorry⸱3']	[7,⸱8]	11

Viewing 4 of 4 rows / 10 columns

8 partition(s)

Select row where column "filter" is "integer"¶

In [ ]:

from optimus.audf import filter_row_by_data_type as fbdt
df.rows.select(fbdt("filter", "integer")).table()

Create an abstract dataframe to filter a rows where the value of column "num"> 1¶

In [ ]:

from optimus.audf import abstract_udf as audf

def func(val, attr):
    return val > 1

df.rows.select(audf("num", func, "boolean")).table()

Create an abstract dataframe (Pandas UDF) to pass two arguments to a function a apply a sum operation¶

In [ ]:

from optimus.audf import abstract_udf as audf

def func(val, attr):
    return val + attr[0] + attr[1]

df.withColumn("num_sum", audf("num", func, "int", [10, 20])).table()

Apply a column expression to when the value of "num" or "num 2" is grater than 2¶

In [ ]:

from pyspark.sql import functions as F

def func(col_name, attr):
    return F.when(F.col(col_name) > 2, 10).otherwise(1)

df.cols.apply_expr(["num", "num 2"], func).table()

Convert to uppercase¶

In [48]:

def func(col_name, attr):
    return F.upper(F.col(col_name))


df.cols.apply_expr(["two strings", "animals"], func).table()

Viewing 4 of 4 rows / 10 columns

8 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable	new_col_1 10 (int)
⸱⸱I⸱like⸱⸱⸱⸱⸱fish⸱⸱	1	DOG	housé	CAT-CAR	a	1	['baby',⸱'sorry']	[1,⸱2,⸱3]	1
⸱⸱⸱⸱zombies	2	CAT	tv	DOG-TV	b	2	['baby⸱1',⸱'sorry⸱1']	[3,⸱4]	1
simpsons⸱⸱⸱cat⸱lady	2	FROG	table	EAGLE-TV-PLUS	1	3	['baby⸱2',⸱'sorry⸱2']	[5,⸱6,⸱7]	1
None	3	EAGLE	glass	LION-PC	c	4	['baby⸱3',⸱'sorry⸱3']	[7,⸱8]	1

Viewing 4 of 4 rows / 10 columns

8 partition(s)

Using apply with a condition¶

In [49]:

def func(val, attr):
    return 10

col = "num"

df.cols.apply(col, func, "int", when=df["num"] > 1).table()

df.cols.apply(col, func, "int", when=fbdt(col, "int")).table()

Viewing 4 of 4 rows / 10 columns

8 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable	new_col_1 10 (int)
⸱⸱I⸱like⸱⸱⸱⸱⸱fish⸱⸱	1	dog	housé	cat-car	a	1	['baby',⸱'sorry']	[1,⸱2,⸱3]	1
⸱⸱⸱⸱zombies	10	cat	tv	dog-tv	b	2	['baby⸱1',⸱'sorry⸱1']	[3,⸱4]	1
simpsons⸱⸱⸱cat⸱lady	10	frog	table	eagle-tv-plus	1	3	['baby⸱2',⸱'sorry⸱2']	[5,⸱6,⸱7]	1
None	10	eagle	glass	lion-pc	c	4	['baby⸱3',⸱'sorry⸱3']	[7,⸱8]	1

Viewing 4 of 4 rows / 10 columns

8 partition(s)

Viewing 4 of 4 rows / 10 columns

8 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable	new_col_1 10 (int)
⸱⸱I⸱like⸱⸱⸱⸱⸱fish⸱⸱	10	dog	housé	cat-car	a	1	['baby',⸱'sorry']	[1,⸱2,⸱3]	1
⸱⸱⸱⸱zombies	10	cat	tv	dog-tv	b	2	['baby⸱1',⸱'sorry⸱1']	[3,⸱4]	1
simpsons⸱⸱⸱cat⸱lady	10	frog	table	eagle-tv-plus	1	3	['baby⸱2',⸱'sorry⸱2']	[5,⸱6,⸱7]	1
None	10	eagle	glass	lion-pc	c	4	['baby⸱3',⸱'sorry⸱3']	[7,⸱8]	1

Viewing 4 of 4 rows / 10 columns

8 partition(s)

Count Nulls¶

In [50]:

import numpy as np

df_null = op.spark.createDataFrame(
    [(1, 1, None), (1, 2, float(5)), (1, 3, np.nan), (1, 4, None), (1, 5, float(10)), (1, 6, float('nan')),
     (1, 6, float('nan'))],
    ('session', "timestamp1", "id2"))

In [51]:

df_null.table()

Viewing 7 of 7 rows / 3 columns

8 partition(s)

session 1 (bigint) nullable	timestamp1 2 (bigint) nullable	id2 3 (double) nullable
1	1	None
1	2	5.0
1	3	nan
1	4	None
1	5	10.0
1	6	nan
1	6	nan

Viewing 7 of 7 rows / 3 columns

8 partition(s)

In [52]:

df_null.cols.count_na("id2")

Out[52]:

In [53]:

df_null.cols.count_na("*")

Out[53]:

{'session': 0, 'timestamp1': 0, 'id2': 5}

Count uniques¶

Spark¶

Pandas¶

In [54]:

df.cols.count_uniques("*")

Out[54]:

{'words': 3,
 'num': 3,
 'animals': 4,
 'thing': 4,
 'two strings': 4,
 'filter': 4,
 'num 2': 4,
 'col_array': 3,
 'col_int': 4,
 'new_col_1': 1}

Unique¶

Spark¶

An abstraction of distinct to be use in multiple columns at the same time

Pandas¶

Similar behavior than pandas

In [55]:

df.table()

Viewing 4 of 4 rows / 10 columns

8 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable	new_col_1 10 (int)
⸱⸱I⸱like⸱⸱⸱⸱⸱fish⸱⸱	1	dog	housé	cat-car	a	1	['baby',⸱'sorry']	[1,⸱2,⸱3]	1
⸱⸱⸱⸱zombies	2	cat	tv	dog-tv	b	2	['baby⸱1',⸱'sorry⸱1']	[3,⸱4]	1
simpsons⸱⸱⸱cat⸱lady	2	frog	table	eagle-tv-plus	1	3	['baby⸱2',⸱'sorry⸱2']	[5,⸱6,⸱7]	1
None	3	eagle	glass	lion-pc	c	4	['baby⸱3',⸱'sorry⸱3']	[7,⸱8]	1

Viewing 4 of 4 rows / 10 columns

8 partition(s)

In [29]:

df_distinct = op.create.df(
    [
        ("words", "str", True),
        ("num", "int", True)
    ],
    [
        ("  I like     fish  ", 1),
        ("    zombies", 2),
        ("simpsons   cat lady", 2),
        (None, 3),
        (None, 0)
    ])

In [42]:

df_distinct.cols.unique("num").table()

Viewing 4 of 4 rows / 1 columns

200 partition(s)

num 1 (int) nullable
1
3
2
0

Viewing 4 of 4 rows / 1 columns

200 partition(s)

Count Zeros¶

In [43]:

df_zeros = df_distinct
df_zeros.cols.count_zeros("*")

Out[43]:

[{'num': 1}]

Column Data Types¶

In [44]:

df.cols.dtypes('*')

Out[44]:

{'words': 'string',
 'num': 'int',
 'animals': 'string',
 'thing': 'string',
 'two strings': 'string',
 'filter': 'string',
 'num 2': 'string',
 'col_array': 'array<string>',
 'col_int': 'array<int>'}

Replace¶

In [45]:

df.table()

Viewing 4 of 4 rows / 9 columns

8 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable
⋅⋅I⋅like⋅⋅⋅⋅⋅fish⋅⋅	1	dog	housé	cat-car	a	1	['baby',⋅'sorry']	[1,⋅2,⋅3]
⋅⋅⋅⋅zombies	2	cat	tv	dog-tv	b	2	['baby⋅1',⋅'sorry⋅1']	[3,⋅4]
simpsons⋅⋅⋅cat⋅lady	2	frog	table	eagle-tv-plus	1	3	['baby⋅2',⋅'sorry⋅2']	[5,⋅6,⋅7]
None	3	eagle	glass	lion-pc	c	4	['baby⋅3',⋅'sorry⋅3']	[7,⋅8]

Viewing 4 of 4 rows / 9 columns

8 partition(s)

Replace "dog","cat" in column "animals" by the word "animals"¶

In [46]:

df.cols.replace("animals", ["dog", "cat"], "animals").table()

Viewing 4 of 4 rows / 9 columns

8 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable
⋅⋅I⋅like⋅⋅⋅⋅⋅fish⋅⋅	1	animals	housé	cat-car	a	1	['baby',⋅'sorry']	[1,⋅2,⋅3]
⋅⋅⋅⋅zombies	2	animals	tv	dog-tv	b	2	['baby⋅1',⋅'sorry⋅1']	[3,⋅4]
simpsons⋅⋅⋅cat⋅lady	2	frog	table	eagle-tv-plus	1	3	['baby⋅2',⋅'sorry⋅2']	[5,⋅6,⋅7]
None	3	eagle	glass	lion-pc	c	4	['baby⋅3',⋅'sorry⋅3']	[7,⋅8]

Viewing 4 of 4 rows / 9 columns

8 partition(s)

Replace "dog-tv", "cat", "eagle", "fish" in columns "two strings","animals" by the string "animals"¶

In [47]:

df.cols.replace(["two strings", "animals"], ["dog-tv", "cat", "eagle", "fish"], "animals").table()

Viewing 4 of 4 rows / 9 columns

8 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable
⋅⋅I⋅like⋅⋅⋅⋅⋅fish⋅⋅	1	dog	housé	animals-car	a	1	['baby',⋅'sorry']	[1,⋅2,⋅3]
⋅⋅⋅⋅zombies	2	animals	tv	animals	b	2	['baby⋅1',⋅'sorry⋅1']	[3,⋅4]
simpsons⋅⋅⋅cat⋅lady	2	frog	table	animals-tv-plus	1	3	['baby⋅2',⋅'sorry⋅2']	[5,⋅6,⋅7]
None	3	animals	glass	lion-pc	c	4	['baby⋅3',⋅'sorry⋅3']	[7,⋅8]

Viewing 4 of 4 rows / 9 columns

8 partition(s)

Replace "dog" by "dog_1" and "cat" by "cat_1" in columns "animals"¶

In [48]:

df.cols.replace("animals", [("dog", "dog_1"), ("cat", "cat_1")]).table()

Viewing 4 of 4 rows / 9 columns

8 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable
⋅⋅I⋅like⋅⋅⋅⋅⋅fish⋅⋅	1	dog	housé	cat-car	a	1	['baby',⋅'sorry']	[1,⋅2,⋅3]
⋅⋅⋅⋅zombies	2	cat	tv	dog-tv	b	2	['baby⋅1',⋅'sorry⋅1']	[3,⋅4]
simpsons⋅⋅⋅cat⋅lady	2	frog	table	eagle-tv-plus	1	3	['baby⋅2',⋅'sorry⋅2']	[5,⋅6,⋅7]
None	3	eagle	glass	lion-pc	c	4	['baby⋅3',⋅'sorry⋅3']	[7,⋅8]

Viewing 4 of 4 rows / 9 columns

8 partition(s)

Replace in column "animals", "dog" by "pet"¶

In [49]:

df.cols.replace("animals", "dog", "animal").table()

Viewing 4 of 4 rows / 9 columns

8 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable
⋅⋅I⋅like⋅⋅⋅⋅⋅fish⋅⋅	1	animal	housé	cat-car	a	1	['baby',⋅'sorry']	[1,⋅2,⋅3]
⋅⋅⋅⋅zombies	2	cat	tv	dog-tv	b	2	['baby⋅1',⋅'sorry⋅1']	[3,⋅4]
simpsons⋅⋅⋅cat⋅lady	2	frog	table	eagle-tv-plus	1	3	['baby⋅2',⋅'sorry⋅2']	[5,⋅6,⋅7]
None	3	eagle	glass	lion-pc	c	4	['baby⋅3',⋅'sorry⋅3']	[7,⋅8]

Viewing 4 of 4 rows / 9 columns

8 partition(s)

Replace a,b,c by % in all columns¶

In [50]:

df.cols.replace("*", ["a", "b", "c"], "%").table()

Viewing 4 of 4 rows / 9 columns

8 partition(s)

words 1 (string) nullable	num 2 (string) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable
⋅⋅I⋅like⋅⋅⋅⋅⋅fish⋅⋅	1	dog	housé	%%t-%%r	%	1	['baby',⋅'sorry']	[1,⋅2,⋅3]
⋅⋅⋅⋅zom%ies	2	%%t	tv	dog-tv	%	2	['baby⋅1',⋅'sorry⋅1']	[3,⋅4]
simpsons⋅⋅⋅%%t⋅l%dy	2	frog	t%%le	e%gle-tv-plus	1	3	['baby⋅2',⋅'sorry⋅2']	[5,⋅6,⋅7]
None	3	e%gle	gl%ss	lion-p%	%	4	['baby⋅3',⋅'sorry⋅3']	[7,⋅8]

Viewing 4 of 4 rows / 9 columns

8 partition(s)

Replace 3 and 2 by 10 in a numeric columns¶

In [51]:

df.cols.replace('num', ["3", 2], 10).table()

Viewing 4 of 4 rows / 9 columns

8 partition(s)

words 1 (string) nullable	num 2 (string) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable
⋅⋅I⋅like⋅⋅⋅⋅⋅fish⋅⋅	1	dog	housé	cat-car	a	1	['baby',⋅'sorry']	[1,⋅2,⋅3]
⋅⋅⋅⋅zombies	10	cat	tv	dog-tv	b	2	['baby⋅1',⋅'sorry⋅1']	[3,⋅4]
simpsons⋅⋅⋅cat⋅lady	10	frog	table	eagle-tv-plus	1	3	['baby⋅2',⋅'sorry⋅2']	[5,⋅6,⋅7]
None	10	eagle	glass	lion-pc	c	4	['baby⋅3',⋅'sorry⋅3']	[7,⋅8]

Viewing 4 of 4 rows / 9 columns

8 partition(s)

Replace 3 by 6 and 2 by 12 in a numeric columns¶

In [52]:

df.cols.replace('num', [("3", 6), (2, 12)]).table()

Viewing 4 of 4 rows / 9 columns

8 partition(s)

words 1 (string) nullable	num 2 (string) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable
⋅⋅I⋅like⋅⋅⋅⋅⋅fish⋅⋅	1	dog	housé	cat-car	a	1	['baby',⋅'sorry']	[1,⋅2,⋅3]
⋅⋅⋅⋅zombies	2	cat	tv	dog-tv	b	2	['baby⋅1',⋅'sorry⋅1']	[3,⋅4]
simpsons⋅⋅⋅cat⋅lady	2	frog	table	eagle-tv-plus	1	3	['baby⋅2',⋅'sorry⋅2']	[5,⋅6,⋅7]
None	3	eagle	glass	lion-pc	c	4	['baby⋅3',⋅'sorry⋅3']	[7,⋅8]

Viewing 4 of 4 rows / 9 columns

8 partition(s)

Replace as words¶

In [58]:

df.cols.replace("animals", "dog", "animal", search_by="words").table()

Viewing 4 of 4 rows / 9 columns

8 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable
⋅⋅I⋅like⋅⋅⋅⋅⋅fish⋅⋅	1	animal	housé	cat-car	a	1	['baby',⋅'sorry']	[1,⋅2,⋅3]
⋅⋅⋅⋅zombies	2	cat	tv	dog-tv	b	2	['baby⋅1',⋅'sorry⋅1']	[3,⋅4]
simpsons⋅⋅⋅cat⋅lady	2	frog	table	eagle-tv-plus	1	3	['baby⋅2',⋅'sorry⋅2']	[5,⋅6,⋅7]
None	3	eagle	glass	lion-pc	c	4	['baby⋅3',⋅'sorry⋅3']	[7,⋅8]

Viewing 4 of 4 rows / 9 columns

8 partition(s)

In [57]:

df.cols.replace("animals", "dog", "animal", search_by="words", output_cols="friends").table()

Viewing 4 of 4 rows / 10 columns

8 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable	friends 10 (string) nullable
⋅⋅I⋅like⋅⋅⋅⋅⋅fish⋅⋅	1	animal	housé	cat-car	a	1	['baby',⋅'sorry']	[1,⋅2,⋅3]	dog
⋅⋅⋅⋅zombies	2	cat	tv	dog-tv	b	2	['baby⋅1',⋅'sorry⋅1']	[3,⋅4]	cat
simpsons⋅⋅⋅cat⋅lady	2	frog	table	eagle-tv-plus	1	3	['baby⋅2',⋅'sorry⋅2']	[5,⋅6,⋅7]	frog
None	3	eagle	glass	lion-pc	c	4	['baby⋅3',⋅'sorry⋅3']	[7,⋅8]	eagle

Viewing 4 of 4 rows / 10 columns

8 partition(s)

In [61]:

df.cols.replace("*", "1", "11", search_by="chars", output_cols="new").table()

Viewing 4 of 4 rows / 16 columns

8 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable	wordsnew 10 (string) nullable	numnew 11 (string) nullable	animalsnew 12 (string) nullable	thingnew 13 (string) nullable	two stringsnew 14 (string) nullable	filternew 15 (string) nullable	num 2new 16 (string) nullable
⋅⋅I⋅like⋅⋅⋅⋅⋅fish⋅⋅	1	dog	housé	cat-car	a	1	['baby',⋅'sorry']	[1,⋅2,⋅3]	⋅⋅I⋅like⋅⋅⋅⋅⋅fish⋅⋅	11	dog	housé	cat-car	a	11
⋅⋅⋅⋅zombies	2	cat	tv	dog-tv	b	2	['baby⋅1',⋅'sorry⋅1']	[3,⋅4]	⋅⋅⋅⋅zombies	2	cat	tv	dog-tv	b	2
simpsons⋅⋅⋅cat⋅lady	2	frog	table	eagle-tv-plus	1	3	['baby⋅2',⋅'sorry⋅2']	[5,⋅6,⋅7]	simpsons⋅⋅⋅cat⋅lady	2	frog	table	eagle-tv-plus	11	3
None	3	eagle	glass	lion-pc	c	4	['baby⋅3',⋅'sorry⋅3']	[7,⋅8]	None	3	eagle	glass	lion-pc	c	4

Viewing 4 of 4 rows / 16 columns

8 partition(s)

In [62]:

df.cols.replace("animals", [("dog", "dog_1"), ("cat", "cat_1")], "words").table()

Viewing 4 of 4 rows / 9 columns

8 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable
⋅⋅I⋅like⋅⋅⋅⋅⋅fish⋅⋅	1	dog	housé	cat-car	a	1	['baby',⋅'sorry']	[1,⋅2,⋅3]
⋅⋅⋅⋅zombies	2	cat	tv	dog-tv	b	2	['baby⋅1',⋅'sorry⋅1']	[3,⋅4]
simpsons⋅⋅⋅cat⋅lady	2	frog	table	eagle-tv-plus	1	3	['baby⋅2',⋅'sorry⋅2']	[5,⋅6,⋅7]
None	3	eagle	glass	lion-pc	c	4	['baby⋅3',⋅'sorry⋅3']	[7,⋅8]

Viewing 4 of 4 rows / 9 columns

8 partition(s)

In [63]:

df.cols.replace("animals", ["dog", "cat"], "animals", "words").table()

Viewing 4 of 4 rows / 9 columns

8 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable
⋅⋅I⋅like⋅⋅⋅⋅⋅fish⋅⋅	1	animals	housé	cat-car	a	1	['baby',⋅'sorry']	[1,⋅2,⋅3]
⋅⋅⋅⋅zombies	2	animals	tv	dog-tv	b	2	['baby⋅1',⋅'sorry⋅1']	[3,⋅4]
simpsons⋅⋅⋅cat⋅lady	2	frog	table	eagle-tv-plus	1	3	['baby⋅2',⋅'sorry⋅2']	[5,⋅6,⋅7]
None	3	eagle	glass	lion-pc	c	4	['baby⋅3',⋅'sorry⋅3']	[7,⋅8]

Viewing 4 of 4 rows / 9 columns

8 partition(s)

Use Regex¶

In [66]:

df.cols.replace_regex('*', '.*[Cc]at.*', 'cat_1').table()

Viewing 4 of 4 rows / 9 columns

8 partition(s)

words 1 (string) nullable	num 2 (string) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable
⋅⋅I⋅like⋅⋅⋅⋅⋅fish⋅⋅	1	dog	housé	cat_1	a	1	['baby',⋅'sorry']	[1,⋅2,⋅3]
⋅⋅⋅⋅zombies	2	cat_1	tv	dog-tv	b	2	['baby⋅1',⋅'sorry⋅1']	[3,⋅4]
cat_1	2	frog	table	eagle-tv-plus	1	3	['baby⋅2',⋅'sorry⋅2']	[5,⋅6,⋅7]
None	3	eagle	glass	lion-pc	c	4	['baby⋅3',⋅'sorry⋅3']	[7,⋅8]

Viewing 4 of 4 rows / 9 columns

8 partition(s)

Nest¶

Merge two columns in a column vector¶

Match the string as a word not as a substring¶

In [15]:

df = df.cols.copy("num", "num_1")
df.cols.nest(["num", "num_1"], output_col="col_nested", shape="vector").table()

Viewing 4 of 4 rows / 11 columns

8 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable	num_1 10 (int) nullable	col_nested 11 (vector) nullable
⋅⋅I⋅like⋅⋅⋅⋅⋅fish⋅⋅	1	dog	housé	cat-car	a	1	['baby',⋅'sorry']	[1,⋅2,⋅3]	1	[1.0,1.0]
⋅⋅⋅⋅zombies	2	cat	tv	dog-tv	b	2	['baby⋅1',⋅'sorry⋅1']	[3,⋅4]	2	[2.0,2.0]
simpsons⋅⋅⋅cat⋅lady	2	frog	table	eagle-tv-plus	1	3	['baby⋅2',⋅'sorry⋅2']	[5,⋅6,⋅7]	2	[2.0,2.0]
None	3	eagle	glass	lion-pc	c	4	['baby⋅3',⋅'sorry⋅3']	[7,⋅8]	3	[3.0,3.0]

Viewing 4 of 4 rows / 11 columns

8 partition(s)

Merge two columns in a string columns¶

In [16]:

df.cols.nest(["animals", "two strings"], output_col="col_nested", shape="string").table()

Viewing 4 of 4 rows / 12 columns

8 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable	num_1 10 (int) nullable	animalscol_nested 11 (string) not nullable	two stringscol_nested 12 (string) not nullable
⋅⋅I⋅like⋅⋅⋅⋅⋅fish⋅⋅	1	dog	housé	cat-car	a	1	['baby',⋅'sorry']	[1,⋅2,⋅3]	1	dogcat-car	dogcat-car
⋅⋅⋅⋅zombies	2	cat	tv	dog-tv	b	2	['baby⋅1',⋅'sorry⋅1']	[3,⋅4]	2	catdog-tv	catdog-tv
simpsons⋅⋅⋅cat⋅lady	2	frog	table	eagle-tv-plus	1	3	['baby⋅2',⋅'sorry⋅2']	[5,⋅6,⋅7]	2	frogeagle-tv-plus	frogeagle-tv-plus
None	3	eagle	glass	lion-pc	c	4	['baby⋅3',⋅'sorry⋅3']	[7,⋅8]	3	eaglelion-pc	eaglelion-pc

Viewing 4 of 4 rows / 12 columns

8 partition(s)

Merge three columns in an array¶

In [76]:

df.cols.nest(["animals", "two strings", "num 2"], output_col="col_nested", shape="array").table()

Viewing 4 of 4 rows / 12 columns

8 partition(s)

words 1 (string) nullable	num 2 (string) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (string) nullable	col_int 9 (string) nullable	animalscol_nested 10 (array<string>) not nullable	two stringscol_nested 11 (array<string>) not nullable	num 2col_nested 12 (array<string>) not nullable
⋅⋅I⋅like⋅⋅⋅⋅⋅fish⋅⋅	1	dog	housé	cat-car	a	1	[baby,⋅sorry]	[1,⋅2,⋅3]	['dog',⋅'cat-car',⋅'1']	['dog',⋅'cat-car',⋅'1']	['dog',⋅'cat-car',⋅'1']
⋅⋅⋅⋅zombies	2	cat	tv	dog-tv	b	2	[baby⋅1,⋅sorry⋅1]	[3,⋅4]	['cat',⋅'dog-tv',⋅'2']	['cat',⋅'dog-tv',⋅'2']	['cat',⋅'dog-tv',⋅'2']
simpsons⋅⋅⋅cat⋅lady	2	frog	table	eagle-tv-plus	1	3	[baby⋅2,⋅sorry⋅2]	[5,⋅6,⋅7]	['frog',⋅'eagle-tv-plus',⋅'3']	['frog',⋅'eagle-tv-plus',⋅'3']	['frog',⋅'eagle-tv-plus',⋅'3']
None	3	eagle	glass	lion-pc	c	4	[baby⋅3,⋅sorry⋅3]	[7,⋅8]	['eagle',⋅'lion-pc',⋅'4']	['eagle',⋅'lion-pc',⋅'4']	['eagle',⋅'lion-pc',⋅'4']

Viewing 4 of 4 rows / 12 columns

8 partition(s)

Histograms¶

In [82]:

df = op.load.csv("https://raw.githubusercontent.com/ironmussa/Optimus/master/examples/data/foo.csv")

In [83]:

df.table()

Viewing 10 of 19 rows / 8 columns

1 partition(s)

id 1 (int) nullable	firstName 2 (string) nullable	lastName 3 (string) nullable	billingId 4 (int) nullable	product 5 (string) nullable	price 6 (int) nullable	birth 7 (string) nullable	dummyCol 8 (string) nullable
1	Luis	Alvarez$$%!	123	Cake	10	1980/07/07	never
2	André	Ampère	423	piza	8	1950/07/08	gonna
3	NiELS	Böhr//((%%	551	pizza	8	1990/07/09	give
4	PAUL	dirac$	521	pizza	8	1954/07/10	you
5	Albert	Einstein	634	pizza	8	1990/07/11	up
6	Galileo	⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅GALiLEI	672	arepa	5	1930/08/12	never
7	CaRL	Ga%%%uss	323	taco	3	1970/07/13	gonna
8	David	H$$$ilbert	624	taaaccoo	3	1950/07/14	let
9	Johannes	KEPLER	735	taco	3	1920/04/22	you
10	JaMES	M$$ax%%well	875	taco	3	1923/03/12	down

Viewing 10 of 19 rows / 8 columns

1 partition(s)

In [84]:

df.cols.hist("price", 10)

Out[84]:

[{'count': 1, 'lower': 1.0, 'upper': 1.9},
 {'count': 1, 'lower': 1.9, 'upper': 2.8},
 {'count': 4, 'lower': 2.8, 'upper': 3.6999999999999997},
 {'count': 1, 'lower': 3.6999999999999997, 'upper': 4.6},
 {'count': 2, 'lower': 4.6, 'upper': 5.5},
 {'count': 0, 'lower': 5.5, 'upper': 6.4},
 {'count': 0, 'lower': 6.4, 'upper': 7.300000000000001},
 {'count': 6, 'lower': 7.300000000000001, 'upper': 8.200000000000001},
 {'count': 2, 'lower': 8.200000000000001, 'upper': 9.100000000000001},
 {'count': 2, 'lower': 9.100000000000001, 'upper': 10.0}]

In [85]:

df.cols.frequency("billingId")

Out[85]:

{'billingId': [{'value': 992, 'count': 1},
  {'value': 912, 'count': 1},
  {'value': 886, 'count': 1},
  {'value': 875, 'count': 1},
  {'value': 812, 'count': 1},
  {'value': 735, 'count': 1},
  {'value': 672, 'count': 1},
  {'value': 634, 'count': 1},
  {'value': 624, 'count': 1},
  {'value': 553, 'count': 1}]}

Statistics¶

Quantile Statistics¶

In [86]:

print(df.cols.min("billingId"))
print(df.cols.percentile(['billingId', 'price'], [0.05, 0.25, 0.5, 0.75, 0.95]))
print(df.cols.max("billingId"))
print(df.cols.median(["billingId", "price"]))
print(df.cols.range(["billingId", "price"]))
print(df.cols.std(["billingId", "price"]))

111
{'billingId': {'0.05': 111.0, '0.25': 111.0, '0.5': 111.0, '0.75': 111.0, '0.95': 111.0}, 'price': {'0.05': 1.0, '0.25': 1.0, '0.5': 1.0, '0.75': 1.0, '0.95': 1.0}}
992
{'billingId': 111.0, 'price': 1.0}
{'billingId': {'min': 111, 'max': 992}, 'price': {'min': 1, 'max': 10}}
{'billingId': 280.19735, 'price': 2.95285}

In [87]:

print(df.cols.min("*"))

{'id': 1, 'firstName': '(((   Heinrich )))))', 'lastName': '             GALiLEI', 'billingId': 111, 'product': '110790', 'price': 1, 'birth': '1899/01/01', 'dummyCol': '#'}

In [88]:

print(df.cols.kurt("billingId"))
print(df.cols.mean("billingId"))
print(df.cols.skewness("billingId"))
print(df.cols.sum("billingId"))
print(df.cols.variance("billingId"))
print(df.cols.mad("billingId"))

-1.04115
556.0
-0.2137
10564
78510.55556
0.0

Calculate Median Absolute deviation¶

In [89]:

df.cols.mad("price")

Out[89]:

0.0

In [90]:

df.cols.mad("price", more=True)

Out[90]:

{'mad': 0.0, 'median': 1.0}

Calculate precentiles¶

In [91]:

print(df.cols.percentile(['price'], [0.05, 0.25, 0.5, 0.75, 0.95]))

{'0.05': 1.0, '0.25': 1.0, '0.5': 1.0, '0.75': 1.0, '0.95': 1.0}

Calculate Mode¶

In [92]:

print(df.cols.mode(["price", "billingId"]))

[{'price': 8}, {'billingId': None}]

String Operations¶

In [93]:

df.table()

Viewing 10 of 19 rows / 8 columns

1 partition(s)

id 1 (int) nullable	firstName 2 (string) nullable	lastName 3 (string) nullable	billingId 4 (int) nullable	product 5 (string) nullable	price 6 (int) nullable	birth 7 (string) nullable	dummyCol 8 (string) nullable
1	Luis	Alvarez$$%!	123	Cake	10	1980/07/07	never
2	André	Ampère	423	piza	8	1950/07/08	gonna
3	NiELS	Böhr//((%%	551	pizza	8	1990/07/09	give
4	PAUL	dirac$	521	pizza	8	1954/07/10	you
5	Albert	Einstein	634	pizza	8	1990/07/11	up
6	Galileo	⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅GALiLEI	672	arepa	5	1930/08/12	never
7	CaRL	Ga%%%uss	323	taco	3	1970/07/13	gonna
8	David	H$$$ilbert	624	taaaccoo	3	1950/07/14	let
9	Johannes	KEPLER	735	taco	3	1920/04/22	you
10	JaMES	M$$ax%%well	875	taco	3	1923/03/12	down

Viewing 10 of 19 rows / 8 columns

1 partition(s)

In [94]:

df \
    .cols.trim("lastName") \
    .cols.lower("lastName") \
    .cols.upper(["product", "firstName"]) \
    .cols.reverse("firstName") \
    .table()

Viewing 10 of 19 rows / 8 columns

1 partition(s)

id 1 (int) nullable	firstName 2 (string) nullable	lastName 3 (string) nullable	billingId 4 (int) nullable	product 5 (string) nullable	price 6 (int) nullable	birth 7 (string) nullable	dummyCol 8 (string) nullable
1	SIUL	alvarez$$%!	123	CAKE	10	1980/07/07	never
2	ÉRDNA	ampère	423	PIZA	8	1950/07/08	gonna
3	SLEIN	böhr//((%%	551	PIZZA	8	1990/07/09	give
4	LUAP	dirac$	521	PIZZA	8	1954/07/10	you
5	TREBLA	einstein	634	PIZZA	8	1990/07/11	up
6	OELILAG	galilei	672	AREPA	5	1930/08/12	never
7	LRAC	ga%%%uss	323	TACO	3	1970/07/13	gonna
8	DIVAD	h$$$ilbert	624	TAAACCOO	3	1950/07/14	let
9	SENNAHOJ	kepler	735	TACO	3	1920/04/22	you
10	SEMAJ	m$$ax%%well	875	TACO	3	1923/03/12	down

Viewing 10 of 19 rows / 8 columns

1 partition(s)

Calculate the interquartile range¶

In [95]:

df.cols.iqr("price")

Out[95]:

5.0

In [96]:

df.cols.iqr("price", more=True)

Out[96]:

{'iqr': 5.0, 'q1': 3.0, 'q2': 8.0, 'q3': 8.0}

Calculate Zscore¶

In [97]:

df.cols.z_score("price").table()

Viewing 10 of 19 rows / 8 columns

1 partition(s)

id 1 (int) nullable	firstName 2 (string) nullable	lastName 3 (string) nullable	billingId 4 (int) nullable	product 5 (string) nullable	price 6 (double) nullable	birth 7 (string) nullable	dummyCol 8 (string) nullable
1	Luis	Alvarez$$%!	123	Cake	1.336800040638705	1980/07/07	never
2	André	Ampère	423	piza	0.6594882909731277	1950/07/08	gonna
3	NiELS	Böhr//((%%	551	pizza	0.6594882909731277	1990/07/09	give
4	PAUL	dirac$	521	pizza	0.6594882909731277	1954/07/10	you
5	Albert	Einstein	634	pizza	0.6594882909731277	1990/07/11	up
6	Galileo	⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅GALiLEI	672	arepa	0.3564793335252382	1930/08/12	never
7	CaRL	Ga%%%uss	323	taco	1.0337910831908155	1970/07/13	gonna
8	David	H$$$ilbert	624	taaaccoo	1.0337910831908155	1950/07/14	let
9	Johannes	KEPLER	735	taco	1.0337910831908155	1920/04/22	you
10	JaMES	M$$ax%%well	875	taco	1.0337910831908155	1923/03/12	down

Viewing 10 of 19 rows / 8 columns

1 partition(s)

Cleaning and Date Operations Operations¶

In [98]:

df.cols.years_between("birth", "yyyyMMdd", ).table()

Viewing 10 of 19 rows / 8 columns

1 partition(s)

id 1 (int) nullable	firstName 2 (string) nullable	lastName 3 (string) nullable	billingId 4 (int) nullable	product 5 (string) nullable	price 6 (int) nullable	birth 7 (float) nullable	dummyCol 8 (string) nullable
1	Luis	Alvarez$$%!	123	Cake	10	None	never
2	André	Ampère	423	piza	8	None	gonna
3	NiELS	Böhr//((%%	551	pizza	8	None	give
4	PAUL	dirac$	521	pizza	8	None	you
5	Albert	Einstein	634	pizza	8	None	up
6	Galileo	⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅GALiLEI	672	arepa	5	None	never
7	CaRL	Ga%%%uss	323	taco	3	None	gonna
8	David	H$$$ilbert	624	taaaccoo	3	None	let
9	Johannes	KEPLER	735	taco	3	None	you
10	JaMES	M$$ax%%well	875	taco	3	None	down

Viewing 10 of 19 rows / 8 columns

1 partition(s)

In [99]:

df.cols.remove("*", ["&", "%"]).table()

Viewing 10 of 19 rows / 8 columns

1 partition(s)

id 1 (string) nullable	firstName 2 (string) nullable	lastName 3 (string) nullable	billingId 4 (string) nullable	product 5 (string) nullable	price 6 (string) nullable	birth 7 (string) nullable	dummyCol 8 (string) nullable
1	Luis	Alvarez$$!	123	Cake	10	1980/07/07	never
2	André	Ampère	423	piza	8	1950/07/08	gonna
3	NiELS	Böhr//((	551	pizza	8	1990/07/09	give
4	PAUL	dirac$	521	pizza	8	1954/07/10	you
5	Albert	Einstein	634	pizza	8	1990/07/11	up
6	Galileo	⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅GALiLEI	672	arepa	5	1930/08/12	never
7	CaRL	Gauss	323	taco	3	1970/07/13	gonna
8	David	H$$$ilbert	624	taaaccoo	3	1950/07/14	let
9	Johannes	KEPLER	735	taco	3	1920/04/22	you
10	JaMES	M$$axwell	875	taco	3	1923/03/12	down

Viewing 10 of 19 rows / 8 columns

1 partition(s)

In [100]:

df.cols.remove_accents("lastName").table()

Viewing 10 of 19 rows / 8 columns

1 partition(s)

id 1 (int) nullable	firstName 2 (string) nullable	lastName 3 (string) nullable	billingId 4 (int) nullable	product 5 (string) nullable	price 6 (int) nullable	birth 7 (string) nullable	dummyCol 8 (string) nullable
1	Luis	Alvarez$$%!	123	Cake	10	1980/07/07	never
2	André	Ampere	423	piza	8	1950/07/08	gonna
3	NiELS	Bohr//((%%	551	pizza	8	1990/07/09	give
4	PAUL	dirac$	521	pizza	8	1954/07/10	you
5	Albert	Einstein	634	pizza	8	1990/07/11	up
6	Galileo	⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅GALiLEI	672	arepa	5	1930/08/12	never
7	CaRL	Ga%%%uss	323	taco	3	1970/07/13	gonna
8	David	H$$$ilbert	624	taaaccoo	3	1950/07/14	let
9	Johannes	KEPLER	735	taco	3	1920/04/22	you
10	JaMES	M$$ax%%well	875	taco	3	1923/03/12	down

Viewing 10 of 19 rows / 8 columns

1 partition(s)

In [101]:

df.cols.remove_special_chars("lastName").table()

Viewing 10 of 19 rows / 8 columns

1 partition(s)

id 1 (int) nullable	firstName 2 (string) nullable	lastName 3 (string) nullable	billingId 4 (int) nullable	product 5 (string) nullable	price 6 (int) nullable	birth 7 (string) nullable	dummyCol 8 (string) nullable
1	Luis	Alvarez	123	Cake	10	1980/07/07	never
2	André	Ampère	423	piza	8	1950/07/08	gonna
3	NiELS	Böhr	551	pizza	8	1990/07/09	give
4	PAUL	dirac	521	pizza	8	1954/07/10	you
5	Albert	Einstein	634	pizza	8	1990/07/11	up
6	Galileo	⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅GALiLEI	672	arepa	5	1930/08/12	never
7	CaRL	Gauss	323	taco	3	1970/07/13	gonna
8	David	Hilbert	624	taaaccoo	3	1950/07/14	let
9	Johannes	KEPLER	735	taco	3	1920/04/22	you
10	JaMES	Maxwell	875	taco	3	1923/03/12	down

Viewing 10 of 19 rows / 8 columns

1 partition(s)

In [102]:

df.cols.clip("billingId", 100, 200).table()

Viewing 10 of 19 rows / 8 columns

1 partition(s)

id 1 (int) nullable	firstName 2 (string) nullable	lastName 3 (string) nullable	billingId 4 (int) nullable	product 5 (string) nullable	price 6 (int) nullable	birth 7 (string) nullable	dummyCol 8 (string) nullable
1	Luis	Alvarez$$%!	123	Cake	10	1980/07/07	never
2	André	Ampère	200	piza	8	1950/07/08	gonna
3	NiELS	Böhr//((%%	200	pizza	8	1990/07/09	give
4	PAUL	dirac$	200	pizza	8	1954/07/10	you
5	Albert	Einstein	200	pizza	8	1990/07/11	up
6	Galileo	⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅GALiLEI	200	arepa	5	1930/08/12	never
7	CaRL	Ga%%%uss	200	taco	3	1970/07/13	gonna
8	David	H$$$ilbert	200	taaaccoo	3	1950/07/14	let
9	Johannes	KEPLER	200	taco	3	1920/04/22	you
10	JaMES	M$$ax%%well	200	taco	3	1923/03/12	down

Viewing 10 of 19 rows / 8 columns

1 partition(s)

In [103]:

df_abs = op.create.df(
    [
        ("words", "str", True),
        ("num", "int", True),
        ("animals", "str", True),
        ("thing", StringType(), True),
        ("two strings", StringType(), True),
        ("filter", StringType(), True),
        ("num 2", "string", True),
        ("col_array", ArrayType(StringType()), True),
        ("col_int", ArrayType(IntegerType()), True)

    ]
    ,
    [
        ("  I like     fish  ", -1, "dog", "housé", "cat-car", "a", "-1", ["baby", "sorry"], [1, 2, 3]),
        ("    zombies", -2, "cat", "tv", "dog-tv", "b", "-2", ["baby 1", "sorry 1"], [3, 4]),
        ("simpsons   cat lady", -2, "frog", "table", "eagle-tv-plus", "1", "3", ["baby 2", "sorry 2"], [5, 6, 7]),
        (None, 3, "eagle", "glass", "lion-pc", "c", "4", ["baby 3", "sorry 3"], [7, 8])
    ])

In [104]:

df_abs.cols.abs(["num", "num 2"]).table()

Viewing 4 of 4 rows / 9 columns

8 partition(s)

words 1 (string) nullable	num 2 (int) nullable	animals 3 (string) nullable	thing 4 (string) nullable	two strings 5 (string) nullable	filter 6 (string) nullable	num 2 7 (string) nullable	col_array 8 (array<string>) nullable	col_int 9 (array<int>) nullable
⋅⋅I⋅like⋅⋅⋅⋅⋅fish⋅⋅	1	dog	housé	cat-car	a	-1	['baby',⋅'sorry']	[1,⋅2,⋅3]
⋅⋅⋅⋅zombies	2	cat	tv	dog-tv	b	-2	['baby⋅1',⋅'sorry⋅1']	[3,⋅4]
simpsons⋅⋅⋅cat⋅lady	2	frog	table	eagle-tv-plus	1	3	['baby⋅2',⋅'sorry⋅2']	[5,⋅6,⋅7]
None	3	eagle	glass	lion-pc	c	4	['baby⋅3',⋅'sorry⋅3']	[7,⋅8]

Viewing 4 of 4 rows / 9 columns

8 partition(s)

In [105]:

df.cols.qcut("billingId", 5).table()

Viewing 10 of 19 rows / 9 columns

1 partition(s)

id 1 (int) nullable	firstName 2 (string) nullable	lastName 3 (string) nullable	billingId 4 (int) nullable	product 5 (string) nullable	price 6 (int) nullable	birth 7 (string) nullable	dummyCol 8 (string) nullable	billingId_QCUT 9 (double) nullable
1	Luis	Alvarez$$%!	123	Cake	10	1980/07/07	never	0.0
2	André	Ampère	423	piza	8	1950/07/08	gonna	1.0
3	NiELS	Böhr//((%%	551	pizza	8	1990/07/09	give	2.0
4	PAUL	dirac$	521	pizza	8	1954/07/10	you	2.0
5	Albert	Einstein	634	pizza	8	1990/07/11	up	3.0
6	Galileo	⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅GALiLEI	672	arepa	5	1930/08/12	never	3.0
7	CaRL	Ga%%%uss	323	taco	3	1970/07/13	gonna	1.0
8	David	H$$$ilbert	624	taaaccoo	3	1950/07/14	let	2.0
9	Johannes	KEPLER	735	taco	3	1920/04/22	you	3.0
10	JaMES	M$$ax%%well	875	taco	3	1923/03/12	down	4.0

Viewing 10 of 19 rows / 9 columns

1 partition(s)

In [ ]: