%load_ext autoreload
%autoreload 2
import sys
sys.path.append("..")
from optimus import Optimus
op = Optimus("spark")
..\optimus\engines\base\constants.py:25: DeprecationWarning: `np.str` is a deprecated alias for the builtin `str`. To silence this warning, use `str` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.str_` here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations DTYPES_DICT = {"string": np.str, "uint8": np.uint8, "uint16": np.uint16, "uint32": np.uint32, ..\optimus\engines\base\constants.py:27: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations "float": np.float, "float64": np.float64, "boolean": np.bool, "array": np.array, ..\optimus\engines\base\constants.py:27: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations "float": np.float, "float64": np.float64, "boolean": np.bool, "array": np.array, C:\Users\argenisleon\Anaconda3\lib\site-packages\statsmodels\iolib\foreign.py:651: DeprecationWarning: `np.long` is a deprecated alias for `np.compat.long`. To silence this warning, use `np.compat.long` by itself. In the likely event your code does not need to work on Python 2 you can use the builtin `int` for which `np.compat.long` is itself an alias. Doing this will not modify any behaviour and is safe. When replacing `np.long`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations _type_converters = {253 : np.long, 252 : int} C:\Users\argenisleon\Anaconda3\lib\site-packages\patsy\constraint.py:13: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working from collections import Mapping C:\Users\argenisleon\Anaconda3\lib\site-packages\statsmodels\stats\_lilliefors.py:163: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations size = np.array(sorted(cv_data), dtype=np.float) WARNING:root:'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. Koalas will set it for you but it does not work if there is a Spark context already launched. You are using PySparkling of version 2.4.10, but your PySpark is of version 3.1.1. Please make sure Spark and PySparkling versions are compatible.
df = op.load.csv("data/foo.csv")
data/foo.csv
Converting `np.character` to a dtype is deprecated. The current result is `np.dtype(np.str_)` which is not strictly correct. Note that `np.character` is generally deprecated and 'S1' should be used. `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
import os
df
DataFrame.toPandas is deprecated as of DataFrame.to_pandas. Please use the API instead.
id
1 (int32)
|
firstName
2 (object)
|
lastName
3 (object)
|
billingId
4 (int32)
|
product
5 (object)
|
price
6 (int32)
|
birth
7 (object)
|
dummyCol
8 (object)
|
---|---|---|---|---|---|---|---|
1
|
Luis
|
Alvarez$$%!
|
123
|
Cake
|
10
|
1980/07/07
|
never
|
2
|
André
|
Ampère
|
423
|
piza
|
8
|
1950/07/08
|
gonna
|
3
|
NiELS
|
Böhr//((%%
|
551
|
pizza
|
8
|
1990/07/09
|
give
|
4
|
PAUL
|
dirac$
|
521
|
pizza
|
8
|
1954/07/10
|
you
|
5
|
Albert
|
Einstein
|
634
|
pizza
|
8
|
1990/07/11
|
up
|
6
|
Galileo
|
⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅GALiLEI
|
672
|
arepa
|
5
|
1930/08/12
|
never
|
7
|
CaRL
|
Ga%%%uss
|
323
|
taco
|
3
|
1970/07/13
|
gonna
|
8
|
David
|
H$$$ilbert
|
624
|
taaaccoo
|
3
|
1950/07/14
|
let
|
9
|
Johannes
|
KEPLER
|
735
|
taco
|
3
|
1920/04/22
|
you
|
10
|
JaMES
|
M$$ax%%well
|
875
|
taco
|
3
|
1923/03/12
|
down
|
11
|
Isaac
|
Newton
|
992
|
pasta
|
9
|
1999/02/15
|
never⋅
|
df.cols.lower()
id
1 (object)
|
firstName
2 (object)
|
lastName
3 (object)
|
billingId
4 (object)
|
product
5 (object)
|
price
6 (object)
|
birth
7 (object)
|
dummyCol
8 (object)
|
---|---|---|---|---|---|---|---|
1
|
luis
|
alvarez$$%!
|
123
|
cake
|
10
|
1980/07/07
|
never
|
2
|
andré
|
ampère
|
423
|
piza
|
8
|
1950/07/08
|
gonna
|
3
|
niels
|
böhr//((%%
|
551
|
pizza
|
8
|
1990/07/09
|
give
|
4
|
paul
|
dirac$
|
521
|
pizza
|
8
|
1954/07/10
|
you
|
5
|
albert
|
einstein
|
634
|
pizza
|
8
|
1990/07/11
|
up
|
6
|
galileo
|
⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅galilei
|
672
|
arepa
|
5
|
1930/08/12
|
never
|
7
|
carl
|
ga%%%uss
|
323
|
taco
|
3
|
1970/07/13
|
gonna
|
8
|
david
|
h$$$ilbert
|
624
|
taaaccoo
|
3
|
1950/07/14
|
let
|
9
|
johannes
|
kepler
|
735
|
taco
|
3
|
1920/04/22
|
you
|
10
|
james
|
m$$ax%%well
|
875
|
taco
|
3
|
1923/03/12
|
down
|
11
|
isaac
|
newton
|
992
|
pasta
|
9
|
1999/02/15
|
never⋅
|
df.cols.upper()
id
1 (object)
|
firstName
2 (object)
|
lastName
3 (object)
|
billingId
4 (object)
|
product
5 (object)
|
price
6 (object)
|
birth
7 (object)
|
dummyCol
8 (object)
|
---|---|---|---|---|---|---|---|
1
|
LUIS
|
ALVAREZ$$%!
|
123
|
CAKE
|
10
|
1980/07/07
|
NEVER
|
2
|
ANDRÉ
|
AMPÈRE
|
423
|
PIZA
|
8
|
1950/07/08
|
GONNA
|
3
|
NIELS
|
BÖHR//((%%
|
551
|
PIZZA
|
8
|
1990/07/09
|
GIVE
|
4
|
PAUL
|
DIRAC$
|
521
|
PIZZA
|
8
|
1954/07/10
|
YOU
|
5
|
ALBERT
|
EINSTEIN
|
634
|
PIZZA
|
8
|
1990/07/11
|
UP
|
6
|
GALILEO
|
⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅GALILEI
|
672
|
AREPA
|
5
|
1930/08/12
|
NEVER
|
7
|
CARL
|
GA%%%USS
|
323
|
TACO
|
3
|
1970/07/13
|
GONNA
|
8
|
DAVID
|
H$$$ILBERT
|
624
|
TAAACCOO
|
3
|
1950/07/14
|
LET
|
9
|
JOHANNES
|
KEPLER
|
735
|
TACO
|
3
|
1920/04/22
|
YOU
|
10
|
JAMES
|
M$$AX%%WELL
|
875
|
TACO
|
3
|
1923/03/12
|
DOWN
|
11
|
ISAAC
|
NEWTON
|
992
|
PASTA
|
9
|
1999/02/15
|
NEVER⋅
|
df.cols.std("id")
df.cols.mean("id")
df.cols.kurtosis("id")
df.cols.median("id")
10.0
df.cols.abs("id")
df.cols.reverse("id")
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-15-942f6678b894> in <module> 1 df.cols.abs("id") ----> 2 df.cols.reverse("id") ~\Documents\Optimus\optimus\engines\spark\columns.py in reverse(columns) 641 """ 642 # TODO: make this in one pass. --> 643 df = self.root 644 645 columns = parse_columns( NameError: name 'self' is not defined
df.cols.var(["id"])
df.cols.skew(["id"])
0.0
df.cols.rename("id","id1")
df.cols.drop("id")
firstName
1 (object)
|
lastName
2 (object)
|
billingId
3 (int32)
|
product
4 (object)
|
price
5 (int32)
|
birth
6 (object)
|
dummyCol
7 (object)
|
---|---|---|---|---|---|---|
Luis
|
Alvarez$$%!
|
123
|
Cake
|
10
|
1980/07/07
|
never
|
André
|
Ampère
|
423
|
piza
|
8
|
1950/07/08
|
gonna
|
NiELS
|
Böhr//((%%
|
551
|
pizza
|
8
|
1990/07/09
|
give
|
PAUL
|
dirac$
|
521
|
pizza
|
8
|
1954/07/10
|
you
|
Albert
|
Einstein
|
634
|
pizza
|
8
|
1990/07/11
|
up
|
Galileo
|
⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅GALiLEI
|
672
|
arepa
|
5
|
1930/08/12
|
never
|
CaRL
|
Ga%%%uss
|
323
|
taco
|
3
|
1970/07/13
|
gonna
|
David
|
H$$$ilbert
|
624
|
taaaccoo
|
3
|
1950/07/14
|
let
|
Johannes
|
KEPLER
|
735
|
taco
|
3
|
1920/04/22
|
you
|
JaMES
|
M$$ax%%well
|
875
|
taco
|
3
|
1923/03/12
|
down
|
Isaac
|
Newton
|
992
|
pasta
|
9
|
1999/02/15
|
never⋅
|
df.cols.names()
['id', 'firstName', 'lastName', 'billingId', 'product', 'price', 'birth', 'dummyCol']
df.cols.count_uniques()
{'id': 19, 'firstName': 19, 'lastName': 19, 'billingId': 19, 'product': 13, 'price': 8, 'birth': 19, 'dummyCol': 13}