Notebook

Optimus Example¶

This Notebook is a simple tutorial about DataFrameTransformer, DataFrameAnalyzer and Utilities modules.

DataFrameTransformer is a dedicated module to easily make dataframe transformations.
DataFrameProfiler is a dedicated module to run a basic profile of the dataframe.
DataFrameAnalyzer is a dedicated module to plot and see important features of a spark

Dataframe.

Utilities module contains tool classes that support use of DataFrameTransformer and DataFrameAnalyzer modules.

Importing Modules¶

In [1]:

# Import optimus
import optimus as op

Starting or getting SparkSession and SparkContext.

Setting checkpoint folder (local). If you are in a cluster change it with set_check_point_folder(path,'hadoop').

Deleting previous folder if exists...
Creation of checkpoint directory...
Done.

Optimus successfully imported. Have fun :).

Instantiation of Utility class¶

The utility class is a tool class that includes functions to read csv files, setting checkpoint issues (to save or temporally save dataFrames).

In [2]:

# Instance of Utilities class
tools = op.Utilities()

Reading DataFrame¶

In [3]:

# Reading dataframe in this case, local file 
# system (hard drive of the pc) is used.

df = tools.read_csv(path="foo.csv", sep=',')

General view of DataFrame¶

Initially it is a good idea to see a general view of the DataFrame to be analyzed.

In the following cell, a basic profile of the DataFrame is shown. This overview presents basic information about the DataFrame, like number of variable it has, how many are missing values and in which column, the types of each varaible, also some statistical information that describes the variable plus a frecuency plot. table that specifies the existing datatypes in each column dataFrame and other features. Also, for this particular case, the table of dataType is shown in order to visualize a sample of column content.

In [4]:

# Instance of profiler class
profiler = op.DataFrameProfiler(df)
profiler.profiler()

Out[4]:

Overview

Dataset info

Number of variables	8
Number of observations	19
Total Missing (%)	0.0%
Total size in memory	0.0 B
Average record size in memory	0.0 B

Variables types

Numeric	3
Categorical	2
Date	0
Text (Unique)	3
Rejected	0

Warnings

Variables

billingId
Numeric

Distinct count	19
Unique (%)	100.0%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	556
Minimum	111
Maximum	992
Zeros (%)	0.0%

Toggle details

Quantile statistics

Minimum	111
5-th percentile	115.5
Q1	373
Median	553
Q3	773.5
95-th percentile	920
Maximum	992
Range	881
Interquartile range	400.5

Descriptive statistics

Standard deviation	280.2
Coef of variation	0.50395
Kurtosis	-1.0412
Mean	556
MAD	225.05
Skewness	-0.2137
Sum	10564
Variance	78511
Memory size	0.0 B

birth
Categorical, Unique

First 3 values
1990/07/11
1899/01/01
1990/07/09

Last 3 values
1954/07/10
1994/01/04
1920/04/22

Toggle details

First 20 values

1	1990/07/11
2	1899/01/01
3	1990/07/09
4	1999/02/15
5	1921/05/03
6	1956/11/30
7	1950/07/08
8	1930/08/12
9	1958/03/26
10	1923/03/12
11	1950/07/14
12	2000/03/22
13	1980/07/07
14	1970/07/13
15	1993/12/08
16	1997/06/27
17	1954/07/10
18	1994/01/04
19	1920/04/22

Last 20 values

dummyCol
Categorical

Distinct count	13
Unique (%)	68.4%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

you	3
gonna	3
never	2
Other values (10)	11

Toggle details

Value	Count	Frequency (%)
you	3	15.8%
gonna	3	15.8%
never	2	10.5%
#	2	10.5%
never	1	5.3%
down	1	5.3%
run	1	5.3%
desert	1	5.3%
up	1	5.3%
and	1	5.3%
around	1	5.3%
let	1	5.3%
give	1	5.3%

firstName
Categorical, Unique

First 3 values
Luis
Arthur
JAMES

Last 3 values
NiELS
JaMES
Fred

Toggle details

First 20 values

1	Luis
2	Arthur
3	JAMES
4	William
5	Marie
6	((( Heinrich )))))
7	PAUL
8	CaRL
9	Isaac
10	Johannes
11	Galileo
12	André
13	Emmy%%
14	Albert
15	David
16	Max!!!
17	NiELS
18	JaMES
19	Fred

Last 20 values

id
Numeric

Distinct count	19
Unique (%)	100.0%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	10
Minimum	1
Maximum	19
Zeros (%)	0.0%

Toggle details

Quantile statistics

Minimum	1
5-th percentile	1.9
Q1	5.5
Median	10
Q3	14.5
95-th percentile	18.1
Maximum	19
Range	18
Interquartile range	9

Descriptive statistics

Standard deviation	5.6273
Coef of variation	0.56273
Kurtosis	-1.2067
Mean	10
MAD	4.7368
Skewness	0
Sum	190
Variance	31.667
Memory size	0.0 B

lastName
Categorical, Unique

First 3 values
Alvarez$$%!
M$$ax%%well
Nöether$

Last 3 values
Hoy&&&le
Ampère
Ga%%%uss

Toggle details

First 20 values

1	Alvarez$$%!
2	M$$ax%%well
3	Nöether$
4	Planck!!!
5	COM%%%pton
6	KEPLER
7	Einstein
8	H$$$ilbert
9	Hertz
10	Chadwick
11	CURIE
12	Gilbert###
13	Newton
14	GALiLEI
15	Böhr//((%%
16	dirac$
17	Hoy&&&le
18	Ampère
19	Ga%%%uss

Last 20 values

price
Numeric

Distinct count	8
Unique (%)	42.1%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Mean	6.0526
Minimum	1
Maximum	10
Zeros (%)	0.0%

Toggle details

Quantile statistics

Minimum	1
5-th percentile	1.9
Q1	3
Median	8
Q3	8
95-th percentile	10
Maximum	10
Range	9
Interquartile range	5

Descriptive statistics

Standard deviation	2.9528
Coef of variation	0.48786
Kurtosis	-1.4482
Mean	6.0526
MAD	2.6814
Skewness	-0.22564
Sum	115
Variance	8.7193
Memory size	0.0 B

product
Categorical

Distinct count	13
Unique (%)	68.4%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

pizza	4
taco	3
pasta	2
Other values (10)	10

Toggle details

Value	Count	Frequency (%)
pizza	4	21.1%
taco	3	15.8%
pasta	2	10.5%
taaaccoo	1	5.3%
piza	1	5.3%
hamburguer	1	5.3%
BEER	1	5.3%
pizzza	1	5.3%
arepa	1	5.3%
Rice	1	5.3%
110790	1	5.3%
Cake	1	5.3%
null	1	5.3%

Sample

	id	firstName	lastName	billingId	product	price	birth	dummyCol
0	1	Luis	Alvarez$$%!	123	Cake	10	1980/07/07	never
1	2	André	Ampère	423	piza	8	1950/07/08	gonna
2	3	NiELS	Böhr//((%%	551	pizza	8	1990/07/09	give
3	4	PAUL	dirac$	521	pizza	8	1954/07/10	you
4	5	Albert	Einstein	634	pizza	8	1990/07/11	up

Instantiation of analyzer class¶

But if you want more information for data exploration, Optimus has the DataFrameAnalizer which has several functions for analyzing your dataset. It presents a table that specifies the existing datatypes in each column dataFrame and other features. Also, for this particular case, the table of dataType is shown in order to visualize a sample of column content.

In [5]:

# Instance of analyzer class
analyzer = op.DataFrameAnalyzer(df=df)

DataFrameAnalizer has a method called columnAnalize. This method can check all rows of dataFrame and tries to parse each element of each row to determine if the corresponding element is a string or a number. Besides, it can show 20 distinct values of each column classified according the possible datatype value, i.e: a number can be a string, so this function can recognize a number in a column of string dataType by trying to parse the string.

Also the function can plot numerical or categorical histograms.

General view of DataFrame¶

Initially it is a good idea to see a general view of the DataFrame to be analyzed.

In the following cell, the basic results of analyzing the DataFrame are made are shown. Basic results include a table that specifies the existing datatypes in each column dataFrame and other features. Also, for this particular case, the table of dataType is shown in order to visualize a sample of column content.

In [6]:

analyzer_tables = analyzer.column_analyze(column_list="*", print_type=True, plots=False)

Datatype	Quantity	Percentage
Column name: id
Column datatype: int
None	0	0.00 %
Empty str	0	0.00 %
String	0	0.00 %
Integer	19	100.00 %
Float	0	0.00 %

Min value:  1
Max value:  19

<optimus.df_analyzer.DataTypeTable at 0x112a7aeb8>

end of __analyze 6.364997863769531

Datatype	Quantity	Percentage
Column name: firstName
Column datatype: string
None	0	0.00 %
Empty str	0	0.00 %
String	19	100.00 %
Integer	0	0.00 %
Float	0	0.00 %

<optimus.df_analyzer.DataTypeTable at 0x112a3da90>

end of __analyze 1.980332851409912

Datatype	Quantity	Percentage
Column name: lastName
Column datatype: string
None	0	0.00 %
Empty str	0	0.00 %
String	19	100.00 %
Integer	0	0.00 %
Float	0	0.00 %

<optimus.df_analyzer.DataTypeTable at 0x112a1b438>

end of __analyze 1.8731741905212402

Datatype	Quantity	Percentage
Column name: billingId
Column datatype: int
None	0	0.00 %
Empty str	0	0.00 %
String	0	0.00 %
Integer	19	100.00 %
Float	0	0.00 %

Min value:  111
Max value:  992

<optimus.df_analyzer.DataTypeTable at 0x112723400>

end of __analyze 3.079150915145874

Datatype	Quantity	Percentage
Column name: product
Column datatype: string
None	0	0.00 %
Empty str	0	0.00 %
String	18	94.74 %
Integer	1	5.26 %
Float	0	0.00 %

<optimus.df_analyzer.DataTypeTable at 0x112ac4828>

end of __analyze 1.797947883605957

Datatype	Quantity	Percentage
Column name: price
Column datatype: int
None	0	0.00 %
Empty str	0	0.00 %
String	0	0.00 %
Integer	19	100.00 %
Float	0	0.00 %

Min value:  1
Max value:  10

<optimus.df_analyzer.DataTypeTable at 0x112a50198>

end of __analyze 2.50827693939209

Datatype	Quantity	Percentage
Column name: birth
Column datatype: string
None	0	0.00 %
Empty str	0	0.00 %
String	19	100.00 %
Integer	0	0.00 %
Float	0	0.00 %

<optimus.df_analyzer.DataTypeTable at 0x112a1ba58>

end of __analyze 1.4246118068695068

Datatype	Quantity	Percentage
Column name: dummyCol
Column datatype: string
None	0	0.00 %
Empty str	0	0.00 %
String	19	100.00 %
Integer	0	0.00 %
Float	0	0.00 %

<optimus.df_analyzer.DataTypeTable at 0x112a370f0>

end of __analyze 1.3449747562408447
Total execution time:  20.480989933013916

The results obtained by running the analyzer class, details the presence of special chars, string columns with possible numbers on them and None and empty string values in columns.

You can also plot histograms for individual columns using the plot_hist function. This is an interesting feature because you are plotting from a Spark Dataframe:

In [7]:

analyzer.plot_hist("price","numerical")

Instantiation of DataFrameTransformer¶

DataFrameTransformer is a specialized class to make dataFrame transformations. Transformations are optimized as much as possible to internally used native spark transformation functions.

In [8]:

# Instance of transformer class 
transformer = op.DataFrameTransformer(df)

In [9]:

transformer.show()

+---+---------+--------------------+---------+--------+-----+----------+--------+
| id|firstName|            lastName|billingId| product|price|     birth|dummyCol|
+---+---------+--------------------+---------+--------+-----+----------+--------+
|  1|     Luis|         Alvarez$$%!|      123|    Cake|   10|1980/07/07|   never|
|  2|    André|              Ampère|      423|    piza|    8|1950/07/08|   gonna|
|  3|    NiELS|          Böhr//((%%|      551|   pizza|    8|1990/07/09|    give|
|  4|     PAUL|              dirac$|      521|   pizza|    8|1954/07/10|     you|
|  5|   Albert|            Einstein|      634|   pizza|    8|1990/07/11|      up|
|  6|  Galileo|             GALiLEI|      672|   arepa|    5|1930/08/12|   never|
|  7|     CaRL|            Ga%%%uss|      323|    taco|    3|1970/07/13|   gonna|
|  8|    David|          H$$$ilbert|      624|taaaccoo|    3|1950/07/14|     let|
|  9| Johannes|              KEPLER|      735|    taco|    3|1920/04/22|     you|
| 10|    JaMES|         M$$ax%%well|      875|    taco|    3|1923/03/12|    down|
+---+---------+--------------------+---------+--------+-----+----------+--------+
only showing top 10 rows

Trimming blanck spaces at beginning and endings of cells dataFrames¶

In [10]:

# Printing of original dataFrame:
print('Original dataFrame:')
transformer.show(5)

# Triming string blank spaces:
transformer.trim_col("*")

# Printing trimmed dataFrame:
print('Trimmed dataFrame:')
transformer.show(5)

Original dataFrame:
+---+---------+-----------+---------+-------+-----+----------+--------+
| id|firstName|   lastName|billingId|product|price|     birth|dummyCol|
+---+---------+-----------+---------+-------+-----+----------+--------+
|  1|     Luis|Alvarez$$%!|      123|   Cake|   10|1980/07/07|   never|
|  2|    André|     Ampère|      423|   piza|    8|1950/07/08|   gonna|
|  3|    NiELS| Böhr//((%%|      551|  pizza|    8|1990/07/09|    give|
|  4|     PAUL|     dirac$|      521|  pizza|    8|1954/07/10|     you|
|  5|   Albert|   Einstein|      634|  pizza|    8|1990/07/11|      up|
+---+---------+-----------+---------+-------+-----+----------+--------+
only showing top 5 rows

Trimmed dataFrame:
+---+---------+-----------+---------+-------+-----+----------+--------+
| id|firstName|   lastName|billingId|product|price|     birth|dummyCol|
+---+---------+-----------+---------+-------+-----+----------+--------+
|  1|     Luis|Alvarez$$%!|      123|   Cake|   10|1980/07/07|   never|
|  2|    André|     Ampère|      423|   piza|    8|1950/07/08|   gonna|
|  3|    NiELS| Böhr//((%%|      551|  pizza|    8|1990/07/09|    give|
|  4|     PAUL|     dirac$|      521|  pizza|    8|1954/07/10|     you|
|  5|   Albert|   Einstein|      634|  pizza|    8|1990/07/11|      up|
+---+---------+-----------+---------+-------+-----+----------+--------+
only showing top 5 rows

Removing especial chars and accents:¶

In [11]:

# Printing of original dataFrame:
print('Original dataFrame:')
transformer.show(5)

# Remove special chars:
transformer.remove_special_chars("*").clear_accents("*")

# This can also be done by passing a Regex if you want something more personalized

#####################################################################

#transformer.remove_special_chars_regex("*",'[^\w\s]').clear_accents("*")

#####################################################################

# Printing trimmed dataFrame:
print('Removing special chars and accents dataFrame:')
transformer.show(5)

Original dataFrame:
+---+---------+-----------+---------+-------+-----+----------+--------+
| id|firstName|   lastName|billingId|product|price|     birth|dummyCol|
+---+---------+-----------+---------+-------+-----+----------+--------+
|  1|     Luis|Alvarez$$%!|      123|   Cake|   10|1980/07/07|   never|
|  2|    André|     Ampère|      423|   piza|    8|1950/07/08|   gonna|
|  3|    NiELS| Böhr//((%%|      551|  pizza|    8|1990/07/09|    give|
|  4|     PAUL|     dirac$|      521|  pizza|    8|1954/07/10|     you|
|  5|   Albert|   Einstein|      634|  pizza|    8|1990/07/11|      up|
+---+---------+-----------+---------+-------+-----+----------+--------+
only showing top 5 rows

Removing special chars and accents dataFrame:
+---+---------+--------+---------+-------+-----+--------+--------+
| id|firstName|lastName|billingId|product|price|   birth|dummyCol|
+---+---------+--------+---------+-------+-----+--------+--------+
|  1|     Luis| Alvarez|      123|   Cake|   10|19800707|   never|
|  2|    Andre|  Ampere|      423|   piza|    8|19500708|   gonna|
|  3|    NiELS|    Bohr|      551|  pizza|    8|19900709|    give|
|  4|     PAUL|   dirac|      521|  pizza|    8|19540710|     you|
|  5|   Albert|Einstein|      634|  pizza|    8|19900711|      up|
+---+---------+--------+---------+-------+-----+--------+--------+
only showing top 5 rows

Drop dummy column¶

In [12]:

# Printing of original dataFrame:
print('Original dataFrame:')
transformer.show(5)

# Droping a column:
transformer.drop_col("dummyCol")

# Printing trimmed dataFrame:
print('Dataframe without dummy column:')
transformer.show(5)

Original dataFrame:
+---+---------+--------+---------+-------+-----+--------+--------+
| id|firstName|lastName|billingId|product|price|   birth|dummyCol|
+---+---------+--------+---------+-------+-----+--------+--------+
|  1|     Luis| Alvarez|      123|   Cake|   10|19800707|   never|
|  2|    Andre|  Ampere|      423|   piza|    8|19500708|   gonna|
|  3|    NiELS|    Bohr|      551|  pizza|    8|19900709|    give|
|  4|     PAUL|   dirac|      521|  pizza|    8|19540710|     you|
|  5|   Albert|Einstein|      634|  pizza|    8|19900711|      up|
+---+---------+--------+---------+-------+-----+--------+--------+
only showing top 5 rows

Dataframe without dummy column:
+---+---------+--------+---------+-------+-----+--------+
| id|firstName|lastName|billingId|product|price|   birth|
+---+---------+--------+---------+-------+-----+--------+
|  1|     Luis| Alvarez|      123|   Cake|   10|19800707|
|  2|    Andre|  Ampere|      423|   piza|    8|19500708|
|  3|    NiELS|    Bohr|      551|  pizza|    8|19900709|
|  4|     PAUL|   dirac|      521|  pizza|    8|19540710|
|  5|   Albert|Einstein|      634|  pizza|    8|19900711|
+---+---------+--------+---------+-------+-----+--------+
only showing top 5 rows

Setting all letters to lowerCase¶

In [13]:

# Printing of original dataFrame:
print('Original dataFrame:')
transformer.show(5)

print('Setting all letters to lowerCase:')
transformer.lower_case("*")
transformer.show(5)

Original dataFrame:
+---+---------+--------+---------+-------+-----+--------+
| id|firstName|lastName|billingId|product|price|   birth|
+---+---------+--------+---------+-------+-----+--------+
|  1|     Luis| Alvarez|      123|   Cake|   10|19800707|
|  2|    Andre|  Ampere|      423|   piza|    8|19500708|
|  3|    NiELS|    Bohr|      551|  pizza|    8|19900709|
|  4|     PAUL|   dirac|      521|  pizza|    8|19540710|
|  5|   Albert|Einstein|      634|  pizza|    8|19900711|
+---+---------+--------+---------+-------+-----+--------+
only showing top 5 rows

Setting all letters to lowerCase:
+---+---------+--------+---------+-------+-----+--------+
| id|firstName|lastName|billingId|product|price|   birth|
+---+---------+--------+---------+-------+-----+--------+
|  1|     luis| alvarez|      123|   cake|   10|19800707|
|  2|    andre|  ampere|      423|   piza|    8|19500708|
|  3|    niels|    bohr|      551|  pizza|    8|19900709|
|  4|     paul|   dirac|      521|  pizza|    8|19540710|
|  5|   albert|einstein|      634|  pizza|    8|19900711|
+---+---------+--------+---------+-------+-----+--------+
only showing top 5 rows

Date Transformation (Format of date is changed)¶

In [14]:

# Printing of original dataFrame:
print('Original dataFrame:')
transformer.show(5)

# Priting the new date format:
print('Dataframe without dummy column:')
transformer.date_transform("birth", "yyyyMMdd", "dd-MM-YYYY") \
           .show(5)

Original dataFrame:
+---+---------+--------+---------+-------+-----+--------+
| id|firstName|lastName|billingId|product|price|   birth|
+---+---------+--------+---------+-------+-----+--------+
|  1|     luis| alvarez|      123|   cake|   10|19800707|
|  2|    andre|  ampere|      423|   piza|    8|19500708|
|  3|    niels|    bohr|      551|  pizza|    8|19900709|
|  4|     paul|   dirac|      521|  pizza|    8|19540710|
|  5|   albert|einstein|      634|  pizza|    8|19900711|
+---+---------+--------+---------+-------+-----+--------+
only showing top 5 rows

Dataframe without dummy column:
+---+---------+--------+---------+-------+-----+----------+
| id|firstName|lastName|billingId|product|price|     birth|
+---+---------+--------+---------+-------+-----+----------+
|  1|     luis| alvarez|      123|   cake|   10|07-07-1980|
|  2|    andre|  ampere|      423|   piza|    8|08-07-1950|
|  3|    niels|    bohr|      551|  pizza|    8|09-07-1990|
|  4|     paul|   dirac|      521|  pizza|    8|10-07-1954|
|  5|   albert|einstein|      634|  pizza|    8|11-07-1990|
+---+---------+--------+---------+-------+-----+----------+
only showing top 5 rows

Age calculation from birth date client¶

In [15]:

# Printing of original dataFrame:
print('Original dataFrame:')
transformer.show(5)

print("Printing calculation of age born date client")
transformer.age_calculate("birth", "dd-MM-YYYY", "clientAge") \
           .show(5)

Original dataFrame:
+---+---------+--------+---------+-------+-----+----------+
| id|firstName|lastName|billingId|product|price|     birth|
+---+---------+--------+---------+-------+-----+----------+
|  1|     luis| alvarez|      123|   cake|   10|07-07-1980|
|  2|    andre|  ampere|      423|   piza|    8|08-07-1950|
|  3|    niels|    bohr|      551|  pizza|    8|09-07-1990|
|  4|     paul|   dirac|      521|  pizza|    8|10-07-1954|
|  5|   albert|einstein|      634|  pizza|    8|11-07-1990|
+---+---------+--------+---------+-------+-----+----------+
only showing top 5 rows

Printing calculation of age born date client
+---+---------+--------+---------+-------+-----+----------+---------+
| id|firstName|lastName|billingId|product|price|     birth|clientAge|
+---+---------+--------+---------+-------+-----+----------+---------+
|  1|     luis| alvarez|      123|   cake|   10|07-07-1980|  37.7823|
|  2|    andre|  ampere|      423|   piza|    8|08-07-1950|  67.7769|
|  3|    niels|    bohr|      551|  pizza|    8|09-07-1990|  27.7796|
|  4|     paul|   dirac|      521|  pizza|    8|10-07-1954|  63.7903|
|  5|   albert|einstein|      634|  pizza|    8|11-07-1990|  27.7796|
+---+---------+--------+---------+-------+-----+----------+---------+
only showing top 5 rows

Renaming columns:¶

In [16]:

# Printing original dataframe:
print ("Original dataframe")
transformer.show(5)
print ("Renaming some columns of dataFrame")
transformer.rename_col(columns=[("clientAge", "age")])
transformer.show(5)

Original dataframe
+---+---------+--------+---------+-------+-----+----------+---------+
| id|firstName|lastName|billingId|product|price|     birth|clientAge|
+---+---------+--------+---------+-------+-----+----------+---------+
|  1|     luis| alvarez|      123|   cake|   10|07-07-1980|  37.7823|
|  2|    andre|  ampere|      423|   piza|    8|08-07-1950|  67.7769|
|  3|    niels|    bohr|      551|  pizza|    8|09-07-1990|  27.7796|
|  4|     paul|   dirac|      521|  pizza|    8|10-07-1954|  63.7903|
|  5|   albert|einstein|      634|  pizza|    8|11-07-1990|  27.7796|
+---+---------+--------+---------+-------+-----+----------+---------+
only showing top 5 rows

Renaming some columns of dataFrame
+-------+---+---------+--------+---------+-------+-----+----------+
|    age| id|firstName|lastName|billingId|product|price|     birth|
+-------+---+---------+--------+---------+-------+-----+----------+
|37.7823|  1|     luis| alvarez|      123|   cake|   10|07-07-1980|
|67.7769|  2|    andre|  ampere|      423|   piza|    8|08-07-1950|
|27.7796|  3|    niels|    bohr|      551|  pizza|    8|09-07-1990|
|63.7903|  4|     paul|   dirac|      521|  pizza|    8|10-07-1954|
|27.7796|  5|   albert|einstein|      634|  pizza|    8|11-07-1990|
+-------+---+---------+--------+---------+-------+-----+----------+
only showing top 5 rows

Changing positions of columns dataframe:¶

In [17]:

# Printing original dataframe:
print ("Original dataframe")
transformer.show(5)

# This action is to move column age, just after the lastName column
print ("age column moved")
transformer.move_col("age", "lastName", "after")
transformer.show(5)

Original dataframe
+-------+---+---------+--------+---------+-------+-----+----------+
|    age| id|firstName|lastName|billingId|product|price|     birth|
+-------+---+---------+--------+---------+-------+-----+----------+
|37.7823|  1|     luis| alvarez|      123|   cake|   10|07-07-1980|
|67.7769|  2|    andre|  ampere|      423|   piza|    8|08-07-1950|
|27.7796|  3|    niels|    bohr|      551|  pizza|    8|09-07-1990|
|63.7903|  4|     paul|   dirac|      521|  pizza|    8|10-07-1954|
|27.7796|  5|   albert|einstein|      634|  pizza|    8|11-07-1990|
+-------+---+---------+--------+---------+-------+-----+----------+
only showing top 5 rows

age column moved
+---+---------+--------+-------+---------+-------+-----+----------+
| id|firstName|lastName|    age|billingId|product|price|     birth|
+---+---------+--------+-------+---------+-------+-----+----------+
|  1|     luis| alvarez|37.7823|      123|   cake|   10|07-07-1980|
|  2|    andre|  ampere|67.7769|      423|   piza|    8|08-07-1950|
|  3|    niels|    bohr|27.7796|      551|  pizza|    8|09-07-1990|
|  4|     paul|   dirac|63.7903|      521|  pizza|    8|10-07-1954|
|  5|   albert|einstein|27.7796|      634|  pizza|    8|11-07-1990|
+---+---------+--------+-------+---------+-------+-----+----------+
only showing top 5 rows

Setting a custom transformation¶

The core of this function is base on the user define function provide from the lambda function provided in the 'func' argument.

In this example, cells that are not greater than 20, are multiplied by 20, the rest of them stay intact.

In [18]:

# Printing original dataframe:
print ("Original dataframe")
transformer.show(5)

print (' Multiplying by 20 a number if value in cell is greater than 20:')
# Replacing a number:   
func = lambda cell: (cell * 20) if ((cell != None) and (cell < 20)) else cell
transformer.set_col(['price'], func, 'integer')
transformer.show(20)

Original dataframe
+---+---------+--------+-------+---------+-------+-----+----------+
| id|firstName|lastName|    age|billingId|product|price|     birth|
+---+---------+--------+-------+---------+-------+-----+----------+
|  1|     luis| alvarez|37.7823|      123|   cake|   10|07-07-1980|
|  2|    andre|  ampere|67.7769|      423|   piza|    8|08-07-1950|
|  3|    niels|    bohr|27.7796|      551|  pizza|    8|09-07-1990|
|  4|     paul|   dirac|63.7903|      521|  pizza|    8|10-07-1954|
|  5|   albert|einstein|27.7796|      634|  pizza|    8|11-07-1990|
+---+---------+--------+-------+---------+-------+-----+----------+
only showing top 5 rows

 Multiplying by 20 a number if value in cell is greater than 20:
+---+------------+--------+--------+---------+----------+-----+----------+
| id|   firstName|lastName|     age|billingId|   product|price|     birth|
+---+------------+--------+--------+---------+----------+-----+----------+
|  1|        luis| alvarez| 37.7823|      123|      cake|  200|07-07-1980|
|  2|       andre|  ampere| 67.7769|      423|      piza|  160|08-07-1950|
|  3|       niels|    bohr| 27.7796|      551|     pizza|  160|09-07-1990|
|  4|        paul|   dirac| 63.7903|      521|     pizza|  160|10-07-1954|
|  5|      albert|einstein| 27.7796|      634|     pizza|  160|11-07-1990|
|  6|     galileo| galilei| 87.7849|      672|     arepa|  100|12-08-1930|
|  7|        carl|   gauss| 47.7876|      323|      taco|   60|13-07-1970|
|  8|       david| hilbert| 67.7769|      624|  taaaccoo|   60|14-07-1950|
|  9|    johannes|  kepler| 97.7876|      735|      taco|   60|22-04-1920|
| 10|       james| maxwell| 94.7796|      875|      taco|   60|12-03-1923|
| 11|       isaac|  newton| 18.7903|      992|     pasta|  180|15-02-1999|
| 12|        emmy| noether| 24.7903|      234|     pasta|  180|08-12-1993|
| 13|         max|  planck| 23.7930|      111|hamburguer|   80|04-01-1994|
| 14|        fred|   hoyle| 20.7849|      553|    pizzza|  160|27-06-1997|
| 15|   heinrich |   hertz| 61.7769|      116|     pizza|  160|30-11-1956|
| 16|     william| gilbert| 59.7849|      886|      beer|   40|26-03-1958|
| 17|       marie|   curie| 17.7930|      912|      rice|   20|22-03-2000|
| 18|      arthur| compton|118.7769|      812|    110790|  100|01-01-1899|
| 19|       james|chadwick| 96.7930|      467|      null|  200|03-05-1921|
+---+------------+--------+--------+---------+----------+-----+----------+

After the transformation process detailed in the previous cells. It is a good idea to analyze to see if the transformations have solved issued related with special characters, presence of number in column where is to supposed only letters, etc.

Analyzing columns after transformations¶

In [19]:

# Setting the new dataFrame transformed into the analyzer class
analyzer.set_data_frame(transformer.get_data_frame)
analyzer_table = analyzer.column_analyze("*", print_type=True, plots=True)

General description
Features	Name or Quantity
File Name	file with no path
Columns	8
Rows	19

Datatype	Quantity	Percentage
Column name: id
Column datatype: int
None	0	0.00 %
Empty str	0	0.00 %
String	0	0.00 %
Integer	19	100.00 %
Float	0	0.00 %

Min value:  1
Max value:  19

<optimus.df_analyzer.DataTypeTable at 0x112a40400>

end of __analyze 4.61762809753418

Datatype	Quantity	Percentage
Column name: firstName
Column datatype: string
None	0	0.00 %
Empty str	0	0.00 %
String	19	100.00 %
Integer	0	0.00 %
Float	0	0.00 %

<optimus.df_analyzer.DataTypeTable at 0x112abada0>

end of __analyze 2.2484610080718994

Datatype	Quantity	Percentage
Column name: lastName
Column datatype: string
None	0	0.00 %
Empty str	0	0.00 %
String	19	100.00 %
Integer	0	0.00 %
Float	0	0.00 %

<optimus.df_analyzer.DataTypeTable at 0x112f60da0>

end of __analyze 2.093127727508545

Datatype	Quantity	Percentage
Column name: age
Column datatype: string
None	0	0.00 %
Empty str	0	0.00 %
String	0	0.00 %
Integer	0	0.00 %
Float	19	100.00 %

<optimus.df_analyzer.DataTypeTable at 0x11305f198>

end of __analyze 5.873319864273071

Datatype	Quantity	Percentage
Column name: billingId
Column datatype: int
None	0	0.00 %
Empty str	0	0.00 %
String	0	0.00 %
Integer	19	100.00 %
Float	0	0.00 %

Min value:  111
Max value:  992

<optimus.df_analyzer.DataTypeTable at 0x1137320f0>

end of __analyze 4.413137912750244

Datatype	Quantity	Percentage
Column name: product
Column datatype: string
None	0	0.00 %
Empty str	0	0.00 %
String	18	94.74 %
Integer	1	5.26 %
Float	0	0.00 %

<optimus.df_analyzer.DataTypeTable at 0x112fc9ac8>

end of __analyze 1.9454371929168701

Datatype	Quantity	Percentage
Column name: price
Column datatype: int
None	0	0.00 %
Empty str	0	0.00 %
String	0	0.00 %
Integer	19	100.00 %
Float	0	0.00 %

Min value:  20
Max value:  200

<optimus.df_analyzer.DataTypeTable at 0x112b03a90>

end of __analyze 4.271012783050537

Datatype	Quantity	Percentage
Column name: birth
Column datatype: string
None	0	0.00 %
Empty str	0	0.00 %
String	19	100.00 %
Integer	0	0.00 %
Float	0	0.00 %

<optimus.df_analyzer.DataTypeTable at 0x11308bf28>

end of __analyze 2.1021029949188232
Total execution time:  27.68661904335022

It can be seen from output of the analyzer object, that there are columns with numbers even when ceratin column (for example) is supposed to be only of words or letters.

In order to solve this problem, operationInType function of DataFrameTransformer class can be used.

operationInType function is useful to make operations in a certain element of one dataType. In this particular example, it can be seen in the last output cell (specifically in 'product' column' that are values that don't fit the rest of the data, the aren't strings but they are numbers or empty strings. operationInType can take care about them and clean the column dataFrame.

In the following example, operationInType of function of DataFrameTransformer class is run in order to converts all posible parsables strings to integer into a null or none value. Notice how the 110790 value in product column have been changed, but the rest of the column has remained intact.

Making transformation in the inferred dataType elements of a certains columns¶

In [20]:

# This function makes changes or transformation in the column specified only in the cells
# that are recognized as the dataType specified. 
transformer.operation_in_type([('product', 'integer', None)]).show()

+---+---------+--------+-------+---------+-----+----------+--------+
| id|firstName|lastName|    age|billingId|price|     birth| product|
+---+---------+--------+-------+---------+-----+----------+--------+
|  1|     luis| alvarez|37.7823|      123|  200|07-07-1980|    cake|
|  2|    andre|  ampere|67.7769|      423|  160|08-07-1950|    piza|
|  3|    niels|    bohr|27.7796|      551|  160|09-07-1990|   pizza|
|  4|     paul|   dirac|63.7903|      521|  160|10-07-1954|   pizza|
|  5|   albert|einstein|27.7796|      634|  160|11-07-1990|   pizza|
|  6|  galileo| galilei|87.7849|      672|  100|12-08-1930|   arepa|
|  7|     carl|   gauss|47.7876|      323|   60|13-07-1970|    taco|
|  8|    david| hilbert|67.7769|      624|   60|14-07-1950|taaaccoo|
|  9| johannes|  kepler|97.7876|      735|   60|22-04-1920|    taco|
| 10|    james| maxwell|94.7796|      875|   60|12-03-1923|    taco|
+---+---------+--------+-------+---------+-----+----------+--------+
only showing top 10 rows

Sometimes there a some values that are different but actually are the same. In the product column for example, there are the following values: 'taaaccoo', 'piza'. It can be inferred that the correct value is taco and piza and not the rest of them. This problem can be solved with the lookup function of the DataFrameTransformer class.

Replacing multiple string values to a single string¶

In [21]:

transformer.lookup('product', str_to_replace='taco', list_str=['taaaccoo']) 
transformer.lookup('product', str_to_replace='pizza', list_str=['piza', 'pizzza']) 
transformer.show(20)

+---+------------+--------+--------+---------+-----+----------+----------+
| id|   firstName|lastName|     age|billingId|price|     birth|   product|
+---+------------+--------+--------+---------+-----+----------+----------+
|  1|        luis| alvarez| 37.7823|      123|  200|07-07-1980|      cake|
|  2|       andre|  ampere| 67.7769|      423|  160|08-07-1950|     pizza|
|  3|       niels|    bohr| 27.7796|      551|  160|09-07-1990|     pizza|
|  4|        paul|   dirac| 63.7903|      521|  160|10-07-1954|     pizza|
|  5|      albert|einstein| 27.7796|      634|  160|11-07-1990|     pizza|
|  6|     galileo| galilei| 87.7849|      672|  100|12-08-1930|     arepa|
|  7|        carl|   gauss| 47.7876|      323|   60|13-07-1970|      taco|
|  8|       david| hilbert| 67.7769|      624|   60|14-07-1950|      taco|
|  9|    johannes|  kepler| 97.7876|      735|   60|22-04-1920|      taco|
| 10|       james| maxwell| 94.7796|      875|   60|12-03-1923|      taco|
| 11|       isaac|  newton| 18.7903|      992|  180|15-02-1999|     pasta|
| 12|        emmy| noether| 24.7903|      234|  180|08-12-1993|     pasta|
| 13|         max|  planck| 23.7930|      111|   80|04-01-1994|hamburguer|
| 14|        fred|   hoyle| 20.7849|      553|  160|27-06-1997|     pizza|
| 15|   heinrich |   hertz| 61.7769|      116|  160|30-11-1956|     pizza|
| 16|     william| gilbert| 59.7849|      886|   40|26-03-1958|      beer|
| 17|       marie|   curie| 17.7930|      912|   20|22-03-2000|      rice|
| 18|      arthur| compton|118.7769|      812|  100|01-01-1899|      null|
| 19|       james|chadwick| 96.7930|      467|  200|03-05-1921|      null|
+---+------------+--------+--------+---------+-----+----------+----------+

As can be notice above, string specified in the list argument 'list_str' have been replaced to 'str_to_replace' value.

Chaining and lazy evaluation¶

The past transformations were done step by step, but this can be achieved by chaining all operations into one line of code, like the cell below. This way is much more efficient and scalable because it uses all optimization issues from the lazy evaluation approach.

All the transformation set before can be done into a single line of code thanks to the chaining feature of the DataFrameTransformer class. This option is a optimal way to make different transformations, because it uses as much as possible all advantages of the lazy evaluation approach.

In [22]:

# Instanciate DataFrameTransfomer
transformer = op.DataFrameTransformer(df)
# Get original dataFrame to show it.
transformer.show(20)

# Chaining function transformations
transformer.trim_col("*") \
           .remove_special_chars("*") \
           .clear_accents("*") \
           .lower_case("*") \
           .drop_col("dummyCol") \
           .date_transform("birth", "yyyyMMdd", "dd-MM-YYYY") \
           .age_calculate("birth", "dd-MM-YYYY", "clientAge") \
           .operation_in_type([('product', 'integer', None)]) \
           .lookup('product', str_to_replace='taco', list_str=['taaaccoo']) \
           .lookup('product', str_to_replace='pizza', list_str=['piza', 'pizzza'])  \
        
        
transformer.show(20)

+---+--------------------+--------------------+---------+----------+-----+----------+--------+
| id|           firstName|            lastName|billingId|   product|price|     birth|dummyCol|
+---+--------------------+--------------------+---------+----------+-----+----------+--------+
|  1|                Luis|         Alvarez$$%!|      123|      Cake|   10|1980/07/07|   never|
|  2|               André|              Ampère|      423|      piza|    8|1950/07/08|   gonna|
|  3|               NiELS|          Böhr//((%%|      551|     pizza|    8|1990/07/09|    give|
|  4|                PAUL|              dirac$|      521|     pizza|    8|1954/07/10|     you|
|  5|              Albert|            Einstein|      634|     pizza|    8|1990/07/11|      up|
|  6|             Galileo|             GALiLEI|      672|     arepa|    5|1930/08/12|   never|
|  7|                CaRL|            Ga%%%uss|      323|      taco|    3|1970/07/13|   gonna|
|  8|               David|          H$$$ilbert|      624|  taaaccoo|    3|1950/07/14|     let|
|  9|            Johannes|              KEPLER|      735|      taco|    3|1920/04/22|     you|
| 10|               JaMES|         M$$ax%%well|      875|      taco|    3|1923/03/12|    down|
| 11|               Isaac|              Newton|      992|     pasta|    9|1999/02/15|  never |
| 12|              Emmy%%|            Nöether$|      234|     pasta|    9|1993/12/08|   gonna|
| 13|              Max!!!|           Planck!!!|      111|hamburguer|    4|1994/01/04|    run |
| 14|                Fred|            Hoy&&&le|      553|    pizzza|    8|1997/06/27|  around|
| 15|(((   Heinrich )))))|               Hertz|      116|     pizza|    8|1956/11/30|     and|
| 16|             William|          Gilbert###|      886|      BEER|    2|1958/03/26|  desert|
| 17|               Marie|               CURIE|      912|      Rice|    1|2000/03/22|     you|
| 18|              Arthur|          COM%%%pton|      812|    110790|    5|1899/01/01|       #|
| 19|               JAMES|            Chadwick|      467|      null|   10|1921/05/03|       #|
+---+--------------------+--------------------+---------+----------+-----+----------+--------+

+---+------------+--------+---------+-----+----------+---------+----------+
| id|   firstName|lastName|billingId|price|     birth|clientAge|   product|
+---+------------+--------+---------+-----+----------+---------+----------+
|  1|        luis| alvarez|      123|   10|07-07-1980|  37.7823|      cake|
|  2|       andre|  ampere|      423|    8|08-07-1950|  67.7769|     pizza|
|  3|       niels|    bohr|      551|    8|09-07-1990|  27.7796|     pizza|
|  4|        paul|   dirac|      521|    8|10-07-1954|  63.7903|     pizza|
|  5|      albert|einstein|      634|    8|11-07-1990|  27.7796|     pizza|
|  6|     galileo| galilei|      672|    5|12-08-1930|  87.7849|     arepa|
|  7|        carl|   gauss|      323|    3|13-07-1970|  47.7876|      taco|
|  8|       david| hilbert|      624|    3|14-07-1950|  67.7769|      taco|
|  9|    johannes|  kepler|      735|    3|22-04-1920|  97.7876|      taco|
| 10|       james| maxwell|      875|    3|12-03-1923|  94.7796|      taco|
| 11|       isaac|  newton|      992|    9|15-02-1999|  18.7903|     pasta|
| 12|        emmy| noether|      234|    9|08-12-1993|  24.7903|     pasta|
| 13|         max|  planck|      111|    4|04-01-1994|  23.7930|hamburguer|
| 14|        fred|   hoyle|      553|    8|27-06-1997|  20.7849|     pizza|
| 15|   heinrich |   hertz|      116|    8|30-11-1956|  61.7769|     pizza|
| 16|     william| gilbert|      886|    2|26-03-1958|  59.7849|      beer|
| 17|       marie|   curie|      912|    1|22-03-2000|  17.7930|      rice|
| 18|      arthur| compton|      812|    5|01-01-1899| 118.7769|      null|
| 19|       james|chadwick|      467|   10|03-05-1921|  96.7930|      null|
+---+------------+--------+---------+-----+----------+---------+----------+

General description
Features	Name or Quantity
File Name	file with no path
Columns	8
Rows	19