This Notebook is a simple tutorial about DataFrameTransformer, DataFrameAnalyzer and Utilities modules.
DataFrameTransformer is a dedicated module to easily make dataframe transformations.
DataFrameProfiler is a dedicated module to run a basic profile of the dataframe.
DataFrameAnalyzer is a dedicated module to plot and see important features of a spark
Dataframe.
# Import optimus
import optimus as op
The utility class is a tool class that includes functions to read csv files, setting checkpoint issues (to save or temporally save dataFrames).
# Instance of Utilities class
tools = op.Utilities()
# Reading dataframe in this case, local file
# system (hard drive of the pc) is used.
df = tools.read_csv(path="foo.csv", sep=',')
Initially it is a good idea to see a general view of the DataFrame to be analyzed.
In the following cell, a basic profile of the DataFrame is shown. This overview presents basic information about the DataFrame, like number of variable it has, how many are missing values and in which column, the types of each varaible, also some statistical information that describes the variable plus a frecuency plot. table that specifies the existing datatypes in each column dataFrame and other features. Also, for this particular case, the table of dataType is shown in order to visualize a sample of column content.
# Instance of profiler class
profiler = op.DataFrameProfiler(df)
profiler.profiler()
Dataset info
Number of variables | 8 |
---|---|
Number of observations | 19 |
Total Missing (%) | 0.0% |
Total size in memory | 0.0 B |
Average record size in memory | 0.0 B |
Variables types
Numeric | 3 |
---|---|
Categorical | 2 |
Date | 0 |
Text (Unique) | 3 |
Rejected | 0 |
Warnings
billingId
Numeric
Distinct count | 19 |
---|---|
Unique (%) | 100.0% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 556 |
---|---|
Minimum | 111 |
Maximum | 992 |
Zeros (%) | 0.0% |
Quantile statistics
Minimum | 111 |
---|---|
5-th percentile | 115.5 |
Q1 | 373 |
Median | 553 |
Q3 | 773.5 |
95-th percentile | 920 |
Maximum | 992 |
Range | 881 |
Interquartile range | 400.5 |
Descriptive statistics
Standard deviation | 280.2 |
---|---|
Coef of variation | 0.50395 |
Kurtosis | -1.0412 |
Mean | 556 |
MAD | 225.05 |
Skewness | -0.2137 |
Sum | 10564 |
Variance | 78511 |
Memory size | 0.0 B |
birth
Categorical, Unique
First 3 values |
---|
1990/07/11 |
1899/01/01 |
1990/07/09 |
Last 3 values |
---|
1954/07/10 |
1994/01/04 |
1920/04/22 |
First 20 values
1 | 1990/07/11 |
---|---|
2 | 1899/01/01 |
3 | 1990/07/09 |
4 | 1999/02/15 |
5 | 1921/05/03 |
6 | 1956/11/30 |
7 | 1950/07/08 |
8 | 1930/08/12 |
9 | 1958/03/26 |
10 | 1923/03/12 |
11 | 1950/07/14 |
12 | 2000/03/22 |
13 | 1980/07/07 |
14 | 1970/07/13 |
15 | 1993/12/08 |
16 | 1997/06/27 |
17 | 1954/07/10 |
18 | 1994/01/04 |
19 | 1920/04/22 |
Last 20 values
dummyCol
Categorical
Distinct count | 13 |
---|---|
Unique (%) | 68.4% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
you | |
---|---|
gonna | |
never | 2 |
Other values (10) |
Value | Count | Frequency (%) | |
you | 3 | 15.8% | |
gonna | 3 | 15.8% | |
never | 2 | 10.5% | |
# | 2 | 10.5% | |
never | 1 | 5.3% | |
down | 1 | 5.3% | |
run | 1 | 5.3% | |
desert | 1 | 5.3% | |
up | 1 | 5.3% | |
and | 1 | 5.3% | |
around | 1 | 5.3% | |
let | 1 | 5.3% | |
give | 1 | 5.3% |
firstName
Categorical, Unique
First 3 values |
---|
Luis |
Arthur |
JAMES |
Last 3 values |
---|
NiELS |
JaMES |
Fred |
First 20 values
1 | Luis |
---|---|
2 | Arthur |
3 | JAMES |
4 | William |
5 | Marie |
6 | ((( Heinrich ))))) |
7 | PAUL |
8 | CaRL |
9 | Isaac |
10 | Johannes |
11 | Galileo |
12 | André |
13 | Emmy%% |
14 | Albert |
15 | David |
16 | Max!!! |
17 | NiELS |
18 | JaMES |
19 | Fred |
Last 20 values
id
Numeric
Distinct count | 19 |
---|---|
Unique (%) | 100.0% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 10 |
---|---|
Minimum | 1 |
Maximum | 19 |
Zeros (%) | 0.0% |
Quantile statistics
Minimum | 1 |
---|---|
5-th percentile | 1.9 |
Q1 | 5.5 |
Median | 10 |
Q3 | 14.5 |
95-th percentile | 18.1 |
Maximum | 19 |
Range | 18 |
Interquartile range | 9 |
Descriptive statistics
Standard deviation | 5.6273 |
---|---|
Coef of variation | 0.56273 |
Kurtosis | -1.2067 |
Mean | 10 |
MAD | 4.7368 |
Skewness | 0 |
Sum | 190 |
Variance | 31.667 |
Memory size | 0.0 B |
lastName
Categorical, Unique
First 3 values |
---|
Alvarez$$%! |
M$$ax%%well |
Nöether$ |
Last 3 values |
---|
Hoy&&&le |
Ampère |
Ga%%%uss |
First 20 values
1 | Alvarez$$%! |
---|---|
2 | M$$ax%%well |
3 | Nöether$ |
4 | Planck!!! |
5 | COM%%%pton |
6 | KEPLER |
7 | Einstein |
8 | H$$$ilbert |
9 | Hertz |
10 | Chadwick |
11 | CURIE |
12 | Gilbert### |
13 | Newton |
14 | GALiLEI |
15 | Böhr//((%% |
16 | dirac$ |
17 | Hoy&&&le |
18 | Ampère |
19 | Ga%%%uss |
Last 20 values
price
Numeric
Distinct count | 8 |
---|---|
Unique (%) | 42.1% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 6.0526 |
---|---|
Minimum | 1 |
Maximum | 10 |
Zeros (%) | 0.0% |
Quantile statistics
Minimum | 1 |
---|---|
5-th percentile | 1.9 |
Q1 | 3 |
Median | 8 |
Q3 | 8 |
95-th percentile | 10 |
Maximum | 10 |
Range | 9 |
Interquartile range | 5 |
Descriptive statistics
Standard deviation | 2.9528 |
---|---|
Coef of variation | 0.48786 |
Kurtosis | -1.4482 |
Mean | 6.0526 |
MAD | 2.6814 |
Skewness | -0.22564 |
Sum | 115 |
Variance | 8.7193 |
Memory size | 0.0 B |
product
Categorical
Distinct count | 13 |
---|---|
Unique (%) | 68.4% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
pizza | |
---|---|
taco | |
pasta | 2 |
Other values (10) |
Value | Count | Frequency (%) | |
pizza | 4 | 21.1% | |
taco | 3 | 15.8% | |
pasta | 2 | 10.5% | |
taaaccoo | 1 | 5.3% | |
piza | 1 | 5.3% | |
hamburguer | 1 | 5.3% | |
BEER | 1 | 5.3% | |
pizzza | 1 | 5.3% | |
arepa | 1 | 5.3% | |
Rice | 1 | 5.3% | |
110790 | 1 | 5.3% | |
Cake | 1 | 5.3% | |
null | 1 | 5.3% |
But if you want more information for data exploration, Optimus has the DataFrameAnalizer which has several functions for analyzing your dataset. It presents a table that specifies the existing datatypes in each column dataFrame and other features. Also, for this particular case, the table of dataType is shown in order to visualize a sample of column content.
# Instance of analyzer class
analyzer = op.DataFrameAnalyzer(df=df)
DataFrameAnalizer has a method called columnAnalize. This method can check all rows of dataFrame and tries to parse each element of each row to determine if the corresponding element is a string or a number. Besides, it can show 20 distinct values of each column classified according the possible datatype value, i.e: a number can be a string, so this function can recognize a number in a column of string dataType by trying to parse the string.
Also the function can plot numerical or categorical histograms.
Initially it is a good idea to see a general view of the DataFrame to be analyzed.
In the following cell, the basic results of analyzing the DataFrame are made are shown. Basic results include a table that specifies the existing datatypes in each column dataFrame and other features. Also, for this particular case, the table of dataType is shown in order to visualize a sample of column content.
analyzer_tables = analyzer.column_analyze(column_list="*", print_type=True, plots=False)
Column name: id | ||
Column datatype: int | ||
Datatype | Quantity | Percentage |
---|---|---|
None | 0 | 0.00 % |
Empty str | 0 | 0.00 % |
String | 0 | 0.00 % |
Integer | 19 | 100.00 % |
Float | 0 | 0.00 % |
Min value: 1 Max value: 19
<optimus.df_analyzer.DataTypeTable at 0x112a7aeb8>
end of __analyze 6.364997863769531
Column name: firstName | ||
Column datatype: string | ||
Datatype | Quantity | Percentage |
---|---|---|
None | 0 | 0.00 % |
Empty str | 0 | 0.00 % |
String | 19 | 100.00 % |
Integer | 0 | 0.00 % |
Float | 0 | 0.00 % |
<optimus.df_analyzer.DataTypeTable at 0x112a3da90>
end of __analyze 1.980332851409912
Column name: lastName | ||
Column datatype: string | ||
Datatype | Quantity | Percentage |
---|---|---|
None | 0 | 0.00 % |
Empty str | 0 | 0.00 % |
String | 19 | 100.00 % |
Integer | 0 | 0.00 % |
Float | 0 | 0.00 % |
<optimus.df_analyzer.DataTypeTable at 0x112a1b438>
end of __analyze 1.8731741905212402
Column name: billingId | ||
Column datatype: int | ||
Datatype | Quantity | Percentage |
---|---|---|
None | 0 | 0.00 % |
Empty str | 0 | 0.00 % |
String | 0 | 0.00 % |
Integer | 19 | 100.00 % |
Float | 0 | 0.00 % |
Min value: 111 Max value: 992
<optimus.df_analyzer.DataTypeTable at 0x112723400>
end of __analyze 3.079150915145874
Column name: product | ||
Column datatype: string | ||
Datatype | Quantity | Percentage |
---|---|---|
None | 0 | 0.00 % |
Empty str | 0 | 0.00 % |
String | 18 | 94.74 % |
Integer | 1 | 5.26 % |
Float | 0 | 0.00 % |
<optimus.df_analyzer.DataTypeTable at 0x112ac4828>
end of __analyze 1.797947883605957
Column name: price | ||
Column datatype: int | ||
Datatype | Quantity | Percentage |
---|---|---|
None | 0 | 0.00 % |
Empty str | 0 | 0.00 % |
String | 0 | 0.00 % |
Integer | 19 | 100.00 % |
Float | 0 | 0.00 % |
Min value: 1 Max value: 10
<optimus.df_analyzer.DataTypeTable at 0x112a50198>
end of __analyze 2.50827693939209
Column name: birth | ||
Column datatype: string | ||
Datatype | Quantity | Percentage |
---|---|---|
None | 0 | 0.00 % |
Empty str | 0 | 0.00 % |
String | 19 | 100.00 % |
Integer | 0 | 0.00 % |
Float | 0 | 0.00 % |
<optimus.df_analyzer.DataTypeTable at 0x112a1ba58>
end of __analyze 1.4246118068695068
Column name: dummyCol | ||
Column datatype: string | ||
Datatype | Quantity | Percentage |
---|---|---|
None | 0 | 0.00 % |
Empty str | 0 | 0.00 % |
String | 19 | 100.00 % |
Integer | 0 | 0.00 % |
Float | 0 | 0.00 % |
<optimus.df_analyzer.DataTypeTable at 0x112a370f0>
end of __analyze 1.3449747562408447 Total execution time: 20.480989933013916
The results obtained by running the analyzer class, details the presence of special chars, string columns with possible numbers on them and None and empty string values in columns.
You can also plot histograms for individual columns using the plot_hist
function. This is an interesting feature because you are plotting from a Spark Dataframe:
analyzer.plot_hist("price","numerical")
DataFrameTransformer is a specialized class to make dataFrame transformations. Transformations are optimized as much as possible to internally used native spark transformation functions.
# Instance of transformer class
transformer = op.DataFrameTransformer(df)
transformer.show()
+---+---------+--------------------+---------+--------+-----+----------+--------+ | id|firstName| lastName|billingId| product|price| birth|dummyCol| +---+---------+--------------------+---------+--------+-----+----------+--------+ | 1| Luis| Alvarez$$%!| 123| Cake| 10|1980/07/07| never| | 2| André| Ampère| 423| piza| 8|1950/07/08| gonna| | 3| NiELS| Böhr//((%%| 551| pizza| 8|1990/07/09| give| | 4| PAUL| dirac$| 521| pizza| 8|1954/07/10| you| | 5| Albert| Einstein| 634| pizza| 8|1990/07/11| up| | 6| Galileo| GALiLEI| 672| arepa| 5|1930/08/12| never| | 7| CaRL| Ga%%%uss| 323| taco| 3|1970/07/13| gonna| | 8| David| H$$$ilbert| 624|taaaccoo| 3|1950/07/14| let| | 9| Johannes| KEPLER| 735| taco| 3|1920/04/22| you| | 10| JaMES| M$$ax%%well| 875| taco| 3|1923/03/12| down| +---+---------+--------------------+---------+--------+-----+----------+--------+ only showing top 10 rows
# Printing of original dataFrame:
print('Original dataFrame:')
transformer.show(5)
# Triming string blank spaces:
transformer.trim_col("*")
# Printing trimmed dataFrame:
print('Trimmed dataFrame:')
transformer.show(5)
Original dataFrame: +---+---------+-----------+---------+-------+-----+----------+--------+ | id|firstName| lastName|billingId|product|price| birth|dummyCol| +---+---------+-----------+---------+-------+-----+----------+--------+ | 1| Luis|Alvarez$$%!| 123| Cake| 10|1980/07/07| never| | 2| André| Ampère| 423| piza| 8|1950/07/08| gonna| | 3| NiELS| Böhr//((%%| 551| pizza| 8|1990/07/09| give| | 4| PAUL| dirac$| 521| pizza| 8|1954/07/10| you| | 5| Albert| Einstein| 634| pizza| 8|1990/07/11| up| +---+---------+-----------+---------+-------+-----+----------+--------+ only showing top 5 rows Trimmed dataFrame: +---+---------+-----------+---------+-------+-----+----------+--------+ | id|firstName| lastName|billingId|product|price| birth|dummyCol| +---+---------+-----------+---------+-------+-----+----------+--------+ | 1| Luis|Alvarez$$%!| 123| Cake| 10|1980/07/07| never| | 2| André| Ampère| 423| piza| 8|1950/07/08| gonna| | 3| NiELS| Böhr//((%%| 551| pizza| 8|1990/07/09| give| | 4| PAUL| dirac$| 521| pizza| 8|1954/07/10| you| | 5| Albert| Einstein| 634| pizza| 8|1990/07/11| up| +---+---------+-----------+---------+-------+-----+----------+--------+ only showing top 5 rows
# Printing of original dataFrame:
print('Original dataFrame:')
transformer.show(5)
# Remove special chars:
transformer.remove_special_chars("*").clear_accents("*")
# This can also be done by passing a Regex if you want something more personalized
#####################################################################
#transformer.remove_special_chars_regex("*",'[^\w\s]').clear_accents("*")
#####################################################################
# Printing trimmed dataFrame:
print('Removing special chars and accents dataFrame:')
transformer.show(5)
Original dataFrame: +---+---------+-----------+---------+-------+-----+----------+--------+ | id|firstName| lastName|billingId|product|price| birth|dummyCol| +---+---------+-----------+---------+-------+-----+----------+--------+ | 1| Luis|Alvarez$$%!| 123| Cake| 10|1980/07/07| never| | 2| André| Ampère| 423| piza| 8|1950/07/08| gonna| | 3| NiELS| Böhr//((%%| 551| pizza| 8|1990/07/09| give| | 4| PAUL| dirac$| 521| pizza| 8|1954/07/10| you| | 5| Albert| Einstein| 634| pizza| 8|1990/07/11| up| +---+---------+-----------+---------+-------+-----+----------+--------+ only showing top 5 rows Removing special chars and accents dataFrame: +---+---------+--------+---------+-------+-----+--------+--------+ | id|firstName|lastName|billingId|product|price| birth|dummyCol| +---+---------+--------+---------+-------+-----+--------+--------+ | 1| Luis| Alvarez| 123| Cake| 10|19800707| never| | 2| Andre| Ampere| 423| piza| 8|19500708| gonna| | 3| NiELS| Bohr| 551| pizza| 8|19900709| give| | 4| PAUL| dirac| 521| pizza| 8|19540710| you| | 5| Albert|Einstein| 634| pizza| 8|19900711| up| +---+---------+--------+---------+-------+-----+--------+--------+ only showing top 5 rows
# Printing of original dataFrame:
print('Original dataFrame:')
transformer.show(5)
# Droping a column:
transformer.drop_col("dummyCol")
# Printing trimmed dataFrame:
print('Dataframe without dummy column:')
transformer.show(5)
Original dataFrame: +---+---------+--------+---------+-------+-----+--------+--------+ | id|firstName|lastName|billingId|product|price| birth|dummyCol| +---+---------+--------+---------+-------+-----+--------+--------+ | 1| Luis| Alvarez| 123| Cake| 10|19800707| never| | 2| Andre| Ampere| 423| piza| 8|19500708| gonna| | 3| NiELS| Bohr| 551| pizza| 8|19900709| give| | 4| PAUL| dirac| 521| pizza| 8|19540710| you| | 5| Albert|Einstein| 634| pizza| 8|19900711| up| +---+---------+--------+---------+-------+-----+--------+--------+ only showing top 5 rows Dataframe without dummy column: +---+---------+--------+---------+-------+-----+--------+ | id|firstName|lastName|billingId|product|price| birth| +---+---------+--------+---------+-------+-----+--------+ | 1| Luis| Alvarez| 123| Cake| 10|19800707| | 2| Andre| Ampere| 423| piza| 8|19500708| | 3| NiELS| Bohr| 551| pizza| 8|19900709| | 4| PAUL| dirac| 521| pizza| 8|19540710| | 5| Albert|Einstein| 634| pizza| 8|19900711| +---+---------+--------+---------+-------+-----+--------+ only showing top 5 rows
# Printing of original dataFrame:
print('Original dataFrame:')
transformer.show(5)
print('Setting all letters to lowerCase:')
transformer.lower_case("*")
transformer.show(5)
Original dataFrame: +---+---------+--------+---------+-------+-----+--------+ | id|firstName|lastName|billingId|product|price| birth| +---+---------+--------+---------+-------+-----+--------+ | 1| Luis| Alvarez| 123| Cake| 10|19800707| | 2| Andre| Ampere| 423| piza| 8|19500708| | 3| NiELS| Bohr| 551| pizza| 8|19900709| | 4| PAUL| dirac| 521| pizza| 8|19540710| | 5| Albert|Einstein| 634| pizza| 8|19900711| +---+---------+--------+---------+-------+-----+--------+ only showing top 5 rows Setting all letters to lowerCase: +---+---------+--------+---------+-------+-----+--------+ | id|firstName|lastName|billingId|product|price| birth| +---+---------+--------+---------+-------+-----+--------+ | 1| luis| alvarez| 123| cake| 10|19800707| | 2| andre| ampere| 423| piza| 8|19500708| | 3| niels| bohr| 551| pizza| 8|19900709| | 4| paul| dirac| 521| pizza| 8|19540710| | 5| albert|einstein| 634| pizza| 8|19900711| +---+---------+--------+---------+-------+-----+--------+ only showing top 5 rows
# Printing of original dataFrame:
print('Original dataFrame:')
transformer.show(5)
# Priting the new date format:
print('Dataframe without dummy column:')
transformer.date_transform("birth", "yyyyMMdd", "dd-MM-YYYY") \
.show(5)
Original dataFrame: +---+---------+--------+---------+-------+-----+--------+ | id|firstName|lastName|billingId|product|price| birth| +---+---------+--------+---------+-------+-----+--------+ | 1| luis| alvarez| 123| cake| 10|19800707| | 2| andre| ampere| 423| piza| 8|19500708| | 3| niels| bohr| 551| pizza| 8|19900709| | 4| paul| dirac| 521| pizza| 8|19540710| | 5| albert|einstein| 634| pizza| 8|19900711| +---+---------+--------+---------+-------+-----+--------+ only showing top 5 rows Dataframe without dummy column: +---+---------+--------+---------+-------+-----+----------+ | id|firstName|lastName|billingId|product|price| birth| +---+---------+--------+---------+-------+-----+----------+ | 1| luis| alvarez| 123| cake| 10|07-07-1980| | 2| andre| ampere| 423| piza| 8|08-07-1950| | 3| niels| bohr| 551| pizza| 8|09-07-1990| | 4| paul| dirac| 521| pizza| 8|10-07-1954| | 5| albert|einstein| 634| pizza| 8|11-07-1990| +---+---------+--------+---------+-------+-----+----------+ only showing top 5 rows
# Printing of original dataFrame:
print('Original dataFrame:')
transformer.show(5)
print("Printing calculation of age born date client")
transformer.age_calculate("birth", "dd-MM-YYYY", "clientAge") \
.show(5)
Original dataFrame: +---+---------+--------+---------+-------+-----+----------+ | id|firstName|lastName|billingId|product|price| birth| +---+---------+--------+---------+-------+-----+----------+ | 1| luis| alvarez| 123| cake| 10|07-07-1980| | 2| andre| ampere| 423| piza| 8|08-07-1950| | 3| niels| bohr| 551| pizza| 8|09-07-1990| | 4| paul| dirac| 521| pizza| 8|10-07-1954| | 5| albert|einstein| 634| pizza| 8|11-07-1990| +---+---------+--------+---------+-------+-----+----------+ only showing top 5 rows Printing calculation of age born date client +---+---------+--------+---------+-------+-----+----------+---------+ | id|firstName|lastName|billingId|product|price| birth|clientAge| +---+---------+--------+---------+-------+-----+----------+---------+ | 1| luis| alvarez| 123| cake| 10|07-07-1980| 37.7823| | 2| andre| ampere| 423| piza| 8|08-07-1950| 67.7769| | 3| niels| bohr| 551| pizza| 8|09-07-1990| 27.7796| | 4| paul| dirac| 521| pizza| 8|10-07-1954| 63.7903| | 5| albert|einstein| 634| pizza| 8|11-07-1990| 27.7796| +---+---------+--------+---------+-------+-----+----------+---------+ only showing top 5 rows
# Printing original dataframe:
print ("Original dataframe")
transformer.show(5)
print ("Renaming some columns of dataFrame")
transformer.rename_col(columns=[("clientAge", "age")])
transformer.show(5)
Original dataframe +---+---------+--------+---------+-------+-----+----------+---------+ | id|firstName|lastName|billingId|product|price| birth|clientAge| +---+---------+--------+---------+-------+-----+----------+---------+ | 1| luis| alvarez| 123| cake| 10|07-07-1980| 37.7823| | 2| andre| ampere| 423| piza| 8|08-07-1950| 67.7769| | 3| niels| bohr| 551| pizza| 8|09-07-1990| 27.7796| | 4| paul| dirac| 521| pizza| 8|10-07-1954| 63.7903| | 5| albert|einstein| 634| pizza| 8|11-07-1990| 27.7796| +---+---------+--------+---------+-------+-----+----------+---------+ only showing top 5 rows Renaming some columns of dataFrame +-------+---+---------+--------+---------+-------+-----+----------+ | age| id|firstName|lastName|billingId|product|price| birth| +-------+---+---------+--------+---------+-------+-----+----------+ |37.7823| 1| luis| alvarez| 123| cake| 10|07-07-1980| |67.7769| 2| andre| ampere| 423| piza| 8|08-07-1950| |27.7796| 3| niels| bohr| 551| pizza| 8|09-07-1990| |63.7903| 4| paul| dirac| 521| pizza| 8|10-07-1954| |27.7796| 5| albert|einstein| 634| pizza| 8|11-07-1990| +-------+---+---------+--------+---------+-------+-----+----------+ only showing top 5 rows
# Printing original dataframe:
print ("Original dataframe")
transformer.show(5)
# This action is to move column age, just after the lastName column
print ("age column moved")
transformer.move_col("age", "lastName", "after")
transformer.show(5)
Original dataframe +-------+---+---------+--------+---------+-------+-----+----------+ | age| id|firstName|lastName|billingId|product|price| birth| +-------+---+---------+--------+---------+-------+-----+----------+ |37.7823| 1| luis| alvarez| 123| cake| 10|07-07-1980| |67.7769| 2| andre| ampere| 423| piza| 8|08-07-1950| |27.7796| 3| niels| bohr| 551| pizza| 8|09-07-1990| |63.7903| 4| paul| dirac| 521| pizza| 8|10-07-1954| |27.7796| 5| albert|einstein| 634| pizza| 8|11-07-1990| +-------+---+---------+--------+---------+-------+-----+----------+ only showing top 5 rows age column moved +---+---------+--------+-------+---------+-------+-----+----------+ | id|firstName|lastName| age|billingId|product|price| birth| +---+---------+--------+-------+---------+-------+-----+----------+ | 1| luis| alvarez|37.7823| 123| cake| 10|07-07-1980| | 2| andre| ampere|67.7769| 423| piza| 8|08-07-1950| | 3| niels| bohr|27.7796| 551| pizza| 8|09-07-1990| | 4| paul| dirac|63.7903| 521| pizza| 8|10-07-1954| | 5| albert|einstein|27.7796| 634| pizza| 8|11-07-1990| +---+---------+--------+-------+---------+-------+-----+----------+ only showing top 5 rows
The core of this function is base on the user define function provide from the lambda function provided in the 'func' argument.
In this example, cells that are not greater than 20, are multiplied by 20, the rest of them stay intact.
# Printing original dataframe:
print ("Original dataframe")
transformer.show(5)
print (' Multiplying by 20 a number if value in cell is greater than 20:')
# Replacing a number:
func = lambda cell: (cell * 20) if ((cell != None) and (cell < 20)) else cell
transformer.set_col(['price'], func, 'integer')
transformer.show(20)
Original dataframe +---+---------+--------+-------+---------+-------+-----+----------+ | id|firstName|lastName| age|billingId|product|price| birth| +---+---------+--------+-------+---------+-------+-----+----------+ | 1| luis| alvarez|37.7823| 123| cake| 10|07-07-1980| | 2| andre| ampere|67.7769| 423| piza| 8|08-07-1950| | 3| niels| bohr|27.7796| 551| pizza| 8|09-07-1990| | 4| paul| dirac|63.7903| 521| pizza| 8|10-07-1954| | 5| albert|einstein|27.7796| 634| pizza| 8|11-07-1990| +---+---------+--------+-------+---------+-------+-----+----------+ only showing top 5 rows Multiplying by 20 a number if value in cell is greater than 20: +---+------------+--------+--------+---------+----------+-----+----------+ | id| firstName|lastName| age|billingId| product|price| birth| +---+------------+--------+--------+---------+----------+-----+----------+ | 1| luis| alvarez| 37.7823| 123| cake| 200|07-07-1980| | 2| andre| ampere| 67.7769| 423| piza| 160|08-07-1950| | 3| niels| bohr| 27.7796| 551| pizza| 160|09-07-1990| | 4| paul| dirac| 63.7903| 521| pizza| 160|10-07-1954| | 5| albert|einstein| 27.7796| 634| pizza| 160|11-07-1990| | 6| galileo| galilei| 87.7849| 672| arepa| 100|12-08-1930| | 7| carl| gauss| 47.7876| 323| taco| 60|13-07-1970| | 8| david| hilbert| 67.7769| 624| taaaccoo| 60|14-07-1950| | 9| johannes| kepler| 97.7876| 735| taco| 60|22-04-1920| | 10| james| maxwell| 94.7796| 875| taco| 60|12-03-1923| | 11| isaac| newton| 18.7903| 992| pasta| 180|15-02-1999| | 12| emmy| noether| 24.7903| 234| pasta| 180|08-12-1993| | 13| max| planck| 23.7930| 111|hamburguer| 80|04-01-1994| | 14| fred| hoyle| 20.7849| 553| pizzza| 160|27-06-1997| | 15| heinrich | hertz| 61.7769| 116| pizza| 160|30-11-1956| | 16| william| gilbert| 59.7849| 886| beer| 40|26-03-1958| | 17| marie| curie| 17.7930| 912| rice| 20|22-03-2000| | 18| arthur| compton|118.7769| 812| 110790| 100|01-01-1899| | 19| james|chadwick| 96.7930| 467| null| 200|03-05-1921| +---+------------+--------+--------+---------+----------+-----+----------+
After the transformation process detailed in the previous cells. It is a good idea to analyze to see if the transformations have solved issued related with special characters, presence of number in column where is to supposed only letters, etc.
# Setting the new dataFrame transformed into the analyzer class
analyzer.set_data_frame(transformer.get_data_frame)
analyzer_table = analyzer.column_analyze("*", print_type=True, plots=True)
General description | ||
---|---|---|
Features | Name or Quantity | |
File Name | file with no path | |
Columns | 8 | |
Rows | 19 |
Column name: id | ||
Column datatype: int | ||
Datatype | Quantity | Percentage |
---|---|---|
None | 0 | 0.00 % |
Empty str | 0 | 0.00 % |
String | 0 | 0.00 % |
Integer | 19 | 100.00 % |
Float | 0 | 0.00 % |
Min value: 1 Max value: 19
<optimus.df_analyzer.DataTypeTable at 0x112a40400>
end of __analyze 4.61762809753418
Column name: firstName | ||
Column datatype: string | ||
Datatype | Quantity | Percentage |
---|---|---|
None | 0 | 0.00 % |
Empty str | 0 | 0.00 % |
String | 19 | 100.00 % |
Integer | 0 | 0.00 % |
Float | 0 | 0.00 % |
<optimus.df_analyzer.DataTypeTable at 0x112abada0>
end of __analyze 2.2484610080718994
Column name: lastName | ||
Column datatype: string | ||
Datatype | Quantity | Percentage |
---|---|---|
None | 0 | 0.00 % |
Empty str | 0 | 0.00 % |
String | 19 | 100.00 % |
Integer | 0 | 0.00 % |
Float | 0 | 0.00 % |
<optimus.df_analyzer.DataTypeTable at 0x112f60da0>
end of __analyze 2.093127727508545
Column name: age | ||
Column datatype: string | ||
Datatype | Quantity | Percentage |
---|---|---|
None | 0 | 0.00 % |
Empty str | 0 | 0.00 % |
String | 0 | 0.00 % |
Integer | 0 | 0.00 % |
Float | 19 | 100.00 % |
<optimus.df_analyzer.DataTypeTable at 0x11305f198>
end of __analyze 5.873319864273071
Column name: billingId | ||
Column datatype: int | ||
Datatype | Quantity | Percentage |
---|---|---|
None | 0 | 0.00 % |
Empty str | 0 | 0.00 % |
String | 0 | 0.00 % |
Integer | 19 | 100.00 % |
Float | 0 | 0.00 % |
Min value: 111 Max value: 992
<optimus.df_analyzer.DataTypeTable at 0x1137320f0>
end of __analyze 4.413137912750244
Column name: product | ||
Column datatype: string | ||
Datatype | Quantity | Percentage |
---|---|---|
None | 0 | 0.00 % |
Empty str | 0 | 0.00 % |
String | 18 | 94.74 % |
Integer | 1 | 5.26 % |
Float | 0 | 0.00 % |
<optimus.df_analyzer.DataTypeTable at 0x112fc9ac8>
end of __analyze 1.9454371929168701
Column name: price | ||
Column datatype: int | ||
Datatype | Quantity | Percentage |
---|---|---|
None | 0 | 0.00 % |
Empty str | 0 | 0.00 % |
String | 0 | 0.00 % |
Integer | 19 | 100.00 % |
Float | 0 | 0.00 % |
Min value: 20 Max value: 200
<optimus.df_analyzer.DataTypeTable at 0x112b03a90>
end of __analyze 4.271012783050537
Column name: birth | ||
Column datatype: string | ||
Datatype | Quantity | Percentage |
---|---|---|
None | 0 | 0.00 % |
Empty str | 0 | 0.00 % |
String | 19 | 100.00 % |
Integer | 0 | 0.00 % |
Float | 0 | 0.00 % |
<optimus.df_analyzer.DataTypeTable at 0x11308bf28>
end of __analyze 2.1021029949188232 Total execution time: 27.68661904335022
It can be seen from output of the analyzer object, that there are columns with numbers even when ceratin column (for example) is supposed to be only of words or letters.
In order to solve this problem, operationInType function of DataFrameTransformer class can be used.
operationInType function is useful to make operations in a certain element of one dataType. In this particular example, it can be seen in the last output cell (specifically in 'product' column' that are values that don't fit the rest of the data, the aren't strings but they are numbers or empty strings. operationInType can take care about them and clean the column dataFrame.
In the following example, operationInType of function of DataFrameTransformer class is run in order to converts all posible parsables strings to integer into a null or none value. Notice how the 110790 value in product column have been changed, but the rest of the column has remained intact.
# This function makes changes or transformation in the column specified only in the cells
# that are recognized as the dataType specified.
transformer.operation_in_type([('product', 'integer', None)]).show()
+---+---------+--------+-------+---------+-----+----------+--------+ | id|firstName|lastName| age|billingId|price| birth| product| +---+---------+--------+-------+---------+-----+----------+--------+ | 1| luis| alvarez|37.7823| 123| 200|07-07-1980| cake| | 2| andre| ampere|67.7769| 423| 160|08-07-1950| piza| | 3| niels| bohr|27.7796| 551| 160|09-07-1990| pizza| | 4| paul| dirac|63.7903| 521| 160|10-07-1954| pizza| | 5| albert|einstein|27.7796| 634| 160|11-07-1990| pizza| | 6| galileo| galilei|87.7849| 672| 100|12-08-1930| arepa| | 7| carl| gauss|47.7876| 323| 60|13-07-1970| taco| | 8| david| hilbert|67.7769| 624| 60|14-07-1950|taaaccoo| | 9| johannes| kepler|97.7876| 735| 60|22-04-1920| taco| | 10| james| maxwell|94.7796| 875| 60|12-03-1923| taco| +---+---------+--------+-------+---------+-----+----------+--------+ only showing top 10 rows
Sometimes there a some values that are different but actually are the same. In the product column for example, there are the following values: 'taaaccoo', 'piza'. It can be inferred that the correct value is taco and piza and not the rest of them. This problem can be solved with the lookup function of the DataFrameTransformer class.
transformer.lookup('product', str_to_replace='taco', list_str=['taaaccoo'])
transformer.lookup('product', str_to_replace='pizza', list_str=['piza', 'pizzza'])
transformer.show(20)
+---+------------+--------+--------+---------+-----+----------+----------+ | id| firstName|lastName| age|billingId|price| birth| product| +---+------------+--------+--------+---------+-----+----------+----------+ | 1| luis| alvarez| 37.7823| 123| 200|07-07-1980| cake| | 2| andre| ampere| 67.7769| 423| 160|08-07-1950| pizza| | 3| niels| bohr| 27.7796| 551| 160|09-07-1990| pizza| | 4| paul| dirac| 63.7903| 521| 160|10-07-1954| pizza| | 5| albert|einstein| 27.7796| 634| 160|11-07-1990| pizza| | 6| galileo| galilei| 87.7849| 672| 100|12-08-1930| arepa| | 7| carl| gauss| 47.7876| 323| 60|13-07-1970| taco| | 8| david| hilbert| 67.7769| 624| 60|14-07-1950| taco| | 9| johannes| kepler| 97.7876| 735| 60|22-04-1920| taco| | 10| james| maxwell| 94.7796| 875| 60|12-03-1923| taco| | 11| isaac| newton| 18.7903| 992| 180|15-02-1999| pasta| | 12| emmy| noether| 24.7903| 234| 180|08-12-1993| pasta| | 13| max| planck| 23.7930| 111| 80|04-01-1994|hamburguer| | 14| fred| hoyle| 20.7849| 553| 160|27-06-1997| pizza| | 15| heinrich | hertz| 61.7769| 116| 160|30-11-1956| pizza| | 16| william| gilbert| 59.7849| 886| 40|26-03-1958| beer| | 17| marie| curie| 17.7930| 912| 20|22-03-2000| rice| | 18| arthur| compton|118.7769| 812| 100|01-01-1899| null| | 19| james|chadwick| 96.7930| 467| 200|03-05-1921| null| +---+------------+--------+--------+---------+-----+----------+----------+
As can be notice above, string specified in the list argument 'list_str' have been replaced to 'str_to_replace' value.
The past transformations were done step by step, but this can be achieved by chaining all operations into one line of code, like the cell below. This way is much more efficient and scalable because it uses all optimization issues from the lazy evaluation approach.
All the transformation set before can be done into a single line of code thanks to the chaining feature of the DataFrameTransformer class. This option is a optimal way to make different transformations, because it uses as much as possible all advantages of the lazy evaluation approach.
# Instanciate DataFrameTransfomer
transformer = op.DataFrameTransformer(df)
# Get original dataFrame to show it.
transformer.show(20)
# Chaining function transformations
transformer.trim_col("*") \
.remove_special_chars("*") \
.clear_accents("*") \
.lower_case("*") \
.drop_col("dummyCol") \
.date_transform("birth", "yyyyMMdd", "dd-MM-YYYY") \
.age_calculate("birth", "dd-MM-YYYY", "clientAge") \
.operation_in_type([('product', 'integer', None)]) \
.lookup('product', str_to_replace='taco', list_str=['taaaccoo']) \
.lookup('product', str_to_replace='pizza', list_str=['piza', 'pizzza']) \
transformer.show(20)
+---+--------------------+--------------------+---------+----------+-----+----------+--------+ | id| firstName| lastName|billingId| product|price| birth|dummyCol| +---+--------------------+--------------------+---------+----------+-----+----------+--------+ | 1| Luis| Alvarez$$%!| 123| Cake| 10|1980/07/07| never| | 2| André| Ampère| 423| piza| 8|1950/07/08| gonna| | 3| NiELS| Böhr//((%%| 551| pizza| 8|1990/07/09| give| | 4| PAUL| dirac$| 521| pizza| 8|1954/07/10| you| | 5| Albert| Einstein| 634| pizza| 8|1990/07/11| up| | 6| Galileo| GALiLEI| 672| arepa| 5|1930/08/12| never| | 7| CaRL| Ga%%%uss| 323| taco| 3|1970/07/13| gonna| | 8| David| H$$$ilbert| 624| taaaccoo| 3|1950/07/14| let| | 9| Johannes| KEPLER| 735| taco| 3|1920/04/22| you| | 10| JaMES| M$$ax%%well| 875| taco| 3|1923/03/12| down| | 11| Isaac| Newton| 992| pasta| 9|1999/02/15| never | | 12| Emmy%%| Nöether$| 234| pasta| 9|1993/12/08| gonna| | 13| Max!!!| Planck!!!| 111|hamburguer| 4|1994/01/04| run | | 14| Fred| Hoy&&&le| 553| pizzza| 8|1997/06/27| around| | 15|((( Heinrich )))))| Hertz| 116| pizza| 8|1956/11/30| and| | 16| William| Gilbert###| 886| BEER| 2|1958/03/26| desert| | 17| Marie| CURIE| 912| Rice| 1|2000/03/22| you| | 18| Arthur| COM%%%pton| 812| 110790| 5|1899/01/01| #| | 19| JAMES| Chadwick| 467| null| 10|1921/05/03| #| +---+--------------------+--------------------+---------+----------+-----+----------+--------+ +---+------------+--------+---------+-----+----------+---------+----------+ | id| firstName|lastName|billingId|price| birth|clientAge| product| +---+------------+--------+---------+-----+----------+---------+----------+ | 1| luis| alvarez| 123| 10|07-07-1980| 37.7823| cake| | 2| andre| ampere| 423| 8|08-07-1950| 67.7769| pizza| | 3| niels| bohr| 551| 8|09-07-1990| 27.7796| pizza| | 4| paul| dirac| 521| 8|10-07-1954| 63.7903| pizza| | 5| albert|einstein| 634| 8|11-07-1990| 27.7796| pizza| | 6| galileo| galilei| 672| 5|12-08-1930| 87.7849| arepa| | 7| carl| gauss| 323| 3|13-07-1970| 47.7876| taco| | 8| david| hilbert| 624| 3|14-07-1950| 67.7769| taco| | 9| johannes| kepler| 735| 3|22-04-1920| 97.7876| taco| | 10| james| maxwell| 875| 3|12-03-1923| 94.7796| taco| | 11| isaac| newton| 992| 9|15-02-1999| 18.7903| pasta| | 12| emmy| noether| 234| 9|08-12-1993| 24.7903| pasta| | 13| max| planck| 111| 4|04-01-1994| 23.7930|hamburguer| | 14| fred| hoyle| 553| 8|27-06-1997| 20.7849| pizza| | 15| heinrich | hertz| 116| 8|30-11-1956| 61.7769| pizza| | 16| william| gilbert| 886| 2|26-03-1958| 59.7849| beer| | 17| marie| curie| 912| 1|22-03-2000| 17.7930| rice| | 18| arthur| compton| 812| 5|01-01-1899| 118.7769| null| | 19| james|chadwick| 467| 10|03-05-1921| 96.7930| null| +---+------------+--------+---------+-----+----------+---------+----------+
General description | ||
---|---|---|
Features | Name or Quantity | |
File Name | file with no path | |
Columns | 8 | |
Rows | 19 |