pandas-profiling Meteorites example

Source of data: https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh

Import libraries

In [1]:
import pandas as pd
import pandas_profiling
import numpy as np

Load and prepare example dataset

We add some fake variables for illustrating pandas-profiling capabilities

In [2]:
df=pd.read_csv("/tmp/Meteorite_Landings.csv", parse_dates=['year'], encoding='UTF-8')

# Note: Pandas does not support dates before 1880, so we ignore these for this analysis
df['year'] = pd.to_datetime(df['year'], errors='coerce')

# Example: Constant variable
df['source'] = "NASA"

# Example: Highly correlated variables
df['reclat_city'] = df['reclat'] + np.random.normal(scale=5,size=(len(df)))

# Example: Duplicate observations
duplicates_to_add = pd.DataFrame(df.iloc[0:10])
duplicates_to_add[u'name'] = duplicates_to_add[u'name'] + " copy"

df = df.append(duplicates_to_add, ignore_index=True)
In [3]:
duplicates_to_add.columns
Out[3]:
Index([      u'name',          u'id',    u'nametype',    u'recclass',
          u'mass (g)',        u'fall',        u'year',      u'reclat',
           u'reclong', u'GeoLocation',      u'source', u'reclat_city'],
      dtype='object')

Inline report without saving object

In [4]:
pandas_profiling.ProfileReport(df)
Out[4]:

Overview

Dataset info

Number of variables 12
Number of observations 45726
Total Missing (%) 4.1%
Total size in memory 4.5 MiB
Average record size in memory 104.0 B

Variables types

Numeric 4
Categorical 4
Date 1
Text (Unique) 1
Rejected 2

Warnings

  • GeoLocation has 7315 / 16.0% missing values Missing
  • GeoLocation has a high cardinality: 17101 distinct values Warning
  • mass (g) is highly skewed (γ1 = 76.918)
  • recclass has a high cardinality: 466 distinct values Warning
  • reclat has 7315 / 16.0% missing values Missing
  • reclat has 6438 / 14.1% zeros
  • reclat_city is highly correlated with reclat (ρ = 0.99417) Rejected
  • reclong has 7315 / 16.0% missing values Missing
  • reclong has 6214 / 13.6% zeros
  • source has constant value NASA Rejected

Variables

GeoLocation
Categorical

Distinct count 17101
Unique (%) 44.5%
Missing (%) 16.0%
Missing (n) 7315
(0.000000, 0.000000)
6214
(-71.500000, 35.666670)
 
4761
(-84.000000, 168.000000)
 
3040
Other values (17097)
24396
(Missing)
7315
Value Count Frequency (%)  
(0.000000, 0.000000) 6214 13.6%
 
(-71.500000, 35.666670) 4761 10.4%
 
(-84.000000, 168.000000) 3040 6.6%
 
(-72.000000, 26.000000) 1505 3.3%
 
(-79.683330, 159.750000) 657 1.4%
 
(-76.716670, 159.666670) 637 1.4%
 
(-76.183330, 157.166670) 539 1.2%
 
(-79.683330, 155.750000) 473 1.0%
 
(-84.216670, 160.500000) 263 0.6%
 
(-86.366670, -70.000000) 226 0.5%
 
Other values (17090) 20096 43.9%
 
(Missing) 7315 16.0%
 

fall
Categorical

Distinct count 2
Unique (%) 0.0%
Missing (%) 0.0%
Missing (n) 0
Found
44609
Fell
 
1117
Value Count Frequency (%)  
Found 44609 97.6%
 
Fell 1117 2.4%
 

id
Numeric

Distinct count 45716
Unique (%) 100.0%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 26884
Minimum 1
Maximum 57458
Zeros (%) 0.0%

Quantile statistics

Minimum 1
5-th percentile 2388.8
Q1 12681
Median 24256
Q3 40654
95-th percentile 54891
Maximum 57458
Range 57457
Interquartile range 27972

Descriptive statistics

Standard deviation 16863
Coef of variation 0.62727
Kurtosis -1.1601
Mean 26884
MAD 14490
Skewness 0.26653
Sum 1229293495
Variance 284380000
Memory size 357.2 KiB
Value Count Frequency (%)  
417 2 0.0%
 
398 2 0.0%
 
1 2 0.0%
 
6 2 0.0%
 
392 2 0.0%
 
370 2 0.0%
 
379 2 0.0%
 
2 2 0.0%
 
390 2 0.0%
 
10 2 0.0%
 
Other values (45706) 45706 100.0%
 

Minimum 5 values

Value Count Frequency (%)  
1 2 0.0%
 
2 2 0.0%
 
4 1 0.0%
 
5 1 0.0%
 
6 2 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
57454 1 0.0%
 
57455 1 0.0%
 
57456 1 0.0%
 
57457 1 0.0%
 
57458 1 0.0%
 

mass (g)
Numeric

Distinct count 12577
Unique (%) 27.6%
Missing (%) 0.3%
Missing (n) 131
Infinite (%) 0.0%
Infinite (n) 0
Mean 13278
Minimum 0
Maximum 60000000
Zeros (%) 0.0%

Quantile statistics

Minimum 0
5-th percentile 1.1
Q1 7.2
Median 32.61
Q3 202.9
95-th percentile 4000
Maximum 60000000
Range 60000000
Interquartile range 195.7

Descriptive statistics

Standard deviation 574930
Coef of variation 43.298
Kurtosis 6798.4
Mean 13278
MAD 25113
Skewness 76.918
Sum 605430000
Variance 3.3054e+11
Memory size 357.2 KiB
Value Count Frequency (%)  
1.3 171 0.4%
 
1.2 140 0.3%
 
1.4 138 0.3%
 
2.1 130 0.3%
 
2.4 126 0.3%
 
1.6 120 0.3%
 
0.5 119 0.3%
 
1.1 116 0.3%
 
3.8 114 0.2%
 
1.5 111 0.2%
 
Other values (12566) 44310 96.9%
 
(Missing) 131 0.3%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 19 0.0%
 
0.01 2 0.0%
 
0.013 1 0.0%
 
0.02 1 0.0%
 
0.03 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
28000000.0 1 0.0%
 
30000000.0 1 0.0%
 
50000000.0 1 0.0%
 
58200000.0 1 0.0%
 
60000000.0 1 0.0%
 

nametype
Categorical

Distinct count 2
Unique (%) 0.0%
Missing (%) 0.0%
Missing (n) 0
Valid
45651
Relict
 
75
Value Count Frequency (%)  
Valid 45651 99.8%
 
Relict 75 0.2%
 

recclass
Categorical

Distinct count 466
Unique (%) 1.0%
Missing (%) 0.0%
Missing (n) 0
L6
8287
H5
7143
L5
 
4797
Other values (463)
25499
Value Count Frequency (%)  
L6 8287 18.1%
 
H5 7143 15.6%
 
L5 4797 10.5%
 
H6 4529 9.9%
 
H4 4211 9.2%
 
LL5 2766 6.0%
 
LL6 2043 4.5%
 
L4 1253 2.7%
 
H4/5 428 0.9%
 
CM2 416 0.9%
 
Other values (456) 9853 21.5%
 

reclat
Numeric

Distinct count 12739
Unique (%) 33.2%
Missing (%) 16.0%
Missing (n) 7315
Infinite (%) 0.0%
Infinite (n) 0
Mean -39.107
Minimum -87.367
Maximum 81.167
Zeros (%) 14.1%

Quantile statistics

Minimum -87.367
5-th percentile -84.355
Q1 -76.714
Median -71.5
Q3 0
95-th percentile 34.494
Maximum 81.167
Range 168.53
Interquartile range 76.714

Descriptive statistics

Standard deviation 46.386
Coef of variation -1.1861
Kurtosis -1.4769
Mean -39.107
MAD 43.937
Skewness 0.49132
Sum -1502100
Variance 2151.7
Memory size 357.2 KiB
Value Count Frequency (%)  
0.0 6438 14.1%
 
-71.5 4761 10.4%
 
-84.0 3040 6.6%
 
-72.0 1506 3.3%
 
-79.68333 1130 2.5%
 
-76.71667 680 1.5%
 
-76.18333 539 1.2%
 
-84.21667 263 0.6%
 
-86.36667 226 0.5%
 
-86.71667 217 0.5%
 
Other values (12728) 19611 42.9%
 
(Missing) 7315 16.0%
 

Minimum 5 values

Value Count Frequency (%)  
-87.36667 4 0.0%
 
-87.03333 3 0.0%
 
-86.93333 3 0.0%
 
-86.71667 217 0.5%
 
-86.56667 17 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
72.68333 1 0.0%
 
72.88333 1 0.0%
 
76.13333 1 0.0%
 
76.53333 1 0.0%
 
81.16667 1 0.0%
 

reclat_city
Highly correlated

This variable is highly correlated with reclat and should be ignored for analysis

Correlation 0.99417

reclong
Numeric

Distinct count 14641
Unique (%) 38.1%
Missing (%) 16.0%
Missing (n) 7315
Infinite (%) 0.0%
Infinite (n) 0
Mean 61.053
Minimum -165.43
Maximum 354.47
Zeros (%) 13.6%

Quantile statistics

Minimum -165.43
5-th percentile -90.427
Q1 0
Median 35.667
Q3 157.17
95-th percentile 168
Maximum 354.47
Range 519.91
Interquartile range 157.17

Descriptive statistics

Standard deviation 80.655
Coef of variation 1.3211
Kurtosis -0.73139
Mean 61.053
MAD 67.606
Skewness -0.17438
Sum 2345100
Variance 6505.3
Memory size 357.2 KiB
Value Count Frequency (%)  
0.0 6214 13.6%
 
35.66667 4985 10.9%
 
168.0 3040 6.6%
 
26.0 1506 3.3%
 
159.75 657 1.4%
 
159.66667 637 1.4%
 
157.16667 542 1.2%
 
155.75 473 1.0%
 
160.5 263 0.6%
 
-70.0 228 0.5%
 
Other values (14630) 19866 43.4%
 
(Missing) 7315 16.0%
 

Minimum 5 values

Value Count Frequency (%)  
-165.43333 9 0.0%
 
-165.11667 17 0.0%
 
-163.16667 1 0.0%
 
-162.55 1 0.0%
 
-157.86667 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
175.13333 1 0.0%
 
175.73028 1 0.0%
 
178.08333 1 0.0%
 
178.2 1 0.0%
 
354.47333 1 0.0%
 

source
Constant

This variable is constant and should be ignored for analysis

Constant value NASA

year
Date

Distinct count 246
Unique (%) 0.5%
Missing (%) 0.7%
Missing (n) 312
Infinite (%) 0.0%
Infinite (n) 0
Minimum 1688-01-01 00:00:00
Maximum 2101-01-01 00:00:00

name
Categorical, Unique

First 3 values
Elephant Moraine 87822
MacAlpine Hills 02497
Dominion Range 08408
Last 3 values
Northwest Africa 3240
Roberts Massif 04204
Elephant Moraine 87797

First 10 values

Value Count Frequency (%)  
Aachen 1 0.0%
 
Aachen copy 1 0.0%
 
Aarhus 1 0.0%
 
Aarhus copy 1 0.0%
 
Abajo 1 0.0%
 

Last 10 values

Value Count Frequency (%)  
Österplana 062 1 0.0%
 
Österplana 063 1 0.0%
 
Österplana 064 1 0.0%
 
Łowicz 1 0.0%
 
Święcany 1 0.0%
 

Sample

name id nametype recclass mass (g) fall year reclat reclong GeoLocation source reclat_city
0 Aachen 1 Valid L5 21 Fell 1880-01-01 00:00:00 50.77500 6.08333 (50.775000, 6.083330) NASA 50.710757
1 Aarhus 2 Valid H6 720 Fell 1951-01-01 56.18333 10.23333 (56.183330, 10.233330) NASA 60.481559
2 Abee 6 Valid EH4 107000 Fell 1952-01-01 54.21667 -113.00000 (54.216670, -113.000000) NASA 59.575745
3 Acapulco 10 Valid Acapulcoite 1914 Fell 1976-01-01 16.88333 -99.90000 (16.883330, -99.900000) NASA 19.983026
4 Achiras 370 Valid L6 780 Fell 1902-01-01 -33.16667 -64.95000 (-33.166670, -64.950000) NASA -30.702776

Save report to file

In [5]:
pfr = pandas_profiling.ProfileReport(df)
pfr.to_file("/tmp/example.html")
In [6]:
pfr
Out[6]:

Overview

Dataset info

Number of variables 12
Number of observations 45726
Total Missing (%) 4.1%
Total size in memory 4.5 MiB
Average record size in memory 104.0 B

Variables types

Numeric 4
Categorical 4
Date 1
Text (Unique) 1
Rejected 2

Warnings

  • GeoLocation has 7315 / 16.0% missing values Missing
  • GeoLocation has a high cardinality: 17101 distinct values Warning
  • mass (g) is highly skewed (γ1 = 76.918)
  • recclass has a high cardinality: 466 distinct values Warning
  • reclat has 7315 / 16.0% missing values Missing
  • reclat has 6438 / 14.1% zeros
  • reclat_city is highly correlated with reclat (ρ = 0.99417) Rejected
  • reclong has 7315 / 16.0% missing values Missing
  • reclong has 6214 / 13.6% zeros
  • source has constant value NASA Rejected

Variables

GeoLocation
Categorical

Distinct count 17101
Unique (%) 44.5%
Missing (%) 16.0%
Missing (n) 7315
(0.000000, 0.000000)
6214
(-71.500000, 35.666670)
 
4761
(-84.000000, 168.000000)
 
3040
Other values (17097)
24396
(Missing)
7315
Value Count Frequency (%)  
(0.000000, 0.000000) 6214 13.6%
 
(-71.500000, 35.666670) 4761 10.4%
 
(-84.000000, 168.000000) 3040 6.6%
 
(-72.000000, 26.000000) 1505 3.3%
 
(-79.683330, 159.750000) 657 1.4%
 
(-76.716670, 159.666670) 637 1.4%
 
(-76.183330, 157.166670) 539 1.2%
 
(-79.683330, 155.750000) 473 1.0%
 
(-84.216670, 160.500000) 263 0.6%
 
(-86.366670, -70.000000) 226 0.5%
 
Other values (17090) 20096 43.9%
 
(Missing) 7315 16.0%
 

fall
Categorical

Distinct count 2
Unique (%) 0.0%
Missing (%) 0.0%
Missing (n) 0
Found
44609
Fell
 
1117
Value Count Frequency (%)  
Found 44609 97.6%
 
Fell 1117 2.4%
 

id
Numeric

Distinct count 45716
Unique (%) 100.0%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 26884
Minimum 1
Maximum 57458
Zeros (%) 0.0%

Quantile statistics

Minimum 1
5-th percentile 2388.8
Q1 12681
Median 24256
Q3 40654
95-th percentile 54891
Maximum 57458
Range 57457
Interquartile range 27972

Descriptive statistics

Standard deviation 16863
Coef of variation 0.62727
Kurtosis -1.1601
Mean 26884
MAD 14490
Skewness 0.26653
Sum 1229293495
Variance 284380000
Memory size 357.2 KiB
Value Count Frequency (%)  
417 2 0.0%
 
398 2 0.0%
 
1 2 0.0%
 
6 2 0.0%
 
392 2 0.0%
 
370 2 0.0%
 
379 2 0.0%
 
2 2 0.0%
 
390 2 0.0%
 
10 2 0.0%
 
Other values (45706) 45706 100.0%
 

Minimum 5 values

Value Count Frequency (%)  
1 2 0.0%
 
2 2 0.0%
 
4 1 0.0%
 
5 1 0.0%
 
6 2 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
57454 1 0.0%
 
57455 1 0.0%
 
57456 1 0.0%
 
57457 1 0.0%
 
57458 1 0.0%
 

mass (g)
Numeric

Distinct count 12577
Unique (%) 27.6%
Missing (%) 0.3%
Missing (n) 131
Infinite (%) 0.0%
Infinite (n) 0
Mean 13278
Minimum 0
Maximum 60000000
Zeros (%) 0.0%

Quantile statistics

Minimum 0
5-th percentile 1.1
Q1 7.2
Median 32.61
Q3 202.9
95-th percentile 4000
Maximum 60000000
Range 60000000
Interquartile range 195.7

Descriptive statistics

Standard deviation 574930
Coef of variation 43.298
Kurtosis 6798.4
Mean 13278
MAD 25113
Skewness 76.918
Sum 605430000
Variance 3.3054e+11
Memory size 357.2 KiB
Value Count Frequency (%)  
1.3 171 0.4%
 
1.2 140 0.3%
 
1.4 138 0.3%
 
2.1 130 0.3%
 
2.4 126 0.3%
 
1.6 120 0.3%
 
0.5 119 0.3%
 
1.1 116 0.3%
 
3.8 114 0.2%
 
1.5 111 0.2%
 
Other values (12566) 44310 96.9%
 
(Missing) 131 0.3%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 19 0.0%
 
0.01 2 0.0%
 
0.013 1 0.0%
 
0.02 1 0.0%
 
0.03 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
28000000.0 1 0.0%
 
30000000.0 1 0.0%
 
50000000.0 1 0.0%
 
58200000.0 1 0.0%
 
60000000.0 1 0.0%
 

nametype
Categorical

Distinct count 2
Unique (%) 0.0%
Missing (%) 0.0%
Missing (n) 0
Valid
45651
Relict
 
75
Value Count Frequency (%)  
Valid 45651 99.8%
 
Relict 75 0.2%
 

recclass
Categorical

Distinct count 466
Unique (%) 1.0%
Missing (%) 0.0%
Missing (n) 0
L6
8287
H5
7143
L5
 
4797
Other values (463)
25499
Value Count Frequency (%)  
L6 8287 18.1%
 
H5 7143 15.6%
 
L5 4797 10.5%
 
H6 4529 9.9%
 
H4 4211 9.2%
 
LL5 2766 6.0%
 
LL6 2043 4.5%
 
L4 1253 2.7%
 
H4/5 428 0.9%
 
CM2 416 0.9%
 
Other values (456) 9853 21.5%
 

reclat
Numeric

Distinct count 12739
Unique (%) 33.2%
Missing (%) 16.0%
Missing (n) 7315
Infinite (%) 0.0%
Infinite (n) 0
Mean -39.107
Minimum -87.367
Maximum 81.167
Zeros (%) 14.1%

Quantile statistics

Minimum -87.367
5-th percentile -84.355
Q1 -76.714
Median -71.5
Q3 0
95-th percentile 34.494
Maximum 81.167
Range 168.53
Interquartile range 76.714

Descriptive statistics

Standard deviation 46.386
Coef of variation -1.1861
Kurtosis -1.4769
Mean -39.107
MAD 43.937
Skewness 0.49132
Sum -1502100
Variance 2151.7
Memory size 357.2 KiB
Value Count Frequency (%)  
0.0 6438 14.1%
 
-71.5 4761 10.4%
 
-84.0 3040 6.6%
 
-72.0 1506 3.3%
 
-79.68333 1130 2.5%
 
-76.71667 680 1.5%
 
-76.18333 539 1.2%
 
-84.21667 263 0.6%
 
-86.36667 226 0.5%
 
-86.71667 217 0.5%
 
Other values (12728) 19611 42.9%
 
(Missing) 7315 16.0%
 

Minimum 5 values

Value Count Frequency (%)  
-87.36667 4 0.0%
 
-87.03333 3 0.0%
 
-86.93333 3 0.0%
 
-86.71667 217 0.5%
 
-86.56667 17 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
72.68333 1 0.0%
 
72.88333 1 0.0%
 
76.13333 1 0.0%
 
76.53333 1 0.0%
 
81.16667 1 0.0%
 

reclat_city
Highly correlated

This variable is highly correlated with reclat and should be ignored for analysis

Correlation 0.99417

reclong
Numeric

Distinct count 14641
Unique (%) 38.1%
Missing (%) 16.0%
Missing (n) 7315
Infinite (%) 0.0%
Infinite (n) 0
Mean 61.053
Minimum -165.43
Maximum 354.47
Zeros (%) 13.6%

Quantile statistics

Minimum -165.43
5-th percentile -90.427
Q1 0
Median 35.667
Q3 157.17
95-th percentile 168
Maximum 354.47
Range 519.91
Interquartile range 157.17

Descriptive statistics

Standard deviation 80.655
Coef of variation 1.3211
Kurtosis -0.73139
Mean 61.053
MAD 67.606
Skewness -0.17438
Sum 2345100
Variance 6505.3
Memory size 357.2 KiB
Value Count Frequency (%)  
0.0 6214 13.6%
 
35.66667 4985 10.9%
 
168.0 3040 6.6%
 
26.0 1506 3.3%
 
159.75 657 1.4%
 
159.66667 637 1.4%
 
157.16667 542 1.2%
 
155.75 473 1.0%
 
160.5 263 0.6%
 
-70.0 228 0.5%
 
Other values (14630) 19866 43.4%
 
(Missing) 7315 16.0%
 

Minimum 5 values

Value Count Frequency (%)  
-165.43333 9 0.0%
 
-165.11667 17 0.0%
 
-163.16667 1 0.0%
 
-162.55 1 0.0%
 
-157.86667 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
175.13333 1 0.0%
 
175.73028 1 0.0%
 
178.08333 1 0.0%
 
178.2 1 0.0%
 
354.47333 1 0.0%
 

source
Constant

This variable is constant and should be ignored for analysis

Constant value NASA

year
Date

Distinct count 246
Unique (%) 0.5%
Missing (%) 0.7%
Missing (n) 312
Infinite (%) 0.0%
Infinite (n) 0
Minimum 1688-01-01 00:00:00
Maximum 2101-01-01 00:00:00

name
Categorical, Unique

First 3 values
Elephant Moraine 87822
MacAlpine Hills 02497
Dominion Range 08408
Last 3 values
Northwest Africa 3240
Roberts Massif 04204
Elephant Moraine 87797

First 10 values

Value Count Frequency (%)  
Aachen 1 0.0%
 
Aachen copy 1 0.0%
 
Aarhus 1 0.0%
 
Aarhus copy 1 0.0%
 
Abajo 1 0.0%
 

Last 10 values

Value Count Frequency (%)  
Österplana 062 1 0.0%
 
Österplana 063 1 0.0%
 
Österplana 064 1 0.0%
 
Łowicz 1 0.0%
 
Święcany 1 0.0%
 

Sample

name id nametype recclass mass (g) fall year reclat reclong GeoLocation source reclat_city
0 Aachen 1 Valid L5 21 Fell 1880-01-01 00:00:00 50.77500 6.08333 (50.775000, 6.083330) NASA 50.710757
1 Aarhus 2 Valid H6 720 Fell 1951-01-01 56.18333 10.23333 (56.183330, 10.233330) NASA 60.481559
2 Abee 6 Valid EH4 107000 Fell 1952-01-01 54.21667 -113.00000 (54.216670, -113.000000) NASA 59.575745
3 Acapulco 10 Valid Acapulcoite 1914 Fell 1976-01-01 16.88333 -99.90000 (16.883330, -99.900000) NASA 19.983026
4 Achiras 370 Valid L6 780 Fell 1902-01-01 -33.16667 -64.95000 (-33.166670, -64.950000) NASA -30.702776