Source of data: https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh
I have previously transformed the downloaded csv to a Parquet table, but that doesn't matter. As long as you have your Spark Dataframe loaded, you are good to go.
import spark_df_profiling
df = sqlContext.read.parquet("/Users/Julio/Downloads/Meteorite_Landings.parquet").cache()
df
DataFrame[name: string, id: bigint, nametype: string, recclass: string, mass_g: double, fall: string, reclat: double, reclong: double, GeoLocation: string, source: string, reclat_city: double, year: date]
Spark Dataframes have the built-in method .describe()
. Let's see what it shows:
df.describe().show()
+-------+------------------+------+---------+----------+-------------------+ |summary| id|mass_g| reclat| reclong| reclat_city| +-------+------------------+------+---------+----------+-------------------+ | count| 45726| 45726| 45726| 45726| 45726| | mean|26883.906202160695| NaN| NaN| NaN| NaN| | stddev| 16863.44556599258| NaN| NaN| NaN| NaN| | min| 1| 0.0|-87.36667|-165.43333|-103.79172917787167| | max| 57458| NaN| NaN| NaN| NaN| +-------+------------------+------+---------+----------+-------------------+
Now let's use spark_df_profiling
:
report = spark_df_profiling.ProfileReport(df)
report
Dataset info
Number of variables | 12 |
---|---|
Number of observations | 45726 |
Total Missing (%) | 4.1% |
Total size in memory | 0.0 B |
Average record size in memory | 0.0 B |
Variables types
Numeric | 4 |
---|---|
Categorical | 4 |
Date | 1 |
Text (Unique) | 1 |
Rejected | 2 |
Warnings
GeoLocation
has 7315 / 19.0% missing values MissingGeoLocation
has a high cardinality: 17100 distinct values Warningmass_g
is highly skewed (γ1 = 76.916)recclass
has a high cardinality: 466 distinct values Warningreclat
has 7315 / 19.0% missing values Missingreclat
has 6438 / 14.1% zerosreclat_city
is highly correlated with reclat
(ρ = 0.99423) Rejectedreclong
has 7315 / 19.0% missing values Missingreclong
has 6214 / 13.6% zerossource
has constant value NASA RejectedGeoLocation
Categorical
Distinct count | 17100 |
---|---|
Unique (%) | 44.5% |
Missing (%) | 19.0% |
Missing (n) | 7315 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
(0.000000, 0.000000) | |
---|---|
(-71.500000, 35.666670) | 4761 |
(-84.000000, 168.000000) | 3040 |
Other values (17097) | |
(Missing) |
Value | Count | Frequency (%) | |
(0.000000, 0.000000) | 6214 | 13.6% | |
(-71.500000, 35.666670) | 4761 | 10.4% | |
(-84.000000, 168.000000) | 3040 | 6.6% | |
(-72.000000, 26.000000) | 1505 | 3.3% | |
(-79.683330, 159.750000) | 657 | 1.4% | |
(-76.716670, 159.666670) | 637 | 1.4% | |
(-76.183330, 157.166670) | 539 | 1.2% | |
(-79.683330, 155.750000) | 473 | 1.0% | |
(-84.216670, 160.500000) | 263 | 0.6% | |
(-86.366670, -70.000000) | 226 | 0.5% | |
(0.000000, 35.666670) | 223 | 0.5% | |
(-86.716670, -141.500000) | 217 | 0.5% | |
(-85.666670, 175.000000) | 185 | 0.4% | |
(-24.850000, -70.533330) | 178 | 0.4% | |
(-85.633330, -68.700000) | 105 | 0.2% | |
(-72.954880, 160.473280) | 74 | 0.2% | |
(58.583330, 13.433330) | 64 | 0.1% | |
(-76.716670, 159.333330) | 42 | 0.1% | |
(-72.778890, 75.313610) | 39 | 0.1% | |
(-72.983890, 75.246390) | 38 | 0.1% | |
Other values (17080) | 18931 | 41.4% | |
(Missing) | 7315 | 16.0% |
fall
Categorical
Distinct count | 2 |
---|---|
Unique (%) | 0.0% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Found | |
---|---|
Fell | 1117 |
Value | Count | Frequency (%) | |
Found | 44609 | 97.6% | |
Fell | 1117 | 2.4% |
id
Numeric
Distinct count | 45716 |
---|---|
Unique (%) | 100.0% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 26884 |
---|---|
Minimum | 1 |
Maximum | 57458 |
Zeros (%) | 0.0% |
Quantile statistics
Minimum | 1 |
---|---|
5-th percentile | 2388.8 |
Q1 | 12681 |
Median | 24256 |
Q3 | 40654 |
95-th percentile | 54891 |
Maximum | 57458 |
Range | 57457 |
Interquartile range | 27972 |
Descriptive statistics
Standard deviation | 16863 |
---|---|
Coef of variation | 0.62727 |
Kurtosis | -1.1601 |
Mean | 26884 |
MAD | 14490 |
Skewness | 0.26652 |
Sum | 1229300000 |
Variance | 284380000 |
Memory size | 0.0 B |
mass_g
Numeric
Distinct count | 12577 |
---|---|
Unique (%) | 27.6% |
Missing (%) | 0.3% |
Missing (n) | 131 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 13278 |
---|---|
Minimum | 0 |
Maximum | 60000000 |
Zeros (%) | 0.0% |
Quantile statistics
Minimum | 0 |
---|---|
5-th percentile | 1.0978 |
Q1 | 7.1907 |
Median | 32.598 |
Q3 | 202.86 |
95-th percentile | 3999.9 |
Maximum | 60000000 |
Range | 60000000 |
Interquartile range | 195.67 |
Descriptive statistics
Standard deviation | 574930 |
---|---|
Coef of variation | 43.298 |
Kurtosis | 6797.7 |
Mean | 13278 |
MAD | 25113 |
Skewness | 76.916 |
Sum | 605430000 |
Variance | 330540000000 |
Memory size | 0.0 B |
name
Categorical, Unique
First 3 values |
---|
Abee |
Asco |
Aleppo |
Last 3 values |
---|
Allende |
Alessandria |
Akaba |
First 20 values
1 | Abee |
---|---|
2 | Asco |
3 | Aleppo |
4 | Al Rais |
5 | Arbol Solo |
6 | Ash Creek |
7 | Northwest Africa 5815 |
8 | Anlong |
9 | Aomori |
10 | Aldsworth |
11 | Akyumak |
12 | Aachen |
13 | Ambapur Nagla |
14 | Alta'ameem |
15 | Aarhus |
16 | Archie |
17 | Almahata Sitta |
18 | Andhara |
19 | Adzhi-Bogdo (stone) |
20 | Aïr |
Last 20 values
45707 | Alais |
---|---|
45708 | Arroyo Aguiar |
45709 | Aguada |
45710 | Angra dos Reis (stone) |
45711 | Alexandrovsky |
45712 | Akwanga |
45713 | Alfianello |
45714 | Appley Bridge |
45715 | Achiras |
45716 | Adhi Kot |
45717 | Akbarpur |
45718 | Andover |
45719 | Acapulco |
45720 | Albareto |
45721 | Apt |
45722 | Agen |
45723 | Andura |
45724 | Allende |
45725 | Alessandria |
45726 | Akaba |
nametype
Categorical
Distinct count | 2 |
---|---|
Unique (%) | 0.0% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Valid | |
---|---|
Relict | 75 |
Value | Count | Frequency (%) | |
Valid | 45651 | 99.8% | |
Relict | 75 | 0.2% |
recclass
Categorical
Distinct count | 466 |
---|---|
Unique (%) | 1.0% |
Missing (%) | 0.0% |
Missing (n) | 0 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
L6 | |
---|---|
H5 | |
L5 | 4797 |
Other values (463) |
Value | Count | Frequency (%) | |
L6 | 8287 | 18.1% | |
H5 | 7143 | 15.6% | |
L5 | 4797 | 10.5% | |
H6 | 4529 | 9.9% | |
H4 | 4211 | 9.2% | |
LL5 | 2766 | 6.0% | |
LL6 | 2043 | 4.5% | |
L4 | 1253 | 2.7% | |
H4/5 | 428 | 0.9% | |
CM2 | 416 | 0.9% | |
H3 | 386 | 0.8% | |
L3 | 365 | 0.8% | |
CO3 | 335 | 0.7% | |
Ureilite | 300 | 0.7% | |
Iron, IIIAB | 285 | 0.6% | |
LL4 | 268 | 0.6% | |
CV3 | 256 | 0.6% | |
Diogenite | 241 | 0.5% | |
Howardite | 240 | 0.5% | |
LL | 225 | 0.5% | |
Other values (446) | 6952 | 15.2% |
reclat
Numeric
Distinct count | 12739 |
---|---|
Unique (%) | 33.2% |
Missing (%) | 19.0% |
Missing (n) | 7315 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | -39.107 |
---|---|
Minimum | -87.367 |
Maximum | 81.167 |
Zeros (%) | 14.1% |
Quantile statistics
Minimum | -87.367 |
---|---|
5-th percentile | -84.355 |
Q1 | -76.714 |
Median | -71.529 |
Q3 | -0.16289 |
95-th percentile | 34.494 |
Maximum | 81.167 |
Range | 168.53 |
Interquartile range | 76.551 |
Descriptive statistics
Standard deviation | 46.386 |
---|---|
Coef of variation | -1.1861 |
Kurtosis | -1.4768 |
Mean | -39.107 |
MAD | 43.937 |
Skewness | 0.4913 |
Sum | -1502100 |
Variance | 2151.7 |
Memory size | 0.0 B |
reclat_city
Highly correlated
This variable is highly correlated with reclat
and should be ignored for analysis
Correlation | 0.99423 |
---|
reclong
Numeric
Distinct count | 14641 |
---|---|
Unique (%) | 38.1% |
Missing (%) | 19.0% |
Missing (n) | 7315 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Mean | 61.053 |
---|---|
Minimum | -165.43 |
Maximum | 354.47 |
Zeros (%) | 13.6% |
Quantile statistics
Minimum | -165.43 |
---|---|
5-th percentile | -90.466 |
Q1 | -0.0024196 |
Median | 35.666 |
Q3 | 157.17 |
95-th percentile | 167.72 |
Maximum | 354.47 |
Range | 519.91 |
Interquartile range | 157.17 |
Descriptive statistics
Standard deviation | 80.655 |
---|---|
Coef of variation | 1.3211 |
Kurtosis | -0.73145 |
Mean | 61.053 |
MAD | 67.606 |
Skewness | -0.17437 |
Sum | 2345100 |
Variance | 6505.3 |
Memory size | 0.0 B |
source
Constant
This variable is constant and should be ignored for analysis
Constant value | NASA |
---|
year
Date
Distinct count | 244 |
---|---|
Unique (%) | 0.5% |
Missing (%) | 0.7% |
Missing (n) | 312 |
Infinite (%) | 0.0% |
Infinite (n) | 0 |
Minimum | 1688-01-01 |
---|---|
Maximum | 2101-01-01 |
report.to_file("/tmp/example.html")