Individual Project

"Predicting a Pulsar Star"

Аuthor: Aleksei Demin

Part 1: Feature and data explanation

Introduction

A pulsar is a highly magnetized rotating neutron star that emits a beam of electromagnetic radiation. This radiation can be observed only when the beam of emission is pointing toward Earth (much like the way a lighthouse can be seen only when the light is pointed in the direction of an observer), and is responsible for the pulsed appearance of emission. Neutron stars are very dense, and have short, regular rotational periods. This produces a very precise interval between pulses that ranges from milliseconds to seconds for an individual pulsar. Pulsars are believed to be one of the candidates for the source of ultra-high-energy cosmic rays (see also centrifugal mechanism of acceleration).

Each pulsar produces a slightly different emission pattern, which varies slightly with each rotation . Thus a potential signal detection known as a 'candidate', is averaged over many rotations of the pulsar, as determined by the length of an observation. In the absence of additional info, each candidate could potentially describe a real pulsar. However in practice almost all detections are caused by radio frequency interference (RFI) and noise, making legitimate signals hard to find.

Dataset information

Kaggle link

Provided dataset contains 16,259 spurious examples caused by RFI/noise, and 1,639 real pulsar examples. Each row lists the variables first, and the class label is the final entry. The class labels used are 0 (negative) and 1 (positive).

Data explanation

Each object described by 8 continuous variables, and a single class target variable.

The first four are simple statistics obtained from the integrated pulse profile (folded profile). This is an array of continuous variables that describe a longitude-resolved version of the signal that has been averaged in both time and frequency. The remaining four variables are similarly obtained from the DM-SNR curve.

Description
1 Mean of the integrated profile
2 Standard deviation of the integrated profile
3 Excess kurtosis of the integrated profile
4 Skewness of the integrated profile
5 Mean of the DM-SNR curve
6 Standard deviation of the DM-SNR curve
7 Excess kurtosis of the DM-SNR curve
8 Skewness of the DM-SNR curve
target_class Target variable - Object is a pulsar or not

Small clarification about skewness and kurtosis

Skewness assesses the extent to which a variable’s distribution is symmetrical. If the distribution of responses for a variable stretches toward the right or left tail of the distribution, then the distribution is referred to as skewed. Kurtosis is a measure of whether the distribution is too peaked (a very narrow distribution with most of the responses in the center).

Small clarification about DM-SNR

Example of DM-SNR curve

One of parameter a space that be described is the Dispersion Measure (DM) of space. Dispersion is caused by the interstellar medium, and is different for every pulsar, depending on its distance and the number of electrons in the interstellar medium in the direction of the pulsar. Dispersion causes the lower frequencies of the signal to arrive later than the higher frequencies. This smears out, or disperses, the pulse. This smearing will completely obliterate the pulse if the signal is not dedispersed before folding.

SNR - Pulse profile signal-to-noise ratio.

Part 2: Primary data analysis

Import the necessary packages

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import itertools

from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelBinarizer, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold, learning_curve, validation_curve
from sklearn.ensemble import RandomForestClassifier

import lightgbm as lgb

warnings.filterwarnings("ignore")
%matplotlib inline

Read the data

In [2]:
data = pd.read_csv('pulsar_stars.csv')
data.head()
Out[2]:
Mean of the integrated profile Standard deviation of the integrated profile Excess kurtosis of the integrated profile Skewness of the integrated profile Mean of the DM-SNR curve Standard deviation of the DM-SNR curve Excess kurtosis of the DM-SNR curve Skewness of the DM-SNR curve target_class
0 140.562500 55.683782 -0.234571 -0.699648 3.199833 19.110426 7.975532 74.242225 0
1 102.507812 58.882430 0.465318 -0.515088 1.677258 14.860146 10.576487 127.393580 0
2 103.015625 39.341649 0.323328 1.051164 3.121237 21.744669 7.735822 63.171909 0
3 136.750000 57.178449 -0.068415 -0.636238 3.642977 20.959280 6.896499 53.593661 0
4 88.726562 40.672225 0.600866 1.123492 1.178930 11.468720 14.269573 252.567306 0

Data information

In [3]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17898 entries, 0 to 17897
Data columns (total 9 columns):
 Mean of the integrated profile                  17898 non-null float64
 Standard deviation of the integrated profile    17898 non-null float64
 Excess kurtosis of the integrated profile       17898 non-null float64
 Skewness of the integrated profile              17898 non-null float64
 Mean of the DM-SNR curve                        17898 non-null float64
 Standard deviation of the DM-SNR curve          17898 non-null float64
 Excess kurtosis of the DM-SNR curve             17898 non-null float64
 Skewness of the DM-SNR curve                    17898 non-null float64
target_class                                     17898 non-null int64
dtypes: float64(8), int64(1)
memory usage: 1.2 MB
In [4]:
data.describe()
Out[4]:
Mean of the integrated profile Standard deviation of the integrated profile Excess kurtosis of the integrated profile Skewness of the integrated profile Mean of the DM-SNR curve Standard deviation of the DM-SNR curve Excess kurtosis of the DM-SNR curve Skewness of the DM-SNR curve target_class
count 17898.000000 17898.000000 17898.000000 17898.000000 17898.000000 17898.000000 17898.000000 17898.000000 17898.000000
mean 111.079968 46.549532 0.477857 1.770279 12.614400 26.326515 8.303556 104.857709 0.091574
std 25.652935 6.843189 1.064040 6.167913 29.472897 19.470572 4.506092 106.514540 0.288432
min 5.812500 24.772042 -1.876011 -1.791886 0.213211 7.370432 -3.139270 -1.976976 0.000000
25% 100.929688 42.376018 0.027098 -0.188572 1.923077 14.437332 5.781506 34.960504 0.000000
50% 115.078125 46.947479 0.223240 0.198710 2.801839 18.461316 8.433515 83.064556 0.000000
75% 127.085938 51.023202 0.473325 0.927783 5.464256 28.428104 10.702959 139.309331 0.000000
max 192.617188 98.778911 8.069522 68.101622 223.392140 110.642211 34.539844 1191.000837 1.000000

Correlation between fields

In [5]:
data.corr()
Out[5]:
Mean of the integrated profile Standard deviation of the integrated profile Excess kurtosis of the integrated profile Skewness of the integrated profile Mean of the DM-SNR curve Standard deviation of the DM-SNR curve Excess kurtosis of the DM-SNR curve Skewness of the DM-SNR curve target_class
Mean of the integrated profile 1.000000 0.547137 -0.873898 -0.738775 -0.298841 -0.307016 0.234331 0.144033 -0.673181
Standard deviation of the integrated profile 0.547137 1.000000 -0.521435 -0.539793 0.006869 -0.047632 0.029429 0.027691 -0.363708
Excess kurtosis of the integrated profile -0.873898 -0.521435 1.000000 0.945729 0.414368 0.432880 -0.341209 -0.214491 0.791591
Skewness of the integrated profile -0.738775 -0.539793 0.945729 1.000000 0.412056 0.415140 -0.328843 -0.204782 0.709528
Mean of the DM-SNR curve -0.298841 0.006869 0.414368 0.412056 1.000000 0.796555 -0.615971 -0.354269 0.400876
Standard deviation of the DM-SNR curve -0.307016 -0.047632 0.432880 0.415140 0.796555 1.000000 -0.809786 -0.575800 0.491535
Excess kurtosis of the DM-SNR curve 0.234331 0.029429 -0.341209 -0.328843 -0.615971 -0.809786 1.000000 0.923743 -0.390816
Skewness of the DM-SNR curve 0.144033 0.027691 -0.214491 -0.204782 -0.354269 -0.575800 0.923743 1.000000 -0.259117
target_class -0.673181 -0.363708 0.791591 0.709528 0.400876 0.491535 -0.390816 -0.259117 1.000000

Missing values

In [6]:
data.isnull().sum()
Out[6]:
 Mean of the integrated profile                  0
 Standard deviation of the integrated profile    0
 Excess kurtosis of the integrated profile       0
 Skewness of the integrated profile              0
 Mean of the DM-SNR curve                        0
 Standard deviation of the DM-SNR curve          0
 Excess kurtosis of the DM-SNR curve             0
 Skewness of the DM-SNR curve                    0
target_class                                     0
dtype: int64

Target class

In [7]:
data['target_class'].value_counts()
Out[7]:
0    16259
1     1639
Name: target_class, dtype: int64

Some conclusions after primary data analysis

  • we can observe quite large maximum values for all features, probably outliers
  • kurtosis and skewness does not correlate with the mean value and standard deviation, which corresponds to the theory
  • data types are all numeric and non-null, there's no need to do any transformations or cleaning

Part 3: Primary visual data analysis

Preprocessing

In [8]:
df = pd.DataFrame()

columns_old = [' Mean of the integrated profile', ' Standard deviation of the integrated profile', ' Excess kurtosis of the integrated profile', ' Skewness of the integrated profile',
           ' Mean of the DM-SNR curve', ' Standard deviation of the DM-SNR curve', ' Excess kurtosis of the DM-SNR curve',
           ' Skewness of the DM-SNR curve', 'target_class']
columns_new = ['mean_profile', 'std_profile', 'kurtosis_profile', 'skewness_profile',
           'mean_dmsnr', 'std_dmsnr', 'kurtosis_dmsnr', 'skewness_dmsnr', 'target_class']

for i in range(len(columns_old)):
    df[columns_new[i]] = data[columns_old[i]]

df.head()
Out[8]:
mean_profile std_profile kurtosis_profile skewness_profile mean_dmsnr std_dmsnr kurtosis_dmsnr skewness_dmsnr target_class
0 140.562500 55.683782 -0.234571 -0.699648 3.199833 19.110426 7.975532 74.242225 0
1 102.507812 58.882430 0.465318 -0.515088 1.677258 14.860146 10.576487 127.393580 0
2 103.015625 39.341649 0.323328 1.051164 3.121237 21.744669 7.735822 63.171909 0
3 136.750000 57.178449 -0.068415 -0.636238 3.642977 20.959280 6.896499 53.593661 0
4 88.726562 40.672225 0.600866 1.123492 1.178930 11.468720 14.269573 252.567306 0

First of all let's look at our target proportions

In [9]:
plt.figure(figsize=(12,6))
plt.subplot(122)
plt.pie(data["target_class"].value_counts().values,
        labels=["Non-pulsar stars","Pulsar stars"],
        autopct="%1.0f%%",wedgeprops={"linewidth":4,"edgecolor":"white"})
plt.subplots_adjust(wspace = .2)
plt.title("Proportion of target variable in dataset")
plt.show()

Distribution of feature variables

In [10]:
columns_new = ['mean_profile', 'std_profile', 'kurtosis_profile', 'skewness_profile',
           'mean_dmsnr', 'std_dmsnr', 'kurtosis_dmsnr', 'skewness_dmsnr']
length  = len(columns_new)
colors  = ["r","g","b","m","y","c","k","orange"] 

plt.figure(figsize=(13,20))
for i,j,k in itertools.zip_longest(columns_new,range(length),colors):
    plt.subplot(length/2,length/4,j+1)
    sns.distplot(df[i],color=k)
    plt.title(i)
    plt.subplots_adjust(hspace = .3)
    plt.axvline(df[i].mean(),color = "k",linestyle="dashed",label="MEAN")
    plt.axvline(df[i].std(),color = "b",linestyle="dotted",label="STANDARD DEVIATION")
    plt.legend(loc="upper right")

Comparing mean and std beetwen features for target class

In [11]:
compare = df.groupby("target_class")[['mean_profile', 'std_profile', 'kurtosis_profile', 'skewness_profile',
           'mean_dmsnr', 'std_dmsnr', 'kurtosis_dmsnr', 'skewness_dmsnr']].mean().reset_index()
compare = compare.drop("target_class",axis =1)

compare.plot(kind="bar",width=.6,figsize=(13,6),colormap="Set2")
plt.grid(True,alpha=.3)
plt.title("Comparing mean of features for target class")

compare1 = df.groupby("target_class")[['mean_profile', 'std_profile', 'kurtosis_profile', 'skewness_profile',
           'mean_dmsnr', 'std_dmsnr', 'kurtosis_dmsnr', 'skewness_dmsnr']].std().reset_index()
compare1 = compare1.drop("target_class",axis=1)
compare1.plot(kind="bar",width=.6,figsize=(13,6),colormap="Set2")
plt.grid(True,alpha=.3)
plt.title("Comparing std of features for target class")
plt.show()

PairPlot of all features

In [12]:
sns.pairplot(data=df,
             palette="husl",
             hue="target_class",
             vars=['mean_profile', 'std_profile', 'kurtosis_profile', 'skewness_profile',
           'mean_dmsnr', 'std_dmsnr', 'kurtosis_dmsnr', 'skewness_dmsnr'])

plt.tight_layout()
plt.show()