## Solar flares forecasting (ML approach)

Research plan

 - Dataset and features description
- Exploratory data analysis
- Visual analysis of the features
- Patterns, insights, pecularities of data
- Data preprocessing
- Feature engineering and description
- Cross-validation, hyperparameter tuning
- Validation and learning curves
- Prediction for hold-out and test samples
- Model evaluation with metrics description
- Conclusions

### Part 1. What is the solar flares and why we need forecast them?¶

The sun produces solar flares, which have the power to affect the Earth and near-Earth environment with their great bursts of electromagnetic energy and particles. These flares have the power to blow out transformers on power grids and disrupt satellite systems. There is a long lasting task of predictions such events for minimizing its negative impact. Doing so is a difficult task because of the rarity of these events. The success in this task not changes significantly over the last 60 years. Actually, this was a topic of my Ph.D. research, and I did it without any machine learning. But either in the era of big data, there is no big success in this task. The most common approach described in the paper Bobra et al., 2014. The main drawback of the approach is ignoring time dependence on features. Here I tried to use knowledge about working with time series in features. All data for this project could be downloaded from the [link]

#### Picture of the sun¶

Below the picture of our Sun in one of spectral lines $H_\alpha$, the most beautifull one) You can see bright regions on the Sun surface, these regions called Sun active regions, and in most cases solar flares erased from such region. There is a great site (https://solarmonitor.org/) where is information about the Sun aggregated. Let's look at the Sun and there active regions

In [1]:
import matplotlib.image as mpimg
import wget
import os
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
%matplotlib inline
#посмотрим на размеченную картинку с solarmonitor
file_url = 'https://solarmonitor.org/data/2014/05/14/pngs/bbso/bbso_halph_fd_20140514_053834.png'
IMG_PATH = '../../img/'
file_name = file_url.split(sep='/')[-1]

else:
plt.figure(figsize = (12,12))
imgplot = plt.imshow(img)


#### Active regions¶

Active regions called active because the strength of magnetic field in this regions. Strength of magnetics fields could be taken from so-called solar magnetograms. Magnetogramm of full solar disk at the same time at the next pictures.

In [2]:
#посмотрим на размеченную картинку с solarmonitor
file_url = 'https://solarmonitor.org/data/2014/05/14/pngs/shmi/shmi_maglc_fd_20140514_224622.png'
IMG_PATH = '../../img/'
file_name = file_url.split(sep='/')[-1]

else:
plt.figure(figsize = (12,12))
imgplot = plt.imshow(img)


A solar flare occurs when magnetic energy that has built up in the solar atmosphere is suddenly released. Solar flares are an often occurrence when the Sun is active in the years around solar maximum. Many solar flares can occur on just one day during this period! Around solar minimum, solar flares might occur less than once per week.

#### The classification of solar flares¶

Solar flares are classified as A (smallest), B, C, M or X (strongest) according to the peak flux (in watts per square metre, W/m2) of 1 to 8 Ångströms X-rays near Earth, as measured by XRS instrument on-board the GOES-15 satellite which is in a geostationary orbit over the Pacific Ocean. Some (mostly stronger) solar flares can launch huge clouds of solar plasma into space which we call a coronal mass ejection. When a coronal mass ejection arrives at Earth, it can cause a geomagnetic storm and intense auroral displays.

#### Solar flares prediction¶

Most of techniques are based on the complexity of the photospheric magnetic field of the Sun's active regions. There are a large number of dif- ferent characteristics that can be used for magnetic field complexity description. Due to many empirical assumptions during their calculation, they are hardly reproducible. For HMI/SDO vector magnetograms an automated active region tracking system exist called Spaceweather HMI Active Region Patch SHARP. For each active region, key features called SHARP parameters were calculated and are available online. Computation of these features is based on SDO vector magnetograms .

One example of Active region at the picture below:

## Data description¶

To define a solar flare, we only consider flares with a Geostationary Operational Environmental Satellite (GOES) X-ray flux peak magnitude above the M1.0 level. This allows us to focus only on major flares. For the purposes of this study, we defined a positive event to be an active region that flares with a peak magnitude above the M1.0 level, as defined by the GOES database. A negative event would be an active region that does not have such an event within a 24-hour time span. For collection active region for the negative class, we will gather also information about all regions where X-ray flux peak magnitude above the C1.0 level. So in our training set the same active region could be positive in one time and negative in the other. In each time moment, our target value will be 1 if in the next 24 hours will be the event with flux above the M1.0 level and 0 if not.

For doing that we need to describe the complexity of each active region with the features. I (and most other researchers) did it with so-called SHARP features.

The Solar Dynamic's Observatory's Helioseismic and Magnetic Imager is the first instrument to continuously map the vector magnetic field of the sun. The SDO takes the most data of any NASA satellite in history, approximately 2 terabytes per day, making it an ideal dataset for such a problem. Using this data, we can characterize active regions on the sun. From the time frame of 2010 May to 2018 December, we focused on 18 parameters calculated using the SHARP vector magnetic field data. They characterize various physical and geometrical qualities of the active region.

1. USFLUX is the total unsigned flux.
2. MEANGAM is the mean angle of field from radial.
3. MEANGBT is the mean gradient of total field.
4. MEANGBZ is the mean gradient of vertical field.
5. MEANGBH is the mean gradient of the horizontal field.
6. MEANJZD is the mean vertical current density.
7. TOTUSJZ is the total unsigned vertical current.
8. MEANALP is the mean characteristic twist parameter.
9. MEANJZH is the mean current helicity.
10. TOTUSJH is the total unsigned vertical current.
11. ABSNJZH is the absolute value of the net current helicity.
12. SAVNCPP is the sum of the modulus of the net current per polarity.
13. MEANPOT is the mean photospheric magnetic free energy.
14. TOTPOT is the total photospheric magnetic free energy density.
15. MEANSHR is the mean shear angle.
16. SHRGT45 is the fraction of area with shear greater than 45 degrees.
17. R_VALUE is the sum of flux near polarity inversion line.
18. AREA_ACR is the area of strong field pixels in the active region.

The following section of code initializes the start and end dates of the data set used in this study and also fetches the set of possible positive events and the mapper from NOAA active region numbers to the HARPNUMs used in our database.

### Get all info about solar flares from goes¶

This part contain code for gathering solar data. Here data from two instruments collected: goes and SDO. There is special package sunpy for handling solar data.

In [3]:
#pip install sunpy
#pip install suds-py3
#pip install drms
from datetime import timedelta
import datetime
import sunpy
from sunpy.time import TimeRange
from sunpy.instr import goes
import numpy as np
import pandas as pd

In [4]:
DOWNLOAD = False
DATA_PATH = '../../data/solar_flares'
time_range = TimeRange('2010/06/01 00:10', '2018/12/01 00:20')
#time_range = TimeRange(t_start,t_end)
goes_events = goes.get_goes_event_list(time_range,'C1')
goes_events = pd.DataFrame(goes_events)
else:
goes_events['noaa_active_region'] = goes_events['noaa_active_region'].replace(0,np.nan)
goes_events.dropna(inplace=True)
goes_events.drop(['goes_location','event_date','end_time','peak_time'], axis=1, inplace=True)
goes_events.start_time = goes_events.start_time.astype('datetime64[ns]')

In [5]:
goes_events.head()

Out[5]:
goes_class noaa_active_region start_time
0 M2.0 11081.0 2010-06-12 00:30:00
1 C1.0 11080.0 2010-06-12 03:57:00
2 C6.1 11081.0 2010-06-12 09:02:00
3 M1.0 11079.0 2010-06-13 05:30:00
4 C1.2 11081.0 2010-06-13 06:08:00

### Active regions detections¶

There are different approaches to active regions detections. One of them with manual correction and done each day in NOAA. Active regions in this catalog have NOAA numbers. The team of SDO has own fully automated system of AR detections, and their regions called HARPs. Also, they compute plenty of parameters of magnetic field complexity. So I used harp regions with features, but information about goes flux there is only for NOAA regions. HARP and NOAA regions are not coinciding, but there is the mapping between this two catalogs. Below the code for mapping between the HARP and NOAA regions.

In [6]:
#download mapper NOAA
if os.path.isfile(os.path.join(DATA_PATH,'all_harps_with_noaa_ars.txt')):
else:
num_mapper.to_csv(os.path.join(DATA_PATH,'all_harps_with_noaa_ars.txt'), sep=' ')

In [7]:
num_mapper.tail()

Out[7]:
HARPNUM NOAA_ARS
1314 7304 12721
1315 7305 12722
1316 7310 12723
1317 7312 12724
1318 7313 12725
In [8]:
def convert_noaa_to_harpnum(noaa_ar):
"""
Converts from a NOAA Active Region to a HARPNUM
Returns harpnum if present, else None if there are no matching harpnums

Args:
"""
idx = num_mapper[num_mapper['NOAA_ARS'].str.contains(str(int(noaa_ar)))]
return None if idx.empty else int(idx.HARPNUM.values[0])
goes_events['harp_number'] = goes_events['noaa_active_region'].apply(convert_noaa_to_harpnum)
goes_events.dropna(inplace=True)


Events class could be mapped to flux, which is continuous. It could be done with method flareclass_to_flux from goes

In [9]:
#Goes class flares better convert to flux value. It could be done with method flareclass_to_flux from goes
goes_events['flux'] =  goes_events['goes_class'].apply(lambda x: 1e06*goes.flareclass_to_flux(x).value)

Out[9]:
goes_class noaa_active_region start_time harp_number flux
0 M2.0 11081.0 2010-06-12 00:30:00 54.0 20.0
1 C1.0 11080.0 2010-06-12 03:57:00 51.0 1.0
2 C6.1 11081.0 2010-06-12 09:02:00 54.0 6.1
3 M1.0 11079.0 2010-06-13 05:30:00 49.0 10.0
4 C1.2 11081.0 2010-06-13 06:08:00 54.0 1.2

In one region could be many flares of differents classes. We have more then 1300 events and only Let's see to the countplot for the harp_number

In [15]:
import seaborn as sns
sns.countplot(x='harp_number', data = goes_events)

Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x1194a39b0>

Data with the main features of Active regions could be taken from SDO database. There is a special package for acesssing data drms. We will download meta information with the all keywords with drms

In [16]:
#here list of keywords we want to download. Keywords computed for harp regions.
#Here we walk through the all harp regions, download features and save them to disk (it is very time consuming)
import drms
c = drms.Client()
list_keywords = ['T_REC,CRVAL1,CRLN_OBS,USFLUX,MEANGBT,MEANJZH,MEANPOT,SHRGT45,TOTUSJH,MEANGBH,MEANALP,MEANGAM,MEANGBZ,MEANJZD,TOTUSJZ,SAVNCPP,TOTPOT,MEANSHR,AREA_ACR,R_VALUE,ABSNJZH']
harp_list = pd.unique(goes_events.harp_number)
for harp in harp_list:
str_query = f'hmi.sharp_cea_720s[{str(int(harp))}]'

if os.path.isfile(os.path.join(DATA_PATH+'/keys_regions',str_query+'.csv')):
else:
print(f'load region with Harp number {harp}')
keys = c.query(str_query, key=list_keywords)
keys.to_csv(os.path.join(DATA_PATH+'/keys_regions',str_query+'.csv'))


Harp number 54.0 already exist



### Part 2. Exploratory data analysis¶

Now there are many CSV files for each harp region and we can analyze the evolution of different parameters with the time. It is believed that before the flares complexity of magnetic field changes and there are special patterns in features evolutions.

In [17]:
def plot_harp_features_flares(harp, goes_events = goes_events, DATA_PATH=DATA_PATH, feature_key = 'R_VALUE'):
str_query = f'hmi.sharp_cea_720s[{str(int(harp))}]'
df['T_REC']  = drms.to_datetime(df.T_REC)
df.set_index('T_REC', inplace=True)
first_date = df.index.get_values()[0]

is_visible =  abs(df['CRVAL1']-df['CRLN_OBS'])<60
df = df[is_visible]

flux = goes_events[goes_events['harp_number']==harp][['start_time','flux']].set_index('start_time')
#plt.figure(figsize = (10,14))
fig, ax1 = plt.subplots(figsize=(15,5))
#ax1.figure(figsize = (10,14))
first_date = flux.index.get_values()[0]
first_date = df.index.get_values()[0]
#t2 = flux.index.get_values()[0]
#first_data = min(t1,t2)
dates_to_show = pd.date_range(pd.Timestamp(first_date).strftime('%m/%d/%Y'), periods=14, freq='d')
labels = dates_to_show.strftime('%b %d')
color = 'tab:green'
ax1.plot(df.index, df[feature_key], color=color)
ax2 = ax1.twinx()
ax2.bar(flux.index, flux.flux, width=0.05, facecolor='indianred')
plt.setp(ax1, xticks=dates_to_show, xticklabels=labels);
#ax2.set_ylim(0,10)

In [18]:
harp = 6327
plot_harp_features_flares(harp, feature_key = 'R_VALUE')