Operational solar flare prediction model using Deep Flare Net

We developed an operational solar flare prediction model using deep neural networks, named Deep Flare Net (DeFN). DeFN can issue probabilistic forecasts of solar flares in two categories, such as>=M-class and<M-class events or>=C-class and<C-class events, occurring in the next 24 h after observations and the maximum class of flares occurring in the next 24 h. DeFN is set to run every 6 h and has been operated since January 2019. The input database of solar observation images taken by the Solar Dynamic Observatory (SDO) is downloaded from the data archive operated by the Joint Science Operations Center (JSOC) of Stanford University. Active regions are automatically detected from magnetograms, and 79 features are extracted from each region nearly in real time using multiwavelength observation data. Flare labels are attached to the feature database, and then, the database is standardized and input into DeFN for prediction. DeFN was pretrained using the datasets obtained from 2010 to 2015. The model was evaluated with the skill score of the true skill statistics (TSS) and achieved predictions with TSS = 0.80 for>=M-class flares and TSS = 0.63 for>=C-class flares. For comparison, we evaluated the operationally forecast results from January 2019 to June 2020. We found that operational DeFN forecasts achieved TSS = 0.70 (0.84) for>=C-class flares with the probability threshold of 50 (40)%, although there were very few M-class flares during this period and we should continue monitoring the results for a longer time. Here, we adopted a chronological split to divide the database into two for training and testing. The chronological split appears suitable for evaluating operational models. Furthermore, we proposed the use of time-series cross-validation. The procedure achieved TSS = 0.70 for>=M-class flares and 0.59 for>=C-class flares using the datasets obtained from 2010 to 2017.


Introduction
The mechanism of solar flares is a long-standing puzzle. Solar flares emit X-rays, highly energetic particles, and coronal mass ejections (CMEs) into the interplanetary space in the heliosphere, whereby these flares become one of the origins of space weather phenomena (e.g., Schwenn et al. 2005;Fletcher et al. 2011;Liu et al. 2014;Möstl et al. 2015). The prediction of flares is essential for reducing the damage to technological infrastructures on Earth. Solar flares are triggered by a newly emerging magnetic flux or magnetohydrodynamic instability to release excess magnetic energy stored in the solar atmosphere (e.g., Aulanier et al. 2010;Shibata & Magara 2011;Cheung & Isobe 2014;Wang et al. 2017;Toriumi & Wang 2019). Such phenomena are monitored by the Solar Dynamic Observatory (SDO; Pesnell et al. 2012) and the Geostationary Orbital Environment Satellite (GOES), and observation data are used for the prediction of flares.
Recently, the application of supervised machine learning methods, especially deep neural networks (DNNs), to solar flare prediction has been a hot topic, and their successful application in research has been reported (Huang et al. 2018;Nishizuka et al. 2018;Park et al. 2018;Chen et al. 2019;Domijan et al. 2019;Liu et al. 2019;Zheng et al. 2019;Bhattacharjee et al. 2020;Jiao et al. 2020;Li et al. 2020;Panos & Kleint 2020;Yi et al. 2020). However, there is insufficient discussion on how to develop the 3 methods available to real-time operations in space weather forecasting offices, including the methods for validation and verification of the models. Currently, new physical and geometrical (topological) features are applied to flare prediction using machine learning (e.g., Wang et al. 2020a;Deshmukh et al. 2020), and it has been noted that training sets may be sensitive to which period in the solar cycle they are drawn from. (Wang et al. 2020b).
It has been one year since we started operating our flare prediction model using DNNs, which we named Deep Flare Net (DeFN: Nishizuka et al. 2018). Here, we evaluate the prediction results during real-time operations at the NICT space weather forecasting office in Tokyo, Japan. In this paper, we introduce the operational version of DeFN in sections 2 and 3, and we show the prediction results in section 4.
We propose the use of time-series cross-validation (CV) to evaluate operational models in section 5.
We summarize our results and discuss the selection of a suitable evaluation method for models used in operational settings in section 6. 4 2. Flare Forecasting Tool in Real-Time Operation

Procedures of Operational DeFN
DeFN is designed to predict solar flares occurring in the following 24 h after observing magnetogram images, which are categorized into the two categories: (≥M-class and <M-class) or (≥C-class and <Cclass). In the operational system of DeFN forecasts, observation images are automatically downloaded, active regions (ARs) are detected, and 79 physics-based features are extracted for each region. Each feature is standardized by the average value and standard deviation and is input into the DNN model, DeFN. The output is the flare occurrence probabilities for the two categories. Finally, the maximum class of flares occurring in the following 24 h is forecast by taking the maximum probability of the forecasts.
Operational DeFN was redesigned for automated real-time forecasting with operational redundancy. All the programs written in IDL and Python languages are driven by cron scripts at the prescribed forecast issuance time as scheduled. There are a few differences from the original DeFN used for research, as explained in the next subsection. A generalized flow chart of operational DeFN is shown in Figure 1.

NRT Observation Data
The first difference between development DeFN and operational DeFN is the use of near real-time (NRT) observation data. We use the observation data of line-of-sight magnetograms and vector magnetograms taken from Helioseismic and Magnetic Imager (HMI; Scherrer et al. 2012;Schou et al. 2012;Hoeksema et al. 2014) on board SDO, ultraviolet (UV) and extreme ultraviolet (EUV) images obtained from Atmospheric Imaging Assembly (AIA; Lemen et al. 2012) through 1600Å and 131Å filters; and the full-disk integrated X-ray emission over the range of 1-8Å observed by GOES. For visualization, we also use white light images taken from HMI and EUV images obtained using AIA through 304Å and 193Å filters. The time cadence of the vector magnetograms is 12 min, that of the line-of-sight magnetograms is 45 s, those of the 1600Å and 131Å filters are both 12 s, and that of GOES is less than 1 min.
The data product of SDO is provided by the Joint Science Operations Center (JSOC) of Stanford University. The HMI NRT data are generally processed and available for transfer within 90 min after observations (Leka et al. 2018). This is why DeFN was designed to download the observation dataset 1 h earlier. If the observation data are unavailable because of processing or transfer delays, the target of downloading is moved back in time to 1 to 5 h earlier in the operational DeFN system. When no data can be found beyond 5 h earlier, it is considered that the data are missing. Here, the time of 5 h was determined by trial and error. Forecasting takes 20-40 min for each prediction; thus, it is reasonable to set the forecasting time to as often as once per hour. The 1 h cadence is comparable to that of the time evolution of the magnetic field configuration in active regions due to flux emergence or changes before and after a flare. However, DeFN started operating in the minimum phase of solar activity, so we started forecasting with a 6 h cadence instead of a 1 h cadence.
The NRT vector magnetograms taken by HMI/SDO are used for operational forecasts, whereas the calibrated HMI 'definitive' series of vector magnetograms are used for scientific research. The NRT vector magnetograms are accessed from the data series 'hmi.bharp 720s nrt' with segmentations of 'field', 'inclination', 'azimuth', and 'ambig'. These segmentations indicate the components of field strength, inclination angle, azimuth angle, and the disambiguation of magnetic field in the photosphere, respectively.
Additionally, the NRT line-of-sight magnetograms are downloaded from the data series 'hmi.M 720s nrt', and the NRT white light images are from the 'hmi.Ic noLimbDark 720s nrt' (jsoc2) series. The NRT 6 data of AIA 131Å 193Å 304Å and 1600Å filters are retrieved from the 'aia.lev1 nrt2' (jsoc2) series.
Note that the HMI NRT vector magnetogram is not for the full disk, in contrast to the HMI definitive series data. HMI Active Region Patches (HARP) are automatically detected in the pipeline of HMI data processing , and the HMI NRT vector magnetogram is limited to the HARP areas plus a buffer, on which we overlaid our active region frames detected by DeFN and extracted 79 physics-based features (Figure 2; also see the detection algorithms and extracted features in Nishizuka et al. (2017Nishizuka et al. ( , 2018, and details of the HMI NRT vector magnetogram in Leka et al. 2018). Furthermore, the correlation between the HMI NRT data and the definitive data has not been fully statistically revealed. A future task is to reveal how the difference between the HMI NRT and definitive series data affects the forecasting results. The same comments can be made for the AIA NRT and definitive series data.

Implementation of Operational DeFN
Operational DeFN runs autonomously every 6 h by default, forecasting at 03:00, 09:00, 15:00, and 21:00 UT. The forecasting time of 03:00 UT was set to be before the daily forecasting meeting of NICT at 14:30 JST. The weights of multi-layer perceptrons of DeFN were trained with the 2010-2014 observation datasets, and we selected representative hyperparameters by the observation datasets in 2015.
For the classification problem, parameters are optimized to minimize the cross entropy loss function.
However, since the flare occurrence ratio is imbalanced, we adopted a loss function with normalizations of prior distributions. It is the sum of the weighted cross entropy.
Here, p(y * nk ) is the initial probability of correct labels y * nk , i.e., 1 or 0, whereas p(y nk ) is the estimated value of probability. The components of y * nk are 1 or 0; thus, p(y * nk )=y * nk . w k is the weight of each class and is the inverse of the class occurrence ratio, i.e., Theoretically, the positive and negative events, i.e., whether ≥M-class flares occur or not, are predicted in the following manner. The following equation is commonly used in machine learning: Here,ŷ is the prediction result, and the threshold is usually fixed. For example, in the case of two-class classifications, the events with a probability greater than 50 % are output. When we use the model as a probabilistic prediction model, we also tried smaller threshold values for safety in operations, although there are no obvious theoretical meanings.
Note that the loss function weights cannot be selected arbitrarily. The positive to negative event ratios of ≥M-class and <M-class or ≥C-class and <C-class flares, which are called the occurrence frequency and the climatological base rate, are 1:50 and 1:4 during 2010-2015, respectively, in the standard. Only when the cross entropy is applied to the weight of the inverse ratio of positive to negative events does it 8 become theoretically valid to output the prediction by equation (2). Therefore, we used the base rate as the weight of cross entropy.
The DNN model of the operational DeFN was developed as in Nishizuka et al. (2018). Because the full HMI and AIA datasets obtained from 2010 to 2015 were too large to save and analyze, the cadence was reduced to 1 h, although in general a larger amount of data is useful for better predictions. We The time-series CV is stricter than a K-fold CV on data split by active region. It might be true that a K-fold CV on data split by active region can also prevent data from a single active region being used in training and testing (e.g., Bobra & Couvidat 2015). However, a K-fold CV on data split by active region allows the training set to contain future samples from different active regions. This may affect the prediction results, when there is a long-term variation of solar activity. As well, the number of active regions which produced X-class and M-class flares is not so large that a K-fold CV on data split by active region may be biased and not equal.
Indeed, solar flare prediction in operation has been done in a very strict condition, where no future data is available. Our focus is not to deny a K-fold CV on data split by active region. Instead, our focus is to discuss more appropriate CVs in operational setting.

Graphical Output
The graphical output is automatically generated and shown on a website ( Figure 3). The website was designed to be easy for professional space weather forecasters who are often not scientists to understand.
Prediction results for both the full-disk and region-by-region images are shown on the website, and the risk level is indicated by a mark, "Danger flares", "Warning", and "Quiet". Images are updated every 6 h as new data are downloaded. Details of the DeFN website are described below: • Solar full-disk images and detected ARs: Thus, operational DeFN is optimized for forecasting with the default probability threshold of 50 %.
That is, operational DeFN forecasts flares if the real occurrence probability, which is forecast by the non-weighted cross entropy loss function, is greater than the climatological event rate, and it does not forecast flares if the real occurrence probability is less than the climatological event rate (see also Nishizuka et al. 2020). Therefore, the normalized forecasted probability of ≥M-class flares sometimes becomes larger than that of ≥C-class flares.
• Full-disk probability forecasts and alert marks: The full-disk flare occurrence probability of ≥M-class flares, P F D , is calculated using where S is the set of ARs on the disk, i is the number of ARs on the disk, element AR i is a member of set S, and P i is the probabilistic forecast at each AR i (Leka et al. 2018). The risk level is indicated by a mark based on P F D and is categorized into three categories: "Danger flares" (P F D ≥80 %), "Warning" (P F D ≥50 %), and "Quiet" (P F D <50 %). This is analogous to weather forecasting, e.g., sunny, cloudy, and rainy.
• List of comments and remarks: Forecasted probabilities (percentages), comments, and remarks are summarized in a list.

Details of Operations and Redundancy
Operational DeFN has operated stable forecasts since January 2019. In this subsection, we explain the redundancy and operational details of operational DeFN.
• Forecast outages: A forecast is the category with the maximum probability of a flare in each of the categories in the following 24 h after the forecast issue time. A forecast is normally issued every 6 h. If problems occur when downloading or processing data, the forecast is skipped and handled as a forecast failure.
• Data outages of SDO/HMI, AIA: There are delays in the HMI/SDO data processing when no applicable NRT data are available for a forecast. In this case, the NRT data to download is moved back in time to 1 to 5 h earlier. In such as case, the forecasting target will change from 24 h to 25-29 h, though the operational DeFN is not retrained. If no data can be found beyond 5 h earlier, the "no data" value is assigned and the forecast is skipped.
• No sunspots or ARs with strong magnetic field on disk: If there are no sunspots or ARs detected with the threshold of 140 G on the disk image of the line-of-sight magnetogram, feature extraction is skipped, a forecast of "no flare" with a probability forecast of 1 % is issued, and the "no sunspot" value is assigned.
• Forecasts at ARs near/across limb: DeFN is currently not applicable to limb events. If an AR is detected across a limb, it is ignored in forecast targets. • Retraining: DeFN can be retrained on demand, and a newly trained model can be used for forecasting. Currently, the pretrained model is fixed and has not been changed so far.
• Alternative data input after SDO era (option): Since DeFN is designed to detect ARs and extract features by itself, it can be revised and trained to include other space-and ground-based observation data in DeFN, even when SDO data are no longer available. 13

Operational Benchmark
The purpose of machine-learning techniques is to maximize the performance for unseen data. This is called generalization performance. Because it is hard to measure generalization performance, it is usually approximated by test-set performance, where there is no overlap between the training and test sets.
On the other hand, as the community continues to use a fixed test set, on the surface the performance of newly proposed models will seem to improve year by year. In reality, generalization performance is not constantly improving, but there will be more models that are effective only for the test set. This is partially because that models with lower performance than state-of-the-art are not reported. In other words, there are more and more models that are not always valid for unseen datasets. It is essentially indistinguishable whether the improvement is due to an improvement in generalization performance or because it is a method that is effective only for the test set.
The above facts are well-known in the machine learning community, and the evaluation conditions are mainly divided into two, basic and strict. Under the strict evaluation conditions, only an independent evaluation body evaluates each model using the test set only once. The test set is not published to the model builders (see e.g., Ardila et al. 2019). The solar flare prediction is originally a prediction of future solar activity using a present observation dataset, and the data available to researchers are only the past data. This fact is consistent with the strict evaluation condition in machine-learning community.
In this section, we evaluate our operational forecasting results. We call this the "operational benchmark" in this paper. In the machine learning community, a benchmark using a fixed test set is used only for basic benchmark tests. The basic approach is simple but is known to be insufficient. This is because no one can guarantee that the test set is used only once. In strict machine learning benchmarks, evaluation with a completely unseen test set is required. Only organizer can see the "completely unseen test set", which cannot be seen by each researcher. This is because, if researchers use the test set many times, they implicitly tend to select models effective only for the fixed test set.
We think that the evaluation methods of operational solar flare prediction models are not limited to 14 evaluations using a fixed test set. However, this paper does not deny the performance evaluation using a fixed test set. The purpose of this paper is to show that the operational evaluation is important. From a fairness perspective, the strict benchmarking approach takes precedence over the basic approach. Our operational evaluation is based on the strict benchmarking approach. We did not retrain our model after the deployment of our system. 15

Forecast Results and Evaluation
We evaluated the forecast results from January 2019 to June 2020, when we operated operational DeFN in real time. During this period, 24 C-class flares and one M-class flare were observed. The M-class flare was observed on 6 May 2019 as M1.0, which was originally reported as C9.9 and corrected to M1.0 later. The forecast results are shown in Table 2. Each contingency table shows the prediction results for ≥M-class and ≥C-class flares. operational DeFN was originally trained with the probability threshold of 50 % to decide the classification, but in operations, users can change it according to their purposes. In Table 2, we show three cases for ≥M-class and ≥C-class predictions using different probability thresholds, such as 50 %, 45 %, and 40 % for reference.
Each skill score can be computed from the items shown in contingency tables, and not vice versa. This is a well-known fact. No matter how many skill scores you show, you will not have more information than According to Table 2, the flare occurrence was very rare and imbalanced in the solar minimum phase.
Most of the forecasts are true negative. When we decrease the probability threshold, the number of forecast events increases. We evaluated our results with the four verification metrics in Table 3: accuracy, TSS, false alarm ratio (FAR), and Heidke skill score (HSS) (Murphy 1993;Barnes et al. 2009;Kubo et al. 2017  The trends of the contingency tables are similar to those evaluated in the model development phase. ( Table 2). However, there are two differences. First, the data used were the NRT data, whereas the definitive series was used for development. However, in this case, there was negligible difference between them. Second, the evaluation methods are different. The operational DeFN was evaluated on the actual data from 2019 to 2020, whereas the development model was validated with the 2010-2014 dataset and tested with the 2015 dataset. It appears that the chronological split provides more suitable evaluation results for operations than the common methods, namely, shuffle and split CV and K-fold CV.

Time-series CV
Here we propose the use of time-series CV for evaluations of operational forecasting models. In previous papers on flare predictions, we used hold-out CV, where a subset of the data split chronologically was reserved for validation and testing, rather than the naïve K-fold CV. This is because it is necessary to be careful when splitting the time-series data to prevent data leakage (Nishizuka et al. 2018). To accurately evaluate prediction models in an operational setting, we must not use all the data about events that occur chronologically after the events used for training.
The time-series CV is illustrated in Figure 4. In this procedure, there are a series of testing datasets, each consisting of a set of observations and used for prediction error. The corresponding training dataset consists of observations that occurred prior to the observations that formed the testing dataset and is used for parameter tuning. Thus, model testing is not done on data that may have pre-dated the training set. Furthermore, the training dataset is divided into training and validation datasets. The model prediction accuracy is calculated by averaging over the testing datasets. This procedure is called rolling forecasting origin-based CV (Tashman 2000). In this paper, we call it time-series CV, and it provides an almost unbiased estimate of the true error (Varma & Simon 2006).
Note that the time-series CV has the following advantages: (i) The time-series CV is the standard validation scheme in time-series prediction. (ii) A single chronological split does not always reflect low generalization error (Bishop 2006). In other words, the trained model is not guaranteed to work for unseen test set. To avoid this, the time-series CV applies multiple chronological splits. The ability to predict new examples correctly that differ from those used for training is known as generalization performance (Bishop 2006). Therefore, the time-series CV is more generic and appropriate.
The evaluation results obtained by time-series CV using the 2010-2017 datasets are summarized in Table   4. The datasets were chronologically split to form the training, validation, and testing datasets. TSS is largest with the 2010-2014 datasets for training, the 2015 datasets for validation, and the 2016 datasets for testing. This is probably because it is not possible to obtain a reliable forecast based on a small training dataset obtained from 2010 to 2012. By averaging over the five testing datasets, we found that TSS is 0.70 for ≥M-class flares and 0.59 for ≥C-class flares. This procedure will be more suitable for an 18 observation dataset with a longer time period.

Summary and Discussion
We developed an operational flare prediction model using DNNs, which was based on a research ver- When we compared the evaluation results, we observed no significant difference between the pretrained and operational results. This means that, at least during January 2019 -June 2020, the difference between NRT and definitive series science data did not greatly affect the forecasts. We found a TSS of 0.63 for the ≥C-class model evaluated using the pretrained model was maintained and even increased to 0.70 (0.83) for operational forecasts with the probability threshold of 50 (40) %. This suggests that the chronological split is more suitable for the training and validation of the operational model than shuffle and split CV.
Here, we discuss how to train and evaluate machine learning models for operational forecasting. For an exact comparison, it is desirable to use the same datasets among participants. If this is not possible, there are three points that require attention.

20
(i) Observation Database: The ratio of positive to negative events should not be artificially changed, and datasets should not be selected artificially. Data should be the climatological event rate and kept natural. This is because some metrics are affected by controlling the positive to negative event ratio of datasets, especially HSS, which will result in a difference from the operational evaluations.
For operational evaluations, it is also desirable to include ARs near the limb, although they are excluded in most papers because the values of magnetograms are unreliable owing to the projection effect. Currently, in machine learning models, limb flares are not considered, but they also need to be considered in the near future, using GOES X-ray statistics as in human forecasting or magnetograms reproduced by STEREO EUV images (Kim et al. 2019).
(ii) Datasets for Training and Testing: We recommend that a chronological split or time-series CV is used for training and evaluation of operational models. Although K-fold CV using random shuffling is common in solar flare predictions, it has a problem for a time-series dataset divided into two for training and testing when the time variation is very small, e.g., the time evolution of magnetic field.
If the two neighboring datasets, which are very similar, are divided into both training and testing sets, the model becomes biased to overpredict flares. It might be true that a K-fold CV on data split by active region can also prevent data from a single active region being used in training and testing. However, a K-fold CV on data split by active region allows the training set to contain future samples from different active regions. Therefore, in the point of view of generalization performance, a time-series CV is stricter and more suitable for operational evaluation.
(iii) Selection of Metrics: The ranking of models is easily affected by the selection of the metric. Depending on the purpose, users should select their preferred model by looking at the contingency tables and skill scores of each model. After understanding that each skill score can evaluate one aspect of performance, verification methods should be discussed in the space weather community (see also Pagano et al. 2019;Cinto et al. 2020).
In this paper, we showed contingency tables of our prediction results. No matter how many skill scores you show, you will not have more information than one contingency table. We evaluated our prediction 21 results as a deterministic forecasting model. The ROC curve and the reliability diagram, which are shown in Barnes et al. (2016) and Leka et al. (2019), can also be reproduced from the contingency table if it is related to the deterministic forecast.
We demonstrated the performance of a machine learning model in an operational flare forecasting scenario. The same methods and discussion of prediction using machine learning algorithms can be applied to other forecasting models of space weather in the magnetosphere and ionosphere. Our future aim is to extend our model to predicting CMEs and social impacts on Earth by extending our database to include geoeffective phenomena and technological infrastructures.

Availability of data and materials
The code is available at https://github.com/komeisugiura/defn18. In the README file, we explain the architecture and selected hyper parameters. The feature database of DeFN is available at the world data center of NICT (http://wdc.nict.go.jp/IONO/wdc/). The SDO data are available from the SDO data center (https://sdo.gsfc.nasa.gov/data/) and JSOC (https://jsoc.stanford.edu/). The GOES data are available at https://services.swpc.noaa.gov/json/goes/.

Competing interests
The authors declare that they have no competing interests.