In this study, different ML approaches are implemented for tropospheric profiling where no external information about temperature is needed. ML refers to the algorithms, which are capable of improving their performance based on past experience (Michie et al. 1994). For this purpose, I tested ANN and RF models with two separated inputs consisting of bending angle and refractivity, which eventually resulted in trained 4 different models. The whole process involved the following steps: (1) possession of the data, (2) preprocessing, (3) hyperparameters tuning, (4) training and testing the models, and finally, (5) validation with RAOBs.
Artificial Neural Network
ANN is a neurologically inspired ML algorithm that reflects the behavior of the neurons network present in the human brain (Hassoun 1995). ANNs are able to deal with highly nonlinear problems and learn directly based on any kind of data. Due to its flexibility and good performance, ANN found application in a wide variety of fields, such as image classification, speech recognition, risk management or weather forecasting. One of the most commonly used ANN topologies is a feedforward multilayer perceptron (MLP), which was employed in the present study. MLP is a supervised network, which transforms the input data into output based on experience gained during the training on the data set (Gardner and Dorling 1998). The model typically consists of three types of layers: an input, one or more hidden layers, and an output interconnected by multiple fundamental processing units called neurons or nodes. The number of hidden layers and the number of nodes in these layers are arbitrary and depend on the complexity and the amount of available data. An example of MLP model architecture used in this study is presented in Fig. 1. In a fully connected MLP network, each neuron in a certain layer is connected to every neuron in the adjacent layer, whereas the strength of the particular connection is expressed by a numerical weight determined during training. Usually learning process is performed iterative using the backpropagation algorithm. The input data are repeatedly fed into the neural network, multiplied by connection weights, summed up and passed to the next layer. Eventually, in the last layer, the model’s error is estimated based on the differences between predicted and real outputs. In the next step, the calculated error is fed back and used to adjust the connections weight, which minimizes the model’s error and produces the outputs closer to the targets.
Random Forest regression
RF is a statistical nonparametric learning model based on a large ensemble of decision trees. Demonstrated for the first time by Breiman (2001), nowadays RF is one of the most widely and successfully used machine learning algorithms for both classification and regression tasks. However, until now, RF has not been fully explored in GNSS meteorology, especially in RO tropospheric profiling (Łoś et al. 2020). RF belongs to the group of the Bootstrap and Aggregation algorithms, commonly called bagging algorithms. Bagging refers to the random subsampling with the replacement of the original training data set and features to generate multiple base learning models. In RF, decision trees serve as the base models; each tree is constructed independently for the combination of the selected variables and there is no interaction between single trees. The final RF result is calculated as the mean of the outputs of all individual trees. It is proved that the application of RF as the bagging technique contributes to the lower variance and stability and, contrary to the simple decision tree, which is sensitive to the used data set, prevents overfitting (Breiman 1996; Ali et al. 2012).
Step 1: data acquisition
Input: RO bending angle and refractivity profiles
Nearrealtime RO profiles from the COSMIC2 constellation gathered in the latitude band 45°S–45°N for a period between October 1, 2019, and December 31, 2020, were the main products used in this study (Fig. 2). The COSMIC2 is a followon mission of the greatly beneficial COSMIC program led by the Taiwanese National Space Organization, and U.S. National Oceanic and Atmospheric Administration, and other agencies (Schreiner et al. 2020). COSMIC2 consists of a set of 6 satellites, which were successfully launched into lowinclination orbits on June 25, 2019. The satellites produce more than 4 000 atmospheric soundings a day within ± 50° of the north and south latitude band providing better insight into weather and climate. RO data at various processing levels is published in near real time by 0200 UTC the following day and freely available at the COSMIC Data Analysis and Archive Center (CDAAC) website (UCAR COSMIC Program 2019). I focused on atmPrf and wetPf2 level 2 products. The first one contains vertical profiles of bending angle and refractivity, which constitute the inputs for the ML models. The wetPf2 files include meteorological profiles of temperature, pressure and water vapor with 100 m vertical resolution derived through the 1DVar approach with the ECMWF analysis as a background. In the 1DVar approach, a cost function J is minimized to estimate with the maximum likelihood the optimal tropospheric state profile x:
$$J\left( x \right) = \left( {h\left( x \right)  y^{0} } \right)^{T} \left( {O + F} \right)\left( {h\left( x \right)  y^{0} } \right) + \left( {x  x^{b} } \right)^{T} B^{  1} \left( {x  x^{b} } \right)$$
(1)
where y^{0} is the observation vector, h is the observation operator, x^{b} is the background meteorological profile. The matrices B, O and F express the background, observation and observation operator error matrices, respectively (Poli et al. 2002). Then, the socalled wet profiles together with RAOBs were used to assess the accuracy of the ANN/RF retrievals. To balance the required number of observations to train the models and the need to profile the troposphere close to the surface as much as possible, only RO profiles, which reached 1 km altitude and below, were considered in this study.
Target: ERA5 meteorological profiles
The target variable consists of the meteorological profiles of temperature, pressure, and water vapor partial pressure. Specifically, the meteorological data from the ERA5 reanalysis was employed in the training and testing processes. The ERA5 is the most recent global atmospheric reanalysis produced by the ECMWF and replaced the previously used and very popular ERAInterim reanalysis (Hersbach et al. 2020). The major developments encompass higher 1 h time resolution and 0.25° (31 km) spatial resolution, improvements in Data Assimilation (DA) system and rapid 5 day preliminary availability. The DA system implemented in ERA5 is based on a hybrid incremental 4DVar with 12 h windows and assimilates more than 200 types of conventional meteorological data and observations provided by satellites. Meteorological data are available for 37 pressure levels with a resolution of 25 hPa between 1000 and 750 hPa, 50 hPa below 250 hPa layer, and 16 irregularly spaced levels above with the toplevel at 1 hPa. However, the ERA5 does not provide water vapor partial pressure V_{p}, which constitutes part of the ML output. Instead, it must be calculated from pressure P and specific humidity q (Wallace and Hobbs 2006):
$$V_{p} = \frac{q \cdot P}{{\frac{{M_{w} }}{{M_{d} }} + \left( {1  \frac{{M_{w} }}{{M_{d} }}} \right) \cdot q}}$$
(2)
where M_{w} = 18.0152 g mol^{−1} and M_{d} = 28.9644 g mol^{−1} are molar masses of moist and dry air, respectively.
Radiosonde observations
In this study, RAOBs served as an additional validation data source. Meteorological profiles from the radiosonde stations located up to 70 km from the mean RO tangent point were downloaded from the National Oceanic and Atmospheric Administration Earth System Research Laboratory (NOAA/ESRL) radiosonde database (Govett 2020). The database provides temperature, pressure, and dew point depression measurements for at least 21 mandatory levels of constant pressure. The dew point depression T_{dd} can be transformed to the water vapor partial pressure based on the Clausius–Clapeyron equation (Perry 1950):
$$\begin{array}{*{20}c} {V_{p} = e_{s0} \cdot e^{{\frac{{l_{v} }}{{R_{v} }}\left( {\frac{1}{{T_{0} }}  \frac{1}{{T_{d} }}} \right)}} } \\ \end{array}$$
(3)
where e_{s0} = 6.11 hPa is the reference saturation vapor pressure, l_{v} = 2.5·10^{6} J kg^{−1} denotes the latent heat of vaporization of water, R_{v} = 461.525 J K kg^{−1} is the gas constant for water vapor, T_{0} = 273.15 K is the reference temperature and T_{d} = T − T_{dd} stands for the dew point temperature in K.
Step 2: preprocessing
After the acquisition of the needed data and before the training, it was necessary to perform the preprocessing, which included vertical and 3D interpolation of the RO and ERA5 profiles, splitting the data into training and testing subsets, and eventually, colocation of RO and radiosonde observations and their vertical interpolation.
The models’ inputs comprised of latitude, month, and hour of RO event as well as the vertical profiles of bending angle or refractivity linearly interpolated to 100 m resolution between 1 and 20 km resulting in 190 fixed levels. The upper boundary of 20 km was chosen as a compromise between computational speed, model complexity, and the altitude, above which water vapor becomes negligible.
To derive the target profiles of the temperature, pressure, and water vapor at the RO location, 3D interpolation (horizontal and vertical) was applied to the ERA5 meteorological data. Since the ERA5 stands out with the high 1 h resolution, temporal interpolation to the time of the RO observations was omitted. Therefore, first, the vertical spacing of the ERA5 data was adjusted to the RO altitude (100 m resolution within 1–20 km altitude). I applied different interpolation strategies depending on the meteorological parameters. For the temperature, a simple linear interpolation was used, the water vapor partial pressure was interpolated exponentially, while the pressure at the particular height h was calculated based on the pressure P_{i} of the adjacent layer i at height h_{i} (Boehm and Schuh 2004; Wallace and Hobbs 2006):
$$\begin{array}{*{20}c} {P = P_{i} \cdot e^{{  \frac{{\left( {h  h_{i} } \right) \cdot g_{m} }}{{R_{d} \cdot T_{v} }}}} } \\ \end{array}$$
(4)
where R_{d} is the gas constant for dry air, g_{m} denotes acceleration due to gravity, which can be calculated as a function of latitude and height (Kraus 2007), and T_{v} stands for the virtual temperature, which expresses the dry temperature with the same density as the moist air with the constant pressure:
$$\begin{array}{*{20}c} {T_{v} = \frac{T \cdot P}{{P  \left( {1  \frac{{M_{w} }}{{M_{d} }}} \right) \cdot V_{p} }}} \\ \end{array}$$
(5)
Afterward, the vertically uniform ERA5 profiles were interpolated horizontally to the mean RO tangent point position using bilinear interpolation.
In the next step, the pairs of input RO and output ERA5 profiles were subdivided into training and testing data sets. Since the total number of available RO observations exceeded 1.75 million, to reduce computational cost and satisfy memory constraints in the estimation of model parameters, the subsampling was performed. 10 000 random samples for each month between October 2019 and September 2020, giving 120,000 profiles in total, entered into a learning set and the 30,000 random observations between October and December 2020 were used for model testing (Fig. 2). It should be noted, that the Earth’s topography was one of the factors, which constrained the RO technique to sound the atmosphere down below 1 km, which was set as a threshold in this study. Therefore, most (100,150, 83.5%) of RO profiles used for training were located over the wet areas and the rest 19,850 (16.5%) observations occurred above land. Similarly, the test data set consisted of 24,270 (80.9%) and 5730 (19.1%) samples, which took place over the oceans and land, respectively. The RO events included in the testing data set were colocated with nearby RAOBs using a 2 h time window and 70 km as a maximum spatial distance between the mean tangent point of RO profile and location of radiosonde station. This resulted in 477 pairs of colocated RAOBRO cases, which distribution is presented in Fig. 2b.
Originally, RAOBs are provided at constant pressure levels, which correspond to different geometric heights depending on the current weather conditions. Hence, it was necessary to first determine the radiosonde data at common altitudes to be able to evaluate the ML model performance. Since the vertical resolution of radiosonde measurements is relatively sparse, I interpolated meteorological data at the chosen 10 rigid height levels of 1.5, 3.1, 5.8, 7.5, 9.6, 10.9, 12.4, 14.2, 16.6, and 18.7 km, which approximately equal to the mandatory pressure levels of 850, 700, 500, 400, 300, 250, 200, 150, 100, and 70 hPa.
Although the RF does not require any additional preprocessing steps before the training, for the ANN, it is recommended to perform data normalization, which leads to speeding up the learning, a more stable algorithm, and faster convergence. In this study, for convenience, min–max normalization was applied to input features (bending angle/refractivity, latitude, month, and hour) of both algorithms (ANN and RF), transforming them into the 0–1 range. Furthermore, it has to be noted that the vertical input profiles were scaled separately at each altitude level.
Step 3: hyperparameters tuning
Hyperparameters are configuration parameters, which control the learning process and, contrary to the model parameters, must be set in advance before the training. Hyperparameters can significantly influence the model performance and the selection set of the most optimal variables is one of the biggest challenges in model building. To mitigate this problem, a few hyperparameter optimization techniques have been developed, such as Random Search, Bayesian optimization, or Gaussian Process (Bergstra et al. 2011; Bergstra and Bengio 2012). In this study, I exploited the Random Search, separately for the bending angle/refractivity and ANN/RF models. The Random Search approach allows scanning a large domain of hyperparameters by selection and evaluation n random combinations. The number of n was set to 200 in this research. The evaluation of each set of hyperparameters was performed using the popular Kfold crossvalidation. In the Kfold crossvalidation, the training data set is randomly split into K equalsized subsets, where one of the partitions is used in testing and the rest K1 of data serves in the model learning. The process is repeated K times and the final output is estimated as the average of the K fitting results.
ANN and RF are characterized by different sets of available hyperparameters. For RF, the following hyperparameters were considered, where the numbers in the parentheses specify the ranges of values to try:

1.
N estimators—the number of trees in a forest (50, 60, …, 600).

2.
Max depth—the maximum depth of the tree (4, 6, …,26)

3.
Min samples split—the minimum number of samples to consider for each split (30, 40, …, 90, 100, 150, 200,…, 500).

4.
Min samples leaf—the minimum number of samples required to be at a leaf node (10, 15, 20, …, 45, 50, 60, …, 100).

5.
Max features—the number of features consider when looking for the best split (10, 15, 20, …, 190, 194).

6.
Bootstrap—defines whether some samples will be used multiple times in a single tree (True, False).
Note, the prepruning technique was used to mitigate the chance of overfitting. Tested ranges of hyperparameters, such as max depth, min samples split, and min samples leaf were constraint to prevent full growth of single trees.
For the ANN, I adjusted parameters related to the network structure (1–3) and training algorithm (4–6):

1.
Layers—number of hidden units (1, 2, 3).

2.
Neuron—number of units at each hidden layer. (195, 200, 205,… 500).

3.
Dropout rate—a fraction of the neurons, which are randomly ignored during the training (0, 0.1, … 0.5).

4.
Activation function (linear, rectified linear unit).

5.
Epochs (200, 250, 300,…1600).

6.
Batch—describes how many samples are processed before the update of internal ANN parameters (50, 100, …, 500, 1000, 1500, …, 9500, 10,000).
The weights and learning rates of the ANN were determined during the training using a popular and effective Adam optimization algorithm, which was presented for the first time by Kingma and Ba (2014).
It must be pointed out that many various sets of hyperparameters may result in a similar performance. Hence, to simplify the processing and save computational time, the RF models with the lowest number of trees or ANN with a minimum number of hidden layers and neurons were chosen. Eventually, the following ML models were revealed as the best during hyperparameters optimization:

bending angle based ANN: 1 hidden layer, 455 neurons, 0 dropout rate, linear activation function, 850 epochs and batch size of 50 samples,

refractivity based ANN: 1 hidden layer, 405 neurons, 0 dropout rate, linear activation function, 1150 epochs and batch size of 50 samples,

bending angle based RF: 300 trees, max depth of 20, min samples split of 30, min samples leaf of 10, 20 max features, bootstrap option set False.

refractivity based RF: 190 trees, max depth of 18, min samples split of 30, min samples leaf of 10, 15 max features, bootstrap option set True.