Streamflow Prediction Based on in-situ Data Using Machine-Learning Method (SVM and ANN)


Streamflow prediction or estimation has been a very popular topic in hydrological field. The applications of streamflow prediction vary from drought management, flood prediction, hydropower operation and so on. Thus, having a well-performance model to predict streamflow is very important for these applications.

There are already tons of physical-based models estimating downstream flow using the information of precipitation, land-use parameters and temperature. In this project, however, the machine learning is used to predict the streamflow. Specifically, Support Vector Machine(SVM) and Artificial Neural Network(ANN) are used.

The package used in this project are scikit-learn SVM regression model and Tensorflow ANN model. R², Nash-Sutcliffe coefficient(NS coefficient in short), RMSE and test data plot are used to evaluate the machine learning model to see whether it is performing well. NS coefficient is commonly used for evaluating hydrological models. For the last part of evaluating the model is to compare with the final boss: Naive prediction model, which would be introduced in the last.

Study Area

TouChien River Basin and in-situ Stations (Hsinchu, Taiwan)

The study area is in Hsinchu, Taiwan. The map above is the TouChien River basin, YuLou River, NeiWan streamflow station(red dot) and MeiHua rainfall station(purple dot). The catchment area of NeiWan station is 139.07 km². The average annual precipitation of Hsinchu area is around 1,720 mm/yr and average temperature is 22.4 C⁰ (at 22m altitude).

For the observed data, MeiHua rainfall station provides daily precipitation, daily temperature and humidity data. In this project, daily precipitation, temperature and streamflow data is used for SVM and ANN.

Input Features and Data

According to many physical-based hydrological model, streamflow is highly correlated to precipitation, temperature and the streamflow of the past few days, since there is a significant lag-effect of precipitation on groundflow, underground water and streamflow. Also, streamflow, groundflow and underground water would affect with each other. In order to catch those lag-effect, not only the data observed a day before should be used, but data observed several days before should also be included.

In this project, temperature yesterday, precipitation of past five days and streamflow of past two days is selected as input features. (8 features in total.) Also, the streamflow has been log-transformed, since there are a lot of extremely large streamflow in the training data set.

For NeiWan streamflow station model, data from 1991 to 2010 (7305 days of data in total) is used. The first 80% of the data is used as training data set, and the last 20% of the data is used as test data to evaluate the performance of the model. In the last part of the project, another streamflow station in Hsinchu, ShangPing station, is used to test whether the machine learning method is also working in different region. The data from ShangPing station is from 1991 to 2004 (~5000 days in total). The train-test data splitting is also used for ShangPing station. (80%~20%)

SVM Model and Performance

The SVM model is trained using ‘rbf’ mode, C=10000, gamma=0.1 and epsilon=0.1.

The R² and NS coefficient of the test data for SVM model are 0.506, and RMSE is 0.274. However, from the test data plot shown in below, the larger the streamflow is, the worse this model performs; the model couldn’t catch extremely large streamflow, which would be very bad for flood prediction. Also, there are a lot of the predicted test data are fixed around log10(streamflow)=1, which are not very good results for streamflow estimation.

The reason that SVM model doesn’t perform well in streamflow estimation may be that SVM tend to leave the extreme value as outlier, and is not significantly influencing the model result. Also, the training data set (~6000 observed data and 8 features) seems a little too much for SVM; this also could flatten the influence of extremely large streamflow.

ANN Model and Performance

The ANN model used in this study has 3 hidden layer, having 100, 80 and 40 nodes, respectively, the activation function used is ‘Relu’, and the loss function used is mean squared error.

The R² of the test data for ANN model is 0.945, NS coefficient is 0.932 and RMSE is 0.102. Compared to the result of SVM model, the performance is much better in terms of R², NS coefficient and RMSE. But how much better is that?

The figure shown in below is the test data plot. It is obvious that ANN model fixed the problem that SVM didn’t catch the extremely large streamflow; also, the predicted value wouldn’t be around 1 for most of the time for ANN model. Although for large streamflow, ANN model seems still underestimate the streamflow a bit, it is much better than SVM model.

The reason that ANN model would perform better than SVM model is that ANN is not averaging the influence of extreme values. In extremely large streamflow cases, mostly it would tend to use the node different from normal situations. In fact, ANN model has used in a lot of regressions on natural prediction because it has the advantages that doesn’t left out extreme value and could accept large training data set. For SVM, on the other hand, mostly is used in characterizing and regressions that need to get the trend.

It seems that using ANN model to predict the streamflow is well enough to be applicable. However, there is still a final evaluation: Naive Prediction.

Final Boss: Comparing with Naive Prediction Model

In model prediction that is auto-correlated (means that the data is correlated to the data in one or several time steps before), there is always a very easy approach: assuming that the data is just as same as one time step before. This approach is called “Naive Prediction” or “Naive Approach”, since scientists always take this as a joke, saying that even a baby could do this prediction.

For hydrological models, land loading functions and streamflow prediction model, streamflow is highly auto-correlated, both in physical and statistical aspect. However, a lot of physically-based models and even machine learning models could not beat the simple Naive Prediction model. (And that’s why I would call this as “final boss”!) From the past experiences, models must perform very accurate to catch up with Naive Prediction model.

In comparison of Naive Prediction model with SVM and ANN model, R², NS coefficient and RMSE is used. The R², NS coefficient and RMSE are 0.884, 0.881 and 0.135, receptively. The test data plot is shown in below. Although the test data plot seems very accurate, it is actually because it is lag-1 time step data, so we couldn’t “eyeball” the real performance of the model. From the statistics above, it is shown that the Naive Prediction model performance is between SVM model and ANN model.

From the comparison above, it is sure that ANN model is not only statistically performing well, but also have beaten the final boss, Naive Prediction model. However, is the same structure works on other site? And that would be the last evaluation of the ANN model.

Using ANN Model in Another Streamflow Station Estimation

Mentioned in “Input Features and Data” section, another streamflow station data is used to do the last test. ShangPing streamflow station is also a station in TouChien river, but having a much larger catchment area. The ANN model structure is as same as previous. The input features and train-test data splitting is also the same. This test is to verify whether the ANN model could work in different station.

The R², NS coefficient and RMSE of test data at ShangPing station are 0.738, 0.361 and 0.357, respectively. The test data plot is also shown in below. From the figure, a very large variation of test data and predicted data is observed after 800th set, and other parts of the prediction seems to be more accurate.

After doing some survey, it is reported that there are several Typhoon arrived in Taiwan, causing extremely large precipitation. Because this extreme value hasn’t happened in training data set, it would be much less accurate for ANN model. However, if the test data sets after 800th set are drop out, the R², NS coefficient and RMSE would become 0.934, 0.932, 0.083, respectively, which is just as well performed as in NaiWan Station.

From the test above, ANN model has been proven to be a very good choice of predicting the streamflow, but if there is a unseen extreme value input, ANN would sometimes overreact and has bad estimation of streamflow. In another aspect, if extremely large predicted data could be drop manually, like the extreme predicted data above (around10¹⁰ cms, which is not possible for that area), ANN model would perform very well. The rules of dropping extreme predicted data varies from stations and places, and it would need a lot more data sets to make those rules, so it would not be covered in this project.

Summary and Future Work

In this study, streamflow prediction using machine learning method, SVM and ANN specifically, are used to test whether they would perform well. The result of SVM is not promising, since R², NS coefficient and RMSE are all a lot worse than Naive Prediction. ANN model, on the other hand, performed very good and better than Naive Prediction. Even switching to another streamflow station, the same ANN model structure would still perform good if some extreme predicted data is drop out.

Despite the good performance of ANN model, there are still several improvements can be made. One of the problem is the rules of dropping the extreme predicted data, in order to make the predicted data become more reasonable. Another improvement is that if there is some data scarcity problems; it is unknown if the remote sensing data could be used.

In short, ANN model would be a good choice of doing streamflow prediction!

Currently a UCLA PhD Student.