
Introduction
Since 2008 bitcoin has gained a prominent place in the international financial landscape, and conversations about cryptocurrencies have become a frequent subject in social media, attracting more investors every day.
The Cryptocurrencies market is still considered one of the most controversial subjects in the financial sector. It was created based on decentralized trust without any central authority and no government or central bank regulates its value like national currencies causing some organizations to fear dealing in cryptocurrencies, assuming they threaten the traditional economic system. Some see cryptocurrencies as a solution for the lack of confidence in the financial system, and the success of cryptocurrencies market is undeniable.
Cryptocurrencies can be subject to the same types of analysis as national currencies, and conclusions about correlation of similar instruments reached. This study examines the data analysis and suggests methods for forecasting of the weighted average return for two of the major cryptocurrencies: Bitcoin and Ethereum. The analysis is done using a minute-by-minute prices of high-frequency Bitcoin and Ethereum market data dating back to 2018 from Kaggle (1).
The most challenging part about predicting the return of cryptocurrencies is the high volatility of the data since crypto market is still at a very nascent stage compared to other investment tools and currencies.
Lack of a clear pattern that can be detected by the human eye in the data makes the model highly prone to overfitting the training set.
Why predict the return not the close price itself?
Most financial studies involve returns, instead of prices, of assets.
Campbell, Lo, and MacKinlay (1997) gave two main reasons for using returns (2)
- First, for average investors, return of an asset is a complete and scale-free summary of the investment opportunity.
- Second, return series are easier to handle than price series because the former have more attractive statistical properties (e.g., stationarity).
Econometrics and Preliminary Data Analysis
Here we will perform a preliminary data analysis to get a better understanding of the dataset.
This dataset contains stock prices information on historic trades for several crypto assets. In this post, the focus will be on two of the major crypto currencies in the market: Bitcoin and Ethereum.
Dataset Attributes
- Timestamp
- Asset ID
- Stock Prices: Open, Low, High, and Close Prices
- Count: The number of trades that took place this minute
- Volume: the number of crypto assets unites traded during the minute
- Volumen weighted average price for the minute
- Target: 15 minute residualized return
Target Value
Return: definition and equation
The return provided in the dataset is a near future return for the prices
Visualizing the data
The return provided in the dataset is a near future return for the prices

Ethereum

Note the high volatility of the data that is reflected in the return time series with some values that look like outliers (e.g., Bitcoin high return around October 2019).
To prove or disprove that point, let us take a closer look at the signal around that time
The following diagram focuses on Bitcoin over the period between Oct 2019 and Nov 2019.

After zooming in the bitcoin visualization, we note that the sudden shock of the return series is not an anomaly since it is not a single isolated outlier. The shock in the graph is caused by the price surge in the market that took place around the end of October 2019.
Usually, some noise in the dataset could be used as a regularization technique as it smooths the final model, and helps it generalize better and avoid overfitting. With this high volatile dataset, the noise would not do the training any good, and it might even cause the model to overfit the data, since the data will be too complex with the added noise., For that reason, the noise and outliers need to be removed from the training dataset.
“Obviously, if your training data is full of errors, outliers, and noise (e.g., due to poor- quality measurements), it will make it harder for the system to detect the underlying patterns, so your system is less likely to perform well” (2)
Correlation between Cryptocurrencies
To get more insight into the linear relation between the two cryptocurrencies, let us visualize them on the same diagram.
Note that due to the difference in price scale between different cryptocurrencies, we have standardized the two time series to get a more meaningful visualization.

Note the high but variable correlation between the assets. Here we can see that there is some changing dynamics over time, which is critical and important to note while performing forecasts.
The incredible rise in the cryptocurrencies market– in this example: bitcoin and Ethereum – is around March 2021 which illustrates the impact of COVID-19 related media on the crypto market.
Return distribution

The distribution is closer to a normal distribution in terms of centralizing around the mean, but with a very sharp peak and wide base
Taking a closer look at the distribution of bitcoin’s return, note that the less frequent values are widely spread around the mean

Price Distribution
Close price distribution of the two assets (Bitcoin on the left, Ethereum on the right)

Note that the close price distribution is positively skewed to the left caused by the positive values in the price resulting in a low mean and a high variance.
In the Bitcoin price distribution, we recognize four distinct peaks (local maxima) in the probability density. The distribution looks like a multimodal distribution (3) – quadratic – consisting of four modes, the one on the left side of the bitcoin figure represents the major mode (Acrophase), while others are the minor modes.
Extreme Returns
To study the extreme values in the return – target of forecasting – we need to analyze the data using two major econometrics: Excess Kurtosis and Skewness.
These statistical tools better represent the extremes of the data set rather than focusing solely on the average.
Positive extreme returns are critical to holding a short position. It is important as a short seller to know when the positive returns are extreme, so, when the forecast shows a regression to the mean this would allow the short seller to invest in the right time by being able to identify the peak and knowing in advance – through the return forecasting – that a reversion to the mean is expected. Whereas the negative extreme returns are important in risk management since it is important to know when the negative returns are extreme and there will again be a reversion to the mean and to positive returns.
Excess Kurtosis
The excess kurtosis of a normal random variable is zero. A distribution with positive excess kurtosis is said to have heavy tails – which is the case with the price distribution of the two cryptocurrencies – implying that the distribution puts more mass on the tails of its support than a normal distribution does.
In practice, this means that a random sample from such a distribution tends to contain more extreme values. Such a distribution is said to be leptokurtic
Sample kurtosis can be calculated through the following equation
Bitcoin and Ethereum return Kurtosis respectively
Bitcoin Return Kurtosis | Ethereum Return Kurtosis |
65.83791065718604 | 75.85892119668719 |
From the output, we could see that the excess kurtosis is quite high – leptokurtic – for both cryptocurrencies compared to the kurtosis of a normal distribution, which means that we have a lot of extreme values
Skewness in Return
Skewness in the return distribution can be visualized through the asymmetry that deviates from the symmetrical bell curve where the data piles to the left of the curve (positive) skewness or to the right (negative)
From the visualization, the return looks almost symmetrical; but to get more precise numbers, and to find if the data is skewed to the right or the left, we calculated the skewness for both Bitcoin and Ethereum returns
Sample skewness can be calculated through the following equation
T is the number of observations, û is the mean of the distribution, and sigma is the standard deviation
Bitcoin and Ethereum Skewness respectively
Bitcoin Return Kurtosis | Ethereum Return Kurtosis |
1.4296588537329864 | 0.6970627924698324 |
Notice that both returns have a positive skewness – Bitcoin skewness is almost three times larger than Ethereum skewness – which means that we have few large gains and frequent small loses.
These highly frequent extreme values raise the bar for any ML forecasting model.
Stationarity
A stationary process has the property of non-changing mean, variance, and autocorrelation structure over time (4).
A time series is said to be strictly stationarity if the unconditional joint probability distribution does not change when shifted in time. This is a very strong condition that is hard to verify empirically.
A weaker version of stationarity is often assumed. The weak stationarity implies that the time plot of the data would show that the values fluctuate with constant variation around a fixed level.
In applications, weak stationarity enables one to make inference concerning future observations (e.g., prediction) (2).
Stationarity is important because many useful analytical tools and statistical tests rely on it.
Stationarity Test
In finance literature, it is common to assume that an asset return series is weakly stationary. While price series of an asset tend to be nonstationary. The non-stationarity is mainly due to the fact that there is no fixed level for the price (2).
We will check empirically both assumptions by using ADF test.
Augmented Dickey-Fuller ADF Test – Unit-Root Test
In econometrics and statistics, ADF is used to test if a unit-root nonstationarity is present in a given time series. Therefore, it detects whether time series is stationary with a certain level of confidence (5).
After applying ADF on the target return series we found out that the unit-root hypothesis is rejected (5), and the series is stationary. For the close price time series, the unit-root hypothesis could not be rejected, and the series is nonstationary. That can be seen by looking at the expanding mean and standard deviation of the close price of Bitcoin.

Note how the expanding standard deviation had made a decent jump around the beginning of 2021
Conclusion: we have a nonstationary price time series and a stationary target return. This conclusion will help us in the future work when applying econometric models
Trends and Seasonality
Now, let us take a closer look into the closing price time series by decomposing it into three main signals:
- Trend: A pattern in data that shows the movement of a series. Increasing or decreasing slope in the time series
- Seasonality Component: Explains periodic ups and downs
- Residuals: what is left over after fitting the model of the trend and seasonal components
Decomposition is done using the seasonal decomposition functionality in the stats models API which requires specifying the model:
The additive model is Y[t] = T[t] + S[t] + e[t]
The multiplicative model is Y[t] = T[t] * S[t] * e[t]
The results are obtained by first estimating the trend by applying a convolution filter to the data. The trend is then removed from the series and the average of this de-trended series for each period is the returned seasonal component (6).
The visualization here represents the annual decomposition for Bitcoin close price time series:

Note that the results are related to the chosen model.
Several observations can be made from the data. The upward trend has accelerated since 2021.
We have a remarkably high residual, which means that there are many points that are not captured by trend and seasonal components using the multiplicative model, as a result, the multiplicative model is not the best one to fit the data. There are other advanced decompositions like Seasonal and Trend decomposition using STL decomposition (7) that is worth exploring (outside the scope of this post).
Machine Learning Models and Forecasting
It is time to build a model to predict the return for the different cryptocurrencies.
Feature Engineering and Data Preparation
The features set is composed of the stock prices (open, low, high, and close), volume, and count. Data has been cleaned, gaps and missing values were imputed, and values were standardized. Primarily, I partitioned into three subsets based on the famous role %50-%25-%25 for training, validation, and testing, respectively. While experimenting, I changed the percentage of each share – basically I increased the training subset share, since the first roughly 50% of the data does not seem to be representative for the data, and that does not do the training any good.
Data partitions

Neural Network
Let us classify algorithms we could use for forecasting the target into three wide categories:
- Deep learning models
- Econometric models
- Other regressors that do not fall under the first categories
This post is focused on exploring the first approach which is deep learning, and Recurrent Neural Network and LSTM in particular. Recurrent Neural Network is a class of nets that can predict the future. They can analyze time series data such as stock prices and tell when to buy and sell (8)

Image source (9)
In general, when training deep neural network, it might suffer from unstable gradient issue – exploding or vanishing – which can be solved by using multiple techniques like dropout, normalization layers, good initialization.
In the backpropagation phase, the gradient of the loss function is computed for each layer starting from the last layer up to the input layer, and the weights of each layer get updated based on the gradient error. When using a big learning rate, the gradients can grow bigger and bigger, and as a result the weights too, which would cause the algorithm to diverge. To avoid that, we need to use a small learning rate or a good learning schedule. Also using a saturating activation function also helps to alleviate that problem.
The recurrent structure of RNN creates some sort of a simple memory since the output at a specific point of time is a function of the inputs from previous time steps. Each neuron in RNN represents a simple memory cell. This memory cell has a limitation in the length of the time series it can remember. This limitation can be addressed by using Long Short Term Memory LSTM cells in RNN layers, which help the network to more accurately learn long series compared to pure RNN cells.
When forecasting time series, it is common to remove the trend and seasonality first from the series before the training phases, then add them to the predictions. While this is not a required step for RNN, this procedure improves the predictive performance in some cases since the model does not have to learn the trend and seasonality too. (8)
LSTM has feedback connections. It can process not only single data points (such as images), but also entire sequences of data. A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell(10).

LSTM can learn to recognize important input and memorize it.
Implementation
Loss Function
The loss function is extremely critical when training the model, and it should be used based on the data fed to the model. For example, in training a single model for a set of cryptocurrencies it is worth considering loss functions other than MSE, especially if the target value range is asset-based, because you do not want to penalize the model as heavily as mean square error does for each error (11).
Functions I experimented with the following functions:
- Mean Squared Error
- Mean Squared Logarithmic Error
- Mean Absolute Error Loss
- Writing custom loss function based on correlation as a metric
Activation Function
Hyperbolic Tangent function, which is a saturating activation function that helps solve the exploding gradients problem. I experimented with other activation functions, but the Tangent function gave the best results.
Number of Layers and Neurons in each Layer
For the number of layers and neurons, I followed the “stretch pants” approach by Vincent Vanhoucke – a scientist at Google (8) . Go with a bigger network with more layers and neurons than needed and use early stopping & plenty of regularization to prevent the neural network from overfitting.
Hyperparameter turning
I used RandomizedSearchCV, which randomly passes the set of hyperparameters and calculates the score and gives the best set of hyperparameters which gives the best score as an output.
Parameter distributions explored in the search included
- Activation function
- Learning rate
- Loss function
- Optimization function
- Dropout rate
Hyper parameters used after optimization
Future Work
That was the setup, and in future work I will go into more details about the model training, validating, and testing.
What makes the process of building the model for forecasting challenging is that the crypto market is still nascent and emotional – which is the case of the stock market in general – and news plays a key role in defining the new prices for each asset, and since the news is hard to predict, then the crypto prices are hard to predict too.
Future Work
The RNN model I created is still in development, while it is being improved, it is worth trying other custom RNN algorithms for time series prediction in parallel like Amazon SageMaker DeepAr. Besides, I plan to do a deep dive into econometric models used for time series forecasting.
There is still a lot to do with data analytics of cryptocurrency market. I am planning to do an advance analysis to study the correlation between every pair of cryptocurrencies and take a closer look into the correlation between the crypto market and the media in terms of nature of the media news itself.
Due to the high correlation mentioned above between the news and the crypto market prices, in a future time series related work – in the feature engineering phase I will include news/media. That will be done by applying sentiment analysis to the news and extracting positive/negative signal with a confidence score. Such a process could be achieved by using a pre-trained model to perform the sentiment analysis or use an off-the-shelf service to do that like Amazon Comprehend service.
For navigating the hyper-parameter space, training, testing, and validating the model, I am planning to experiment in a more efficient and integrated cloud machine-learning platform like AWS SageMaker and utilize services like Autopilot to automatically train and tune the best machine learning models while maintaining full control and visibility.
As for datasets, I will experiment with other financial time series datasets.
Noor Alsabahi, Lead Data Engineer
References
References
(1): https://www.kaggle.com/c/g-research-crypto-forecasting/data
(2): Ruey S. Tsay – Analysis of Financial Time Series
(3): https://en.wikipedia.org/wiki/Multimodal_distribution
(4): https://en.wikipedia.org/wiki/Stationary_process
(5): https://en.wikipedia.org/wiki/Augmented_Dickey–Fuller_test
(6): https://www.statsmodels.org/dev/generated/statsmodels.tsa.seasonal.seasonal_decompose.html
(7): https://otexts.com/fpp2/stl.html
(8): Aurelien Geron – Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow
(9): https://en.wikipedia.org/wiki/Recurrent_neural_network