Why you need Data Science to do Day Trading: A Bovespa example.

Published in

Analytics Vidhya

12 min readFeb 25, 2022

In January 2020, I decided to use machine learning to build a support system for Day Trade operations. Despite the time, I only got some relevant gains after almost a year of working on this project. Therefore, the purpose of this article is to show the importance of Data Science in supporting Day Trade operations or any other nature.

Day Trade Industry

It is an important characteristic of the human race to enjoy opportunities of easy gains, regardless of understanding the risks that pursuing such gains may entail. That’s why there is Lottery, Gambling, Bookmakers, and also the Day Trade. However, there is no doubt that there is much more analysis content on Day Trade than all the other examples put together, but why is that? For an obvious reason: it makes sense!

But imagine if everything that made sense were, in fact, true? Everything we could think of and consolidate into a hypothesis would automatically be right… It would be wonderful, wouldn’t it? I don’t think so, but the truth is that making sense is the first step in a long journey of investigation to arrive at concepts considered “States of the Art”, that is, as close as possible to the truth at a time in history. For example, the “State of the Art” about the prevention of various diseases is the development of vaccines and mass immunization, as several studies have shown by doing, the number of cases and deaths resulting from these diseases is as small as possible when compared to other alternatives. Therefore, only making sense will rarely be related to the truth, but almost no one will explain that to you.

Now, what happens when we have a strong interest in easy earnings and trust anything that makes sense? Exactly! We get poorer. More specifically, in addition to losing money in day trade operations, we can still consume content from analytics that makes sense but do not give any indication that they can work. This is the kind of addictive cycle that drives the Day Trade analytics industry. A good example of the lack of validation of methods is Monkey Stocks. They weekly raffle 5 random stocks and evaluate their performance, comparing this wallet with the performance of other brokers. These raffled stocks are often among the best weekly and even yearly performers! This indicates that methods used for recommendation are likely not validated, as one of the most fundamental tests is to compare a particular method with a random strategy (such as Monkey Stocks).

But how relevant would I be if I made these claims without offering any evidence, right? Therefore, I take this opportunity to cite the work of Fernando Chague and Bruno Giovannetti from FGV de São Paulo. They followed all traders who started in 2013 to 2016 and continued to trade until 2018. This is perhaps the most relevant evidence for us to understand the scenario of Day Traders in Brazil and we will describe it a little better in the next section.

Is there anyone making money with Day Trading in Brazil?

Almost nobody! More precisely, 99.43% of all people who tried to profit from Day Trade during the evaluated period did not continue their operations until 2018. Furthermore, among those who continued (0.57%), the average gross result (without income tax) was negative, that is, on average they lost money in the period.

It’s amazing how these data already make us understand so much without much discussion. A practical example is why there is so much content about Day Trade analytics with no hint that they can work: virtually everyone who trades Day Trade quits after a little while. Therefore, the objective here becomes the sale of the dream of living with Day Trade income without considering the inherent risks of the activity. So, after using techniques that have never been evaluated, the “aspiring” Day Trader naturally ends up giving up on the dream after losses and expenses with barely reproducible “analytical” content…

It is important to remember that this type of action is something natural for those seeking the Day Trade. The hardest part of any successful strategy is to restrict your operations to what was planned and change, if necessary, at the right time. Despite not having sought analysis content to start with, I recognize that, in an inversely proportional way to the time spent studying and operating, it is quite difficult to keep operations restricted to a strategy outlined a priori. This hypothesis can also be supported by the study cited above. Specifically, a natural tendency for traders was the increase in losses over time, something that can occur to correct losses, through new operations not previously planned.

But after all, is it so hard to make money with Day Trading?

Without a doubt, it is the most difficult task I have ever been interested in solving, but I don’t know if only 0.53% of people interested in Day Trade would be able to solve it. Specifically, I think this level of difficulty drops a lot if we use Data Science. So, in this article, I’m going to explain some methods for testing simple Day Trade strategies. Thus, it will be possible to verify whether certain strategies can provide returns or not in the long run. Let’s go…

Downloading and importing data

B3’s daily data can be found on the Exchange’s official website (Link), where it is possible to download whole years as .txt. From there I use the bovespa2csv package to transform the .txt into a data frame.

from bovespa2csv.BovespaParser import BovespaParser as bp
# Import example of data from 2019 to 2020.
p = bp.BovespaParser()
p.read_txt('../input/b3-daily1920/COTAHIST_A2019.TXT')
p.read_txt('../input/b3-daily1920/COTAHIST_A2020.TXT')
# Final dataframe
df = p.df

Finally, we have a data frame where each line corresponds to a paper’s data in one day. Furthermore, 26 columns make up the base.

Organizing the data

The database available on B3 contains all the spot market stocks, so we only need to select the data we are going to use. Therefore, we selected all the stocks with the BDI code “02”, which corresponds to the standard batch. In addition, it is also necessary to convert the trading date values to datetime. Finally, we selected only the 10 most liquid stocks from B3.

# String to datetime
df['data do pregao'] = [datetime.datetime.strptime(dt, '%Y%m%d') for dt in df['data do pregao']]
# Sort data by the trading day and update index
df = df.loc[df['data do pregao'].sort_values(ascending=True).index, :]
df.index = range(0, len(df))# Removing excess spaces from stock names
df['codigo de negociação do papel'] = [cod.replace(' ', '') for cod in df['codigo de negociação do papel']]
# Get only the stocks with BDI "02" code
df = df.loc[df['codigo bdi'].isin(['02']), :]
# Get the 10 most liquid stocks
df = df.loc[-df['codigo de negociação do papel'].isin(['BOVA11', 'FIBR3', 'KROT3']), :] # Removing stocks that were not negotiated in 2020
selVar = 'numero de negocios efetuados com o papel mercado no pregao'
selectedStocks = list(df.loc[df['data do pregao'] <= np.datetime64('2019-12-31T00:00:00')].groupby('codigo de negociação do papel')[[selVar]].median().sort_values(by=selVar, ascending=False).iloc[:10, :].index)
df = df.loc[df['codigo de negociação do papel'].isin(selectedStocks) & (df['data do pregao'] >= np.datetime64('2020-01-01T00:00:00')), :]
print("Selected Stocks:", selectedStocks)

Feature engineering

In this article, we will try to propose and test some basic Day Trade strategies. For that, we need to remember that in our Day Trading strategy we need to buy stocks at the lowest possible value and sell at the highest possible value. Therefore, we will need to normalize the features with the maximum and minimum values of the day, as well as the value at the end of the trading session, depending on the opening value. We will also make a dataset with the results of a Buy and Hold strategy (buy and hold) to compare with the results of our strategy.

# Absolute maximum variation from the paper opening value
df['varMax'] = df['preco maximo do papel-mercado no pregao'] - df[
    'preco de abertura do papel-mercado no pregao']
# Minimum variation in relation to the paper opening value
df['varMin'] = df['preco minimo do papel-mercado no pregao'] - df[
    'preco de abertura do papel-mercado no pregao']
# Variation between the opening price and the price of the last trade in the trading session
df['delta'] = df['preco do ultimo negocio do papel-mercado no pregao'] - df['preco de abertura do papel-mercado no pregao']
# Here, we normalize the maximum, minimum and delta variations in relation to the paper's opening value
df['varMaxPerc'] = df['varMax'] / df['preco de abertura do papel-mercado no pregao']
df['varMinPerc'] = df['varMin'] / df['preco de abertura do papel-mercado no pregao']
df['deltaPerc'] = (df['delta'])/df['preco de abertura do papel-mercado no pregao']# Stocks performance
bnhRes = pd.DataFrame()
for s in selectedStocks:
    dfStock = df.loc[df['codigo de negociação do papel'] == s,:]
    dfStock[s] = (dfStock['preco do ultimo negocio do papel-mercado no pregao'] - dfStock.iloc[0].loc['preco de abertura do papel-mercado no pregao'])/dfStock.iloc[0].loc['preco de abertura do papel-mercado no pregao']    
    bnhRes = bnhRes.append(dfStock.set_index('data do pregao')[s])
bnhRes = bnhRes.T

This way, it is possible to compare different stocks in considering the variation during the day and, consequently, to the results of an eventual purchase at any value.

Identifying Possible Strategies

Now we need to identify interesting strategies. I always prefer to start any analysis as simple as possible, so in the present case, let’s think about the change in stock value during the day.

First, let’s try to optimize the purchase value. So we need to investigate the variable varMinPerc. For this, we plot a histogram using the plotly package.

import plotly.express as exfig = ex.histogram(df['varMinPerc'], template='simple_white')
fig.update_layout(showlegend=False)
fig.show()

It is possible to observe that the vast majority of values are very close to 0 and that values lower than -5% occur, but rarely. Therefore, as the variation of the day is important, does the value of the last trade of the day also follow this distribution? Let’s plot the histogram to verify…

fig = ex.histogram(df['deltaPerc'], template='simple_white', labels={'value': 'deltaPerc'})
fig.update_layout(showlegend=False)
fig.show()

Oops, the deltaPerc distribution resembles a normal distribution with a mean of 0 (or something lower but very close) and some outliers. It seems that there is a tendency for the value of the last trade of the day to be very close to the opening value (deltaPerc = 0%). Therefore, always buying stocks at a specific minimum range could give good results. To investigate this hypothesis, we can subtract the value of the minimum variation of the day’s variation, thus, we will have the distance (in %) from the final value to the minimum variation of the day.

# New feature creation
varMinDeltaPerc = (df['deltaPerc'] - df['varMinPerc'])
# Histogram plot
fig = ex.histogram(varMinDeltaPerc, template='simple_white', labels={'value': 'varMinDeltaPerc'})
fig.update_layout(showlegend=False)
fig.show()

It seems that the most common values, in this case, are greater than 0. This is important evidence for our previous hypothesis. Therefore, we understand that the value of a stock tends to close above the minimum value for the day and very close to the opening, which may be opportunities. Therefore, predicting the minimum value of the paper in the day should give important results! However, since the purpose of this article is to provide simple solutions, let’s consider a simple strategy of tracking the 10 stocks we selected and buying whenever they have a specific minimum variance. We then sell at the end of the trading session at the standing price. Is there an ideal minimum variation value for this strategy to work?

Simulating a Day Trading Scenario

To test our strategy, we can simply simulate a Day Trading scenario with the dataset we have and vary the purchase value parameter to look for some profit in some condition. So we first need to create a function to evaluate the results based on an ‘x’ % buy variance from open.

def analyse(df, x):
    res = df.loc[df.varMinPerc < x, ['data do pregao', 'codigo de negociação do papel', 'varMinPerc', 'deltaPerc']]
    res['result'] = (res['deltaPerc'] - x)
    return res[['data do pregao', 'codigo de negociação do papel', 'varMinPerc', 'result']]

Next, we need to define a list of x-values for us to test. For this, we will use values from 0 to -0.049, which would correspond to variations of the opening value from 0 to -5%. At the end of each test, we save a new row in the resDict dataframe with the results of daily average, accumulated total gains and composite gain.

resDict = pd.DataFrame()
for v in range(50):
    x = -(v / 10 ** 3)
    res = analisar(df, x)
    res = res.loc[(res.varMinPerc < x), :]
    mediaDia = (res.groupby('data do pregao')['result'].sum()/10).mean()
    total = (res.groupby('data do pregao')['result'].sum() / 10).sum()
    totalAc = ((res.groupby('data do pregao')['result'].sum() / 10)+1).cumprod().iloc[-1]
    resDict = resDict.append({'x': x, 'mediaDia': mediaDia, 'totalAcumulado': total, 'totalComposto': totalAc}, ignore_index=True)
# Finally, we plot the results of resDict
ex.line(resDict, x='x', y='mediaDia', template='simple_white').show()
ex.line(resDict, x='x', y='totalAcumulado', template='simple_white').show()
ex.line(resDict, x='x', y='totalComposto', template='simple_white').show()

It is clear from the graph that values of x smaller than -2.5% show gains in the simulations. Furthermore, we see that the smaller the value of x, the greater the daily gain. However, thinking of the graph we saw earlier of the distribution of the minimum daily variation, values of -5% are not very frequent and may not give the same return in the long term as the shares would be bought more rarely than with lower values. But thanks to Data Science, we can also test this hypothesis by plotting the total composite result.

Now we can see that, despite offering an interesting gain for very low values of x, the biggest gains are at -4%. Furthermore, values less than -2.5% offer little return or substantial losses.

Plotting the Results

Now we have an interesting strategy to consider. However, there is still a feature of these results that we have not tested: the variability of the results over the proposed period. Specifically, we need to know if this is a good strategy most of the time, or if the results are too extreme, which could cause big gains but also big losses. For this, we plot the results per day throughout 2020.

x = -0.04
serieRes = analisar(df, x)
dfPlot = serieRes.groupby(‘data do pregao’)[[‘result’]].sum()/10
dfPlot[‘Resultado Acumulado’] = dfPlot[‘result’].cumsum() + 1
dfPlot[‘Resultado Acumulado Composto’] = (dfPlot[‘result’] + 1).cumprod()
dfPlot[“Buy n’ Hold”] = bnhRes.mean(axis=1) + 1
fig = ex.line(dfPlot, y=[‘Resultado Acumulado’, ‘Resultado Acumulado Composto’, “Buy n’ Hold”], template=’simple_white’)
fig.update_layout(legend=dict(
 yanchor=”bottom”,
 y=0.01,
 xanchor=”right”,
 x=1
))
fig.show()

If we were to buy all the selected shares on all trading sessions where the value of these shares fell 4% or less compared to the opening price since 2017 until today, we would have something between 11 and 12% of positive returns for the year, which is much better than losing 1.5% by buying all shares and selling on the last day of the year (Buy n’ Hold). It’s not something that will make us rich, but without a doubt, it has a very good chance of giving positive results, as in several simulations we had good returns. In addition, as this is an extremely simple strategy, a lot can be improved, such as testing other stock selection criteria (standard deviation of value, volume), other frequencies in which the most liquid stocks are selected (monthly, daily), different numbers of roles and so on. Using machine learning and statistical techniques such as LSTM or Autoregression are still options that can help a lot. In my case, I developed a system based on Random Forests and I’ve already tested other regressors such as SVM and Linear Regression to predict the next day’s minimum value and recommend which shares to buy, but as I intend to write a scientific paper explaining these methods, to publish them here it would make any publication in journals unfeasible.

Conclusion

So far nothing can better validate a day trading technique than an extensive analysis of its results. Therefore, when we use data to validate techniques, we can see that good results in day trading may not be that difficult to achieve. Always remember that when it comes to day trading (and almost every subject), just making sense means absolutely nothing.

The Jupyter Notebook with the experiments is available on Kaggle through the following link.