Extending the Trade War (App) – Further Explorations with Google Trends and the Market

These guys…

This is a continuation of the original article describing the app : 
https://kerpanic.wordpress.com/2018/12/20/trading-the-trade-war-sentiment-based-trading-using-google-trends/
If you have not read the previous article, I highly recommend to read that first as it serves as an introduction to the web app at https://tradewargoogletrends.herokuapp.com/

New Features and Parameters

Since the original web app was released, some requests and suggestions have come in, and some new world events like the increase in the federal reserve rates and several Trump antics have occurred. Here are the all the updates to the web app.

  • Using Google Trend Topics instead of Search Terms: Topics in Google Trends are more all-encompassing as they cover all the terms related to the word topic you chose. Hence, to cover more ground I chose to convert getting the search terms previously to topics. A detailed difference between search terms and topics is listed here
  • Other Keywords : In the previous version, the keyword was limited to “Trade War”. Now the new keywords include : “recession”, “financial crisis”, “debt”, “federal reserve system” and also something to satisfy these two best friends (look for it!). These are terms which are related somewhat to the trade war, and allows us to expand upon the analysis of how Google Trend sentiment affects the market prices. It is interesting to see how these terms correlate to the ups and downs of the individual stocks, especially now that we are in a volatile period of time. I personally like “financial crisis” or “recession” as a “chicken-little” warning indicator – try it out! 
  • Wider Range for Parameters: In order to deal with the behaviour of different keywords, a wider percentage range for “Upper Sell Threshold” and “Lower Buy Threshold” has been included.
  • More Tickers : There are more tickers now of popular stocks, including the S&P500 and Dow indexes. I’ve also added the tickers of those stocks I have a personal interest in.
  • Lighter red/blue line to indicate triggers on partial data: Google trend values actually measure the proportion of that particular keyword search in relation to all the other searches in Google. This means when Google Trends states that the latest data is partial data, we cannot reliably say that it is a buy or sell trigger should it cause a trigger. We can only say it is “headed that way” and we can only confirm it when the end of the week it is measuring is reached. Thus, a lighter blue/red line indicates such a situation.
  • Date Picker for Date From Which to Start Calculations: This states the point in time at which we start the buy/sell calculations. It will come in useful if you want to test the Google Trend strategy from a different point in history. 
  • Addressing Heroku App Startup Times: If you are on the free tier of Heroku, it seems that the app automatically sleeps if it is idle for 30 minutes, after which there is a 10 second startup time for the first time the app is accessed. By pinging it at regular intervals with https://uptimerobot.com you can keep it alive to maintain the responsiveness of the app. I am really cheap.
aapl_recession_trend
Oh, if only I had sold …

Thoughts

The more I mess around with the web app, the more I believe in its ability to draw from the wisdom of the crowd. That said, not all the keywords presented seem effective, and I leave it to your own judgement to see which ones are. Personally I think “trade war” , “financial crisis” and “recession” seem to correlate best with trends in the market, but YMMV. Perhaps a reason for this is that these are terms most people can relate to and search for.

Essentially, the Google searches represent the sentiment of the people, albeit on a weekly aggregation. At the very least, I will look at what these graphs say before I make any buys or sells, especially in these turbulent conditions to gauge how as a whole all of us are thinking. 

One drawback though, is that the Google trends are coming in on a weekly basis, which may be too slow compared to the volatile market in this climate. Perhaps a further step could be to get wiki search trends (these are available daily, I think), or to analyse trump’s incessant tweets to further confirm the positive/negative sentiments at that point of time. The possibilities are endless, and I am positive that the professionals quantitative analysts out there are doing more than this.

All that said, if you mess around with the app enough, you’ll find that buy-and-hold is still a pretty solid strategy!

Trading the Trade War – Sentiment-Based Trading using Google Trends

You can find the web application, made using Dash here : https://tradewargoogletrends.herokuapp.com/

Prospect Theory and Loss Aversion

Prospect Theory is a theory in cognitive psychology that describes the way people choose between probabilistic alternatives that involve risk, where the probabilities of outcomes are uncertain. The theory states that people make decisions based on the potential value of losses and gains rather than the final outcome, and that people evaluate these losses and gains using some heuristics. A layman way to think of Prospect Theory is an analysis of decision under risk.

One easy illustration of loss aversion in prospect theory is when we are faced with two choices:

  1. Get $50 dollars straightaway
  2. Flip a coin – get $100 if heads, nothing if tails.

Most people go for choice (1), because we can’t bear the thought of getting nothing should we choose (2) and the coin says tails. We are loss-avoiding creatures, even if mathematically (1) and (2) are the same.

Relating Prospect Theory to Google Trends

Dr. Tobias Preis from Warwick Business School suggested in 2013 that Google Trends could be used to predict stock movements in his paper “Quantifying Trading Behavior in Financial Markets Using Google Trends”. 

Here is a link to his presentation. A simple trading strategy proposed by him is as follows:

Why does this strategy work? Apparently, according to prospect theory, we tend to search more when bad news happens. We over-react to bad news (searching frantically and worrying) and under-react to routine news like reliable growth in a company. So, when something bad happens, a lot of people search for it, and cause a upwards spike in Google Trends, and that is generally when it is time to sell. Conversely, when the data point in Google trend drops, it means less folks are searching for the topic and hence no bad sentiment is evident. Arguably, it is a better time to buy in “calmer” times.

Of course, for the individual investor, there are a number of issues with the trading strategy above, even if the paper claims it to be profitable.

  1. The number of trading transactions will be too large if we are doing a transaction every week. If we are charged $20 per transaction, the transaction costs will balloon and prove to be too much for the individual investor.
  2. Individual investors do not have the advantage of speed – by the time it is his/her turn to buy/sell, the effect may already be priced in.

Hence, for the individual investor, we can only make use of this knowledge in the broad sense, perhaps as a danger alert indicator for bad times ahead. In this article, we will investigate how we can make use of this knowledge and how it may be applied to the individual retail investor’s trading activities.

Defining the Terms for the Experiment

We will use some popular stocks like AAPL and GOOG to run the experiment. Before we begin, we have to define some terms that we will use throughout the article as well as the web application, as follows:

  • Upper Sell Threshold : This is the percentage increase of the google trend from one data point to another beyond which we sell. Default for the program developed is set at 45%.
  • Lower Buy Threshold : This is the percentage decrease of the google trend from one data point to another beyond which we buy. Default for the program developed is set at -40%.
  • Keyword : This is the keyword to be entered for the Google Trends. The default keyword we use here is “Trade War” as a topic in Google Trends.
  • Shares to Buy: This is the number of shares to buy for each transaction.
  • Shares to Sell: This is the number of shares to sell for each transaction.
  • Initial Money: We also define in the program the initial amount of money that we have to do the transactions.

Rules of the Game

There are some other important rules to consider in our experiment as well:

  • Bench-marking against Buy-and-Hold: Naturally, we would benchmark this trading strategy against buy-and-hold, where we buy the stock at the first buying opportunity (same first buying opportunity as the Google Trends strategy) and hold it till the end of the experimental period. We will then check to see if the strategy has earned more or less money than buy-and-hold.
  • Number of transactions are recorded and transaction costs are accounted for: The cost of each transaction is set at $20, and is multiplied by the number of transactions that took place as a result of the strategy. This is for added realism with the individual retail investor in mind.
  • If Not Enough Funds to Buy: If we do not have “cash” on hand to buy more shares, we will buy the maximum number of shares that we can afford.
  • If Not Enough Shares to Sell: If we do not have the stipulated number of shares to sell, we will sell whatever remaining shares we have. In this sense, if we have not bought anything yet, we will not be able to sell anything, even if it is a “sell” line.
  • Period of Testing: The experiment is set over the Trade War timeline, from early 2017 to the current date. The web app will be kept alive and running throughout the trade war.

Conclusion

You can find the web application, made using Dash here : https://tradewargoogletrends.herokuapp.com/

Running the experiment, we are able to “devise” a strategy that out-performs the buy-and-hold strategy. This shows that there is some truth to the claim that Google Trends is capable of aiding investment strategies.

In the web app, the red dotted lines are “sell” actions and the blue dotted lines are”buy” actions. You are able to choose your own parameters to run the experiment to your liking.

Disclaimer: Do not blame me for any loss of money should you decide to follow this strategy. This is a purely academic endeavour that explores the link between prospect theory, trends and the stock market.

That said, there is a emotional hurdle here to overcome if we are to abide fully with the strategy. For example, when the algorithm tells me to sell when I will clearly lose money. Greed is also something I had to overcome, when the algorithm told me to sell when it is clear the trend is going up.

I will keep the web application alive on Heroku so that it can serve as a continuous point of reference for this article. It can be a little slow on first access as it is a free app hosting tier and this is the default behaviour (I’m cheap) – the app has to start itself up. For me personally, this app is useful as a “chicken-little” early warning signal. It’ll be interesting to see how the future plays out!

Let me know at madstrum@gmail.com what you think! I would be interested in ideas and suggestions for the web app. Thanks!

Lastly, I would like to thank Mr. Eric Tham from the NUS-ISS Sentiment Mining course for introducing us to this as well as other finance related topics. 

You can find Part 2 of this article here, where I describe some extensions I made to the app, and some findings. 

Image result for chicken little the sky is falling

FIFA World Cup 2018 (Part 2) – Quarters Predictions

Quarter_Predictions

Previous article here : https://kerpanic.wordpress.com/2018/06/27/world-cup-2018-the-final-16-predictions/

Data obtained from here : https://www.kaggle.com/agostontorok/soccer-world-cup-2018-winner/data

All code here : https://github.com/lppier/fifa18_final16

Enhancements

  • Matches that resulted in a draw are no longer considered, as in the final 16 knockout stage, there is no draw.
  • One hot encoding is used to remove possibility that model might consider the team number a ranked value.
  • Added latest results from the new matches played so far into the data!
  • Added an ensemble voting classifier aggregating results between kNN, Adaboost and Neural Networks

Metrics from kNN Classifier Model

area under curve: 0.8912393162393163
accuracy: 0.8893333333333333
precision: [ 0.84711779  0.93732194]
recall: [ 0.93888889  0.84358974]
fscore: [ 0.89064559  0.8879892 ]
kNN Quarters Prediction
1 means a win for the 1st country.
France vs Argentina : 0
Uruguay vs Portugal : 0
Spain vs Russia : 1
Croatia vs Denmark : 1
Brazil vs Mexico : 1
Belgium vs Japan : 1
Sweden vs Switzerland : 0
Colombia vs England : 0

Metrics from Adaboost Model

area under curve: 0.8912393162393163
accuracy: 0.8893333333333333
precision: [ 0.84711779  0.93732194]
recall: [ 0.93888889  0.84358974]
fscore: [ 0.89064559  0.8879892 ]
Adaptive Boosting Prediction
1 means a win for the 1st country.
France vs Argentina : 0
Uruguay vs Portugal : 0
Spain vs Russia : 1
Croatia vs Denmark : 0
Brazil vs Mexico : 0
Belgium vs Japan : 1
Sweden vs Switzerland : 0
Colombia vs England : 1

Metrics from Neural Networks Model

area under curve: 0.8912393162393163
accuracy: 0.8893333333333333
precision: [ 0.84711779  0.93732194]
recall: [ 0.93888889  0.84358974]
fscore: [ 0.89064559  0.8879892 ]

Neural Networks Quarters Prediction
1 means a win for the 1st country.
France vs Argentina : 0
Uruguay vs Portugal : 0
Spain vs Russia : 1
Croatia vs Denmark : 0
Brazil vs Mexico : 1
Belgium vs Japan : 1
Sweden vs Switzerland : 0
Colombia vs England : 0

Voting Ensemble Results

Basically, among the three classifiers, majority wins. It’s interesting to note that Adaptive Boosting actually predicted that Brazil will lose, but the majority votes Brazil will win.

A similar situation is in Croatia vs Denmark. Majority votes are for Denmark to win.

Ensemble Prediction (Voting scheme, followed by averaging among the 3 models)
1 means a win for the 1st country using voting scheme.
France vs Argentina : 0 Probability 0.7537506238216767
Uruguay vs Portugal : 0 Probability 0.5634269455951803
Spain vs Russia : 1 Probability 0.8396692244968408
Croatia vs Denmark : 0 Probability 0.4291791223437185 <- Probability here suggests different result from voting!
Brazil vs Mexico : 1 Probability 0.8330163908699761
Belgium vs Japan : 1 Probability 0.8396997204649211
Sweden vs Switzerland : 0 Probability 0.6988071986353249
Colombia vs England : 0 Probability 0.5637759883599832

It’ll be fun to see how it goes!

World Cup 2018 – Predictions (Part 1)

All code can be found here : https://github.com/lppier/fifa18_final16

2018-06-23t195652z-61914496-rc1d8e49e1c0-rtrmadp-3-soccer-worldcup-ger-swe

I thought that it would be interesting to build a prediction model to predict the results of the clashes between the final 16 FIFA World Cup 2018 teams.

Data obtained from here : https://www.kaggle.com/agostontorok/soccer-world-cup-2018-winner/data

Main python notebook here : https://github.com/lppier/fifa18_final16/blob/master/fifa2018.ipynb

Two datasets were used :

Dataset 1 : A FIFA World Ranking database of all the countries that play soccer competitively.

Dataset 2 : Results of all international soccer matches since 1872.

Data Preparation

  • Considering that line-up changes does affect the odds, I elected to take in data only from the last world cup onwards.
  • Only the 2018 rankings was used, rankings earlier than this was not considered.
  • Only World Cup matches’ data was used (Qualifers, or otherwise), it was found during model exploration that using World Cup data only yielded better accuracies and overall metrics.
  • Data Balancing was done to push up the accuracies of the model.
  • Data was split into 80% training data, 20% test data. The 20% test data was not used to test until the end, after using 5-fold cross-validation to validate the model quality using only the training data.

Model

  • I used Orange to do a quick run of all the Scikit-Learn classifier algorithms, of which Logistic Regression, Random Forest and Naive Bayes emerged as the best classifiers for this particular problem.
  • A random forest model was eventually chosen, and hyper-parameters tweaking was done on it.

Predicted Outcome : Germany Wins the Germany – South Korea match with 77% accuracy.

 

More to come soon after the final 16 are in…

@FakeDonaldTrump – fun with Recurrent Neural Networks

President Trump Meets With Steel And Aluminum Manufacturing Industry Leaders Announcing New Tariffs

Recurrent Neural Networks are useful in things like translation and natural language processing. I decided to give it a whirl on Donald Trump’s tweets I got from http://www.trumptwitterarchive.com , which records every tweet that Trump says.

This particular implementation, heavily adapted from Martin Gorner’s Shakespeare generator, makes use of a layered GRU RNN cell. Basically, the network feeds back the output state of the RNN cells back into the input initial state. For more information, check out his lecture here : https://www.youtube.com/watch?v=vq2nnJ4g6N0&t=107m25s

The gist of the code is that words are fed character by character  using one-hot encoding into the RNN network, and the target of the training is the entire word shifted right by one. Eg. When input is “hello”, the target will “ello!” as h’s next predicted character is e, e’s next predicted character is l, and so on. The final output state of the network is fed back to the input state for the training. Every 150 batches, we ask @FakeDonaldTrump to output 1000 characters, and let’s see what he says.

ogooateefouA4 7o7+eeitnnoto ou7add4o4 e :eo de ot:etaaon alro til ia4 atitil4a   oter:drd i tootetttie&+oa e+el+a 7: oeteie: i7e++it &  e::::at :d tdodd &i&e l iea se t o na + +ei+at++deioa a0:att+e t+to+ o +t7taodd tiod tiottui &tid::&i:4ltle &llt&&+oatiit   tlea +osoi+tolt:+ie7 e & +e+ tdi&ae&  dtt t+s & 7ao ai ss  -tt &t4&soe1o tt es t:ta4i+ia att 1 do dd+to&tlol e&:oaaa4ila  4+raoai rliol e te rrilnttoal++oieio 0oatt:0+:111o it4o aa ti:d eoeieer+rit e&oa ai& ol  oo&4a idd+lae eottl ioet++ol+lt7ato :ei7a40ee01satoi1aaoe oo t:eto4t:tto  :tee  oeooid taelad   e i  +  +  l arit&i otole& +&a llal    e &+&+a ttte  r te  itet&dd&o&eida &latiel+tae o il1 +:&eia:__ _ noee :+  no aeo d++o tl noeo: oot: tau::a4ai oeo o roaeeiteie 4oa +laotit 4iit4+a+:i:1tFat+a:  ot:+:--ot: eto aFdoe:oieteoiiao tiiea  o++oit iteo ieti ao+oaa ii+t::i:i:aaoeoo-iei :t: +o+: :ie:a ee ad&  eed  eaioaien&iiol oel ei&iit&at ++o &&a eio +tt:  d+ i :oi&lal i:ll   + ll:ennenetoi  eaaata    aito+ dd daolatli& r+aa a

Eh, he’s not making sense at the beginning. It’s fine, he just needs some more training.

.AT.C.S. Norle groeg tho gleat on of the fithry to @Mothampr whttp://t.co/AuzQA5558 https://t.co/K275870327,00-15-2012 17:54:50
Amenacand of @Benalyorert Trump on Thampso,09-22-2018 16:35:43
 shtice in e lots ane way and then's to bar hame it toer far of tha ar ort lost nnws the wom a ament of Indad. Chard you show the frinad tonat calfice in expansionstanted stank stary-wank.,01-10-2013 08:27:00
@Bumcranal @novere wst @BanckObama on helos ove fondents!,00-18-2012 17:50:41
Mutthe a beanth the semply thank dost fur toteeting wo lle fice on @Nontary hon as enthey to at a tama yon Chinct rock a toree frolly one we time reald inerss it the fest ant wome of to got the ass well ne are wouk to griese in asting shor a lakge- if cridice and tomeela gis and thit at e firnor spech ruplitictanith..02-19-2012 10:53:26
@Donarca lued @Momincins stro groat of the @crempecand weakt at as that weth and betight at estere ar thisedort faigs.,10-08-2013 18:48:16
That stor the Coolle cone the wase to selice

Still learning English.

aing and the U.S. and so make at the Unitinate Apprentice.,03-03-2012 21:39:38
It is a great confidence and the people and the people and the problem is a live this is a great comment on the people and a look of the world be the problem in the world will be.,01-08-2013 20:39:57
I am a stating the world be a great campaign is a great political controtical country in tonorrow.,08-09-2013 18:49:58
Thank you @CNN is allow many precest of @MittRomney will be in the probecome to state the people on the Universe at 7:00 A.M. A great perpon.,10-18-2013 20:39:57
The provise of the person is the protest of the U.S. Americans will be in the U.S. Americans are now. They want to see working the probection. It will be interviewed by the problems in the people. http://t.co/Rudds0R,08-08-2012 18:55:40
Thank you @MarickObama and the people who will be an and a for me to the problems in the president is a great people and the U.S. and to the president in their start the president of the people and stron

At epoch 2,  he’s talking to @MarickObama, is there such a person?

anis and the Democrats are always all the statement.,08-15-2013 18:18:48
I will be interviewed on @foxandfriends and @Mike_Pence and @TheBrade is a total mest on @FoxNews.,05-05-2013 15:55:15
@AlexSalmond @TrumpTowerNY "@realDonaldTrump has an explain to hear @MittRomney. http://t.co/5656hhkk http://t.co/khookkCA Trump Ister and his secord http://t.co/5AAkgkgk "@realDonaldTrump @TrumpTorerto,12-05-2013 20:44:58
I would be a friend @Mike_Pence in the world on the world is not a great hotol on @foxandfriends at 8 PM.,02-19-2013 18:58:18
@TruckAddresti Thanks.,01-25-2013 21:48:48
@macheller Thanks.,01-25-2013 21:36:46
@Roberten @TrumpTowerNY. Thank you for yourself!,02-16-2013 18:56:15
@TrumpTowerNY Thank you for the world of you think you.,08-26-2013 18:46:58
I would be a great guy and the worst than the world is a good last week and the same thing is a good lack. I would have always all talking about the world is.

He’s learning the twitter structure pretty fast! Look he has an interview on @foxandfriends.

enticky. http://t.co/kkqyy88I,01-03-2014 21:37:34
If you dont think you cant can be a complete disaster as a wonderful person who will be the worst income back to the U.S.A. I was allowed to be a good start. http://t.co/kkyL855h,02-05-2012 21:33:33
To be an amazing people to take a big decision. I hope you have to love what you're doing.,12-16-2012 14:32:22
@thericaster Thanks Cark. Thanks Can!,02-05-2013 18:42:58
@michaellerlan Thanks Carly.,02-06-2013 18:54:58
@marklevinshow Thank and thanks.,02-25-2013 18:34:37
@Markleyerton Thanks Jay.,02-06-2013 21:00:37
@marismandaris Thanks.,11-05-2012 19:04:55
@MarkBurnert Thanks.,01-05-2013 18:52:56
@MichaelLerrine1 Thanks Carly.,02-05-2013 19:34:58
@MikeAlaysJo Thanks Cark.,02-06-2012 18:36:57
@MarkBarterime Thanks Can.,01-06-2012 21:30:03
The example if the problem is the worst star in the world. The best way to do it and thinking with a bill be personable for the possibilety. The border is an attack to the U.S. and they are n

See that he has learnt to use the @ to reference people, and is hallucinating people. There is no @MarkBarterime on Twitter! He says thanks a lot.. hmmm.

end is total listen for the U.S. and to the U.S. http://t.co/sW0L00M0 They are now a part of the people without them any consumer.,09-16-2013 21:16:02
@James_Comeye Thanks Jamil.,02-25-2013 18:55:55
@Republicans That's good luck.,02-27-2013 13:58:45
@MattyNower That's why you will not be a great guy.,02-27-2013 15:55:15
@Repablie Thank you for your nice words on the great honor! http://t.co/zoonnnzV,02-13-2013 20:18:02
@stanning Thanks James.,02-27-2013 21:58:57
@sandon Thanks Jamie. 12-20-2013 20:26:15
@Janieler @CelebApprentice True!,01-10-2013 15:55:57
@JonesPaters Thanks.,01-10-2013 15:15:55
It s Tuesday. How many million dellars are staying to take a disaster.,04-25-2013 20:56:55
@Joan_Rivers @TrumPTowerNY Thanks Jan!,01-28-2013 15:18:15
@Jaces_Comment Thanks.,12-25-2013 18:57:57
@Taniello Thanks Scott.,01-20-2013 21:13:55
 I disn't think you have to be a country in them in your business and the best people will be the bisgest season. - Think Big,01-18-2013 21:28:55

Look, he’s started to use http:// internet addresses! So there are a few recognizable celebrities here.

 is the beginning they were great!,07-17-2013 19:16:22
@BarackObama is a campaign. He will be a terrible job as a second situation for tax contril.,10-13-2013 21:21:22
@Brandopero True!,01-10-2013 19:31:17
@Matterney2011 Thanks Jeff!,03-20-2013 19:21:17
@Brentardaney The Art of the Deal is great!,02-03-2013 20:33:37
@sarlynewa @TraceAdkins Thanks.,07-17-2013 15:39:39
@maraldashard Thanks Jeff!,07-23-2013 19:21:29
@BarackObama has a complete disaster and that will be the biggest state of the U.S.A.,09-27-2013 11:19:22
I am self funding a big day for an amazing people. I have a great time in Iowa. He is a good start. I am a total loser!,12-06-2014 21:29:22
I am in New Hampshire this morning. The fact that the United States is to all with the military. I will be there for a great honor to help it.,07-29-2013 16:17:36
I am self funning for the @WSJ contestant of the @WhiteHouse the race to see that his speech are so latghing at a lot of millions of dollars.,12-23-2013 19:47:26
The perve

Doesn’t have nice things to say about Obama…. associating him with words like “terrible” and “disaster”, not unlike the real guy. Yeah, of course his book “The Art of the Deal is great!”… wait did he say he is a total loser?

It was a fun exercise, and really added a lot to my understanding of RNNs. Hope to get the time to do more of this!

All code and data can be found here : https://github.com/lppier/Recurrent_Neural_Networks

 

A Gentle Guide to Recommender Systems with Surprise

finalcollaborative-filtering

Recommender systems are useful for recommending users items based on their past preferences. Broadly, recommender systems can be split into content-based and collaborative-filtering types.

Content-based recommendations : Recommend users items based on their past buying records/ratings.  One way to do this is to use a predictive model on a table of say, characteristics of items bought by the user, run through a list of new items and try to predict whether the user will like to buy the items. This can be done with typical binary classification supervised learning methods like logistic regression.

The disadvantage of this method is there is no serendipity – items recommended tend to be those that you already know you want. There is no, “hey how did the system know I wanted this?”. Which brings us to …

Collaborative-Filtering Types : Matches users to people with similar tastes. Users who have similar tastes are put in a “basket” algorithmic-ally,  and recommendations are given based on what these users like on a whole. There are 3 approaches to this :  user-user collaborative filtering, item-item collaborative filtering and matrix factorization.

We will concentrate on collaborative filtering for the purposes of this article. Here, we will use the Surprise python package, an excellent open-source library by Nicolas Hug which has most of the fundamental algorithms. http://surpriselib.com/ Also, we will make use of the in-built dataset in Surprise, movieLens, for this method.

User-user Collaborative Filtering

Screenshot from 2018-03-26 11-07-17

Things to note :

  • The output is the prediction of user u’s rating on item i.
  • We utilize the similarity measure between user u and user v in this case.

Other than that, the algorithm is already coded for u in Surprise.

http://surprise.readthedocs.io/en/stable/knn_inspired.html

Item-Item Collaborative Filtering

This means that instead of using user similarity, we use item similarity measure to calculate the prediction.

Screenshot from 2018-03-26 11-07-34

Note that similarity in the above equation is now between item i and item j, instead of user u and v as before.

Advantages of item-based filtering over user-based filtering :

  1. Scales Better : User-based filtering does not scale well as user likes/interests may change frequently. Hence, the recommendation needs to be re-trained frequently.
  2. Computationally Cheaper : In many cases, there are way more users than items. It makes sense to use item-based filtering in this case.

A famous example of item-based filtering is Amazon’s recommendation engine.

https://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf

Matrix Factorization

While user‐based or item‐based collaborative filtering methods are simple and intuitive, Matrix Factorization techniques are usually more effective because they allow us to discover the latent features underlying the interactions between users and items. We don’t actually know these latent features. The famous singular vector decomposition (SVD) shown here employs the use of gradient descent to minimize the squared error between predicted rating and actual rating, eventually getting the best model.

Click to access math420-UPS-spring-2014-gower-netflix-SVD.pdf

svd_1

svd_2

Again, Surprise has done the hard plumbing for you, and all is needed is to utilize the SVD() class.

See http://surprise.readthedocs.io/en/stable/matrix_factorization.html for more information.

In the above code, we use GridSearchCV to do a brute-force search for the hyper-parameters for the SVD algorithm. After doing a cross validation that these are indeed the best values, we use these hyper-parameter values to train on the training set.

Eventually, we evaluate the model on the test set.

Evaluating RMSE, MAE of algorithm KNNWithMeans on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
MAE (testset)     0.7191  0.7158  0.7138  0.7166  0.7254  0.7181  0.0040  
RMSE (testset)    0.9173  0.9162  0.9125  0.9179  0.9271  0.9182  0.0048  
Fit time          1.04    1.11    1.03    1.01    0.90    1.02    0.07    
Test time         2.08    2.06    2.06    2.05    2.06    2.06    0.01    
{'test_mae': array([0.71909501, 0.71579784, 0.71384567, 0.71656901, 0.72541331]), 'fit_time': (1.0419056415557861, 1.1055827140808105, 1.0349535942077637, 1.01346755027771, 0.9016950130462646), 'test_rmse': array([0.91726677, 0.91622589, 0.91245293, 0.91793969, 0.92711428]), 'test_time': (2.0783369541168213, 2.0616250038146973, 2.0627968311309814, 2.049354314804077, 2.058506727218628)}

SVD : Test Set
RMSE: 0.9000

 

All the code from this article can be found here:

https://github.com/lppier/Recommender_Systems

 

 

Tuning Hyper-Parameters in R Using MLR

rlogo

Recently, I was trying to tune hyper-parameters for support vector machines using the ksvm function. As ksvm is the default utility used by rattle for support vector machines, and I did most of my exploratory work in rattle, I needed something that worked with existing rattle code.

The mlr library in R contains a utility that allows you to do this. In this gist are examples for Support Vector Machines and Random Forests.

If you are using some of the more generic models in R, you may be able to use the following utility, tune.

https://www.rdocumentation.org/packages/e1071/versions/1.6-8/topics/tune

Update: Yet another alternative method to do grid-search in R is to use the caret library.