As I have been working on my 'Almost Live' projects, one of the biggest challenges I face is how to filter out good past performance data from bad past performance data. It's pretty easy for me to look at a running line and say, "That is bad because the horse stumbled coming out of the gate". Unfortunately, there are lots of different factors that can weigh in and make a race bad. Things like poor starts, being bumped, running wide, muddy track, and so on. I was not looking forward to the daunting task of writing endless if and switch statements to apply some general rules. Then, almost out of the blue, I remembered reading somewhere (still can't remember where) a statement mad. e in an article that went something like this, "Google uses Bayesian filters like Microsoft uses if statements." Bingo! I had my solution. If you are not familiar with Bayesian filters, read on since you will see that they are easy to create (simple ones at least) and easy to use.
Google does indeed use Bayesian filters a lot, as does a lot of other software companies. The easiest way to explain and Bayesian filter, and fortunately a way that closely mimics what I want to use a Bayesian filter for, is in the area of spam email. It is very easy for a person to recognize most spam, but it's a lot harder on a computer. Bayesian filters provide a mathematical way to analyze a given email to determine if it is spam or not. The biggest drawback to the Bayesian filter approach is that you need a lot of historical data to use as part of the analysis to determine what makes an email message spam or not. If you don't have a lot of historical data, you can "train" your filter over time. The longer you train it, the more "spam aware" it will become and the better it will be at filtering spam. So let's take a look at how you would do this conceptually.
Let's say that we do not have any background data. As each new email comes in, we mark it as either a good email or a bad (spam) email. The simplest approach is to have the filter parse that email and build tables that represent good and bad emails. What is being parsed is the information (the words) in the email. You can also add other information such as the sender's email address, ip address, whatever, but it means you will have to parse and store more types of data. We will just stick to parsing the content of the email for now. If the first email is tagged by the human read as a good message, the filter might construct some data like this based on the message "Hello Jeff, The dog bit the boy"...
| Word | Count |
| Hello | 1 |
| Jeff | 1 |
| The | 2 |
| dog | 1 |
| bit | 1 |
| boy | 1 |
Basically, each word is logged and the count of that word is incremented by one for each time it appears. This is a running total, so as more emails are marked as good, the table gets bigger and the word count will start to vary considerably. With this data, we can now make determinations of how likely a given word is to appear in a good email. The exact same process is down for bad emails. So we now have a list of words that appear in good and bad emails. Of course, many words will appear in both types of emails, and your filter may do things like remove the most common duplicates or look for word pairings or groupings.
So what do we do now that we have a list of good email words and bad email words? Bayesian filters are called Bayesian because they use Bayes' Theorem. Without going into a lot of detail, the simplest explanation is that probability that an email is spam is equal to the probability of "spam words" being in an email times the probability that any particular email is spam, divided by the probability of those words in any kind of email. So it looks something like this:
P = P(words in spam) x P(is spam) / P(words in any email)
Easy right? ;-) I won't go into the actual implementation in this post. I'll save that for next week when I post a screencast on building the filter itself. I will tell you now that the solution works great. I will show you how I implemented both the filter and the filter trainer using C#. The results have been very impressive so far with the ability to correctly filter well over 90% of the races. A few slip by here or there, but it is much, much better than anything I could have come up with myself.
How did I use this with past race data? Basically, I am word parsing the past performance comment line (this is a word description of how the horse did in that particular race and will call out things like stumbles, bumps, wide trips, etc.) and then I also parsed data on how the horse did at different positions in the race. Was he way in the back, lose a lot of ground, etc? I also take note of track conditions. Again, you will get a better feel for it next week.
All in all, it is a very interesting and useful exercise and has a fair amount of applicability. If you are working with data and you need to filter or categorize it, Bayesian filters may be for you.