In Part 2 we dive into some of the fundamental theory behind the Hardball Model that easily translates to other sports; we also explain some of the key equations and functions that we are using to make this as easy as possible to run and maintain. Be sure to catch up on Part 1 here for more background on this experiment. Note, this is coming from the perspective of a recreational bettor and reflects on lessons learned over the years by handicapping with the assistance of a statistical model.
Some info on my background… I like math, I use statistics and probability every day in my actual paying job (handicapping is my hobby) and I think Excel and spreadsheets in general are excellent tools for simple problem solving. I also know you don’t have to be an expert in math and statistics to build and run an analytical model for projecting game scores and win %tages.
A couple of key concepts that specifically apply to the Hardball Model and that can be extended to apply to sports in general:
- Think of a team performance as a percentile. What is a percentile? Well, pretty much a grade except we don’t grade on a curve, an average performance, or a “C” would be the 50th percentile, a “B” would be like the 84th percentile (also 1 standard deviation over the average), an “A” would be like the 95th percentile (also 2 standard deviations over the average) “D” and “F” performances are like the 16th and 5th percentiles, respectively. There is no sure way to know what kind of grade a team will earn going into a game, you can use situations and information about injuries and line-ups to make your best guess, but on any given day a team could give you their “A” or their “F” effort. This is how scores like 13–1 or 18–2 happen, the are rare but possible on any given night but they exist at the end of the spectrum (tail of the distribution) where you have one team gives you and “A” and their opponent plops an “F”.
- Over the course of the season a team and player will generally perform in a way that approximates a normal distribution. A normal distribution, often visualized as a bell curve and is simply another way of saying they will have good games and bad games but most will come in close to the average. This is often discussed in the context of NFL teams as the 10–3–3 rule where out of 16 games a team can be expected to perform well above average for 3 games, have 3 stinkers and play 10 about as you would expect. For the purpose of projecting scores and win %tages, it is important to be able to go back and forth between a %tage and where that puts you on the normal distribution curve. Thinking like this gives you lots of flexibility to apply adjustments because you can nudge the expectation up and down by using 0.5 or 1 standard deviation and then a 50th percentile expectation becomes an 84th percentile or and 84th percentile becomes a 95th. When you operate in this space you have the added bonus of the fact that there is no upper or lower bound where you run into issues where you will predict a negative score or a false %tage. Try it out… go 6 standard deviations over the 50th percentile you will be at 99.999% but never go over 100%. Similarly by using a continuous distribution to look up runs, you can never sample a result that is physically impossible.
- Define the major components on offense and defense that you need to predict the expected performance for a given team relative to the league average. If the league average performance for a given team is 4 runs, we want to deviate from that by assessing how much better Team A is on offense than the average team and how good is the pitcher the are facing relative to the average? You can use any metric you see fit to solve this part of the problem, there is no right answer and it doesn’t really matter because once you have the backbone of your algorithm established you can put lots of other knobs and adjustment factors as you want to account for the elements of batting and pitching you think are the most important to capture. For the example presented in Part 1, we were using some stats from baseball-reference.com to carry this out, as a result the projections in a lot of cases were pretty far away from the consensus lines, not necessarily biased but suggesting the true expectations for each teams weren’t very strong. I got immediate, excellent feedback from the readers to explore the advanced stats at fangraphs.com and after incorporating batting stats like wRC+ and pitching stats like xFIP, the projections immediately improved relative to the market at-large meaning that significantly less adjustment were needed post-facto.
- Checks and balances are the last important element you need to incorporate. We are constructing a web of look-ups and references that can get complicated so when in doubt add checks like calling up the team name again just to make sure you are getting what you think you are getting from the database. Also once you get some projected score, turn your knobs and make sure they are doing what you think they are doing. My favorite way to do this second part is to take a set of results, go back to the original projection and see how many standard deviations each team and pitcher had to be off their average by to produce the observed results. When you do this you’ll see, okay I bumped the offense by 1 standard deviation and their runs expected went from 5 to 7, check! I bumped the pitcher down 2 standard deviations and the opponents expected runs went up from 3 to 8. That sort of exercise will immediately give you a feel for how to apply the adjustments going forward and can always be used to back-calc performance of each team using the model.
There are lost of useful equations to develop an analytical model, this is far from an exhaustive list and I’ve found that the same functions work in excel and Google sheets so far:
lookup(a,A:A,B:B) This is by far the most important function to operate the Hardball Model. To keep things more well organized I have the model running on the main tab in the sheet and use lookup to pull in data from the other tabs. You lookup a value “a” in a column on another tab “A:A” and have it return the value from the same row in column “B:B”. I use this function to bring the days games from the schedule tab then use it to bring in the pitchers, the stats for the team and once we calc the expected percentile performance for a given team I use lookup to go to the runs distribution and bring back the expected runs.
if(a=””,””,do something) This is more of an OCD thing where I have live equations in all the cells on the sheet but if there is no game for that row, or if I haven’t updated the Pitchers or odds yet then I use the if statement to keep things blank. If you didn’t already know, double quotations is how you tell a spreadsheet to print a blank cell. It’s worth it in my opinion to write these statements in from the beginning to keep things tidy and reduce the noise. You don’t want to have a bunch of errors or erroneous projections jamming up your sheet. Especially if you are sharing it with collaborators or the community at large. If statements obviously play a very important role otherwise in carry out logical functions so best advice if to get comfortable using these no matter what because you’ll need them.
$A$2 You probably already know this but sometimes when you are writing an equation you will want to constantly reference a single cell or a given column and then copy/paste that all over the place. If you use $ in front of the column letter or row number then it will forever look in the same place, not relative to the cell you are working in. Using these regularly will help you work more efficiently and prevent mistakes and you already probably use the regularly so carrying on…
norm.dist(x,mean,standard_dev,cumulative) This is a helpful way to go back and forth between percentile and number of standard deviations. It’s my opinion that working entirely in the standard deviation space and then converting to percentile or runs scored at the last step is an easier way to go. If you’re not used to it, it’s hard to rationalize what a -2.5 or a 1.75 means…. 0 means average negative is below average and positive is above average and once you get the hang of operating in this space you may find it easier. For me it’s important because it is just as hard to go from 50 to 84 as it is to go from 84 to 95 and then to go from 95 to 99. It turns out all changes in percentile are not equal but it’s just as easy to go from 0 to 1 as it is to go from 1 to 2 and then 2 to 3 and that makes more sense intuitively if you are working in terms of standard deviation.
Log5 This is a beauty of an equation to convert a performance statistic in to a winning percentage. I know not everyone agrees on it’s usefulness but until I find a better way I don’t know what else to use. Considering it was developed FOR BASEBALL specifically it seems like a good place to start for the hardball model.
=1/(IF(ODDS<1,-100/ODDS+1,ODDS/100+1)) This is the equation you need to convert US odds to an implied winning percentage. If you get nothing else from this post take this equation and use it to convert a price you see at your sportsbook to know what the expected win percentage is you need to beat in order to have an edge.
In part 3 we’ll go into probably the most important aspect of all of this which is what the hell to do with the projections. As noted in Part 1, it’s not rational to use a model and bet blindly, it will almost certainly get you in trouble. Even the best model results are a fundamental starting point and you need to have some strategy in mind to determine what you should do once you have this tool in hand.
Thanks for reading and Best-of-Luck!