Modeling Tennis: Basic Philosophy

For many of us in the handicapping community, late January means transitioning from NFL betting to our other passions. Surely there are folks adjusting their focus to college ball, NBA, golf or prepping for MLB, but for myself this time of year means ramping up the tennis handicap. With daily matches worldwide from New Years through Thanksgiving, tennis betting is a degenerates delight. The first grand slam of the season, the Australian Open, just kicked off in mid-January and the final slam, the US Open, ends in early-September which makes tennis season a perfect compliment to the NFL if you like to keep your focus narrowed down to just one or two sports at a time. This article addresses some of the basic elements of tennis modeling strategy through some simple questions and answers to help in your preparation for match betting for the 2019 season.

I’ve never played tennis or even watched it, is it possible to overcome that?

Absolutely, it takes surprisingly little time to learn the rules and basic info about the players that matter, plus experience comes quickly when you follow the sport on a daily basis. Furthermore, if you understand the basics of probability theory and analytical modeling for sports betting you’ll find that tennis is almost uniquely qualified to be handicapped using a data-based approach because the evaluation requires fewer assumptions and has fundamentally smaller uncertainty since there are only two players involved in the match and their strengths and weaknesses are well represented by their past results.


Of course if you find a passion for the sport through betting on match play, you will learn the ins-and-outs of the game along the way, but by no means do you need to know what a baseliner, a dagger, a tweener or a semi-western two-handed backhand grip is to be a successful tennis bettor.

Where do I go to get data and what am I even looking for?

There are two important databases you need, a player database and past results; a third database with historical odds is also helpful if you want to regress to market prices but it’s not mandatory. For the first two databases, the great Jeff Sackmann has compiled the most complete and useful player and match database at the site the site is extremely useful and the underlying raw data libraries are available for download on his GitHub page.

For player data it’s useful to know age, rank, handedness, height and country of origin. For match data we care about opponent, scoreline, date, tournament location, tournament level (i.e. Slam, Masters, 500, 250 or Challenger, etc.) and if you want to get fancy, hold/break percentages. (Note: a hold is when a player holds serve and a break is when they break their opponent’s serve and combined hold-break percentage is an excellent indicator of skill level and recent form).


Aside from tennisabstract, useful historical odds data can be scrapped from (example shown below which tracked the moneyline prices for the recent RBA-Tsitsipas match) and by far the best scorebug (including a snapshot of player performance info) is FlashScore which has both an app and a web interface. The official ATP (Men’s tour) and WTA (Women’s tour) websites and apps are so laughably bad it’s surprising they can put on so many complex events throughout the world without screwing anything up.


Okay, I’ve got my data, now what?

Alright let’s break down the important aspects for evaluating player strengths. Believe it or not a player’s career win/loss record means a lot in terms of how good they are generally. Obviously the context of the wins and losses is important but win %tage is great as a high-level starting point because virtually all match-play is knockout tournament style, you need to win to advance and you have a self-sorting system. Wins at the main tour level mean more than wins at the Challenger level and recent wins are more meaningful than wins from several years prior in terms of evaluating current strength of a player. The most common and effective way of converting these past results into a meaningful metric is to use an ELO-style rating system, ideally adjusting for strength of opponent (ranking is a good proxy) and using a weighted time window (i.e. improvements or regression over the years is captured appropriately).

The critical next step, and something that you learn almost immediately when you start to cap tennis, is that the playing surface matters significantly. Tennis matches are played on hard courts, clay courts or grass and each surface has characteristics that influence the speed, spin and height of the ball off the bounce, all factors that help/hurt a player based on their strengths and weaknesses. Again a great way to figure out who plays well on clay is to look at career record splits for that specific surface. As you get more familiar with players on tour you quickly identify them as “clay-courters” or “serve-bots” or “all-around” players.


So far so good, what else can the data tell us that needs to be modeled?

After coming up with a basic strength metric for a given player and then adjusting for the specific surface, useful information can be mined from past head-to-head results. Unlike many other sports trends in tennis kind of matter. If two players of relatively equal strength on a given surface have a lopsided head-to-head record, it’s probably not an accident. Winning a match is almost as much mental as physical at the highest level and path to victory has shown to be repeatable making it possible for a player to have their opponents at a disadvantage before the match even starts if they’ve had past success. Head-to-head for a given style, height, handedness and ranking tier can all be teased out of results and used to adjust a win probability for a given matchup. An example of the Djokovic/Nishikori h2h history is summarized below:


Beyond head-to-head, a player’s past success at a given tournament is often indicative of how they will perform. When players find success at a given tour stop, whether it’s the city itself, the micro-characteristics of the court/conditions, time of year or the confidence that comes with past success, patterns are apparent for almost every pro suggesting positive results year-over-year at certain locations.


Finally fatigue and current form are both aspects that can be specifically accounted for in a model. Creative schemes for quantifying these effects have been published but in summary the number of games played over a two week window is negatively correlated with performance and can be used to account for fatigue. Current form is subjective in general but victories over superior players/losses to inferior players can be quantified to adjust an ELO rating appropriately.

There have to be other key factors or intangibles to consider though right?

Intangibles such as playing on a home court, recent retirements, motivation and injuries also need to be factored into a handicap in some way but are difficult to do so quantitatively. Believe it or not motivations can often be teased out of social media; player Instagram posts, quotes during media availability and coaches statements can shed light on how a player is approaching a given event or portion of the calendar. Players jockeying for ranking points is also worth considering when framing a medium to smaller size tournament at the outset. Plenty of players enter a tournament strictly for a paycheck, especially the smaller tournaments, and it’s not always easy to see those low motivational spots coming (presumably everyone except Nick Kyrgios wants to win a major so usually no surprises there).


Alright I’m done, I’ve calculated my win probability for this match-up, now what?

So let’s say for example you calculate that Kei Nishikori has an 18% win probability vs Novak Djokovic in the Australian Open quarterfinals and he’s priced on the money line at +600 which has a break even percentage of 14.6%. Despite having an edge, betting on the upset is likely beyond most bettor’s risk tolerance; thankfully there are other ways to attack this market. A decent strategy when you identify value on any underdog is to test path to victory and game script narratives that would support an upset to see if there is any merit to backing a handicap (similar to taking the points on an underdog in football or baseball) or an over on the match. For most matches there are betting markets for both game handicaps and set handicaps which may provide lower risk options to support the value side. Similarly and “over games” or “over sets” bet in a match is effectively supporting the underdog to give his opponent a test while avoiding the potential loss if they come up just short. In an effort to keep this basic, I’ll stop there and give you the opportunity to go through this decision making process yourself and learn what works for you and what doesn’t. Finally, the example above is purely hypothetical, I actually calculate a 91% chance Djoker defeats Nishikori and have action on the correct score 3-0 at -110 odds (which I make a 58% likelihood) in their quarterfinal match.

Best-of-luck trying your hand at tennis modeling, well worth your time to explore this highly liquid market with dozens of matches per day, all year round, especially now that football is over. Plus it’s easier than beating the NBA.