The story of SportsXML

With the NFL season just around the corner, I’m feeling a little strange not being bogged down with the usual statsfeed preparation work (I guess there is a plus side to shutting down a project). In any case, I thought I might take a second to dump out a few ‘insider’ details about some of the things I was doing inside of statsfeed…just in case someone wants to pick up the ball on their own and start to run with it. So a big part of offering up a sports statistics service, is dealing with accuracy…and in my opinion the best way to deal with accuracy to do as much comparison between sources as possible. In a perfect world, you would have a computer program that pulled in statistics from a variety of sources, compared them, and then made an educated decision on what values to go with. This is essentially what I have attempted to build into the back end of statsfeed.com the past 7 years. Since I don’t actually live in a perfect world (and probably never will), the system wasn’t 100% automatic or accurate. But it does handle a large amount of the grunt work…leaving just a simple report for an actual human (I never was able to remove that human component 100%) to review and approve (some day AI is going to be able to complete this system for me). But I’m getting ahead of myself here. There were a handful of technical challenges that I was able to solve, and those are some of the things I think it’s time to share. The first challenge is of course getting the stats from the various sources…while this challenge was pretty interesting and took up a good chunk of the first 4+ years of developing/tweaking the system, I’m not going to talk about it right now (let me just say it’s a heavy does of scraping and custom API implementations mostly done in Java). Tonight, I would rather touch a little on the second part of the challenge…comparing the various stats. While it sounds like a fairly simple concept, it’s actually a very involved technical challenge. In a general sense, we need to first get everything into the same format and order. Then we need to do as much matching as possible with some basic assumptions…and we have to do this in as real-time as possible. Without going into too many examples and details, let me give you a quick example that stresses a bit of the problem…let’s say that you are comparing stats for the Falcons vs. Packers game between sportsline and fox sports. Now let’s say that there were two touchdowns scored by Micheal Turner in the first quarter…and each one went for 24 yards. Are you starting to see the problem? Assuming we have a way to actually know that M. Turner on sportsline is the same as Micheal Turner on foxsports (a challenge/solution I may talk about another day); how do we make the computer know that there really were two different touchdowns scored (and it’s not just a duplication problem)…or on the flip side, if there is a duplication problem, how do we make sure the computer knows that? Trust me when I tell you, thinking about this stuff for too long can really hurt your head (and is why even after 7 years of providing the service, I have problems, fixes, and tweaks to make during almost every game). Anyway, let me get to my solution for the night…in order to compare apples to apples, we first have to make sure everything is an apple. So what I did was invent my own, dead simple, but I think extremely flexible Sports XML format (and was lucky enough to also nab the sportsxml.com domain name with the plan of eventually open sourcing it…I guess this is finally step one in that process no?). Every feed that I pull into the system first gets translated to my SportsXML format. And below are the basic DTDs for that format: Team DTD

<!ELEMENT content ( data+, player+ ) > <!ELEMENT data ( #PCDATA ) > <!ATTLIST data category NMTOKEN #REQUIRED > <!ELEMENT player ( data+ ) > <!ATTLIST player id NMTOKEN #REQUIRED > <!ELEMENT sportsxml ( content ) >

Schedule DTD

<!ELEMENT data ( #PCDATA ) > <!ATTLIST data category ( awayscore | awayteam | gameday | gametime | homescore | hometeam | weekday ) #REQUIRED > <!ELEMENT game ( data+ ) > <!ATTLIST game id NMTOKEN #REQUIRED > <!ELEMENT schedule ( game+ ) > <!ATTLIST schedule weeknumber NMTOKEN #REQUIRED > <!ELEMENT sportsxml ( schedule+ ) >

Boxscore DTD

<!ELEMENT content ( data+ ) > <!ELEMENT data ( #PCDATA ) > <!ATTLIST data category NMTOKEN #REQUIRED > <!ELEMENT game ( data+, player+ ) > <!ATTLIST game gamename NMTOKEN #REQUIRED > <!ATTLIST game gamestatus NMTOKEN #REQUIRED > <!ELEMENT player ( data+ ) > <!ATTLIST player playerid NMTOKEN #REQUIRED > <!ATTLIST player teamid NMTOKEN #FIXED “17” > <!ELEMENT sportsxml ( content, game ) >

Roster DTD

<!ELEMENT data ( #PCDATA ) > <!ATTLIST data category ( playername | position ) #REQUIRED > <!ELEMENT player ( data+ ) > <!ATTLIST player id NMTOKEN #REQUIRED > <!ELEMENT roster ( player+ ) > <!ATTLIST roster team CDATA #REQUIRED > <!ELEMENT sportsxml ( roster+ ) >

As you can hopefully grep from these DTDs, there really isn’t much to the SportsXML format. My idea was that, since ultimately you are triggering actions off of the data anyway, I didn’t want the markup to be too complex or get in the way of the important stuff (that data). In any case, the format worked for my needs…and I think is flexible enough to handle a lot of other problems. But I’ll leave that up to others to actually determine.

This post has received 37 loves.


ARCHIVE OF POSTS



This is the personal blog of Kevin Marshall (a.k.a Falicon) where he often digs into side projects he's working on for digdownlabs.com and other random thoughts he's got on his mind.

Kevin has a day job as CTO of Veritonic and is spending nights & weekends hacking on Share Game Tape. You can also check out some of his open source code on GitHub or connect with him on Twitter @falicon or via email at kevin at falicon.com.

If you have comments, thoughts, or want to respond to something you see here I would encourage you to respond via a post on your own blog (and then let me know about the link via one of the routes mentioned above).