Working with Events in SQL Server (Pt. 2 – SQL for Events)
In this tutorial we will continue looking at the concept of events, how they relate to football fixtures and how they can be analysed using SQL Server.
This is the second part of this three part series moves us away from the theory and database set-up and now focuses on the core SQL code itself which is needed to work with such events.
The first part of this tutorial series Working with Events in SQL Server (Pt.1) should be completed before attempting this tutorial. It is important this series is completed in such an order as the concepts discussed in the first tutorial will built upon further in this tutorial.
The Starting Set-up
In the last part of the previous tutorial in this series (Working with Events in SQL Server (Pt.1)) we imported the events spreadsheet into our database for the English Premier League 19/20 season and tested it was working ok. At this point we should now have a football stats database (ideally called football-stats-db if you are following along) and the following two tables imported into it.
NB – You can, of course, use any league and season you like, the concept is the same, just remember to alter league and season name values accordingly when following along.
Once you are at this stage you will be ready to continue with this tutorial.
Getting All Events for a Fixture
Let’s start off by revisiting how we select an individual fixture from our fixtures table, I will continue using the Liverpool vs Norwich City fixture which was referred to in the last tutorial.
As can be seen above this is a fairly basic SQL statement which brings back only the Liverpool vs Norwich fixture form the EPL 19/20 table.
Next, we can bring back the events for this fixture (and only this fixture). To do this we now need to make use of a JOIN clause and join together the Fixtures table to the Events table like so:
As can be seen above, when we join together these two tables we get back a total of 13 rows. It looks like these rows are all duplicates at first glance, however, if you scroll to the far right of the table you will see that whilst the fixture details are indeed duplicated, each row will have a unique set of results as per the event it represents (as per below image).
NB – Remember back to the previous tutorial, a single fixture can have multiple events, hence here we have the same fixture details per line but different event details for each line.
Whilst this might not seem too impressive (yet) this really is the basic concept in action – a single fixture linked to all of the events which happened during said fixture. Now we have this set-up and the ability to link these two tables correctly we can start delivering some even more interesting results.
Getting the First Goal Time for Fixtures
Now let’s get into something a bit more relevant, finding the first goal timings for a specific fixture and then extending this concept out to all fixtures within a specified league and season.
To start with let’s copy the query build in the last section and modify it slightly so as to bring back a smaller number of (relevant) columns:
As can be seen above, I have added a series of relevant columns onto the “SELECT” clause and also converted the MINUTE column into an integer (a number as oppose to a text value) so as we can order these events a little later on.
Next, if we look at the TYPE column we can see we have multiple different types of event currently being returned using this query, this includes substitutions and yellow cards which we are not interested in right now! To remove these types and return only goal based events we can modify the SQL code as follows:
If you look at the SQL above you will see I have added an additional predicate to the WHERE clause (as is highlighted above) which will only return the event types specified (these are ‘goal’, ‘own-goal’ and ‘penalty’).
NB – When using the Pro-Data service: goals, own-goals and penalty’s are all treated as different events! They are, of course, still all “goals” but recording them in this way gives info on how they were actually scored which could be useful in certain circumstances (e.g. selecting all fixtures with a penalty scored).
Next, now we have our list of goals and the minute each goal was scored, we can go about finding the first goal for this fixture. Naturally, as humans looking at this data set we can easily see it was scored at 7 minutes in, however, we now need to alter our SQL so as the query will now select this value for us:
As can be seen above, we now have some SQL which is delivering to us the first goal time for this specific fixture. Since I have made few small adjustments to the code this time around let’s go through them line by line to see what is now happening here.
Line 15: Notice how I have removed the TYPE column from the list of columns which are returned, this is due to “grouping” which I will explain in line 22.
Line 16: Notice on this line I have updated the column which was converting the MINUTE column to an integer. I have encased it inside of a MIN function which will tell SQL Server to fetch only the smallest value of all rows available (i.e. the smallest time value being the first goal).
Line 22: Whenever using an “aggregate” functions such as MIN within SQL Server (as we did on line 15) we MUST group the returned data set (this is because the MIN function is working on the whole group of rows as opposed to a single row at a time).
Also note that on line 22 the set of columns we group by must match exactly the columns selected at the beginning under the SELECT clause (minus the MIN column itself which is returned regardless of the other group members).
Finally, in this section, we will now get the first goal timings for all fixtures that made up the 2019 / 2020 English Premier League. This is actually really easy and simply involves removing the WHERE clause which specifies the FIX_ID be used for the Liverpool vs Norwich game we have been working on so far.
Once removed (and re-adjusted so that the WHERE clause now filters the event type) we will have the following SQL code which will return the first goal time for all games:
NB – This query will naturally exclude any games which finish 0-0, so be aware of this when counting games etc.
Getting Fixtures with Late Goals
At this point, thanks to the above query which finds the earliest goal per fixture for an entire season, we now have a base piece of SQL which can be easily modified to generate other useful data sets concerning events.
Next, let’s assume we now wanted to see what percentage of fixtures in our season (Premier League 2019 / 2020) saw a last minute / late goal scored. For the purposes of this example I will say a late goal is anything on or after the 90th minute, but you can easily adjust this number as you like.
To start off let’s copy the previous query from the last section and make a few small modifications to the SQL, these are shown in the below image:
As can be seen above, I have made a few modifications to the SQL code and this query now returns for me all games with a goal on or after the 90th minute (that’s 67 in total for this particular season).
Let’s go through these modifications line-by-line again (for the above image) and see why each one is necessary.
Line 39: Here I have updated the time used to identify a late goal, this is because the timings in the MINUTE column only go up to a maxim value of 90! As such, any events happening after the 90th minute will make use of the EXTRA_MINUTE column and we must add these columns together to get the correct time for such a late goal.
NB – Don’t worry too much about this setup for now, just be aware we must convert both values to integers so they can be added together for use later on in the query.
Line 42: Here we have added another condition to our WHERE clause which checks that the “Last_Goal_Time” value is equal to or greater than 90 minutes. You might already have noticed this is very similar to the expression on line 37 minus having the MAX() function wrapped around it!
Now, at this point we have more or less got everything we need to answer our original question (“What percentage of games have a late goal”). We still need to modify our query a little further though and in this next step we will finally get our answer.
Looking at the modified query above we now have the answer to the question, let’s go through the changes.
Firstly: Notice that I have removed the GROUP BY clause from the bottom of the original statement, this is because we do not have any other fields making up this result so it is not needed any more.
NB – If, for example, we had include multiple seasons and wanted to find the number of fixtures “per season” we would simply include the season name in the columns being selected and then group by seasons at the bottom of the query!
Secondly: Notice how the SELECT clause now looks very different to anything else we have seen so far in this tutorial, this is because (thanks to the flexibility of SQL Server) it now actually contains an additional SQL query embedded into the column as a part of getting this columns final value.
Starting at line 37 (above image still) we can see an opening bracket and on line 41 a closing bracket, everything which happens between these brackets all combines to deliver a single value to our main SQL statement (this value being the percentage of games with a late goal)
Line 38 is where we count the number of games with a late goal (the core query we are building upon), notice how the line ends with a “/” symbol to indicate we divide the number of games with a late goal (from original query) by this next number…
On line 39 we see some more brackets but this time they contain an entire SQL statement in its own right. Basically this SQL statement will return a single number which is a count of the total number of games in the 19/20 season, thanks to the brackets this final number is then fed into our formula which continues on from line 38.
Line 40 is where we complete the formula for working out the percentage of games with a late goal (so, this formula is: number of late goal games / total number of games * 100).
After line 40 the query is basically the same as it was when we simply selected all games with a late goal, all of the “magic” for this query happens in the SELECT clause!
NB – Whilst this ability to use a sub SQL statement is supported in SQL Server it is important to note it might not be supported by other database systems! There are actually several methods to obtaining such a figure but if you do stay within the SQL Server ecosystem this can be a useful trick to know about!
Where to Next
As was mentioned earlier, this is the first in a three part series of tutorials for understanding and working with events. After completing this tutorials you will be ready to move onto part three of the series where we look at the process for building up more complex event-related queries.
Remember, if you have experienced any problems whilst following this tutorial please get in touch with me via the Pro-Data support form or feel free to leave a comment below this article.