Working with Events in SQL Server (Pt. 3 – Advanced Events)
In this tutorial we will continue looking at the concept of “events” relating to football fixtures and how they can be analysed using SQL Server.
This is the third part of this series which now focuses on more advanced SQL, the problem solving process and the building of queries which allow us to work with events on a much deeper level.
The second part of this tutorial series Working with Events in SQL Server (Pt.2) should be completed before attempting this tutorial. It is important this series is completed in such an order as the concepts discussed in the first tutorial will built upon further in this tutorial.
The Problem Solving Process
In the last part of the previous tutorial in this series (Working with Events in SQL Server (Pt.2)) we built some basic SQL queries used to retrieve events for a fixture (or for all fixtures within a season). We then went on to perform some basic analysis on these results such as working out the percentage of games which have a late goal scored (scored over 90 minutes in).
Whilst the question which were answered in the last tutorial were fairly straight forwards (e.g. what percentage of games had a late goal scored) other questions will not be so simple to answer (even if, to the human ear they don’t sound too much different from what we have already covered).
For example, if I were to ask a question of my data such as “in what percentage of games did the home team score first” or “in how many games did the home team win from behind” the SQL needed to answer such questions would start becoming more complicated.
As such, when answering more complex questions you might find yourself needing to break the problem up into smaller pieces, creating a separate piece of SQL to answer each piece and then combine all of these smaller pieces to form a final SQL statement which will answer the question.
NB – This will hopefully become more clear when following along with the below example which I will build up and explain how the pieces come together
Before starting I think it is also valuable to stress that there are potentially lots of different ways of solving the same problem using SQL, this is thanks both to the flexibility of the SQL language itself alongside the numerous different add-ons to the SQL language available when working within SQL Server itself. As such I would advise not to worry too much if you have a go at solving a problem using SQL and it comes out different to the way I might have solved the same problem, focus instead on accuracy and (as much as is reasonable) reference back to the raw data and make spot checks to ensure that your query is doing what you think it is correctly.
With all of this said, let’s start looking at a more advanced query…
Percentage of Games in which the Home Team Scored First
In this section I am looking to find the percentage of games whereby the home team scored the first goal. As before I will continue using the EPL 19 / 20 season which is already imported into the football-stats-db database and ready to go!
Firstly, as per the first section on problem solving, let’s start this query off by thinking about the basic components needed to build up such a result.
First: Since we are looking specifically for home teams which have scored first we will need to differentiate between home team goals and away team goals (so far we have lumped all goals together in our calculations). This will also (probably) require us to store the results of each into temporary structures such as temporary tables for working with each set of results in the next step.
Secondly: Once we have split up home and away goals into temporary tables we will need to (for each and every fixture) filter out all but the earliest goal for each fixture. Once we know this a secondary analysis will be needed to compare the first home and away goal for each fixture and filter out any which are scored by the home team only (again, we will probably need another temporary table for this result).
Thirdly: Once we have the latest temporary table (containing all first goals scored by home teams) we can start doing our percentage calculations, these will be similar to what we did in the previous tutorial and involve counting all games (regardless of goals) in total to then calculate a percentage figure.
NB – This above is just a rough through process, something I usually do when faced with a potentially complex question to answer. This process is not about answering the question itself here and now, but instead thinking through a series of steps which should lead to the answer. I believe this way of thinking is a good habit to get into versus simply trying to write out the query straight away!
With the theory out of the way let’s start building up our query, first lets get all home goals into their own temporary table.
As can be seen above this query will filter out only home goals (scored by the home team) and store them in a separate (and temporary) table for use later on in the query. Next, let’s go through what is happening in the above query line by line:
Line 6: Here we will drop the temporary table if it exists, if you run this query more than once the temporary table will still exist from the last time the query was run. As a result this line will (if it already exists) delete the table prior to re-creating it for the current run of the query.
Line 7: Here we simply end the current statement (which is deleting the temporary table). Anything after this semi-colon will be treated as a separate / new SQL statement.
Lines 8 & 9: Here we see the SELECT clause and the columns to retrieve across both of the Fixtures and Events tables (note that they are joined a little further down). Also note that since this temporary table will only store home team goals I have added a column called H_or_A which will always contain a hard-coded value of “HOME” alongside each row. This will help identify which team scored goals later on, but this will become clearer as to why I am doing this later on also.
Line 10: Here we write the result of this query to our temporary table (the INTO statement will automatically create such a table is it doesn’t already exist, which of course it wont because we deleted it earlier on). Normally SQL Server will output the results of our queries to the screen so we can read them, this INTO clause simply turns that same output into a table which we can then reference further down in our query.
Lines 11 & 12: These two lines make up the JOIN clause between the two tables involved in this query. Line 11 stipulates the join be on the FIX_ID in which we have already covered how this works. Line 12 is where we make sure we only receive home team goals by matching the home team ID (from the Fixtures table) to the team ID for the lines in the Events table.
NB – Because we don’t have any way of telling whether the home or away team was responsible for any given event, we must make us of this link with the Fixtures table in a similar fashion to the above.
Line 15: This line is where we filter out only goal events from the numerous other events such as Red / Yellow cards etc.
Line 17: This line is temporary for running the query so far and checking the results, it is not a key part of the main query.
Next, we need to effectively repeat this query and make a second temporary table for all away goals:
In the above SQL sample we now have goal events for both home and away teams being filtered and stored in two separate temporary tables. The main difference to note from the first query (for home goals) is that we have changed the H_or_A label to “AWAY” and the join (on line 26) now joins the fixtures table based upon the away team’s ID instead of that of the home team.
Next, with these two tables created we need to work on the second part of our query, this is the seeking of the lowest goal time per fixture (i.e. the first goal). This value will, of course, need to be sought across both of the temporary tables to then (in the next step) filter out any games where the home value is lower than the away value (i.e. the home team scored first). Let’s have a look at some SQL to perform this task:
As can be seen above, we now have some SQL which is delivering the lowest goal time (i.e. the first goal) per fixture and (crucially) adding label of “HOME” or “AWAY” to help us identify which team scored first. This might look complicated at first (and there are a couple of new concepts introduced here) so let’s go through this SQL line by line:
Line 30 & 36: Here we start using a new function available within SQL Server, this is the common table expression (CTE). If you think back to earlier on in this tutorial we made use of temporary tables (for splitting up home and away games) which are great when we need to obtain a result and then (in effect) query that result. The CTE helps us to perform the same task as with a temporary table (i.e. query the result of a prior query) but helps makes the process easier to mange (for example, no need to think about deleting tables first etc.).
On line 30 we start creating this CTE (and name it “cte”), everything between the opening bracket on this line and the closing bracket on line 36 is part of this CTE.
Starting on line 31 and finishing on line 35 is the actual query that makes up our CTE table, we are combining both of the temporary tables we created earlier via use of the UNION clause (which will effectively just combine the two tables into one longer table of the same column structure).
Line 37 & 38: By the time we reach this line the temporary table built using the CTE function will be ready to use, we can now make a new query based upon the results of this CTE table. The select statement itself returns the H_or_A and FIX_ID columns and then, via the MIN function, also returns the lowest numeric goal time value (NB -we are still combining MINUTE & EXTRA_MINUTE as before).
Line 39: Here we specify that we are now selecting from the CTE table.
Line 40: Remember, whenever we use an aggregate function in the SELECT clause (such as MIN) we must use a GROUP BY clause to group by the other columns which are present as we do here.
If we now look at the results after this query we will see we now have the first (earliest) goal time for each fixture and a label indicating whether it was the home team or away team who scored it (see below):
Moving on, now we have this result set the next thing we need to do is filter out only home games (remember our original question is to find the percentage of fixtures in which the home team scored first). The simple modification (line 44 below) will be enough to now only show the goals scored by home teams:
At this point we can now see a list of all games in which the home team scored first, all that remains now is to calculate what this number is as a percentage of all games!
As can be see above, we now have our answer, 48.94% of all games in this league and season saw the home team score the first goal! Let’s now look at the modifications made to get to this final result:
Lines 30 & 41: Notice I have now put the results from my CTE expression into another temporary table, lines 30 & 41 are where we set this table up and write the result of the CTE into it.
Lines 43 to 47: This section might look unusual at first glance, but it is here we use a sub-query to solve an issue known as a tie-breaker problem. Basically, it “could” be possible that we have two goals scored in the same minute by both teams, if this were to happen we would need a way to determine which goal was scored first other than the goal time.
Because making use of the minute would be no good (again they have the same minute) we would need to make use of the event ID instead. This event ID will, regardless of the minute of the event, always be in ascending order for all events in a game (so, the first goal will always have a lower event ID than the second and so on.).
So, with the above in mind, if we just look at this sub-query for a second we will see it returns the lowest event ID for any given fixture to the outer query. We will also see (on line 43) that the event ID for the outer query must be equal to this lowest event ID found using the inner query for each fixture (i.e. when running the outer query we use this mechanism to filter out (for each fixture ID) all but the lowest event IDs (i.e. the first goal for each fixture).
Also note that whilst we don’t use the JOIN keyword in this sub-query we are still joining this inner sub-query to the query that operates outside of it, this is done on line 46 where we say that the fixture ID for the outer query must match that of the inner sub-query (in other words we get the lowest event ID per fixture ID, not just overall for the whole table).
Line 47, which is still a part of this sub-query, also makes use of the HOME or AWAY labels which we set-up earlier on to ensure that we filter out only the events (goals) for our home teams.
Lines 51 to 57: Between these lines we make a new statement to select form the new temporary table (mentioned just prior). You will note that lines 52 to 56 are very similar to those used in the last tutorial whereby we count the number of home goals (form the temp table) and then divide by the total number of games obtained by using the sub-statement on line 54.
Line 50: Notice that before these latest modifications, below the GROUP BY clause which is now on line 49, we had an ORDER BY clause. I have removed this line as we cant send pre-ordered data into a new table (even the temporary table now being used in this part of the query).
Finally, as per the results box at the very bottom of the above screen-shot we can now see our answer:
For reference I have included below a copy of the full script to reach this result:
I hope you have enjoyed going though this tutorial series on events, I have tried to built it up in such a way as to go through the process of getting to the answers using SQL rather than just providing a load of copy and paste solutions from the get go.
Once you understand these principals and processes you should be able to start thinking of your own questions to ask of the events based data and then finding ways of answering them for yourself. Using the SQL code in this series as a starting template can be a great way of building up you knowledge further and quicker than simply learning SQL outside of this tutorial and trying to write them from scratch.
Where to Next
After completing this three part series please keep an eye out for other, more specific, SQL based tutorials in the Tutorials section of the Pro-Data members website.
Remember, if you have experienced any problems whilst following this tutorial please get in touch with me via the Pro-Data support form or feel free to leave a comment below this article.