We briefly mentioned the idea of rows referencing other rows back when we introduces SQL. We will talk about it in some more detail here.
This idea of relationships between tables is a major cornerstone of relational databases. Without it you simply have structured data, not related structured data. You will end up working with both and it is important to understand how relational data works and why it is used.
So why do we care? Lets go back to the table of BSG characters and the planets they are from.
We could just have the homeworld
column contain the text ‘Caprica’ or ‘Gemenon’ and that might work. But what if the name of a planet were to change? We would need to make sure to go back and update every instance of that planet name. Doable, but not awesome.
Another issue is that planets have associated data. They have a capital city, language and population. This is where it gets tricky. If we simply included that information in the row of a person we would basically double the size of our database. What used to be rows of 5 attributes becomes a row of 9 attributes. If we had millions of people in our database that would be a major difference.
Yet another issue is what if multiple things reference a planet? Maybe space ships are based there. Maybe we have a list of episodes and what planets are in them. Now we have 4 or 5 tables all referencing the same planet. It becomes a huge problem if that planets information is updated to figure out all the tables it exists in and update all the references. But if we don’t then we have two versions of the planet data in our database. Some current and some old.
So it should be pretty apparent that relationships help solve a variety of problems. Instead of having a lot of copies of a planet, we store a relationship between a person and a planet by storing the persons homeworld
as the primary key of the planet they are from.
The tool we will be using to do all of this relationship work are foreign keys. We say a table has a foreign key reference to another table. When it does the values of that column containing a foreign key reference must be contained in the column of the table it is referencing or be NULL
. If it is some other value the DBMS will reject the operation.
You need to be a little bit careful deleting things. Depending on configuration the DBMS may not let you delete things that are being referenced. If we had people with Caprica as their homeworld, we would not be able to delete the Caprica row until all those references were changed or removed.
This kind of relationship, between people and their home planet is a one-to-many relationship. A person is related to one planet but a planet can be related to many people.
The structure should be fairly intuitive. The thing that references one thing stores the key of the thing it is referencing. The thing being referenced does not need to do anything special in this case.
Lets take a look at a different kind of relationship, the many-to-many relationship. We can imagine in this universe that people might hold certificates in the military to do different things.
id | title |
---|---|
1 | Raptor |
2 | Viper |
3 | Mechanic |
4 | Command |
The problem is that people might hold more than one certification. We can’t use the trick we did with people and homeworlds. You might be tempted to add multiple certification columns to your bsg_people
table, but that is a recipe for disaster when some some overachiever comes through and gets certified in everything and now you need 20 columns.
cid | pid |
---|---|
2 | 2 |
4 | 2 |
4 | 3 |
2 | 4 |
4 | 6 |
1 | 7 |
So we make a new relationship table. This table holds two foreign keys. One referencing a person and one referencing a certification. Each row represents a person holding a particular certification. In this case cid
is the primary key of the certificate that is held and pid
is the primary key of the person who holds the certificate.
So now we have all this nice related data, we need a way to put it all together. That is where joins come in. They come in different flavors, we are going to talk about three, the inner, left and right join.
Inner joins are probably the most common type of join. At least it is the variety of join that I use the most in my database interactions. An inner join provides the intersection of two tables. Lets say we want to join these two tables.
id | fname | lname | homeworld | age |
---|---|---|---|---|
5 | Gaius | Baltar | 3 | NULL |
6 | Saul | Tigh | NULL | 71 |
7 | Karl | Agathon | 1 | NULL |
8 | Galen | Tyrol | 1 | 32 |
id | name | population | language | capital |
---|---|---|---|---|
1 | Gemenon | 2800000000 | Old Gemenese | Oranu |
2 | Leonis | 2600000000 | Leonese | Luminere |
3 | Caprica | 4900000000 | Caprican | Caprica City |
The command for this would be
SELECT *
FROM bsg_people
INNER JOIN bsg_planets
ON bsg_people.homeworld = bsg_planets.id;
And it would generate the following table
id | fname | lname | homeworld | age | id | name | population | language | capital |
---|---|---|---|---|---|---|---|---|---|
5 | Gaius | Baltar | 3 | NULL | 3 | Caprica | 4900000000 | Caprican | Caprica City |
7 | Karl | Agathon | 1 | NULL | 1 | Gemenon | 2800000000 | Old Gemenese | Oranu |
8 | Galen | Tyrol | 1 | 32 | 1 | Gemenon | 2800000000 | Old Gemenese | Oranu |
The newlines are just added for clarity, this can be done on a single line or broken up for readability. It is often convention to break it on keywords like is done here.
So lets look at what we added to the SELECT
statement to make this happen. We SELECT
from columns from a table like usual, in this case we selected everything. We could limit this to specific columns if they are uniquely named by specifying the column name, if they are not unique like id
in this case we would need to prefix it with the table it is coming from, like we did with bsg_planets.id
. So that is all fairly similar to what we did before except we can select columns from multiple tables if we want.
The INNER JOIN
is where things get different. We specify the table we want to join our first table to. In inner joins the order does not really matter.
Finally we specify a join condition. In this case the condition that means the rows go together is if the homeworld
attribute of a person is equal to the id
of a planet.
This query generates a list of all people who have a homeworld along with the details of their homeworld. We can see that some people in the original table are not included because they don’t have a homeworld. Likewise some planets are not included because no one is from that planet.
Left and right joins are syntactically the same as an inner joins other than the name of the join, but the result sets are different.
SELECT *
FROM bsg_people
LEFT JOIN bsg_planets
ON bsg_people.homeworld = bsg_planets.id
id | fname | lname | homeworld | age | id | name | population | language | capital |
---|---|---|---|---|---|---|---|---|---|
5 | Gaius | Baltar | 3 | NULL | 3 | Caprica | 4900000000 | Caprican | Caprica City |
6 | Saul | Tigh | NULL | 71 | NULL | NULL | NULL | NULL | NULL |
7 | Karl | Agathon | 1 | NULL | 1 | Gemenon | 2800000000 | Old Gemenese | Oranu |
8 | Galen | Tyrol | 1 | 32 | 1 | Gemenon | 2800000000 | Old Gemenese | Oranu |
A left join includes every row from the table on the left (the first table we list) and matching rows on the right. So now we also get people who have no homeworld.
A right join does the same thing but lists all the columns on the right, even if there is not a matching column on the left.
SELECT *
FROM bsg_people
RIGHT JOIN bsg_planets
ON bsg_people.homeworld = bsg_planets.id
id | fname | lname | homeworld | age | id | name | population | language | capital |
---|---|---|---|---|---|---|---|---|---|
7 | Karl | Agathon | 1 | NULL | 1 | Gemenon | 2800000000 | Old Gemenese | Oranu |
8 | Galen | Tyrol | 1 | 32 | 1 | Gemenon | 2800000000 | Old Gemenese | Oranu |
NULL | NULL | NULL | NULL | NULL | 2 | Leonis | 2600000000 | Leonese | Luminere |
5 | Gaius | Baltar | 3 | NULL | 3 | Caprica | 4900000000 | Caprican | Caprica City |
The syntax can be a bit confusing for joins, but once you get the hang of it there is not a whole lot of variation to it. Start by trying to join the tables we did here. Then if you go to the sakila database you will see lots of tables like film_actor or film_category. Try to see if you can get all of these joined together into a big table, you may need to filter on some of the results to keep the result set a reasonable size, but it is good practice to see if you can string together 3 or 4 tables and keep things organized.