SQL Overview

Introduction

SQL is the most commonly used language for database interactions being used in most of the most popular databases. SQL is a domain specific language designed for querying data sets to get back filtered or combined results. In particular it is designed to do this with relational databases. These are databases that have relations set up between multiple tables. This is in contrast to non-relational data like you might find in a single .csv file that may contain several columns, but none of the reference other tables.

Relational Databases

Relationships

BSG People and Planets

This is a simple two table example of a relational database. It contains people and planets from the show Battlestar Galactica.

Everything looks pretty standard with maybe 2 exceptions. Every row has an id. This isn’t strictly required, but is good practice that we will talk about later. The other is that there is this homeworld column that contains numbers rather than names of planets these characters might be from. This is a key aspect of relational databases. Those homeworld values reference the id of rows in the planet table. This is a key aspect of relational databases. One table can have columns containing values that reference another.

Columns

Columns in SQL should consist of atomic data from a single domain. Atomic in this case is referring to data which cannot be broken down further an maintain its meaning. Sometimes this is not followed exactly, sometimes it helps, sometimes it makes life harder. For example one could make arguments both for and against including area codes as part of a phone number. In most cases the phone number, including the area code, is the atomic value. It is used to make phone calls. But in some cases you might want to do analysis based on area codes in which case really the area code is a separate datum.

The other requirement is that the column contain values from one domain. The domain is the set of all possible values for an attribute. In SQL there are many domains. For example there are int ranging from tiny to big. There is date and datetime. There are various forms of character arrays (strings). Whatever the case, the column can only contain values from a single domain. We cannot for example, have in some cases int and in other cases varchar (which is a kind of string).

Rows

Rows in a relational database are a tuple of data (not specifically the Python tuple). Strictly speaking they are unordered, but named, a bit like a Python dictionary. However, as you interact with them there will be an order inherent in how they are given back to you from the DBMS (database management system, basically the thing you interact with you use the database).

A very important feature of the tuples of a table is that they are unique there are never two tuples that contain the exact same data. This is the reason many database tables have a column that contains unique id values. That column makes sure that every tuple is unique.

Primary Keys

The thing that identifies a row as being unique is the primary key. Every table needs to have a column or columns that make up a primary key. Then every new row needs to have those columns be not null and unique. An easy way to solve this, as mentioned before, is to have a column that contains an integer that gets incremented every time a row is added. You might see syntax that looks like this

CREATE TABLE my_table (id INT NOT NULL AUTO_INCREMENT,
...
...
PRIMARY KEY (id)
);

This table has an attribute called id that automatically increments every time a new row is added and is the pimrary key of the table. We will talk a bit about table creation syntax later.

Differences to Things you Know About

Dictionaries

Python dictionaries are actually a common way rows returned from a SQL DBMS are stored. You access each attribute using its column name. This is probably a pretty the best analog to rows in Python, however there is no enforcement of domain.

2d Arrays

A 2d array does make a table with rows and columns, but everything is accessed by index (eg. [4][2]) to access the thing at (4,2) in the table. In contrast, in a relational database table rows are unordered and attributes are accessed by name.

You will probably never get back data in a multidimensional array. You will much more commonly get a list of dicts or maybe tuples.

JSON

You could probably devise a way to backup a SQL database into JSON format pretty easily. The main differences between data structured using JSON come in the form of domain enforcement. JSON does not let you specify what domain values need to be from. So it is a good way to store structured data, it does not do much in the way of enforcing what sorts of things are allowed to go where.

Parts of a Relational Database

DBMS

The Database Management System or DBMS is the brains behind the database. It is like the Python executable on your computer. The DBMS is in charge of interpreting the commands you give it and turning them into actions to write or read the appropriate data from the database.

Interface

So how the heck do you interact with a database? This is actually a little tricker than you might imagine.

Take MySQL for example, it is a really popular DBMS. When you install it you get access to the mysql command line tool. You can use this to type in queries and it spits back results in the command line. It is a lot like the Python interpreter.

To actually be able to visualize the tables in a meaningful way you typically need an additional application. Think of something like Pycharm. In the case of MySQL there is their own product, MySQL Workbench and then there are a host of options provided by 3rd parties. A really common one is PHPMyAdmin which is a web based tool written in PHP which lets you interact with the database.

Finally there are libraries. In Python one option is PyMySQL, it provides functionality to connect to and interact with a MySQL database. But it only works with MySQL. If you wanted to work with say, PostreSQL, a different SQL DBMS you would need a different Python library like psycopg. These libraries let you directly work with the database.

More commonly you will be interacting with a database via something like an Object Relational Mapper (ORM). An ORM lets you create objects in your code representing entities in your database (even ones that span multiple tables) and it will map the database properties to them. So you might have code that looked like

w_adama = select("WHERE fname = 'William' AND lname = 'Adama'")
w_adama.age = 62
w_adama.save()

This would load a person into the w_adama variable, let you modify the contents and then save it back to the database. Typically this is a lot less work than writing the select query, executing it, mapping the properties to an object then writing the query to correctly save stuff back to the database.

A popular ORM is SQLAlchemy for Python. If you end up wanting to try one out, I would suggest using this.

Activity

Get your database set up on the ONID server so that you can follow along with the rest of the material in this section. The database file can be found back in the resources of the overview page and this video will walk you through importing it to your ONID database.

Review

This exploration is designed to give you a sense of what is out there in the relational database world. If you go into data related fields you will almost certainly end up working with a SQL based database at some point. But it will almost certainly be with the tools and libraries that your team already uses.

In addition to all of that, these relational databases are not really well suited to what we aim to accomplish in this course. They are good for organizing data with a lot of breadth. You can have companies, with divisions, who have managers, who have employees who have salaries and heal insurance plans and so forth. They are really good at that sort of thing.

They are less good at storing billions of tweets or logging LIGO strain data at 16 kHz to detect gravity waves. These sorts of data tend to be pretty simple, a user, a time and a post or simply a list of strain measurements at a known frequency from a known start time.