BASICS OF SQL

Linh Mai
7 min readNov 30, 2020

Structured Query Language or also preferred as SQL or Sequel. This is a domain-specific language used in programming and designed for managing data held in a Relational Database Management System (RDBMS), or for stream processing in a Relational Data Stream Management System (RDSMS). SQL was initially developed by IBM’s researchers — Donald D. Chamberlin and Raymond F. Boyce after learning about the relational model from Edgar F. Codd in the early 1970s. It is an essential language that any data scientist should be familiar with to handle structured data. Therefore, in this post, I am going to talk about basic knowledge about SQL and some of the common syntax in SQLite3.

CONNECT TO A DATABASE WITH SQLITE3

SQL has been around since the 1970s, so there are numerous dialects of the language such as MySQL, SQLite, and PostgreSQL. Each of these dialects has particularities such as specific functions or keywords for that specific implementation. However, they all have the same basic structures including keywords SELECT for querying databases and the same general database architecture. In this post, I am focusing on the syntax for SQLite3 — the latest version of SQLite.

1. SQL AND RELATIONAL DATABASES

Relational databases accessed by SQL which is a domain-specific language for querying and manipulating data. Databases are a collection of two-dimensional tables that have fixed columns and rows. Each table is very similar to a CSV file or a spreadsheet in Microsoft Excel. Relational data just simply means that pieces of data are related to each other. For instance, from the database structure example below, data on the ‘Customers’ table has some relationships to data on the ‘Orders’ table, indicating that a customer (‘CustomerID’) may be associated with a specific order (‘OrderID).

On the other hand, in each table, there is a key symbol next to a column name such as the ‘SupplierID’ column in the ‘Suppliers’ table. It indicates that the ‘SupplierID’ column is the primary key for the ‘Suppliers’ table. A primary key is a unique identifier for a column and it can only be unique values for that column entry. On the other hand, the primary key of this table can also appear on another table such as the ‘SupplierID’ column appears on the ‘Products’ table. In this case, this key is called a foreign key which means the primary key from a different table. This is the main idea of how data on different tables are associated with each other in a relational database. For example, if you were told a specific ‘CustomerID’, and then given a list of order data that included the ‘CustomerID’, you could determine which orders were placed by that customer by matching up the primary and foreign keys.

2. CONNECT TO A DATABASE

Before we can start using SQLite in Python, we need to import the sqlite3 module. Following that, we connect it to the database that we want to work with by creating an conn object which simply stands for ‘connect’. Once we have a connection conn. The cur object is called to fetch results. Inside of it, the cursor() method is executed to allow keeping track of which result set is which since we can run multiple queries before we are done fetching the results of the first query.

import sqlite3
conn = sqlite3.connect('database_name.db')
cur = conn.cursor()

3. QUERY AND SHOW THE RESULTS

After creating your cursor object, you now can execute queries with execute() (NOTE: using triple quotation marks for being able o use multiple query lines within the same string. However, using execute() only execute the queries. In order to show the result after execution, we need to use fetcha() right after that. We can also use fetcha()method with pandas to wrap the results into dataframes and access columns’ names in the dataframe with description

import pandas as pd
cur.execute("""SELECT * FROM Employees;""")
df = pd.DataFrame(cur.fetcha())
df.columns = [x[0] for x in cur.description]

SOME COMMON SYNTAX IN SQLite

  • Creating a new SQL database and tables from an existing dataframe
# Create a new SQL database named movie_runtime
conn = sqlite3.connect('movie_runtime.db')
cur = conn.cursor()
# Create a new table 'run_time' in the SQL database movie_runtime
# Need to clarify the data type for each column
cur.execute('''CREATE TABLE run_time(tconst INTEGER,
primary_title TEXT,
original_title TEXT,
start_year INTEGER,
runtime_minutes REAL,
genres TEXT);''')
conn.commit()
# Get from Pandas DataFrame 'imdb_basic' to SQL
imdb_basic.to_sql('run_time', conn, if_exists='replace')
  • SELECT … FROM … clause

The very first syntax to query with SQL is SELECT-FROM. With SELECT, you need to put in the column names that you want to extract the data from. If you want to choose all the columns in a table, just simply use an asterisk (*). As you can guess, we put the table name after FROM, so SQL can identify the where the columns come from to make the selection. For instance, we want to choose all columns in the table named ‘Employees’ like the above code snip.

  • WHERE clause

We use this clause to filter query results by some condition(s). For example, we want to extract the customer information in the cities of Boston or Madrid

cur.execute("""SELECT * FROM Customers
WHERE City='Boston' OR City='Madrid';""")
df = pd.DataFrame(cur.fetcha())
df.columns = [x[0] for x in cur.description]
  • ORDER BY and LIMIT clauses

With ORDER BY clause, we can sort the results by a particular feature. We can limit the output to a number of results by using LIMIT clause. For instance, we want to get fifteen customer number and name in the cities of Boston or Madrid with a credit limit greater than $50,000. Show the results from largest to smallest values of credit limit.

cur.execute("""SELECT CustomerID, CustomerName, CreditLimit
FROM Customers
WHERE (City='Boston' OR City='Madrid')
AND (CreditLimit >= 50000)
ORDER BY CreditLimit DESC
LIMIT 15;""")
df = pd.DataFrame(cur.fetcha())
df.columns = [x[0] for x in cur.description]
  • BETWEEN … AND … clause

We use this clause when we want to extract the information of a column in desires value range.

cur.execute("""SELECT column_name(s)
FROM table_name
WHERE (conditions)
BETWEEN value1 AND value2;""").fetcha()
  • Aggregate functions

These SQL aggregate functions are statements that can get the average of a column’s values ( AVG ), the minimum and the maximum values from a column ( MIN and MAX), the sum of values in a column ( SUM ), or counting the number of records that meet certain conditions ( COUNT ).

cur.execute('''SELECT COUNT (owner_id) 
FROM cats
WHERE owner_id=1;''').fetcha()
  • GROUP BY function

We use this function to group results by more than 1 given columns. Typically, GROUP BY also involves an aggregate function such as COUNT or AVG. For example, count the number of each breed and group by breed and owner ID

cur.execute('''SELECT breed, owner_id,
COUNT (breed)
FROM cats
GROUP BY breed, owner_id;''').fetcha()
  • JOIN … USING … or JOIN … ON …

Of course, the main purpose of using SQL is to extract information from different tables, not simply just in one column. This is when JOIN ... USING... comes in handy. There are numerous types of JOIN . By default, simply using JOIN will mean INNER JOIN which will return all the matched records from the left table and the right table. Other JOINS are:

# Joining table 'orderdetails' and table 'products' with the same column in both tables 'productCode'
cur.execute('''SELECT * FROM orderdetails
JOIN products
ON orderdetails.productCode = products.productCode
LIMIT 10;''')
df = pd.DataFrame(cur.fetcha())
df.columns = [x[0] for x in cur.description]

A more concise way to join table is to replace ON with USING . If we know both tables have the same column name, we can put the column name in the USING clause without stating the table names

cur.execute('''SELECT * FROM orderdetails
JOIN products
USING(productCode)
LIMIT 10;''').fetcha()
  • Aliasing AS

An alias is a shorthand for a table or a column name. It helps reducing the amount of typing to enter a query and it is also easier to read. Aliasing is especially useful with JOIN, GROUP BY and aggregates functions. The alias only exists for the duration of the query.

cur.execute('''SELECT * FROM orderdetails AS o
JOIN products AS p
ON o.productCode = p.productCode
LIMIT 10;''').fetcha()

We can also use alias to rename our aggregate to a more descriptive name

cur.execute('''SELECT city COUNT(employeeNumber) AS numEmployees
FROM offices
JOIN employees
USING(officeCode)
# using '1' to group by the 1st column that we are
selecting from the beginning

GROUP BY 1
ORDER BY numEmployees DESC;''')
df = pd.DataFrame(cur.fetcha())
df.columns = [x[0] for x in cur.description]
  • HAVING clause

This clause is very similar to WHERE clause, but it is used to filter data selections on conditions after the GROUP BY function. We can use both WHERE and HAVING clause in the same query.

For instance, we want to extract a list of customers who have made at least 2 purchases of over $50,000 each:

cur.execute('''SELECT customerName,
COUNT(amount) AS num_purchases_over_50k
FROM customers
JOIN payments
USING(customerNumber)
WHERE amount >= 50000
GROUP BY customerName
HAVING count(amount) >= 2
ORDER BY count(amount);''')
df = pd.DataFrame(cur.fetcha())
df.columns = [x[0] for x in cur.description]

CONCLUSION

There are much more in what we can do in SQL. This post is just a brief introduction to SQL. However, if you know the basics and how it works, you can easily learn other SQL syntax in a faster time.

--

--

Linh Mai
0 Followers

Data Analyst with a professional background in Chemical Research