Advanced SQL - top tips for dealing with real world data
Learn about joining on messy data, window functions and using arrays in SQL
In this talk I aim to teach experienced SQL users something they didn’t know.
I covered 3 key topics, I’ll share an overview of each below
Joining on messy data, what can go wrong and how to avoid it.
Window functions, what makes them great and an example of how to use them to create sessions.
Arrays, why data engineers use them and how to access data stored in them easily.
Follow along with code snippets:
Joining on messy data, what can go wrong and how to avoid it
When interviewing candidates I’ve found that many people who’ve only used SQL in a learning context don’t really have a deep understanding of how joins work. When presented with messy data they don’t have an intuition about what the result of the query will be.
I made one of my biggest SQL errors early in my career for the same reason, it was only when my head of department asked why our sales were 7x too high that I found my mistake.
The goal of this section is to help you avoid making the same mistake.
Example
The error will crop up whenever you have duplicate rows in one table that you aren’t accounting for, as in this example:
SELECT order_id
, sum(price) as order_value
FROM sales s
INNER JOIN product_details p
ON s.product_id = p.product_id
GROUP BY 1
This is caused by the row with the error producing two rows in the output
This is fundamental to how SQL works. When you join on a key, here product_id, all matches will be returned.
If table product_id = 1010103
appears n times in sales
, and m times in product_details
, you will always get n*m rows out.
Take a look at the recording to see this in action.
Solutions
When using a new table for the first time always check for duplicates if you require uniqueness.
Example SQL
SELECT COUNT(1) AS rows
, COUNT(DISTINCT product_id) AS distinct_products
FROM product_details p
If you find a duplicate you want to find out why it exists and put the right fix in place
Check you are using the right data model - e.g. is there a column showing the most recent row that you didn’t use?
Write a subquery to deduplicate your code
Fix the bug, Identify why there is an issue and resolve it in the data transformations
Read the slides or watch the recording for more details.
Arrays and Maps
When you have a very large amount of data, billions of rows, tables sometimes can’t be fully de-normalised. Arrays and Maps allow data to be stored in a standardised way even when this is the case.
Methods to query Arrays or Maps
Access one element at a time
SELECT product_name
, allergens[0] one_allergen
, additional_details['Use By'] use_by_date
FROM products
Pivot into rows
Note, each SQL version has its own syntax for this. If this doesn’t work google “<your SQL version> cross join unnest”.
SELECT p.product_name
, t.val as allergen
FROM products p
CROSS JOIN UNNEST (allergens) as t(val)
SELECT p.product_name
, t.key AS key
, t.val AS value
FROM products p
CROSS JOIN UNNEST(additional_details) AS t(key, val)
Window Functions
Window functions can replace complex SQL with more simple SQL that is easier to manage and edit.
Aggregation over specific rows
Example: Calculating total spend
Without a window function
WITH order_total AS (
SELECT order_id
, SUM(price) AS order_value
FROM sales_with_product_details s
GROUP BY 1
)
SELECT s.order_id
, s.product_id
, s.price
, s.category
, ot.order_value
FROM sales_with_product_details s
INNER JOIN order_total ot
ON s.order_id = ot.order_id
With a window function
SELECT order_id
, product_id
, price
, category
, SUM(price) OVER (PARTITION BY order_id) AS order_value
FROM sales_with_product_details s
QUALIFY - like WHERE or HAVING, but for window functions
SELECT order_id
, product_id
, price
, category
, SUM(price) OVER (PARTITION BY order_id) AS order_value
, SUM(price) OVER (PARTITION BY order_id, category) AS category_value
FROM sales_with_product_details s
QUALIFY SUM(price) OVER (PARTITION BY order_id, category) >= 5
Creating a Moving Average
SELECT date
, units_sold
, AVG(units_sold) OVER (ORDER BY date ROWS BETWEEN 1 PRECEDING AND CURRENT ROW ) AS units_sold_2d_avg
FROM product_sales s
GROUP BY 1, 2
Note: Use 6 days preceding for weekly averages
Read the slides or watch the recording for more details.
Top Features I didn’t have time for in the talk
STRING_AGG - an aggregate function that joins strings
SELECT schema, table
, STRING_AGG(user_name, ‘, ‘) as users
FROM table_views
GROUP BY 1, 2
Add distinct to remove duplicates
SELECT schema, table
, STRING_AGG(DISTINCT user_name, ‘, ‘) as users
FROM table_views
GROUP BY 1, 2
Not all versions of SQL have this, so check your documentation.
Grouping Sets
Grouping sets allow you to group at multiple granularities without unioning two tables together.
Here we want sales by category (shown by a null brand), and category and brand.
Code without grouping sets
SELECT category
, sum(units_sold) as units_sold
FROM product_sales s
GROUP BY 1
UNION ALL
SELECT category, brand
, sum(units_sold) as units_sold
FROM product_sales s
GROUP BY 1,2
Code with grouping sets
SELECT category, brand
, sum(units_sold) as units_sold
FROM product_sales s
GROUP BY
GROUPING SETS (
(category, brand),
(category)
)
ROLLUP and CUBE automatically provide all levels of grouping