Advanced SQL - top tips for dealing with real world data

Learn about joining on messy data, window functions and using arrays in SQL

Data Deciphered

Sep 10, 2024

Slides

In this talk I aim to teach experienced SQL users something they didn’t know.

I covered 3 key topics, I’ll share an overview of each below

Joining on messy data, what can go wrong and how to avoid it.
Window functions, what makes them great and an example of how to use them to create sessions.
Arrays, why data engineers use them and how to access data stored in them easily.

Follow along with code snippets:

Joining on messy data, what can go wrong and how to avoid it

When interviewing candidates I’ve found that many people who’ve only used SQL in a learning context don’t really have a deep understanding of how joins work. When presented with messy data they don’t have an intuition about what the result of the query will be.

I made one of my biggest SQL errors early in my career for the same reason, it was only when my head of department asked why our sales were 7x too high that I found my mistake.

The goal of this section is to help you avoid making the same mistake.

Example

The error will crop up whenever you have duplicate rows in one table that you aren’t accounting for, as in this example:

SELECT order_id
, sum(price) as order_value
FROM sales s
INNER JOIN product_details p
ON s.product_id = p.product_id
GROUP BY 1

Try the SQL yourself

This is caused by the row with the error producing two rows in the output

This is fundamental to how SQL works. When you join on a key, here product_id, all matches will be returned.

If table product_id = 1010103 appears n times in sales, and m times in product_details, you will always get n*m rows out.

Take a look at the recording to see this in action.

Solutions

When using a new table for the first time always check for duplicates if you require uniqueness.

Example SQL

SELECT COUNT(1) AS rows
, COUNT(DISTINCT product_id) AS distinct_products
FROM product_details p

If you find a duplicate you want to find out why it exists and put the right fix in place

Check you are using the right data model - e.g. is there a column showing the most recent row that you didn’t use?
Write a subquery to deduplicate your code
Fix the bug, Identify why there is an issue and resolve it in the data transformations

Read the slides or watch the recording for more details.

Arrays and Maps

When you have a very large amount of data, billions of rows, tables sometimes can’t be fully de-normalised. Arrays and Maps allow data to be stored in a standardised way even when this is the case.

Methods to query Arrays or Maps

Access one element at a time

SELECT product_name
, allergens[0] one_allergen
, additional_details['Use By'] use_by_date
FROM products

Pivot into rows

Note, each SQL version has its own syntax for this. If this doesn’t work google “<your SQL version> cross join unnest”.

SELECT p.product_name
, t.val as allergen
FROM products p
CROSS JOIN UNNEST (allergens) as t(val)

SELECT p.product_name
, t.key AS key
, t.val AS value
FROM products p
CROSS JOIN UNNEST(additional_details) AS t(key, val)

Window Functions

Window functions can replace complex SQL with more simple SQL that is easier to manage and edit.

Aggregation over specific rows

Example: Calculating total spend

Without a window function

WITH order_total AS (
SELECT order_id
, SUM(price) AS order_value
FROM sales_with_product_details s
GROUP BY 1
)
SELECT s.order_id
, s.product_id
, s.price
, s.category
, ot.order_value
FROM sales_with_product_details s
INNER JOIN order_total ot
ON s.order_id = ot.order_id

With a window function

SELECT order_id
, product_id
, price
, category
, SUM(price) OVER (PARTITION BY order_id) AS order_value
FROM sales_with_product_details s

Try the code

QUALIFY - like WHERE or HAVING, but for window functions

SELECT order_id
, product_id
, price
, category
, SUM(price) OVER (PARTITION BY order_id) AS order_value
, SUM(price) OVER (PARTITION BY order_id, category) AS category_value
FROM sales_with_product_details s
QUALIFY SUM(price) OVER (PARTITION BY order_id, category) >= 5

Creating a Moving Average

SELECT date
, units_sold
, AVG(units_sold) OVER (ORDER BY date ROWS BETWEEN 1 PRECEDING AND CURRENT ROW ) AS units_sold_2d_avg
FROM product_sales s
GROUP BY 1, 2

Note: Use 6 days preceding for weekly averages

Read the slides or watch the recording for more details.

Top Features I didn’t have time for in the talk

STRING_AGG - an aggregate function that joins strings

SELECT schema, table
, STRING_AGG(user_name, ‘, ‘) as users
FROM table_views
GROUP BY 1, 2

Add distinct to remove duplicates

SELECT schema, table
, STRING_AGG(DISTINCT user_name, ‘, ‘) as users
FROM table_views
GROUP BY 1, 2

Not all versions of SQL have this, so check your documentation.

Grouping Sets

Grouping sets allow you to group at multiple granularities without unioning two tables together.

Here we want sales by category (shown by a null brand), and category and brand.

Code without grouping sets

SELECT category
, sum(units_sold) as units_sold
FROM product_sales s
GROUP BY 1

UNION ALL

SELECT category, brand
, sum(units_sold) as units_sold
FROM product_sales s
GROUP BY 1,2

Code with grouping sets

SELECT category, brand
, sum(units_sold) as units_sold
FROM product_sales s
GROUP BY
GROUPING SETS (
(category, brand),
(category)
)

ROLLUP and CUBE automatically provide all levels of grouping