This week I was doing some analysis and came across a common issue when using multiple data sets. I work with a variety of tables in our data warehouse, which vary in granularity. My task required blending a table that contains daily records with a table that contained monthly records. If I just LEFT JOIN on month(date) and year(date), and the foreign key, I would end up with duplicate records because the join would pull in monthly records at the daily level. My solution to this was to leverage SQL Analytic Functions.
Analytic Functions vary between systems and versions. At my office, we use HP Vertica as our data warehouse. But the one I used should be available on most, if not all, SQL products. ROW_NUMBER is an easy way to create a column that will count rows based on your input. It takes two parameters. The first, PARTITION BY, is optional. You can reset the counter on one or multiple columns or partitions. The second parameter is ORDER BY, which structures the results so your partition counter is as you desire.
As I work in the hotel industry, my daily table contains consumed room nights for our properties. The monthly table has other key metrics at the property level.
Here’s a basic representation of my tables:
I needed to pull one of those metrics to the daily level so I could compare two values Year over Year, aggregated monthly. I used ROW_NUMBER to initiate a counter that resets for every unique year(date), month(date) and property code. The actual syntax for this is as follows:
ROW_NUMBER() OVER (PARTITION BY year(date), month(date), property_code ORDER BY date, property_code)
If I use this statement on my daily table, the result would look as follows:
The result is a counter column that resets every distinct year/month and property code. This can be used to mark where I pull in the value from my monthly table, knowing it will not duplicate anywhere else in the result set. Taking it one step further, I used this function inside a CASE statement. In my SELECT query, I wrote:
CASE WHEN ROW_NUMBER() OVER (PARTITION BY year(date), month(date), property_code ORDER BY date, property_code) = 1 then monthly.value END AS column_name
Using my sample data from above, here’s how the results would appear:
The key here is the equal to 1. Each time the counter resets to 1, I pull in the desired value from my monthly table, else the column is null. My results are still at the daily level. Remember, I still need to LEFT JOIN the tables on month(date) and year(date), and property_code. My data is now ready to bring it all into Tableau and I can start my visualization.