Category Archives: SQL

Distance Calculation

I was tasked with an interesting request recently at work. It was to find how many hotels of an external brand were near one of our brands’ hotels. For those not familiar with the hotel industry, we leverage a data provider called STR. So no crazy web scrapers needed. We have all the hotel data required for the task.

The key fields required here are a unique identifier for each hotel and their respective latitude and longitude. My resulting data would need to have a row of every external brand hotel for each internal brand hotel. First, I created two tables in the database. One for all internal and the other for the external brand. Querying both tables together without specifying a join relation results in a row of every external brand hotel for each internal hotel. The only thing I needed to add was a calculation to determine the distance between two lat/lon coordinates.

We’re a Vertica shop. So I checked the reference guide and found the DISTANCEV function. It is as simple as passing in all four coordinates and it returns the distance in kilometers. I just multiplied that by 0.621371 to convert to miles. The task was to find number of external hotels within a 2 mile and 7 mile radius per internal hotel. There are a number of ways to get this but I decided to use Tableau. You never know when requirements will change and a viz will then be asked for. I pass this as a custom SQL query in Tableau’s data window.

screenhunter_43-sep-20-21-32

The output needed to be a list of our internal hotels with two columns. One counting the number of external hotels within 2 miles and another within 7 miles. Using Tableau’s Level of Detail function, I created the following.

if {fixed [choice_id], [external_str_nbr] : sum([distance])} <=2 then 1 else 0 end

For every unique pair of internal and external hotel ids, I check the distance between the two and if it is less than or equal to 2, I sum 1. This is repeated with less than or equal to 7 calculated field.

My end result is as follows:

screenhunter_43-sep-21-09-16

The way I have this setup, under 2 miles are also included in under 7. This was a fun task. I’ve never done anything related to distance measurement. Luckily, there are tools that make it so simple.

 

Share Button

SQL Join Daily and Monthly

This week I was doing some analysis and came across a common issue when using multiple data sets. I work with a variety of tables in our data warehouse, which vary in granularity. My task required blending a table that contains daily records with a table that contained monthly records. If I just LEFT JOIN on month(date) and year(date), and the foreign key, I would end up with duplicate records because the join would pull in monthly records at the daily level. My solution to this was to leverage SQL Analytic Functions.

Analytic Functions vary between systems and versions. At my office, we use HP Vertica as our data warehouse. But the one I used should be available on most, if not all, SQL products. ROW_NUMBER is an easy way to create a column that will count rows based on your input. It takes two parameters. The first, PARTITION BY, is optional. You can reset the counter on one or multiple columns or partitions. The second parameter is ORDER BY, which structures the results so your partition counter is as you desire.

As I work in the hotel industry, my daily table contains consumed room nights for our properties. The monthly table has other key metrics at the property level.

Here’s a basic representation of my tables:

ScreenHunter_29 Jun. 09 14.20

I needed to pull one of those metrics to the daily level so I could compare two values Year over Year, aggregated monthly. I used ROW_NUMBER to initiate a counter that resets for every unique year(date), month(date) and property code. The actual syntax for this is as follows:

ROW_NUMBER() OVER (PARTITION BY year(date), month(date), property_code ORDER BY date, property_code)

If I use this statement on my daily table, the result would look as follows:

ScreenHunter_23 Jun. 09 13.47

The result is a counter column that resets every distinct year/month and property code. This can be used to mark where I pull in the value from my monthly table, knowing it will not duplicate anywhere else in the result set. Taking it one step further, I used this function inside a CASE statement. In my SELECT query, I wrote:

CASE WHEN ROW_NUMBER() OVER (PARTITION BY year(date), month(date), property_code ORDER BY date, property_code) = 1 then monthly.value END AS column_name

Using my sample data from above, here’s how the results would appear:

ScreenHunter_30 Jun. 09 14.22

The key here is the equal to 1. Each time the counter resets to 1, I pull in the desired value from my monthly table, else the column is null. My results are still at the daily level. Remember, I still need to LEFT JOIN the tables on month(date) and year(date), and property_code. My data is now ready to bring it all into Tableau and I can start my visualization.

Share Button

Tableau in the office

I’m going to start sharing some things I do with Tableau in my professional life. I’ve done this once with my Time Duration post. Hopefully, “Tableau in the Office” becomes a recurring series and provides ideas/tips/tricks that you can employ in your daily Tableau routine.

This week I was asked by a colleague for some data on their jobs queue. As my division is the content supply chain for Discovery, incoming deliveries must be QC’d to ensure they meet our technical specifications. We call this process, Technical Evaluation or TE. Our media management system tracks these TE jobs with multiple date fields. We have job created date and job completed date. This is a high volume operation so new jobs are created every day as the team completes jobs in the queue.

To illustrate how I solved this, I’ve mocked up data in the following excel file. This file contains three columns: Created Date, Completed Date and Job Number. I randomly generated 1500 dates between January 1, 2014 and October 31, 2014 to populate the Created Date column. Then I populated the Completed Date column by randomly adding 1 to 15 days to the Created Date value.

I connected to my data source using the following custom SQL technique. For excel file, this would be:

SELECT
[Sheet1$].[created date] AS [date],
‘CREATED’ AS [SOURCE],
[Sheet1$].[job_num] AS [job_num]
FROM [Sheet1$]

UNION SELECT
[Sheet1$].[completed date] AS [date],
‘COMPLETED’ AS [SOURCE],
[Sheet1$].[job_num] AS [job_num]
FROM [Sheet1$]

The important thing to note here is that the first select statement labels the Created date as Date and the second labels Completed Date as Date. This way I have both date values in the same column when I bring the data into Tableau. Also, I added a SOURCE field to designate where the data is coming from.

Once in Tableau, I made a bar chart with Stack Marks turned off. Placed SOURCE in the color and size shelf. This is a great way to see how many jobs are being created versus completed for a given date range. I’m filtering out November in the view below.

job dates

This was my take on this problem. Let me know if you have done something similar in a different manner.

Share Button

Manual UTC Conversion

During a project last year I was creating a FACT table from one of many databases in my organization used to track program information. Using SQL Server, one of the fields that I wanted to pull was Air Date/Time of a program. When I first created my fact table, I noticed that the Air Date/Time field was in UTC. I suspect this database was created with a time OFFSET field but I didn’t have schema documentation for it.

I decided create my own conversion calculation to East Coast time. I’ve been working with SQL for a little time now but I’m no expert. So the challenge I faced was adjusting for Standard versus Day Light Savings because UTC does not change. I looked up the dates when we adjust our clocks from 2011 – 2025. The rest was the following SQL CASE statement:

CASE
WHEN [Table].[Date Field] BETWEEN ‘2011-03-13 07:00:00:000’ AND ‘2011-11-06 06:00:00:000’ THEN DATEADD(hour,-4,[Table].[Date Field])
WHEN [Table].[Date Field] BETWEEN ‘2012-03-11 07:00:00:000’ AND ‘2012-11-04 06:00:00:000’ THEN DATEADD(hour,-4,[Table].[Date Field])
WHEN [Table].[Date Field] BETWEEN ‘2013-03-10 07:00:00:000’ AND ‘2013-11-03 06:00:00:000’ THEN DATEADD(hour,-4,[Table].[Date Field])
WHEN [Table].[Date Field] BETWEEN ‘2014-03-09 07:00:00:000’ AND ‘2014-11-02 06:00:00:000’ THEN DATEADD(hour,-4,[Table].[Date Field])
WHEN [Table].[Date Field] BETWEEN ‘2015-03-08 07:00:00:000’ AND ‘2015-11-01 06:00:00:000’ THEN DATEADD(hour,-4,[Table].[Date Field])
WHEN [Table].[Date Field] BETWEEN ‘2016-03-13 07:00:00:000’ AND ‘2016-11-06 06:00:00:000’ THEN DATEADD(hour,-4,[Table].[Date Field])
WHEN [Table].[Date Field] BETWEEN ‘2017-03-12 07:00:00:000’ AND ‘2017-11-05 06:00:00:000’ THEN DATEADD(hour,-4,[Table].[Date Field])
WHEN [Table].[Date Field] BETWEEN ‘2018-03-11 07:00:00:000’ AND ‘2018-11-04 06:00:00:000 THEN DATEADD(hour,-4,[Table].[Date Field])
WHEN [Table].[Date Field] BETWEEN ‘2019-03-10 07:00:00:000’ AND ‘2019-11-03 06:00:00:000’ THEN DATEADD(hour,-4,[Table].[Date Field])
WHEN [Table].[Date Field] BETWEEN ‘2020-03-08 07:00:00:000’ AND ‘2020-11-01 06:00:00:000’ THEN DATEADD(hour,-4,[Table].[Date Field])
WHEN [Table].[Date Field] BETWEEN ‘2021-03-14 07:00:00:000’ AND ‘2021-11-07 06:00:00:000’ THEN DATEADD(hour,-4,[Table].[Date Field])
WHEN [Table].[Date Field] BETWEEN ‘2022-03-13 07:00:00:000’ AND ‘2022-11-06 06:00:00:000’ THEN DATEADD(hour,-4,[Table].[Date Field])
WHEN [Table].[Date Field] BETWEEN ‘2023-03-12 07:00:00:000’ AND ‘2023-11-05 06:00:00:000’ THEN DATEADD(hour,-4,[Table].[Date Field])
WHEN [Table].[Date Field] BETWEEN ‘2024-03-10 07:00:00:000’ AND ‘2024-11-03 06:00:00:000’ THEN DATEADD(hour,-4,[Table].[Date Field])
WHEN [Table].[Date Field] BETWEEN ‘2025-03-09 07:00:00:000’ AND ‘2025-11-02 06:00:00:000’ THEN DATEADD(hour,-4,[Table].[Date Field])
ELSE DATEADD(hour,-5,[Table].[Date Field]) END AS [Date Name]

I first created a parameter in Tableau to let the user decide to view the data between standard and daylight savings. But I wanted the users to avoid any confusion between the viz and any other area where one can see program air date/time. Secondly, for Tableau best data practices, do heavy data lifting calculations before connection to Tableau. Software like SQL Server are built for this type of data manipulation. Especially when working with millions of records, long winded calculations can slow Tableau performance.

Thanks to my manager for encouraging this post. I hope it saves anyone time who needs to convert UTC date fields. 

UPDATE: To perform the same in Tableau click here.

Share Button