Calculating Hourly Utilization in PostgreSQL
I recently received a request from a member of the site sales team at Volta for hourly station utilization. This metric is interesting to me because it requires calculation, specifically the percentage of time when someone is charging at that station. Since our data is stored in PostgreSQL, why not make it do the heavy lifting to answer this question? Turns out there are some great utilities baked in which make this an easy question to answer.
What is the utilization per hour for station X on August 13th, 2015?
For a single charging station on a single day this is an easy problem to answer, but at Volta we often deal with much larger volumes. Let’s explore a solution which will scale to looking at hundreds of thousands of charge events recorded from hundreds of stations, step by step.
Generate the Series
Since we want to know the average utilization per hour, we will start by creating a series containing every hour in the day. Developing the range for our data is easy:
SELECT
interval_start,
interval_start + INTERVAL '00:59:59' AS interval_stop
FROM
GENERATE_SERIES(
‘8–13–2015’::TIMESTAMP,
‘8–13–2015 23:59:59’::TIMESTAMP,
‘1 hour’::INTERVAL
) AS interval_start;
I’ll call this subquery intervals
from here on.
Perfect, now’ve we got a timestamp for each of the 24 hours in the day. Note that we’ll need to wrap this in an outer SELECT
because interval_start
and interval_stop
are computed columns, (so they can’t be used in a where clause of the same query).
Overlapping Events
With the magic of PostgreSQL’s OVERLAPS
condition, introduced in 7.4, we can easily grab all events that share any amount of time with the interval:
SELECT
interval_start
FROM (...) AS intervals
LEFT OUTER JOIN charge_events
ON (interval_start, interval_stop)
OVERLAPS (charge_events.time_start, charge_events.time_stop)
AND charge_events.station_id = 97;
I’ll call this subquery interval_overlap
from here on.
The LEFT OUTER JOIN
is very important here: without this type of join you may lose intervals during which no events occurred. Written this way we always get back at least 24 rows, more if multiple events happen in the same interval.
The magic of using OVERLAPS
is that events which started the previous month, ended the following month, (or both!), will still be captured, edge cases which are easy to miss. Our only problem now is how to count all that time up.
Overlapping Time
Unfortunately, PostgreSQL doesn’t have a built in function, (that I know of), which will return the overlap between two intervals, only whether or not they overlap. Instead we can compute an additional property to SELECT
in the interval_overlap
query. Fortunately there is an easy equation for this computation, well described in this blog post:
SELECT
interval_start,
CASE
WHEN charge_events.id IS NOT NULL
LEAST(
time_stop - interval_start,
time_stop - time_start,
interval_stop - time_start,
interval_stop - interval_start
)
ELSE INTERVAL '00:00:00'
END AS overlap
FROM (...) AS intervals
LEFT OUTER JOIN charge_events
ON (interval_start, interval_stop)
OVERLAPS (charge_events.time_start, charge_events.time_stop)
WHERE charge_events.station_id = 97;
Once again we’ve created a computed column so if we want to do additional math, like SUM
ing all of the calculated overlaps, we’ll need to wrap this up in another SELECT
.
SELECT
interval_start,
SUM(overlap) AS sum_overlap
FROM (...) AS interval_overlap
GROUP BY interval_start
ORDER BY interval_start;
I’ll call this subquery total_overlaps
from here on.
Here is an SQL Fiddle with a subset of the data I used and the full query written out.
Averaging Over Wider Ranges
Let’s expand the question a bit:
What is the average utilization per hour for station X during the month of November?
When we adjust the dates in our GENERATE_SERIES
query to cover the full month we get at least 720 rows, one for each hour in the month. Since we’re still looking for the average utilization at each hour, we can wrap this query in yet another SELECT
and group it by the interval_start
hour:
SELECT
interval_start::TIME AS start_hour,
AVG(sum_overlap)
FROM (...) AS total_overlaps
GROUP BY start_hour
ORDER BY start_hour;
By casting interval_start
as TIME
we can GROUP
across multiple days. This entire query finishes in under a second running on my medium AWS RDS instance — not bad for a pretty computationally expensive.
Here’s the fiddle for the increased range.
Querying Multiple Stations
Our last twist will be the true test of this approach: querying multiple stations at once. Let’s make a couple of simple modifications to enable multiple stations.
What is the average utilization per hour for stations in Pasadena during the month of November?
First, to make sure our initial series returns a separate row for each hour at each station, we need to JOIN
the stations in the intervals
subquery:
SELECT
id as station_id,
interval_start,
interval_start + INTERVAL '00:59:59' AS interval_stop
FROM
GENERATE_SERIES(
‘8–13–2015’::TIMESTAMP,
‘8–13–2015 23:59:59’::TIMESTAMP,
‘1 hour’::INTERVAL
) AS interval_start
JOIN stations
ON id IN (96, 97);
This approach is more flexible than our initial attempt because it returns a record for each station for each interval, regardless of there being events during that interval or not. Now that we’re using IN
we can list as many stations as we want, or even toss in a subquery to dynamically grab a whole group.
Next we need to GROUP
our results on the station_id
. By selecting it from the interval_overlaps
and the outer most query we can use it like so:
GROUP BY interval_start, station_id
Lastly, let’s wrap the entire query in the same average logic we used before:
SELECT
interval_start::TIME AS start_hour,
AVG(total_overlap) as average_total_overlap
FROM (...) AS total_overlaps
GROUP BY start_hour
ORDER BY start_hour;
Here’s another Fiddle showing the final query, note that I am not including a huge amount of data so the result set may look strange.
Analysis and Performance
We used some interesting functions to build a powerful dataset, one which has enabled our team at Volta to highlight how highly utilized our stations are which increases demand for more. GENERATE_SERIES
and OVERLAPS
are my two favorite features that I’ve learned recently, in addition to BETWEEN
.
By always working with at least one row per hour per station of the desired range we have achieved incredibly consistent performance: ~800ms per month per station. Here is a link to the EXPLAIN ANALYZE
output for those interested.
This is just one of many approaches to solving this problem, hopefully it is helpful to anyone facing similar requirements. All thoughts and questions are welcome, especially if you know of another way to do this!