This is actual utilization data from a Volta station at Macy’s Pasadena Plaza on Lake Avenue

Calculating Hourly Utilization in PostgreSQL

5 min readJan 23, 2017

I recently received a request from a member of the site sales team at Volta for hourly station utilization. This metric is interesting to me because it requires calculation, specifically the percentage of time when someone is charging at that station. Since our data is stored in PostgreSQL, why not make it do the heavy lifting to answer this question? Turns out there are some great utilities baked in which make this an easy question to answer.

What is the utilization per hour for station X on August 13th, 2015?

For a single charging station on a single day this is an easy problem to answer, but at Volta we often deal with much larger volumes. Let’s explore a solution which will scale to looking at hundreds of thousands of charge events recorded from hundreds of stations, step by step.

Generate the Series

Since we want to know the average utilization per hour, we will start by creating a series containing every hour in the day. Developing the range for our data is easy:

SELECT
  interval_start,
  interval_start + INTERVAL '00:59:59' AS interval_stop
FROM
  GENERATE_SERIES(
    ‘8–13–2015’::TIMESTAMP,
    ‘8–13–2015 23:59:59’::TIMESTAMP,
    ‘1 hour’::INTERVAL
  ) AS interval_start;

I’ll call this subquery intervals from here on.

Perfect, now’ve we got a timestamp for each of the 24 hours in the day. Note that we’ll need to wrap this in an outer SELECT because interval_start and interval_stop are computed columns, (so they can’t be used in a where clause of the same query).

Overlapping Events

With the magic of PostgreSQL’s OVERLAPS condition, introduced in 7.4, we can easily grab all events that share any amount of time with the interval:

SELECT
  interval_start
FROM (...) AS intervals
LEFT OUTER JOIN charge_events
  ON (interval_start, interval_stop)
    OVERLAPS (charge_events.time_start, charge_events.time_stop)
  AND charge_events.station_id = 97;

I’ll call this subquery interval_overlap from here on.

The LEFT OUTER JOIN is very important here: without this type of join you may lose intervals during which no events occurred. Written this way we always get back at least 24 rows, more if multiple events happen in the same interval.

The magic of using OVERLAPS is that events which started the previous month, ended the following month, (or both!), will still be captured, edge cases which are easy to miss. Our only problem now is how to count all that time up.

Overlapping Time

Unfortunately, PostgreSQL doesn’t have a built in function, (that I know of), which will return the overlap between two intervals, only whether or not they overlap. Instead we can compute an additional property to SELECT in the interval_overlap query. Fortunately there is an easy equation for this computation, well described in this blog post:

SELECT
  interval_start,
  CASE
    WHEN charge_events.id IS NOT NULL
      LEAST(
        time_stop - interval_start,
        time_stop - time_start,
        interval_stop - time_start,
        interval_stop - interval_start
      )
    ELSE INTERVAL '00:00:00'
  END AS overlap
FROM (...) AS intervals
LEFT OUTER JOIN charge_events
  ON (interval_start, interval_stop)
  OVERLAPS (charge_events.time_start, charge_events.time_stop)
WHERE charge_events.station_id = 97;

Once again we’ve created a computed column so if we want to do additional math, like SUMing all of the calculated overlaps, we’ll need to wrap this up in another SELECT.

SELECT
  interval_start,
  SUM(overlap) AS sum_overlap
FROM (...) AS interval_overlap
GROUP BY interval_start
ORDER BY interval_start;

I’ll call this subquery total_overlaps from here on.

Here is an SQL Fiddle with a subset of the data I used and the full query written out.

Averaging Over Wider Ranges

Let’s expand the question a bit:

What is the average utilization per hour for station X during the month of November?

When we adjust the dates in our GENERATE_SERIES query to cover the full month we get at least 720 rows, one for each hour in the month. Since we’re still looking for the average utilization at each hour, we can wrap this query in yet another SELECT and group it by the interval_start hour:

SELECT
  interval_start::TIME AS start_hour,
  AVG(sum_overlap)
FROM (...) AS total_overlaps
GROUP BY start_hour
ORDER BY start_hour;

By casting interval_start as TIME we can GROUP across multiple days. This entire query finishes in under a second running on my medium AWS RDS instance — not bad for a pretty computationally expensive.

Here’s the fiddle for the increased range.

Querying Multiple Stations

Our last twist will be the true test of this approach: querying multiple stations at once. Let’s make a couple of simple modifications to enable multiple stations.

What is the average utilization per hour for stations in Pasadena during the month of November?

First, to make sure our initial series returns a separate row for each hour at each station, we need to JOIN the stations in the intervals subquery:

SELECT
  id as station_id,
  interval_start,
  interval_start + INTERVAL '00:59:59' AS interval_stop
FROM
  GENERATE_SERIES(
    ‘8–13–2015’::TIMESTAMP,
    ‘8–13–2015 23:59:59’::TIMESTAMP,
    ‘1 hour’::INTERVAL
  ) AS interval_start
JOIN stations
ON id IN (96, 97);

This approach is more flexible than our initial attempt because it returns a record for each station for each interval, regardless of there being events during that interval or not. Now that we’re using IN we can list as many stations as we want, or even toss in a subquery to dynamically grab a whole group.

Next we need to GROUP our results on the station_id. By selecting it from the interval_overlaps and the outer most query we can use it like so:

GROUP BY interval_start, station_id

Lastly, let’s wrap the entire query in the same average logic we used before:

SELECT
  interval_start::TIME AS start_hour,
  AVG(total_overlap) as average_total_overlap
FROM (...) AS total_overlaps
GROUP BY start_hour
ORDER BY start_hour;

Here’s another Fiddle showing the final query, note that I am not including a huge amount of data so the result set may look strange.

Analysis and Performance

We used some interesting functions to build a powerful dataset, one which has enabled our team at Volta to highlight how highly utilized our stations are which increases demand for more. GENERATE_SERIES and OVERLAPS are my two favorite features that I’ve learned recently, in addition to BETWEEN.

By always working with at least one row per hour per station of the desired range we have achieved incredibly consistent performance: ~800ms per month per station. Here is a link to the EXPLAIN ANALYZE output for those interested.

This is just one of many approaches to solving this problem, hopefully it is helpful to anyone facing similar requirements. All thoughts and questions are welcome, especially if you know of another way to do this!