Sessionizing your events data

When trying to understand a user’s journey through your product, you’ll likely turn to your company’s events data, which records users’ actions like clicks, form fills, and purchases. But without the context of when these events occurred, it’s hard to understand user behavior.

That’s where sessionization comes in. When you “sessionize” your event data, you group events into sessions that capture some time-bound record of product usage (usually, sessions are periods in which each event happened within some time period after the previous one). This lets you see events as part of a specific journey a user took through your product, rather than a collection of random actions. With your sessionized data, you can do lots of interesting analyses, like a funnel analysis.

At a high level, sessionization consists of these steps:

  1. Decide on a definition of a session
  2. For each event, calculate the time since the previous event (partitioned by user)
  3. Define sessions based on the time since the previous event
  4. Join each event to its corresponding session and aggregate to get session-level metrics

Below, we’ll walk through how these steps build toward this SQL code that sessionizes your event data:

 WITH first_pass AS (
      SELECT
        id
        , user_id
        , created_at 
        , TIMESTAMPDIFF(
          minute
          , LAG(created_at) OVER (PARTITION BY user_id ORDER BY created_at)
          , created_at
        ) AS minutes_since_prior_event
      FROM events
      WHERE user_id IS NOT NULL
    )

    , sessions AS (
      SELECT 
        id AS session_id
        , user_id
        , created_at AS session_start
        , LEAD(created_at) OVER (PARTITION BY user_id ORDER BY created_at) AS next_session_start
        , ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at) AS user_session_sequence_no
      
      FROM first_pass
      WHERE minutes_since_prior_event > 30 OR minutes_since_prior_event IS NULL
    )

    SELECT 
      sessions.session_id AS session_id
      , sessions.user_id AS user_id
      , user_session_sequence_no
      , session_start AS first_event_at
      , MAX(events.created_at) AS last_event_at
      , CASE 
          WHEN TIMESTAMPDIFF(second, session_start, MAX(events.created_at)) > 0 
            THEN TIMESTAMPDIFF(second, session_start, MAX(events.created_at)) 
          ELSE 
           CASE 
             WHEN TIMESTAMPDIFF(minute, session_start, current_timestamp) < 30
               THEN TIMESTAMPDIFF(minute, session_start, current_timestamp)
             ELSE 60 * 30
           END
END AS session_duration_seconds
      , COUNT(*) AS events_in_session
    FROM sessions
    JOIN events 
      ON events.user_id = sessions.user_id
      AND events.created_at >= sessions.session_start
      AND (events.created_at < sessions.next_session_start OR sessions.next_session_start IS NULL)
    GROUP BY 1,2,3,4

Running this code will result in a table with key session metrics like this:

:notebook_with_decorative_cover: Note: This example SQL code is written for Snowflake SQL, but the logic can be easily adjusted to work for your SQL dialect.

1: Decide on a definition of a session

There are multiple ways you can define a session. One option is to use a time-based definition – for example, you can decide that a session will expire at midnight every day. Another option that’s helpful when looking at web traffic sessions is a campaign-based definition – for example, a new session would begin any time a user arrives at your site via a different campaign (Facebook, Google, etc.).

For the sake of this analysis, we’ll use a popular variation of a time-based definition: a “session timeout” length, or a maximum length of time between events that still qualifies a user as “active.” This is useful because it lets you analyze user behavior contained to a period of active usage, but it can also be difficult to nail down since users often multi-task among many apps. For example, with Omni, developers may be toggling between Omni and dbt as they update the model, and analysts may be switching between Omni and Google Slides while they’re building a metrics slide deck. In either case, we’d want to define some period where they’re ‘active’ in Omni.

:bulb:Tip: A 30-minute session timeout is standard, but you may want to check with your Engineering team to see if they already have a session timeout value. *Engineers may have a session timeout value defined to log users out after the session timeout period for security reasons, or to prompt users with a notification to re-engage them.

2: For each event, calculate the time elapsed since the previous event (partitioned by user)

To see how long has elapsed between each user’s events, you can run some SQL using window functions. Let’s assume your events data looks like this:

Then the first CTE of your SQL query can look like this:

 WITH first_pass AS (
      SELECT
        id
        , user_id
        , created_at 
        , TIMESTAMPDIFF(
          minute
          , LAG(created_at) OVER (PARTITION BY user_id ORDER BY created_at)
          , created_at
        ) AS minutes_since_prior_event
      FROM events
      WHERE user_id IS NOT NULL
    )

.
.
.

To result in this table:

3: Define sessions based on the time since the previous event

Now, we can start identifying unique sessions. This lets you zoom in from a bird’s eye view of many user’s actions in your product to a specific user’s experience with your product. This part is a bit complicated; we’ll walk through it step-by-step, but here’s the second CTE we’re building toward:

  WITH first_pass AS (
      SELECT
        id
        , user_id
        , created_at 
        , TIMESTAMPDIFF(
          minute
          , LAG(created_at) OVER (PARTITION BY user_id ORDER BY created_at)
          , created_at
        ) AS minutes_since_prior_event
      FROM events
      WHERE user_id IS NOT NULL
    )

    , sessions AS (
      SELECT 
        id AS session_id
        , user_id
        , created_at AS session_start
        , LEAD(created_at) OVER (PARTITION BY user_id ORDER BY created_at) AS next_session_start
        , ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at) AS user_session_sequence_no
      
      FROM first_pass
      WHERE minutes_since_prior_event > 30 OR minutes_since_prior_event IS NULL
    )
  1. With our first_pass table, we can easily identify which events denote the “start” of a session: it’s the events that are either (a) the user’s first event ever or (b) more than 30 minutes since the previous event. So we start by filtering for only these events:

WHERE minutes_since_prior_event > 30 OR minutes_since_prior_event IS NULL

  1. By filtering in #1, we now have a table where each row represents a unique session. Since each event ID is unique, we can repurpose the event ID to become the session ID:

id AS session_id

  1. Next, we can add the start time of each user’s next session to demarcate the time boundaries for events that fall into each session:

LEAD(created_at) OVER (PARTITION BY user_id ORDER BY created_at) AS next_session_start

  1. We can also add a user_session_sequence_no so we can easily filter for a user’s 1st, 2nd… Nth session:

ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at) AS user_session_sequence_no

  1. Finally, we can add the user_id and session_start times:

, user_id

, created_at AS session_start

→ Altogether, this turns our first_pass table of events into a sessions table that looks like this:

4: Join each event to its corresponding session

We now have a list of sessions with a start and end time, as well as a list of events with timestamps. Next, we want to identify some more details about each session, including:

  • The timestamp of the last event in the session (i.e. when the session “ended”)
  • How long the session was
  • How many events occurred in the session

To do that, we can use a SQL join to identify which session each event belongs to and aggregate per session:

FROM sessions
    JOIN events 
      ON events.user_id = sessions.user_id
      AND events.created_at >= sessions.session_start
      AND (events.created_at < sessions.next_session_start OR sessions.next_session_start IS NULL)

In the join above, we’re checking to ensure:

  • The user_id of the event matches the user_id of the sessions (events.user_id = sessions.user_id)
  • The event happened at the exact same time or after the session started (events.created_at >= sessions.session_start)
  • The event happened before the next session started, or the event is part of the current session (events.created_at < sessions.next_session_start OR sessions.next_session_start IS NULL)

Once we complete the join, each event will be attached to a session :tada:

From here, all we need to do is aggregate on a per-session level, then calculate our final metrics:

  • The timestamp of the last event in the session: MAX(events.created_at) AS last_event_at

    • :bulb:Note: This is different from sessions.next_session_start, which is the beginning of the next session, not the last event in the current session.
  • The session duration (between first and last event):
    We use a CASE WHEN statement here to check whether or not a given session length returns 0. If so, this means that a given session is either (1) the current session (still in progress), or (2) it was a session with only 1 event. In the first case, we return the difference between the session start and the current time. In the second case, we return the maximum session length.

	CASE 
       WHEN TIMESTAMPDIFF(second, session_start, MAX(events.created_at)) > 0 
       THEN TIMESTAMPDIFF(second, session_start, MAX(events.created_at)) 
    ELSE 
           CASE 
             WHEN TIMESTAMPDIFF(minute, session_start, current_timestamp) < 30
               THEN TIMESTAMPDIFF(minute, session_start, current_timestamp)
             ELSE 60 * 30
END
  • The number of events in the session: COUNT(*) AS events_in_session

All these SQL blocks come together to produce this query:

  WITH first_pass AS (
      SELECT
        id
        , user_id
        , created_at 
        , TIMESTAMPDIFF(
          minute
          , LAG(created_at) OVER (PARTITION BY user_id ORDER BY created_at)
          , created_at
        ) AS  minutes_since_prior_event
      FROM events
      WHERE user_id IS NOT NULL
    )

    , sessions AS (
      SELECT 
        id AS session_id
        , user_id
        , created_at AS session_start
        , LEAD(created_at) OVER (PARTITION BY user_id ORDER BY created_at) AS next_session_start
        , ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at) AS user_session_sequence_no
      
      FROM first_pass
      WHERE minutes_since_prior_event > 30 OR minutes_since_prior_event IS NULL
    )

    SELECT 
      sessions.session_id AS session_id
      , sessions.user_id AS user_id
      , user_session_sequence_no
      , session_start AS first_event_at
      , MAX(events.created_at) AS last_event_at
      , CASE 
WHEN TIMESTAMPDIFF(second, session_start, MAX(events.created_at)) > 0 
THEN TIMESTAMPDIFF(second, session_start, MAX(events.created_at)) 
ELSE 
           CASE 
             WHEN TIMESTAMPDIFF(minute, session_start, current_timestamp) < 30
               THEN TIMESTAMPDIFF(minute, session_start, current_timestamp)
             ELSE 60 * 30
           END
      , COUNT(*) AS events_in_session
    FROM sessions
    JOIN events 
      ON events.user_id = sessions.user_id
      AND events.created_at >= sessions.session_start
      AND (events.created_at < sessions.next_session_start OR sessions.next_session_start IS NULL)
    GROUP BY 1,2,3,4

Ultimately, running this SQL block results in a session-level table like this: