How do I prevent my aggregations from counting duplicates when my query fans out?

Firstly, let’s talk about why we need to consider fan-out. Let’s say we have the following table of orders

If we want to get the sum of price on our orders, we simply do a sum of each of the values

select SUM(price) from orders

Now lets imagine we also have a table of individual order items. There is a one-to-many relationship between orders and order items (ie. each order can have many order items)

If we need to join these tables together, we get the following:

What happens if we do our simple sum on the joined data?

select SUM(price) from orders left join order_items on orders.order_id = order_items.order_id

We get 185 as the (incorrect) answer, because the order price column has fanned-out, meaning it has been duplicated across each of the order items.

Omni uses symmetric aggregates to allow aggregations across data that is duplicated. This article will go through some of the way Omni achieves this.

Duplication due to fan out in a join

This is by-far the most common scenario that requires symmetric aggregates. It is exactly the scenario described above, where you want to do an aggregation on a field from the “one” table when it is joined to a “many” table.

When defining relationship between tables in Omni, you set the relationship type (eg. one-to-many, many-to-one)

Omni can also infer this from the database - if the relationship is set to Unknown then you will get an Infer Relationship button

By setting this relationship type, you are telling Omni when to expect fan-out. It will then use the primary keys defined on each table to deduplicate the fields in the aggregate (or error if you don’t have a primary key set).

Duplication in a materialised table

Let’s say you pushed down the logic to join orders and order items into your transformations layer, eg. for improved performance so that you don’t always have to do the join on the fly. Because this is now a single materialised table, Omni has lost the context of the relationship between the underlying models. You need a new way to tell Omni that it needs to deduplicate a sum on the price field, based on the order ID. You can use the sum_distinct_on aggregate type for this. Your view file might would look like

dimensions:
  order_item_id:
    primary_key: true
  order_id: {}
  price: {}
measures:
  total_price:
    sql: ${price}
    aggregate_type: sum_distinct_on
    custom_primary_key_sql: ${order_id}

This will now calculate Total Price by first deduplicating based on the Order ID.

Duplication due to fan-out in an unnest

This is similar to the join scenario, except instead of joining two tables to each other, we’re doing an unnest join on an array. An example of this is where we have an orders table, and then an array of objects to represent the individual order items.

Omni will automatically write the unnest logic. All you need to do is add the custom_primary_key_sql parameter to the aggregations on the nested fields. For example, the view file code for this might look like:

dimensions:
  order_id: {}
    primary_key: true
  order_items: {} # this is the repeated field
  order_item_price: 
    parent_field: order_items
    nested_on_field: order_items
    sql: price
measures:
  total_price:
    sql: ${order_item_price}
    aggregate_type: sum_distinct_on
    custom_primary_key_sql: ${order_item_id}

This is telling Omni that the Primary Key for the table is the Order ID, but the distinct key to use for measures on the nested field is Order Item ID (ie. this is the equivalent of the primary key on the repeated data).