Often, we'll need to create groups of some kind on a
dimension either to group long lists of members up into more
user-friendly groups, or to group numeric attributes such as Age or
measure values into bands or ranges. Analysis Services offers some
functionality to help us do this. But as usual, we'll get much more
flexibility if we design these groups into the dimension ourselves.
Grouping
First of all let's consider why
we might want to group members on a large attribute hierarchy. Some
dimensions are not only very large there are a lot of rows in the
dimension table but they are also very flat, so they have very few
attributes on them that are related to each other and have very few
natural hierarchies. We might have a Customer dimension with millions of individual customers on it, and we might also have City and Country
attributes, but even then it might be the case that for a large city, a
user might drill down and see hundreds or thousands of customers. In
this situation, a user looking for an individual Customer might have
problems finding the one they want if they need to search through a very
long list; some client tools might also be slow to respond if they have
to display such a large number of members in a dialog or dimension
browser. Therefore, it makes sense to create extra attributes on such
dimensions to group members together to reduce the chance of this
happening.
Analysis Services can automatically create groups for you, using the DiscretizationMethod and DiscretizationBucketCount properties on an attribute. The DiscretizationMethod property allows you to choose how groups should be created: the EqualAreas option will try to create groups with a roughly equal number of members in them, the Clusters option will use a data mining algorithm to create groups of similar members, and the Automatic option will try to work out which of the preceding two options fits the data best; the DiscretizationBucketCount property specifies the number of groups that should be created. Full details of how this functionality works can be found at http://tinyurl.com/groupingatts
and while it does what it is supposed to do, it rarely makes sense to
use it. The reason why can be seen from the following screenshot that
shows the result of using the EqualAreas option to group a Weight attribute:
Clearly, this isn't very
user-friendly, and while you could try to tweak property values to get
the groups and group names you want, frankly, it is much easier to
create new columns in the views we're building our dimensions from to
get exactly the kind of grouping that you want. Nothing is going to be
more flexible than SQL for this job, and writing the necessary SQL code
is not hard—usually a simple CASE
statement will be sufficient. An expression such as this in TSQL, when
used to create a new column in a view or a named calculation:
CASE WHEN Weight IS NULL OR Weight<0 THEN 'N/A'
WHEN Weight<10 THEN '0-10Kg'
WHEN Weight<20 THEN '10-20Kg'
ELSE '20Kg or more'
END
This yields much better results in the dimension when you build an attribute from it:
In this case, the names happen
to sort in the order you'd want to see them, and you might need an
additional column to use as the key for the new attribute. The point is
that in this situation, as in many others, a little extra time spent
modeling the relational data to get it the way you want it pays
dividends even when Analysis Services seems to offer you a quicker way
of getting things done.
Banding
Similarly, we might need to
create an entire dimension that acts as a way of grouping measure values
on a fact table. For example, we might have a measure that gives us the
total value of an order, and we might want to find the total number of
orders whose values fall into some predefined bandings such as 'High
Value', 'Medium Value' or 'Low Value'. In this case, again we would need
to create a dimension table to hold these bandings, but one problem we
might have to face is that the ranges used for the bandings might change
frequently as the users' requirements change—one day a 'High Value'
order might be one for more than €10000, the next it might be more than
€15000.
If we modeled our new
dimension using meaningless surrogate keys, we would have to perform a
lookup during our ETL to find out which band each row in the fact table
fell into and assign it the appropriate surrogate key:
But what would happen if the
user changed the bandings? If a user does this, then we would have to
reload our entire fact table, because potentially any order might now
fall into a new banding. A more flexible approach is to hardcode only
the granularity of the bandings into the fact table: for example, we
could say that our bandings could only have boundaries divisible by
€1000. This would then allow us to use an expression in our fact table
ETL such as Floor(OrderValue/100) to
create a meaningful key; in our dimension, we would then create one row
per €100 up to what we think the maximum value of an order might be, and
then group these €100 ranges into the bandings our users wanted as
follows:
The advantage of this is that
so long as the granularity of the bandings doesn't change, we will never
need to reload our fact table. In Analysis Services terms, this
dimension would have two attributes: one built from the meaningful key,
and one to hold the name of the band; a Process Update would be all that
was necessary when the banding boundaries changed because only the
dimension table would have been changed.