Introduction

When aggregating by a list of "group-by" properties, the developer has the option of specifying Multiset Discrimination instead of sorting, in order to gather together all items whose properties compare equal to each other prior to aggregation. Normally, sorting is used to achieve this aim. Java's built-in method "Collections.sort" is used to sort the input before aggregation. It is a merge sort, which means it runs in O(n log n) time. For a lot of purposes, this is fine, and additionally, it has the added benefit that the returned AggregateValues are placed in the returned List in sorted order. For those applications that need to aggregate a lot of data, and the returned order does not matter, Multiset Discrimination is the answer. While multiset discrimination may return results that are "out of order", it is able to place items whose properties compare equal adjacent to each other, just like sorting can. However, unlike sorting, multiset discrimination runs in O(n) time in the worst case.

Basic Multiset Discrimination

All data can be eventually broken down to the bits and bytes of the data to analyze. At the base case, jAgg discriminates 16-bit values. It allocates an array "lists" of 216 Lists of items, and an array "used" of 216 integers to represent used indices in the array of Lists. Then it follows this algorithm:

  1. For each element of the input List to discriminate, access the proper List from "lists", using the element value as the index. Add the element to this list.
  2. If the list was empty, then add the element value as the index to the list of used indices, "used".
  3. For each element in the "used" indices, add the corresponding List obtained from "lists". Clear the List from "lists". Each of these lists is called an equivalence class, consisting of only items compared equal.
  4. Clear the list of used indices in "used" to prepare for reuse.

A constant level of operations is performed for each element in the input List, so the running time of this algorithm is O(n).

Object types that implement the interface Discriminator are capable of multiset discrimination on a particular type.

Multiset Discrimination On Bigger Types

Most object types can have more than 216 possible values. So jAgg performs multiset discrimination on those types hierarchically. It can't do its job on the entire value, so it does its job repeatedly, on labels. A label is a portion of a value that is used as the actual value during multiset discrimination. It performs multiset discrimination on the first label, or 16 bytes, of the value that it can, creating a set of equivalence classes. Then, on each equivalence class, it performs multiset discrimination on the next label, or 16 bytes, creating another set of equivalence classes. This process is repeated until the entire value is used up.

Objects that implement the Extractor interface are capable of taking an element, and extracting a label from that element for this purpose.

For primitive values such as bytes, chars, shorts, and booleans, only one pass is necessary. However, for other primitive types, such as integers, longs, floats, and doubles, multiple passes are necessary.

For more complex data types such as arrays, Strings, Collections, Dates, and Calendars, each element is considered, one at a time, until all elements are discriminated.

For this to work with a List of "group by" properties in jAgg, a PropertiesDiscriminator is employed that will perform multiset discrimination only on the properties specified.

Sometimes, null values are encountered when performing multiset discrimination. A NullDiscriminator is employed to separate all null values into their own equivalence class before passing on discrimination duties to another Discriminator.