Things I’ve learned while reading The Data Warehouse Toolkit from Ralph Kimball.
- A data warehouse almost always should have data expressed at the lowest possible grain of each dimension because queries need to cut through the database in very precise ways.
- Normalizing/snowflaking a fact table just to save disk space is a waste of time.
- Dimension tables must not be normalized/snowflaked but should remain as flat tables. Normalized/snowflaked dimension tables destroy the ability to browse which makes it undesirable from a user perspective. The saved disk space when normalizing is outweight by the easyness of use.
- Most DWHs need an explicit time dimension table.
- ‘Drilling down’ means to add dimensions to a row header. ‘Drillip up’ means to remove dimensions. An explicit hierarchy for drilling is not required.
- A semi-additive fact is a type of fact in a data warehouse that can be aggregated across some dimensions but not others. This means that while you can perform summation or other aggregate operations for certain dimensions, the same cannot be done for others. Examples: Inventory level or account balance. However, these measure may be usefully aggregated across time by averaging over the number of time periods.
- A ‘degenerate dimension’ is a dimension key with no corresponding dimension table. Instead, the attributes of the degenerate dimension are stored directly within the fact table. Example: Purchase order number.
- A transaction-level fact table must in most cases be accompanied by some sort of snapshot table to give a practical view of a process.
- Slowly Changing Dimensions are a key concept referring to the fact that dimensions may change over time (e.g. customer information or product details). There are three different ways to deal with SCDs:
- Type 1: Overwrite. when a change occurs in a dimension attribute, the existing value is simply overwritten with the new value. This means that no historical data is preserved for the changed attribute. This type is suitable for attributes where historical tracking is not important. The advantage is it’s straightforward and simple to implement as no history needs to be managed.
- Type 2: When a change occurs, a new record (row) is added to the dimension table with the new attribute values. The old record remains intact, allowing for historical tracking. Type 2 SCDs often include additional columns to indicate the validity period of each record (e.g., start date, end date, and current flag). This type is used when it’s important to keep a complete history of changes. The advantage is that a history is preserved at cost of more complex management.
- Type 3: a new column is added to the dimension table to track changes. This method typically retains the current and the previous value of an attribute but does not keep a complete history. This type is used when only the most recent change and the previous value need to be tracked. For example, keeping track of a customer’s current and previous address in the same record can be done with Type 3.
- Factless Fact Tables is a fact table containing only keys but no numerical facts as such. The keys map to dimension tables.
- To enhance querying performance its advised to pre-compute certain summaries and store them in a dedicated table, called aggregate.
- In the tradeoff between power and ease of use, ease of use should always come first. Users won’t be using a tool that’s too complicated.
- A good user interface is based on recognizing and pointing, not remembering and typing.