This is the first in my series of ‘Basics’ articles, introducing the simple concepts needed for creating a reporting system. In this post I will be outlining the format of the data warehouse. A data warehouse is essentially a database structured for analysis and reporting. One of its roles is to be a highly accurate, independent ‘authority’ on the company’s data. It very often combines information from different parts of the business, sorting out areas of conflict as it goes.
Data warehouses are most often designed around a dimensional model, or star schema. This structure is based on the idea of a central ‘Fact’ table and connected ‘Dimension’ tables surrounding it. The dimension tables store the entities that combine to produce some event or fact. The result of this combination is stored in the fact table. For instance, an employee from the Employee table sells a product from the Product table to a customer in the Customer table in a store from the Store table on a date from the Date table. The combination of these entities leads to an event which is stored in the Sales Fact table. This table would show an ID number for each of the dimensions mentioned plus some values related to the event e.g. how many of the item were sold, what the individual price was, what the total price was etc.
The dimension table typically stores the ID from the source system (‘business ID’ or ‘business key’), descriptive info and maybe dates to record historical changes.’Surrogate key’, shortened to SK, is often used to name the warehouse’s version of the ID. This differentiates it from the source ID, which may not be unique when storing historic changes or data from different systems.
The fact table typically records an SK for each dimension combining to create a fact, some measurements and dates relating to when the record was active.
One aspect of the dimensional model is that it is relatively denormalized, having fewer tables and joins than you would find in a transactional system database. This helps performance when querying large datasets. The dimension tables should contain all the relevant information related to that entity. For example, a store may be within an area. But rather than linking to a separate Area dimension, it is usually better to hold the area information within the store table, using extra fields. This can lead to repeated information as you populate the word Suffolk several times rather than linking to a Suffolk record in another table. But in a data warehouse this is OK as we are aiming for speed of reading the information out rather than speed of writing the information when there is a change.
The dimensional model is very powerful but also relatively easy for business users to understand. It makes it easier to summarize large numbers of records by date, geography, people etc. Another aspect of the data warehouse is that it tends to be updated on a regular schedule, typically once a day overnight, when there is the least activity on all systems. Data from the transactional systems will be transferred over, possibly manipulated in some way, then converted into the date warehouse’s format. This process is referred to as Extract Transform and Load (ETL). Once the data warehouse is loaded up with the company’s data it is ready to be analysed using a tool such as SQL Server Analysis Services (SSAS) and/or reported on with stored procedures and a reporting tool such as SSRS.