logo for information-management-architect.com
leftimage for information-management-architect.com

Data Warehouse Basics

Need to know data warehouse basics and how it helps organizations maximize return on information management investment?

What is a data warehouse?

A data warehouse is the foundation for information management. Data warehouse design includes:
  • Extracting data from operational systems;
  • Moving it into data warehouse structures;
  • Reorganizing and structuring the data for analysis purposes;
  • Moving it into reporting structures called data marts; and
  • Presenting it to knowledge workers using business intelligence analytic and decision support reporting tools.
Data Warehouse Basics Schematic

Business Intelligence Architecture

Staging area data warehouse basics


Staging areas are used to temporarily store data extracted from source systems so that it can be prepared for subsequent cleansing and transformation.

Table structures usually are very similar to the table structure or file layouts in the source systems and are a series of “flat” table structures. In other words, they have no indexes and no relationships to other tables in the staging area.

Data structures may be designed to include:
  • In-scope data e.g. only data in scope is loaded in the staging area; or
  • All data, e.g. all data in the source table or files is loaded in the staging area and subsequent transformations eliminate non-essential data.
Data is loaded into staging area using fast load techniques and is transformed to ensure basic compatibility with the staging area structures e.g.
  • Dates might be transformed from mm/dd/yy to yy/mm/dd; or
  • Numeric values might be converted to varchar (if not needed for reporting).
Data is normally archived from the staging area after subsequent processing has loaded the data warehouse.

What is a better data warehouse design for staging areas?

It is possible to design a generic meta data driven staging area data model that will:
  • Eliminate the need to change the staging area schema every time there is a new data source and staging area table;
  • Eliminate the need to change the target definition for load software, since everything will be loaded into the same set of target tables; and
  • Dramatically simplify archiving thru the use of effective table partitioning.
Clearinghouse data warehouse basics

Some designs utilize the concept of a clearinghouse—This approach cleans and transforms data in the staging area and prepares it for loading into the data warehouse.

Cleaning data might include things like passing address data thru address verification software, which will examine each address and return a corrected address. The corrected address is a standard address that can be compared with other standard addresses e.g.
  • 123 Main Street North, 12345 might be transformed to:
  • 123 NE Main St, 12345-9768.
Transformations include things such as:
  • Removal of operational data e.g. some source systems have text descriptions for status, or delivery comments that are not required for analysis purposes;
  • Adding dates and times if they are not present e.g. a snapshot date;
  • Calculating derived values, which may not be required in the source system but could help reporting performance e.g. the source system might include product ordered, unit price and quantity shipped fields and this data might be transformed to calculate total cost;
  • Creating relationships, e.g. source systems might have a customer file that contains customer and address information but the warehouse has several tables in a normalized design and hence requires additional foreign keys and relationships to ensure data integrity;
  • Summarizing data, or aggregating data, to satisfy performance requirements e.g. there may be a requirement to summarize sales by day rather than by individual invoice and the transformation routines need to complete this task before the data is loaded in the warehouse; and
  • Data that fails pre-determined quality checks may be transformed to “Unknown” to permit some reporting, or may be stored in the clearinghouse until a subsequent load corrects it.
Data warehouse basics

Data that has been cleaned and transformed in the staging and clearing house schemas is now loaded into a data warehouse structure.

Data warehouse schemas are built using the enterprise data model and the data warehouse data model.

They are intended to contain an integrated view of all data which can serve the needs of each department and provide a single point of “truth” for the data e.g. everyone who uses this data will be looking at the same data that has been cleansed and transformed using the same logic and software routines.

Data mart data warehouse basics

Data marts are a critical part of data warehouse design. They are created to store data required for analysis by specific departments e.g. manufacturing department, sales department or inventory control.

Data is extracted from the warehouse and transformed into facts and dimensions within the data mart.

One of the key data warehouse design challenges is determining what data to include as a fact and what data to include as a dimension.

Define data warehouse facts and dimensions

A fact is something that has occurred e.g. an invoice was sent, on a specific date, to customer A, at address B, for a specific quantity of product X, that was sold by sales representative Y.

A dimension is a way of analyzing the fact e.g.
  • Date or time period dimension answers questions such as what was the quantity of product X sold by:
    • Date;
    • Week;
    • Month;
    • Quarter;
    • Year; or
    • Fiscal year.
  • Product dimension answers additional questions such as what was the quantity of product X sold by?
    • Product type;
    • Product category;
    • Product market interest group; or
    • Any other product related attribute
  • Customer dimension answers additional questions such as:
    • Which customers purchased product X; or conversely
    • What products were purchased by customer A;
  • Address dimension answers additional questions such as what product was sold by:
    • City;
    • Sales territory;
    • State/province;
    • Country; or conversely
    • Which customers live in each of the address areas?
Why is it called a star schema?

A dimensional model usually has the fact in the middle with the dimensions surrounding it and it looks like a star, hence the name, star schema.

What is a snowflake?

Dimensions are usually de-normalized e.g. each row of an address dimension might contain:
  • Address id;
  • City name;
  • Sales territory name;
  • State/province name; and
  • Country name.
Sometimes, for performance reasons, the dimension is normalized e.g.
  • Address dimension contains:
    • Address id;
    • City Name;
    • Sales territory name;
    • State/province code; and
    • Country name
  • State/province dimension (or snowflake) contains
    • State /province code; and
    • State/province name.
The snowflake is related to the dimension and reduces the amount of data that needs to be duplicated in the address dimension e.g. only the state/province code, which is the primary key, needs to be maintained on the address dimension.

What is a star flake schema?


A star flake schema is simply a star schema, which also includes snowflake structures

Who is responsible for data warehouse design?

A data warehouse architect has overall responsibility for data warehouse design, which includes database design, and for explaining data warehouse basics to key stakeholders.

A data integration designer is responsible for creating design specifications for extract, transform and load software.

A business intelligence analyst/designer is responsible for creating design specifications for decision support and business intelligence reporting software.

Summary...

This page helps explain data warehouse basics. Data warehousing is the foundation for information management and involves extracting data from operational systems, moving it into data warehouse structures, reorganizing and structuring the data for analysis purposes, and moving it into reporting structures called data marts. 

A data architect usually has overall responsibility for data warehouse design.


footer for Information management page