logo for information-management-architect.com
Home
Strategy
Framework
Business Case Analysis
Project Planning
Requirements Analysis
Architecture & Design
Build Phase
Quality Assurance
Transition to Production
Management Information
Business Intelligence
Data Warehouse
Tools
Jobs
Contact David Bowman
leftimage for information-management-architect.com

Data Warehouse System Processes

Need to define data warehouse system processes for an information management project and want an overview of key architectural components?

What is a data warehouse?


Think of a data warehouse as a central storage facility which collects information from many sources, manages it for efficient storage and retrieval, and delivers it to many audiences, usually to meet decision support and business intelligence requirements.

Data warehousing includes the following components:
  • Extracting data from operational systems;
  • Moving it into data warehouse structures;
  • Reorganizing and structuring the data for analysis purposes; and
  • Moving it into reporting structures called data marts.

Data Warehouse SchematicSystem Processes
Data warehouses are usually intended to store very large amounts of data. Many data warehouses store well over 100GB of data and some very large ones store over 1-2 terabytes. That is a lot of data to extract, transform, load, store and manage.

Architecture and design must consider system processes that provide:
  • Efficient means of extracting the correct source data and moving it to staging areas;
  • Effective means of cleaning and transforming the data for efficient storage and retrieval; and
  • Optimal data management and storage architecture.
Why is load so important?

Most people think of data warehouses and think of data storage and retrieval. This is the primary function of the data warehouse but it involves a lot of moving parts and behind the scene processes to prepare the data for efficient storage and retrieval.

Data is usually extracted from source systems very late at night or early in the morning when all daily transactions are processed by the source online transaction processing (OLTP) systems.

Data is usually required for business intelligence and decision support users by start of business day and that leaves a very short “load window”, or time available to extract, transform and load data.

To ensure that service level agreements are met, architecture and design must ensure that system processes complete data load during the load window.

What are system processes?

System processes include:
  • Extract and load processes;
  • Transformation and data cleansing processes;
  • Data backup and archiving processes; and
  • Reporting management processes.
What are extract and load system processes?

The extract and load process includes all of the software and utilities required to:
  • Extract source system data and move it to the warehouse environment. Data can be pushed or pulled e.g.
    • Pushing data requires:
      • The source system to extract the data, send it to a secure pick-up area on its server and create a signal file to notify the data warehouse that it is ready; and 
      • The data warehouse to watch for the signal file, retrieve the data when it's ready, verify the signal file metadata, and load the data in the staging area.
    • Pulling requires the data warehouse to query the source system directly and extract the data as required.
  • Complete basic transformation to ensure that non-essential data is eliminated and other data is converted to appropriate data types;
  • Fast load data into a staging area where it can be subsequently manipulated;
  • Extract data from staging area and load clearing house tables;
  • Extract data from clearing house tables and load the data warehouse; and
  • Extract data from the data warehouse and load data marts.
What is a transformation and cleansing process?

Cleaning and transformation is required between:
  • Source system and staging area;
  • Staging area and clearing house;
  • Clearing house and data warehouse; and
  • Data warehouse and data marts.
Source system to staging area usually does not involve data cleansing but it will involve basic transformations such as:
  • Changing date formats to meet a common format e.g. a source system date might be DD/MM/YY and this needs to be converted to YY/MM/DD, or it might need to be converted to a date format compatible with the database management software; or
  • Adding leading zeros to numeric fields.
Staging area to clearinghouse involves significant cleansing and/or data quality checks.

Transformation routines need to ensure that:
  • The data is consistent e.g.
    • Numeric fields are numeric;
    • Addresses make sense;
    • Telephone numbers do not contain alpha characters.
  • The data is valid e.g.
    • Product codes exist in the master product list;
    • Customer numbers exit in the master customer list;
  • The data reconciles with data already in the warehouse e.g. if we are tracking physical phone locations and customer move work orders we may have a quality issue if we discover a new phone location with no move order.
Data cleansing routines need to be developed based on the information management data quality strategy e.g.
  • Addresses may be a critical part of analysis and need to be analyzed by an address verification services to ensure consistency before the data quality is checked for errors; and 
  • Operational data, e.g. text messages, source system operator notes, may need to be removed.
Clearing house to data warehouse transformation may involve:
  • Adding dates and times if they are not present e.g. a snapshot date;
  • Calculating derived values, which may not be required in the source system but could help reporting performance e.g. the source system might include product ordered, unit price and quantity shipped fields and this data might be transformed to calculate total cost;
  • Creating relationships, e.g. source systems might have a customer file that contains customer and address information but the warehouse has several tables in a normalized design and hence requires additional foreign keys and relationships to ensure data integrity;
  • Summarizing data, or aggregating data, to satisfy performance requirements e.g. there may be a requirement to summarize sales by day rather than by individual invoice and the transformation routines need to complete this task before the data is loaded in the warehouse; and
  • Creating keys required for data partitioning.
Data warehouse to data mart transformation will involve:
  • Creating fact tables;
  • Creating dimensions and handling slowly changing dimensions; and
  • Summarizing data if required for reporting performance.
What are data backup and archiving processes?

Data archiving and backup is required in each data warehouse environment e.g.

Moving data from source to staging area involves ensuring that:
  • Source system pick-up area data is deleted e.g. the account transactions we processed yesterday should be deleted, or archived, so that are not re-processed again today; and
  • Staging area data is archived, or deleted, either at the end of a successful data warehouse load or before the beginning of the next load to avoid re-processing.
Clearing house data should be:
  • Backed-up daily, if it is intended to store data that fails data quality validation; or
  • Deleted if appropriate.
Data warehouse and data marts require:
  • Archiving data that has passed its retention period, e.g.
    • Data that has passed a three year online retention period may need to be archived to tape and moved to near line storage; and
    • It may also need to be summarized before it is archived so that monthly or quarterly values are retained in the warehouse thus reducing storage demands;
  • Daily incremental or full backups (if appropriate); and
  • Weekly full back-ups or cold backups (if appropriate).
What are reporting management system processes?

Reporting management processes are required to ensure that business intelligence reporting tools direct queries to the data that will provide the quickest query response e.g.

Product sales data may be stored at a detail level by invoice, by date, by product, by sales territory but it may also be summarized by product, week and sales territory.

A query that only requires summary data for one product should be directed to the summary tables to avoid searching the detail invoice transactions and summarizing data that is already summarized.

Summary...


A data warehouse needs to process a large volume of data in a very short time frame. It requires the following system processes:
  • Extract and load processes;
  • Transformation and data cleansing processes;
  • Data backup and archiving processes; and
  • Reporting management processes.
To ensure that service level agreements are met, architecture and design must ensure that system processes complete data load during the load window.


footer for Information management page