Data Warehouse
System Processes
Need
to define data warehouse system processes for an information management
project and want an overview of key architectural components?
What is a data warehouse?
Think
of a data warehouse as a central storage facility which collects
information from many sources, manages it for efficient storage and
retrieval, and delivers it to many audiences, usually to meet decision
support and business intelligence requirements.
Data warehousing includes the following components:
- Extracting data from operational systems;
- Moving it into data warehouse structures;
- Reorganizing and structuring the data for
analysis purposes; and
- Moving it into reporting structures called data
marts.
Data
Warehouse Schematic
Data
warehouses are usually intended to store very large amounts of data.
Many data warehouses store well over 100GB of data and some very large
ones store over 1-2 terabytes. That
is a lot of data to extract, transform, load, store and manage.
Architecture and design must consider system processes that provide:
- Efficient means of extracting the correct
source data and moving it to staging areas;
- Effective means of cleaning and transforming
the data for efficient storage and retrieval; and
- Optimal data management and storage
architecture.
Why is load
so important?
Most
people think of data warehouses and think of data storage and
retrieval. This is
the primary function of the data warehouse but it involves a lot of
moving parts and behind the scene processes to prepare the data for
efficient storage and retrieval.
Data is usually extracted from
source systems very late at night or early in the morning when all
daily transactions are processed by the source online transaction
processing (OLTP) systems.
Data is usually required for
business intelligence and decision support users by start of business
day and
that leaves a very short “load window”, or time available to extract,
transform and load data.
To ensure that service level agreements
are met, architecture and design must ensure that system processes
complete data load during the load window.
What are
system processes?
System processes include:
- Extract and load processes;
- Transformation and data cleansing processes;
- Data backup and archiving processes; and
- Reporting management processes.
What are
extract and load system processes?
The extract and load process includes all of the software and utilities
required to:
- Extract source system data and move it to the
warehouse environment. Data can be pushed or pulled e.g.
- The
source system to extract the data, send it to a secure pick-up area on
its
server and create a signal file to notify the data warehouse
that
it is ready; and
- The
data warehouse to watch for the signal file, retrieve the data when
it's ready, verify the signal file metadata, and load the data in the
staging area.
- Pulling requires the data warehouse to query
the source system directly and extract the data as required.
- Complete
basic transformation to ensure that non-essential data is eliminated
and other data is converted to appropriate data types;
- Fast load data into a staging area where it can
be subsequently manipulated;
- Extract data from staging area and load
clearing house tables;
- Extract data from clearing house tables and
load the data warehouse; and
- Extract data from the data warehouse and load
data marts.
What is a
transformation and cleansing process?
Cleaning and transformation is required between:
- Source system and staging area;
- Staging area and clearing house;
- Clearing house and data warehouse; and
- Data warehouse and data marts.
Source
system to staging area usually does not involve data
cleansing but it will involve basic transformations such as:
- Changing
date formats to meet a common format e.g. a source system date might be
DD/MM/YY and this needs to be converted to YY/MM/DD, or it might need
to be converted to a date format compatible with the database
management software; or
- Adding leading zeros to numeric fields.
Staging area
to clearinghouse involves significant cleansing and/or
data quality checks.
Transformation
routines need to ensure that:
- The data is consistent e.g.
- Numeric fields are numeric;
- Telephone numbers do not contain alpha
characters.
- The data is valid e.g.
- Product codes exist in the master product
list;
- Customer numbers exit in the master customer
list;
- The
data reconciles with data already in the warehouse e.g. if we are
tracking physical phone locations and customer move work orders we may
have a
quality issue if we discover a new phone location with no move order.
Data
cleansing routines need to be developed based on the
information management data quality strategy e.g.
- Addresses
may be a critical part of analysis and need to be analyzed by an
address verification services to ensure consistency before the data
quality is checked for errors; and
- Operational data, e.g. text messages, source
system operator notes, may need to be removed.
Clearing
house to data warehouse transformation may involve:
- Adding dates and times if they are not present
e.g. a snapshot date;
- Calculating
derived values, which may not be required in the source system but
could help reporting performance e.g. the source system might include
product ordered, unit price and quantity shipped fields and this data
might be transformed to calculate total cost;
- Creating
relationships, e.g. source systems might have a customer file that
contains customer and address information but the warehouse has several
tables in a normalized design and hence requires additional foreign
keys and relationships to ensure data integrity;
- Summarizing
data, or aggregating data, to satisfy performance requirements e.g.
there may be a requirement to summarize sales by day rather than by
individual invoice and the transformation routines need to complete
this task before the data is loaded in the warehouse; and
- Creating keys required for data partitioning.
Data
warehouse to data mart transformation will involve:
- Creating fact tables;
- Creating dimensions and handling slowly
changing dimensions; and
- Summarizing data if required for reporting
performance.
What are
data backup and archiving processes?
Data archiving and backup is required in each data warehouse
environment e.g.
Moving data from source to staging area involves ensuring that:
- Source system pick-up area data is deleted e.g.
the
account transactions we processed yesterday should be deleted, or
archived, so that are not re-processed again today; and
- Staging
area data is archived, or deleted, either at the end of a successful
data
warehouse load or before the beginning of the next load to avoid
re-processing.
Clearing house data should be:
- Backed-up daily, if it is intended to store
data that fails data quality validation; or
- Deleted if appropriate.
Data warehouse and data marts require:
- Archiving data that has passed its retention
period, e.g.
- Data that has passed a three year online
retention period may need to be archived to tape and moved to near line
storage; and
- It
may also need to be summarized before it is archived so that monthly or
quarterly values are retained in the warehouse thus reducing storage
demands;
- Daily incremental or full backups (if
appropriate); and
- Weekly full back-ups or cold backups (if
appropriate).
What are
reporting management system processes?
Reporting
management processes are required to ensure that business intelligence
reporting tools direct queries to the data that will provide the
quickest query response e.g.
Product sales data may be stored at
a detail level by invoice, by date, by product, by sales territory but
it may also be summarized by product, week and sales territory.
A
query that only requires summary data for one product should be
directed to the summary tables to avoid searching the detail invoice
transactions and summarizing data that is already summarized.
Summary...
A data warehouse needs to process a large volume of data in a very
short time frame. It requires the following system processes:
- Extract and load processes;
- Transformation and data cleansing processes;
- Data backup and archiving processes; and
- Reporting management processes.
To
ensure that service level agreements are met, architecture and design
must ensure that system processes complete data load during the load
window.
|