|
Data Warehouse Basics
Need to know data warehouse basics and how
it helps organizations maximize return on information management
investment?
What is a
data warehouse?
A data warehouse is the foundation for information management. Data
warehouse design includes:
- Extracting data from operational systems;
- Moving it into data warehouse structures;
- Reorganizing and structuring the data for
analysis purposes;
- Moving it into reporting structures called data
marts; and
- Presenting it to knowledge workers using
business intelligence analytic and decision support reporting tools.
Data Warehouse Basics Schematic
Staging area data warehouse basics
Staging areas are used to temporarily store data extracted from source
systems so that it can be prepared for subsequent cleansing and
transformation.
Table structures usually are very similar to the table structure or
file layouts in the source systems and are a series of “flat” table
structures. In other words, they have no indexes and no relationships
to other tables in the staging area.
Data structures may be designed to include:
- In-scope data e.g. only data in scope is loaded in
the staging area; or
- All data, e.g. all data in the source table or
files is loaded in the staging area and subsequent transformations
eliminate non-essential data.
Data is loaded into staging area using fast load techniques and is
transformed to ensure basic compatibility with the staging area
structures e.g.
- Dates might be transformed from mm/dd/yy to
yy/mm/dd; or
- Numeric values might be converted to varchar
(if not needed for reporting).
Data is normally archived from the staging area after subsequent
processing has loaded the data warehouse.
What is a
better data warehouse design for staging areas?
It is possible to design a generic meta data driven staging area data
model that will:
- Eliminate the need to change the staging area
schema every time there is a new data source and staging area table;
- Eliminate the need to change the target
definition for load software, since everything will be loaded into the
same set of target tables; and
- Dramatically simplify archiving thru the use of
effective table partitioning.
Clearinghouse data warehouse basics
Some designs utilize the concept of a clearinghouse—This approach
cleans and transforms data in the staging area and prepares it for
loading into the data warehouse.
Cleaning data
might include things like passing address data thru address
verification software,
which will examine each address and return a corrected address. The
corrected address is a standard address that can be compared with other
standard addresses e.g.
- 123 Main Street North, 12345 might be transformed to:
- 123 NE Main St, 12345-9768.
Transformations
include things such as:
- Removal of operational data e.g. some source
systems have text descriptions for status, or delivery comments that
are not required for analysis purposes;
- Adding dates and times if they are not present
e.g. a snapshot date;
- Calculating derived values, which may not be
required in the source system but could help reporting performance e.g.
the source system might include product ordered, unit price and
quantity shipped fields and this data might be transformed to calculate
total cost;
- Creating relationships, e.g. source systems
might have a customer file that contains customer and address
information but the warehouse has several tables in a normalized design
and hence requires additional foreign keys and relationships to ensure
data integrity;
- Summarizing data, or aggregating data, to
satisfy performance requirements e.g. there may be a requirement to
summarize sales by day rather than by individual invoice and the
transformation routines need to complete this task before the data is
loaded in the warehouse; and
- Data that fails pre-determined quality checks
may be transformed to “Unknown” to permit some reporting, or may be
stored in the clearinghouse until a subsequent load corrects it.
Data warehouse basics
Data that has been cleaned and transformed in the staging and clearing
house schemas is now loaded into a data warehouse structure.
Data warehouse schemas are built using the enterprise data model and
the data warehouse data model.
They are intended to contain an integrated view of all data which can
serve the needs of each department and provide a single point of
“truth” for the data e.g. everyone who uses this data will be looking
at the same data that has been cleansed and transformed using the same
logic and software routines.
Data mart data warehouse basics
Data marts are a critical part of data warehouse design. They are
created to store data required for analysis by specific departments
e.g. manufacturing department, sales department or inventory control.
Data is extracted from the warehouse and transformed into facts and
dimensions within the data mart.
One of the key data warehouse design challenges is determining what
data to include as a fact and what data to include as a dimension.
Define data warehouse
facts and dimensions
A fact is
something that has occurred e.g. an invoice was sent, on a specific
date, to customer A, at address B, for a specific quantity of product
X, that was sold by sales representative Y.
A dimension is
a way of analyzing the fact e.g.
- Date or time period dimension answers questions such
as what was the
quantity of product X sold by:
- Date;
- Week;
- Month;
- Quarter;
- Year; or
- Fiscal year.
- Product dimension answers additional questions
such as what was the quantity of product X sold by?
- Product type;
- Product category;
- Product market interest group; or
- Any other product related attribute
- Customer dimension answers additional questions
such as:
- Which customers purchased product X; or
conversely
- What products were purchased by customer A;
- Address dimension answers additional
questions such as what product was sold by:
- City;
- Sales territory;
- State/province;
- Country; or conversely
- Which customers live in each of the address
areas?
Why is it
called a star schema?
A dimensional model usually has the fact in the middle with the
dimensions surrounding it and it looks like a star, hence the name,
star schema.
What is a
snowflake?
Dimensions are usually de-normalized e.g. each row
of an address
dimension might contain:
- Address id;
- City name;
- Sales territory name;
- State/province name; and
- Country name.
Sometimes, for performance reasons, the dimension is normalized e.g.
- Address dimension contains:
- Address id;
- City Name;
- Sales territory name;
- State/province
code; and
- Country name
- State/province dimension (or snowflake) contains
- State
/province code; and
- State/province name.
The snowflake is related to the dimension and reduces the amount of
data that needs to be duplicated in the address dimension e.g. only the
state/province
code, which is the primary key, needs to be maintained on
the address dimension.
What is a
star flake schema?
A star flake schema is simply a star schema, which also includes
snowflake structures
Who is
responsible for data warehouse design?
A data
warehouse architect has overall responsibility for data
warehouse design,
which includes database design, and for explaining data warehouse
basics to key stakeholders.
A data
integration designer is responsible for creating design
specifications for extract, transform and load software.
A business
intelligence analyst/designer is responsible for creating
design specifications for decision support and business intelligence
reporting software.
Summary...
This page helps explain data warehouse basics. Data warehousing is the
foundation for information management and
involves extracting data from operational systems, moving it into data
warehouse structures, reorganizing and structuring the data for
analysis purposes, and moving it into reporting structures called data
marts.
A data architect usually has overall responsibility for data warehouse
design.

|