logo for information-management-architect.com
leftimage for information-management-architect.com

Data Integration Architecture

Improve service with David Bowman’s information management guidelines for data integration architecture

This site is designed for Information Technology professionals who need to improve service and require guidance and direction to help teams consistently produce error free results.

It provides information management guidelines for data integration architecture.

What is an Enterprise Data Warehouse (EDW)?

A data warehouse should be designed as a central repository of detailed, historical, enterprise-wide information.

Data Integration Architecture

Data integration involves obtaining and storing data, and the architecture and processes required to support those activities.

data warehouse architecture

Information management solutions are usually involved with some, or all of, the following:
  • Extracting data from operational source systems;
  • Transforming data as required to ensure consistency;
  • Loading data into a data warehouse or some form of storage optimized for efficient storage;
  • Extracting data from the data warehouse;
  • Loading departmental data marts optimized for reporting and analytics; and
  • Some form of reporting tool architecture e.g. universe design or some middle ware design to insulate users from the data warehouse.
Information is a valuable corporate asset and getting the right information, to the right people at the right time is a key management objective.

Achieving this objective requires robust technology and optimal data integration architecture based on industry best practices.

Data Integration Architecture Environments

Data Integration Architecture for Landing Area
  • Should be established as the FTP landing site;
  • Should ensure that all files from source systems land in this area including files pulled directly from source systems by the extract, transform and load (ETL) software;
  • Should provide the capability to land data that is “pulled” from specified systems and data that is “pushed” from source systems;
  • Should provide the capability for an iterative daily load process;
  • Should provide the capability to correct previous transmission errors;
  • Should be configurable so that new interfaces and/or changes to interface schedules can be accomplished without requiring custom software development;
  • Should ensure that the only processing that occurs is basic error handling to ensure that files are received as expected; and
  • Should provide the capability to notify appropriate production support teams of any data transmission error.
Data Integration Architecture for Staging Area
  • Should be the initial area where basic edit checks and error reporting occurs;
  • Should be the initial data integration area for data destined for the EDW and ultimately the integrated data marts and reporting data marts;
  • Should be designed to facilitate efficient processing of data records by extract transform and load routines;
  • Should be a separate database from other components;
  • Should be metadata driven so that additional data files, columns and reference data can be accommodated without requiring custom ETL development;
  • Should not enforce referential integrity between tables;
  • Should ensure all data columns are designed as either varying character e.g. VARCHAR, with a fixed length or numeric with a fixed length;
  • Should not be accessed by end users;
  • Should ensure all rows receive a generated unique identifier as a primary key;
  • Should ensure every row from landing extract is loaded without transformations;
  • Should ensure that appropriate initial data integration processes validate either record counts or control field calculation and confirm that table to table data relationships exist e.g. one or more child rows per parent row, each child row has a parent row, no child row has multiple parent rows;
  • Should log issues in the technical metadata repository;
  • Should verify that all columns that are identified as logical primary identifiers of a row exist without nulls; and
  • Should verify that all columns required to have data do not have nulls.
Data Integration Architecture for Clearinghouse
  • Should be used to prepare all data loading into the data warehouse;
  • Should contain a matching database used to match incoming source data with existing source data;
  • Should contain data that has passed initial inspection in the staging area and that has added matching data from the matching area;
  • Should contain data that has failed statistical error reporting and is waiting for approval prior to loading to the EDW;
  • Should contain data that has passed all edits and statistical error reporting and is waiting for approval prior to loading to the EDW;
  • Should contain statistical data concerning data feeds that will be passed to the EDW and integrated data marts for reporting;
  • Should be used to provide matching services to match data received from outside sources with data contained in the EDW;
  • Should only load the EDW with new data or data that changed since it was initially acquired;
  • Should be a separate database environment;
  • Should contain one relational table for each physical table in the EDW;
  • Should ensure that table names mimic table names from the EDW;
  • Should ensure that column names mimic names from the EDW;
  • Should ensure that data remains in the clearinghouse environment until it is approved for loading to the EDW;
  • Should retain reference/control data tables permanently;
  • Should ensure that all reference/control tables are loaded first;
  • Should ensure that all parent tables are loaded before child tables;
  • Should not be accessed by any end user;
    Should add audit columns to all tables;
    Should generate unique identifier’s as a primary key for each table;
    Should ensure that all rows have a column added for a natural key;
    Should ensure that the natural key is a concatenation of the unique generated key from a previous load cycle and the date grain for the EDW;
  • Should ensure that ETL processes use the matching database environment to assign generated key values to each row based on the natural key;
  • Should ensure that ETL processes use defined business rules to perform integration;
  • Should ensure that ETL processes use persistent reference/control tables to assign the appropriate generated key values to replace the natural data field; 
  • Should ensure that ETL processes use persistent reference/control tables to replace null values with designed default value; and
  • Should ensure that ETL processes add audit data to audit columns.
Data Integration Architecture for Matching Area
  • Should store multiple forms of identification data e.g. SSN required to match clearinghouse data with a unique enterprise data warehouse key that can be used for reporting purposes;
  • Should facilitate matching without the need for extensive EDW queries;
  • Should be a set of tables in the clearinghouse database environment;
  • Should have one table for each table in clearinghouse;
  • Should create one row for each natural key value of business data and reference/control data;
  • Should ensure that each row only contains the columns that make the row unique as defined by the business and the generated unique key value for a row;
  • Should not be accessed by anyone other than the ETL process and database administrator;
  • Should ensure each table name is the same as the clearinghouse table name with MTCH appended to the name;
  • Should ensure that column names in the matching area are the same as the column names in the clearinghouse;
  • Should ensure that rows in the matching database are never deleted;
  • Should add audit fields to identify initial creation of primary key value;
  • Should ensure that when a row of data is ready to be inserted into the clearinghouse database, the process should use the natural key values from the row to match against the natural key values from the matching database;
  • Should ensure that if a row is found, the incoming row should use the same generated key value; and
  • Should ensure that if a match is not found, a new row should be created in the matching table and the new row uses that generated key value.
Data Warehouse
  • Should be the source of all data required for query, analysis and reporting;
  • Should store data in its atomic state so that it will be available for recovery, reloads, and loads of new data marts and/or data extracts;
  • Should be the source for all data marts;
  • Should be a third normal form or better representation of the business data;
  • Should ensure that each table has a name that represents a business concept;
  • Should ensure that each column has a name that represents a business item in that business concept;
  • Should ensure that audit columns are added to each table;
  • Should only be accessed by ETL routines and/or database administrators;
  • Should ensure that load processes only loads EDW tables with appropriate data from the clearinghouse;
  • Should ensure transform processes transform each column as specified by the business data transformation rules defined in the source to target mapping documents; and
  • Should ensure that load processes populates audit columns as appropriate.
Data Integration Architecture for Data Marts
  • Should ensure that only one version of the truth exists for each reference, control, dimension, and fact table;
  • Should ensure that data mart dimension tables have a name that begins with 'DIM' and the rest of the name conveys the business concept for that table;
  • Should ensure that dimensions table column names will be the same as the names in the EDW;
  • Should ensure that data mart fact tables have a name that begins with FACT and the rest of the name conveys the business process that defines the measures/columns;
  • Should ensure that fact table columns are the same as in the EDW and ensure that new columns are created to represent a new measurement concept e.g. calculated columns;
  • Should use the generated unique key identifier added in the EDW as the primary key of each row in the data mart tables;
  • Should ensure that constraints are dropped before loading and re-applied after data loading;
  • Should ensure that a date dimension is added as a one-time load and has the ability to extend the number of rows if needed;
  • Should ensure that a semantic layer or business friendly mask is developed to insulate business intelligence/reporting users from the fact/dimension table/column names;
  • Should ensure data integration processes loads each row from the EDW that matches selection criteria;
  • Should ensure data integration processes use the same column characteristics as the EDW;
  • Should ensure data integration processes calculate any fact table columns as required;
  • Should ensure data integration processes handle slowly changing dimensions; and
  • Should ensure data integration processes populate audit columns as appropriate
Development and Test Environments
  • Should include personal schema's, used by individual developers for development and unit testing prior to certifying that code is ready for system testing;
  • Should include system test, used to certify that a related group of modules is built to design specifications and is ready for integration testing;
  • Should include integration test, used to certify a release is built to design specifications and is ready for quality assurance testing;
  • Should include quality assurance test, used to certify a release is is built to requirement specifications and is ready for user acceptance testing;
  • Should include user acceptance test, used to certify that a release meets user expectations and is ready for production; and
  • Should include release test, used to certify that all components of a release package work as expected and that there is no adverse affect on production.
Data Integration Architecture Guidelines
  • Should support and track metadata, so that it can be reported upon and leveraged during on-going development work and data stewardship activities;
  • Should enforce a common approach towards developing the data integration modules, and enforce development standards;
  • Should use a minimum of physical steps to move and transform the data, with a minimum of data integration tasks;
  • Should provide for an efficient identification and processing of changed data e.g. change data capture is a challenging component of any data acquisition solution. There are tools to simplify this process, but even a custom solution can be very successful;
  • Should leverage technology and design principles to ensure scalability; and
  • Should support manual data e.g. user-defined hierarchy for reporting.
Data Integration Architecture Considerations

A typical warehouse might load many millions of records per day.  It involves extracting data from many source systems, verifying data quality, cleansing the data in some cases, loading it into staging areas for initial processing, moving it from staging to the warehouse and then moving it into data marts, and sometimes aggregating the data to help reporting performance.

In addition, data must be purged from the warehouse when it reaches its retention period and backup must occur with no interruption to availability or performance.

There are a lot of moving parts and it is very important that overall design consider all significant data warehouse ETL factors to ensure efficient storage and effective use of the data.
Summary...

Information is a valuable corporate asset and getting the right information, to the right people at the right time is a key management objective.

Achieving this objective requires robust technology and optimal data architecture based on industry best practices.

This site provided information management guidelines for data integration architecture.