logo for information-management-architect.com
leftimage for information-management-architect.com

Data Integration Best Practices

Improve effectiveness with David Bowman’s information management guidelines for data integration best practices

This site is designed for Information Technology professionals who need  to improve effectiveness and want guidance and direction to help teams consistently produce error free results.

It provides information management guidelines for data integration best practices and roles and responsibilities.

Data Integration
  • Should focus on the movement of data between systems; and
  • Should provide direction on the extract-transform-load facilities used for bulk data movement including mechanisms to support continuous movement of discrete records, rows, or messages between systems including data integration best practices.

Data Integration Architect
  • Should have overall responsibility for the data integration solution and all data integration deliverables including data flow diagrams and source system inventory;
  • Should include designing the extract, transform and load solution, leading system and integration testing and defining standards and architecture toolkits;
  • Should provide technical direction to the data warehouse team to ensure the integrity of the data warehouse; and
  • Should have overall responsibility for the quality of all data integration deliverables.
Data Integration Developers

Should be responsible for creating software code that will extract data from source systems, transform it as necessary and load it into data warehouses and/or into data marts.  They are primarily required during the development, quality assurance and transition to production phases. Key responsibilities include:
  • Helping data analysts run data profiling scripts and documenting the results;
  • Creating detail design specifications based on architecture and design specifications;
  • Documenting satisfaction of requirements in design specifications to help with requirements traceability;
  • Participating in design specification quality assurance reviews; involving a data integration architect and data integration designer.
  • Creating unit test data to ensure that the code performs under all expected conditions;
  • Developing ETL software using common tools such as Informatica, Ab Initio, and Ascential DataStage;
  • Conducting unit tests in a personal schema to ensure that the code meets specification;
  • Documenting test results and storing test data and results in the project documentation repository; and
  • Participating in code reviews with data architects, lead developers and other developers
Data Integration Best Practices Guidelines

Enterprise Conceptual Data Movement Model
  • Should be created as part of the information management strategy;
  • Should be part of the business model and shows what data flows into, within, and out of the organization; and
  • Should be a “high-level” model that shows movement of data from one application to another e.g. order entry systems may capture customer name, address information, and later send it to the application, which handles billing and accounting so that bills can be sent out and money collected
Application Conceptual Data Movement Model
  • Should provide definitions of each key source system or data source that is required;
  • Should confirm understanding of application data integration scope; and
  • Should be a sub-set of the enterprise conceptual data movement model.
Application Logical Data Movement Model
  • Should show data movement requirements at the dataset e.g entity and table level;
  • Should detail the transformation rules and target logical datasets e.g. entity and tables;
  • Should be considered technology independent;
  • Should show each interface between systems e.g. an order entry system, which sends customer name and address information to the billing application, may actually send three or four files. Each file is called an interface. Each interface should be shown in the logical data movement model;
  • Should focus at the logical level  on the capture of actual source tables and proposed target stores; and
  • Should be supported by a source to target mapping document, which should include business rules and data transformation rules.
Application Physical Data Movement Model
  • Should be a detailed representation of the data integration requirements at the dataset e.g. table level, that details the transformation rules and target physical datasets e.g. tables;
  • Should be considered technology dependent; and
  • Should be one-to-many physical models for each logical model.
Data Integration Best Practices for Architecture
  • Should support and track metadata, so that it can be reported upon and leveraged during on-going development work and data stewardship activities;
  • Should enforce a common approach towards developing the data integration modules, and enforce development standards;
  • Should use a minimum of physical steps to move and transform the data, with a minimum of data integration tasks;
  • Should provide for an efficient identification and processing of changed data e.g. change data capture is a challenging component of any data acquisition solution and there are tools to simplify this process, but even a custom solution can be very successful;
  • Should leverage technology and design principles to ensure scalability; and
  • Should support manual data e.g. user-defined hierarchy for reporting.
Data Integration Best Practices for Standards
  • Should be established to guide the analysis, design, and development of data movement processes that are consistent, usable and of high quality;
  • Should lead to greater consistency, which subsequently leads to greater productivity;
  • Should reduce ongoing maintenance costs;
  • Should improve readability of software, making it easier for developers to understand new code more quickly;
  • Should be the subject of project quality assurance checkpoints;
  • Should introduce common, consistent data integration analysis, design, and coding patterns;
  • Should provide reusable, enterprise-wide analysis, design, and construction components through data integration modeling processes using data integration tools, to ensure an acceptable level of data quality per business specifications;
  • Should introduce best practices and consistency in coding and naming standards;
  • Should reduce costs to analyze, design, develop and maintain, source code deliverables;
  • Should integrate controls into the data integartion process to ensure data quality and integrity;
  • Should standardize on a common data integration tool and ensure that this is included in the information management technology standards list e.g. Ab initio, Ascential Datastage, from IBM, Informatica or open source code such as Talend or Pentaho;
  • Should standardize on a job-scheduling tool such as Control-M or Autosys for job scheduling; and
  • Should define and communicate accountabilities  to all stakeholders for each key data integration role.
Data Integration Best Practices for Architecture and Design
  • Should be established for technical design qualities;
  • Should be established for functional qualities;
  • Should be established for non-functional qualities;
  • Should be established for constraints;
  • Should be established for run-time qualities, including configurability and supportability, correctness, reliability and availability, quality of service, safety properties, operational scalability and accuracy and completeness;
  • Should be established for development-time qualities, including localization, modifiability or extensibility, scalability, re usability  and efficiency;
  • Should be established for separation of concerns;
  • Should be established for process flow;
  • Should be established for designing transactions at an atomic level;
  • Should be established for restart-ability;
  • Should be established for error handling approach;
  • Should be established for parallelization; and
  • Should be established for job scheduling.
Data Integration Best practices for Architecture and Design Goals
  • Should move data from source once i.e. store data from source to the ETL environment quickly so that the source is only accessed once. The target architecture should enforce reuse of a single copy of data drawn from production sources.  This will minimize resource utilization on the source system;
  • Should standardize data quality i.e. provide a facility for standard centralized data quality checks. The facility should allow for required checks and optional checks and each target should determine if they will subscribe to the optional checks.  Systems of record should be responsible for data quality;
  • Should increase data integration processing efficiency i.e. provide a capability through metadata and process control flow for tracking data dependencies and driving data processing as soon as possible based on the availability of required input files;
  • Should maximize reuse i.e. provide a facility for storing and cataloging clean data (both source and derived from calculations and/or joins).  By storing the clean data, it can be utilized by new processes.  This will minimize the need for new processes to re-source the data recheck quality or re-compute derived values.; and
  • Should focus on metadata i.e. use a centralized metadata repository to drive data quality and integrity, and data integration workflow.
Components Standards
  • Should provide facility for storage and retrieval of interim/staged ETL data sets as flat files;
  • Should provide a mechanism for creating target specific outputs by eliminating unwanted records and fields;
  • Should provide a mechanism for reformatting and calculation, creating new derived values;
  • Should provide mechanisms for moving data from sources to the ETL environment and from the ETL environment to targets;
  • Should ensure data set level validation  ie. verify the integrity of copied data sets e.g., checksums;
  • Should provide a standard and centralized method for testing and validating data at the row level;
  • Should provide a mechanism for moving data through ETL components as quickly as is feasible, allowing for jobs to start as soon as all inputs are available; and
  • Should  use technical metadata to support these processes.
Other Data Integration Best Practices
  • Should be established for data flow diagram;
  • Should be established for data profiling including what the profile should contain, what tools should be used to create the profile and who should be involved with data profiling;
  • Should be established for data warehouse experts e.g. contractors, should be established with clearly defined job descriptions based on information requirements;
  • Should be established for data management services e.g. outsourcing data mapping and data movement development should be established if these services are part of the data management practice;
  • Should be established for generic data formats;
  • Sgould consider meta data driven data integration routines;
  • Should be established for standard source to target data mapping templates;
  • Should include data mapping activities as part of the  requirements analysis phase.
  • Should be established to handle missing data;
  • Should be established  to standardize specialized data extractor routines;
  • Should be established for data transmission including standards for signal files;
  • Should be established for extracting data from source systems;
  • Should be established for data file documentation;
  • Should be established for data recovery procedures and routines;
  • Should be established to export data from data warehouses;
  • Should be established for using "fast load" techniques e.g. data pump;
  • Should be established for data import into data marts;
  • Should be established for master data management;
  • Should be established for data transfer between applications; and
  • Should be established for data replication between applications.
Summary...

Data integration should focus on the movement of data between systems and should provide direction on the extract-transform-load facilities used for bulk data movement including mechanisms to support continuous movement of discrete records, rows, or messages between systems including data integration best practices.

This site provided information management guidelines for data integration best practices and  roles and responsibilities.