It is common practice to make changes to the underlying systems either to correct problems or to provide support for new features that are needed by the business. Changes can be in the form of adding a new source system to your existing Enterprise Data Warehouse (EDW).
This blog post examines the issue of adding new source systems in an EDW environment, how to manage customizations in an existing EDW, what type of analysis has to be made before the commencement of a project, in the impacted areas, and the solution steps required.
Enterprise Data Warehouse
An Enterprise Data warehouse (EDW) is a conceptual architecture that helps to store subject-oriented, integrated, time variant and non-volatile data for decision making. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources. An EDW includes various source systems, ETL (E- extract, T - transform, and L - load), Staging Area, Data warehouse, various Data Marts and BI reporting as shown in EDW Architecture.
Figure 1: Enterprise Data Warehouse Architecture
Why do we need to add a new source system/data in an existing EDW?
An organization can have many transactional source systems. At the time of building EDW, the organization may or may not consider all the source systems. But over the time they need to add those left outsource systems or newly arrived source system into the existing EDW for their decision-making reports.
Challenges in Adding a New Source System/Data to the existing EDW
Let us illustrate with a real-time scenario.
Let us assume that the existing EDW has the below data model as shown in (Figure 2: Existing EDW Data Model). Business has decided to add data from a new source system (Figure 1: Source 4) to the existing EDW. This new source system will be populating data into all the existing dimension tables also has some more information which requires storing in new dimension table (Figure 3: Store Dimension).
Figure 2: Existing EDW Data Model
Adding data by creating a new dimension table in the EDW does not pose an issue because we will be able to create new ETL jobs, staging tables, DIM tables, Marts and, Reports accordingly.
The problem arises when we try populating new source data into the existing DIM/Fact tables or in Marts in EDW without proper analysis. This may corrupt the existing ETL jobs, upstream and downstream applications, reporting process and finally corrupting the EDW.
Figure 3: EDW Data Model after adding a new source system
What kind of analysis must be performed before commencing a project?
Analysis has to be performed in the below-mentioned areas:
New Source System/Data
Any organization that has many lines of business globally or locally stores the data in multiple source systems. No two source systems will be completely identical in terms of data model, tables ddl (data definition language) and information stored.
Before introducing a new source system to the EDW, data has to be analyzed in the row and column level (maximum length, type, and frequency of the data). It is recommended to use any profiling tool (like IDQ or IBM IIA) to get a complete picture of your data.
Target DIM/Fact tables
It is crucial to have a complete picture of the existing EDW DIM/Fact tables to which the new source system data will be a part of.
Accommodating the new Source System Data
Most of the EDW Table column types are defined as INTEGER but the problem arises when the new source system has the data type as ‘BIGINT”. It is a known fact that BIGINT value cannot reside in an INTEGER field.
In this scenario, we had to change the schema of the existing DW table from INT to BIGINT. But we must ensure that there is no impact on the existing data received from other source systems.
Similarly, we found many changes required to do in existing EDW to accommodate the new source system data.
After completing the analysis of source and target tables we have to create the BIBLE (mapping document) for smoothly marrying the new source system to the existing EDW. It should clearly have defined what ddls need to be changed in the staging and Target DB tables.
Changing the ddls of the table is not an easy task because the table already has a huge volume of data and indexed. It is recommended to consult a Database Architect or performed under the supervision of an expert.
Any changes in the existing target table ddls will bring huge ripples in the ETL jobs. As we know most of the EDW stores the data in SCD2/SCD3. For implementing SCD2/SCD3 we use lookups in target tables. In the above scenario, we modified the target table column datatype INTEGER to BIGINT. But what will happen with the existing ETL jobs that were doing a lookup in the same table and expecting INTEGER datatype. Obviously, it will start failing. So, it is mandatory to do rigorous impact analysis to the existing ETL jobs.
Upstream and Downstream Applications
The data stored in the EDW is uploaded from the upstream applications and consumed by downstream applications like reporting and data analysis applications. Any changes in ETL jobs or in EDW will impact these applications. Without doing proper impact analysis of these applications and adding new source data in the existing EDW can impact adversely.
Mastech InfoTrellis’ 11 plus years of expertise in building MDM, ETL, Data Warehouse, adding new source systems/data to an existing EDW, have helped us to perform a rigorous impact analysis and arrive at the right solution sets without corrupting the EDW.
About the Author
Anand Kumar, is an Associate Architect at Mastech InfoTrellis. He has 7.5 years of experience in design and development of various data integration and warehouse projects in different verticals. He is an IBM Certified Specialist in PureData System for Analytics v7.0 (Netezza data warehouse appliance) and Solution Developer in InfoSphere DataStage v8.5