Blogs

Informatica MDM MDE Batch Process in a nutshell

Jul 25, 2019 4:56:19 AM / by Satish Kumar


Overview

Data is siloed across wide variety of platforms in an enterprise environment and the data needs to be processed, cleansed, and mastered to ensure it is same across source systems for effective reporting and analysis. To cater to this need, Informatica provides Master Data Management (MDM) product called Multi-Domain Edition (MDE). To master the data in this tool, the data needs to be loaded into the Informatica MDM Hub. In Informatica MDM data can be loaded in two different modes (i) Batch and (ii) Real-time.

In this blog we will delve into Batch Processing for loading the data into the Informatica MDM Hub Store, and  the various features and options that are available. Before diving into the batch process, let’s understand how data is organized in Informatica MDM Hub Store.

About MDM Hub Store

Informatica MDM Hub follows three layered approach, typically seen in an Extract, Transform and Load (ETL) process, to load the data.The layers are:

  1. Landing – this forms a pre-staging layer, tables in this layer are used to receive batch loads from multiple source systems.
  2. Staging – tables in this layer are used to hold cleansed and standardized data before loading to the final target.
  3. Target – tables in his layer, referred as Base Objects (BO), contain the master data or golden records.

These tables and other related objects, along with master data, content metadata, rules for processing the source data and rules for defining the master data are stored in a database called as Operational Reference Store or ORS.

Figure 1: Informatica Hub Store

Batch Process

Informatica MDM Hub provides a sequence of steps and options to load the data, maintain lineage of data for tracking & other metadata for effective processing and finally consolidate the data to build the Best version of Truth (BVT). Let’s look at the batch processes at a high level and understand the features they offer.

Land Process

Land Process is external to Informatica MDM Hub and is executed using external batch processes or external applications that directly populates landing tables in the Hub Store.

Land Process is optional and can be skipped, along with other Land Process options as discussed in our blog, How to integrate Informatica Data Quality (IDQ) with Informatica MDM.

Figure 2: Land Process

Stage Process

Stage Process transfers source data from a “landing table to source specific staging table” associated with a base object. The movement of data from landing table to staging table is defined using Mappings offered by Informatica MDM Hub.

Figure 3: Stage Process

Mappings

Mappings define the movement of data from the source column in the landing table to the target column in staging table. Data from landing table is cleansed and standardized before loading to staging table. The cleansing and standardization can be done within Informatica MDM HUB or outside using external tools like Informatica Data Quality (IDQ) or Informatica Power Center.

Figure 4: Landing Table to Staging Table Mapping

Reject Records

During the Stage Process, records that have problems are rejected, transferred along with the reason for rejection to the Reject Table associated with the Staging Table. If there are multiple reasons for rejection, the first reason alone is persisted.

Advantage of this feature

This feature helps investigate the rejected records after running Stage Process. As this feature is is built into Informatica MDM, it saves considerable development and implementation time.

Audit Trail

Stage Process is capable of maintaining a copy/replica of source data in a table (called as RAW table) associated with each of the staging tables. This is possible only if audit trail is enabled for the staging table for configurable number of stage job runs or retention period.

Advantage of this feature

This feature is useful for auditing purposes and tracking data issues, identifying the missing data back to the source data. Informatica MDM Hub provides this feature with an easy configuration, saving considerable development and implementation time.

Delta Detection

Stage Process can detect the records in the landing table that are new or updated if delta detection is enabled for a staging table. The comparison for delta detection can be configured based on all columns or for any date column or specific set of columns as required.

Advantage of this feature

This feature is useful, if the data sent by source system is a full datset, to limit the volume to changed data only and thus improve performance in further processes.

Load Process

Load Process moves data from staging table to the corresponding base object. In addition to this, Load Process performs lookup, computes the trust score for merging of data and running the validation rules. Each base object has tables associated with it to capture the lineage of data, track the history of changes and other details.

Figure 5: Load Process

Reject Records

During the Load Process, records are rejected and transferred along with reason for rejection to the Reject Table associated with Base Object’s Staging Table for that source system.

Advantage of this feature

This feature is based on cental reject table for maintaining reject records for both Stage & Load Process. Informatica knowledge base article KB 90407 provides a good insight into the reject handling process.

Tokenize Process

Tokenization process generates match tokens and stores them in a match key (strip) table associated with the base object. The match tokens are subsequently used by the Match Process to identify suspects for matching.

Figure 6: Tokenize Process

Match & Consolidate Process

Matching is the process of identifying whether two records are similar or duplicate of the other either by exact (deterministic) or fuzzy (probabilistic) matching.

Consolidation is the process of consolidating data from matched records into a single, master record once the match pairs or suspects have been identified in the Match process. Records can flag for auto merge if the suspects are sufficiently similar or sent for manual merge if suspects are likely to be duplicates.

Figure 7: Load & Consolidate Process

Summary

The following figure provides a detailed perspective on the overall flow of data through the Informatica MDM Hub using batch processes, including individual processes, source systems, base objects, and support tables.

Figure 8: Informatica MDM MDE Batch Process (Image source: Informatica)

In Informatica MDM, Batch mode is often used for initial data load (first time business data is loaded), as it can be the most efficient way to populate large number of records. Batch mode provides necessary options and logic for loading data, for updating lineage of the data and other metadata for processing the data.

About the Author

Author: Sathish Kumar

Sathish Kumar is a Senior Consultant at Mastech Infotrellis having 6 plus years of experience in development and implementation of ETL and MDM solutions. He has good knowledge in various Informatica products like Power Center, IDQ, Informatica Cloud and Informatica MDM MDE.

Co-Author: Sachin Dedhia

Sachin Dedhia is an Architect at Mastech InfoTrellis and has 13 plus years of extensive experience in design and development of Data Integration (DI), Business Intelligence (BI) and Master Data Management (MDM) solutions. He leads the Informatica MDM practice providing his expertise on the projects, designing internal trainings, assets and accelerators.

Topics: Blog, Batch Process, Informatica, Land Process, Load Process, MDE, MDM, Stage Process

Satish Kumar

Written by Satish Kumar

Subscribe to Email Updates

Lists by Topic

see all

Posts by Topic

See all

Recent Posts