Thursday, September 8, 2011

Implementing Informatica Persistent Cache

You must have noticed that the "time" Informatica takes to build the lookup cache can be too much sometimes depending on the lookup table size/volume. Using Persistent Cache, you may save lot of your time. This article describes how to do it.

What is Persistent Cache?

lookupLookups are cached by default in Informatica. This means that Informatica by default brings in the entire data of the lookup table from database server to Informatica Server as a part of lookup cache building activity during session run. If the lookup table is too huge, this ought to take quite some time. Now consider this scenario - what if you are looking up to the same table different times using different lookups in different mappings? Do you want to spend the time of building the lookup cache again and again for each lookup? Off course not! Just use persistent cache option!
Yes, Lookup cache can be either non-persistent or persistent. The Integration Service saves or deletes lookup cache files after a successful session run based on whether the Lookup cache is checked as persistent or not.

Where and when we shall use persistent cache:

Suppose we have a lookup table with same lookup condition and return/output ports and the lookup table is used many times in multiple mappings. Let us say a Customer Dimension table is used in many mappings to populate the surrogate key in the fact tables based on their source system keys. Now if we cache the same Customer Dimension table multiple times in multiple mappings that would definitely affect the SLA loading timeline.
There can be some functional reasons also for selecting to use persistent cache. Please read the article Advantage and Disadvantage of Persistent Cache Lookup to know how persistent cache can be used to ensure data integrity in long running ETL sessions where underlying tables are also changing.

So the solution is to use Named Persistent Cache.

In the first mapping we will create the Named Persistent Cache file by setting three properties in the Properties tab of Lookup transformation.

session property

Lookup cache persistent: To be checked i.e. a Named Persistent Cache will be used.
Cache File Name Prefix: user_defined_cache_file_name i.e. the Named Persistent cache file name that will be used in all the other mappings using the same lookup table. Enter the prefix name only. Do not enter .idx or .dat
Re-cache from lookup source: To be checked i.e. the Named Persistent Cache file will be rebuilt or refreshed with the current data of the lookup table.
Next in all the mappings where we want to use the same already built Named Persistent Cache we need to set two properties in the Properties tab of Lookup transformation.

session property using cache

Lookup cache persistent: To be checked i.e. the lookup will be using a Named Persistent Cache that is already saved in Cache Directory and if the cache file is not there the session will not fail it will just create the cache file instead.
Cache File Name Prefix: user_defined_cache_file_name i.e. the Named Persistent cache file name that was defined in the mapping where the persistent cache file was created.

Note:

If there is any Lookup SQL Override then the SQL statement in all the lookups should match exactly even also an extra blank space will fail the session that is using the already built persistent cache file.
So if the incoming source data volume is high, the lookup table’s data volume that need to be cached is also high, and the same lookup table is used in many mappings then the best way to handle the situation is to use one-time build, already created persistent named cache.


The benefit and disadvantage of Informatica Persistent Cache Lookup

Persistent cache may be your choice of caching when it comes to lookup performance. But you should be aware of the hazards of persistent cache as well.

Overview

If you enable caching in a Lookup transformation, the Integration Service builds a cache in memory to store lookup data. If the lookup does not change between sessions, you can configure the transformation to use a persistent lookup cache. When you run the session containing persistent lookup cache, the Integration Service rebuilds the persistent cache if any cache file is missing or invalid. For details, please check How to Implement Informatica Persistent Cache.

Benefits of Persistent Cache Lookup

  • Depending on your situation, persistent cache can give you huge performance boost during the session runtime since the Integration Service does not need to rebuild the lookup cache again, thereby saving time.
  • The very fact that the data in a persistent cache file does not refresh or change in session run can be used to overcome major functional hurdles.
I would like to illustrate below one problematic situation that is often encountered in long running ETL while trying to load data in Fact tables and how the same can be overcome by using Persistent Cache lookups.

Problem: Ensuring Data Integrity in Long Running Load when underlying tables are changing

Suppose you have a requirement of producing cross subject area reports combining two fact tables from two different data marts – Sales and Marketing. The loading of both the facts are started at the 1st day of the month, however sales fact is loaded first and then the marketing fact is started.
Now suppose sales fact, being a very huge volume fact, takes more than a day to complete the load and there is a possibility that the surrogate keys flowing in to this fact from different SCD Type-II dimension tables get changed in the mean time during the daily refresh. Meaning the surrogate key of one customer populated in marketing fact will not be same with the surrogate key of the same customer populated in Sales Fact since they are loaded in two different days and mean time the key may have been changed.

Overcome using Persistent Cache Lookup

How to overcome this issue? If you think closely the issue basically boils down to ensuring that both facts get the same surrogate keys that can be easily achieved by using persistent cache lookup so that even if the underlying tables change data during daily loads, all the facts get the same set of keys.

Disadvantage of Persistent cache Lookup

Although persistent cache can give you considerable performance and other advantages, it comes with some hazards.
  • Recovering sessions after failure in midway may not be possible. Consider this – you have a typical update-else-insert logic in a target table over which you also have persistent cache lookup. This PC lookup on target is used to determine if a certain record coming from source is already present in target (Update) or not (Insert). Suppose this session got failed after inserting a few record in target. If this was a normal lookup, you could simply restart the session after fixing the cause of the error. But if this is a persistent cache lookup, you can not restart this job directly as because the lookup cache will not be holding the records that got inserted in the first time and as a result the lookup will fail to determine that these records are already existing in target and your mapping will try to insert them once again.
  • Persistent cache gives you performance boost by saving time in building the lookup while session run but it still takes the same time like a normal lookup for using the lookup data during session runtime. It is often observed that persistent cache shared across many sessions creates huge disk level I/O contention when the lookup is actually being used in the sessions. You need to monitor the disk IO performance using “iostat” or “vmstat” (UNIX) if you see huge transformation bottleneck without any apparent reason in sessions using persistent cache lookup.

1 comment:

  1. Informatica is one of the most trusted and effective ETL software which is known globally. Informatica is one data warehousing ETL tool. Informatica ha very go opportuities for job across the world.
    informatica training in chennai | informatica training institute in Chennai | informatica classes in Chennai

    ReplyDelete