Thursday, September 8, 2011

Informatica Reject File - How to Identify rejection reason

Implementing Informatica Persistent Cache

You must have noticed that the "time" Informatica takes to build the lookup cache can be too much sometimes depending on the lookup table size/volume. Using Persistent Cache, you may save lot of your time. This article describes how to do it.

What is Persistent Cache?

Lookups are cached by default in Informatica. This means that Informatica by default brings in the entire data of the lookup table from database server to Informatica Server as a part of lookup cache building activity during session run. If the lookup table is too huge, this ought to take quite some time. Now consider this scenario - what if you are looking up to the same table different times using different lookups in different mappings? Do you want to spend the time of building the lookup cache again and again for each lookup? Off course not! Just use persistent cache option!
Yes, Lookup cache can be either non-persistent or persistent. The Integration Service saves or deletes lookup cache files after a successful session run based on whether the Lookup cache is checked as persistent or not.

Where and when we shall use persistent cache:

Suppose we have a lookup table with same lookup condition and return/output ports and the lookup table is used many times in multiple mappings. Let us say a Customer Dimension table is used in many mappings to populate the surrogate key in the fact tables based on their source system keys. Now if we cache the same Customer Dimension table multiple times in multiple mappings that would definitely affect the SLA loading timeline.
There can be some functional reasons also for selecting to use persistent cache. Please read the article Advantage and Disadvantage of Persistent Cache Lookup to know how persistent cache can be used to ensure data integrity in long running ETL sessions where underlying tables are also changing.

So the solution is to use Named Persistent Cache.

In the first mapping we will create the Named Persistent Cache file by setting three properties in the Properties tab of Lookup transformation.

session property

Lookup cache persistent: To be checked i.e. a Named Persistent Cache will be used.
Cache File Name Prefix: user_defined_cache_file_name i.e. the Named Persistent cache file name that will be used in all the other mappings using the same lookup table. Enter the prefix name only. Do not enter .idx or .dat
Re-cache from lookup source: To be checked i.e. the Named Persistent Cache file will be rebuilt or refreshed with the current data of the lookup table.
Next in all the mappings where we want to use the same already built Named Persistent Cache we need to set two properties in the Properties tab of Lookup transformation.

session property using cache

Lookup cache persistent: To be checked i.e. the lookup will be using a Named Persistent Cache that is already saved in Cache Directory and if the cache file is not there the session will not fail it will just create the cache file instead.
Cache File Name Prefix: user_defined_cache_file_name i.e. the Named Persistent cache file name that was defined in the mapping where the persistent cache file was created.

Note:

If there is any Lookup SQL Override then the SQL statement in all the lookups should match exactly even also an extra blank space will fail the session that is using the already built persistent cache file.
So if the incoming source data volume is high, the lookup table’s data volume that need to be cached is also high, and the same lookup table is used in many mappings then the best way to handle the situation is to use one-time build, already created persistent named cache.

The benefit and disadvantage of Informatica Persistent Cache Lookup

Persistent cache may be your choice of caching when it comes to lookup performance. But you should be aware of the hazards of persistent cache as well.

Overview

If you enable caching in a Lookup transformation, the Integration Service builds a cache in memory to store lookup data. If the lookup does not change between sessions, you can configure the transformation to use a persistent lookup cache. When you run the session containing persistent lookup cache, the Integration Service rebuilds the persistent cache if any cache file is missing or invalid. For details, please check How to Implement Informatica Persistent Cache.

Benefits of Persistent Cache Lookup

Depending on your situation, persistent cache can give you huge performance boost during the session runtime since the Integration Service does not need to rebuild the lookup cache again, thereby saving time.
The very fact that the data in a persistent cache file does not refresh or change in session run can be used to overcome major functional hurdles.

I would like to illustrate below one problematic situation that is often encountered in long running ETL while trying to load data in Fact tables and how the same can be overcome by using Persistent Cache lookups.

Problem: Ensuring Data Integrity in Long Running Load when underlying tables are changing

Suppose you have a requirement of producing cross subject area reports combining two fact tables from two different data marts – Sales and Marketing. The loading of both the facts are started at the 1st day of the month, however sales fact is loaded first and then the marketing fact is started.
Now suppose sales fact, being a very huge volume fact, takes more than a day to complete the load and there is a possibility that the surrogate keys flowing in to this fact from different SCD Type-II dimension tables get changed in the mean time during the daily refresh. Meaning the surrogate key of one customer populated in marketing fact will not be same with the surrogate key of the same customer populated in Sales Fact since they are loaded in two different days and mean time the key may have been changed.

Overcome using Persistent Cache Lookup

How to overcome this issue? If you think closely the issue basically boils down to ensuring that both facts get the same surrogate keys that can be easily achieved by using persistent cache lookup so that even if the underlying tables change data during daily loads, all the facts get the same set of keys.

Disadvantage of Persistent cache Lookup

Although persistent cache can give you considerable performance and other advantages, it comes with some hazards.

Recovering sessions after failure in midway may not be possible. Consider this – you have a typical update-else-insert logic in a target table over which you also have persistent cache lookup. This PC lookup on target is used to determine if a certain record coming from source is already present in target (Update) or not (Insert). Suppose this session got failed after inserting a few record in target. If this was a normal lookup, you could simply restart the session after fixing the cause of the error. But if this is a persistent cache lookup, you can not restart this job directly as because the lookup cache will not be holding the records that got inserted in the first time and as a result the lookup will fail to determine that these records are already existing in target and your mapping will try to insert them once again.
Persistent cache gives you performance boost by saving time in building the lookup while session run but it still takes the same time like a normal lookup for using the lookup data during session runtime. It is often observed that persistent cache shared across many sessions creates huge disk level I/O contention when the lookup is actually being used in the sessions. You need to monitor the disk IO performance using “iostat” or “vmstat” (UNIX) if you see huge transformation bottleneck without any apparent reason in sessions using persistent cache lookup.

Implementing Informatica Partitions

Why use Informatica Pipeline Partition?

Identification and elimination of performance bottlenecks will obviously optimize session performance. After tuning all the mapping bottlenecks, we can further optimize session performance by increasing the number of pipeline partitions in the session. Adding partitions can improve performance by utilizing more of the system hardware while processing the session.

PowerCenter Informatica Pipeline Partition

Different Types of Informatica Partitions
We can define the following partition types: Database partitioning, Hash auto-keys, Hash user keys, Key range, Pass-through, Round-robin.

Informatica Pipeline Partitioning Explained

Each mapping contains one or more pipelines. A pipeline consists of a source qualifier, all the transformations and the target. When the Integration Service runs the session, it can achieve higher performance by partitioning the pipeline and performing the extract, transformation, and load for each partition in parallel.
A partition is a pipeline stage that executes in a single reader, transformation, or writer thread. The number of partitions in any pipeline stage equals the number of threads in the stage. By default, the Integration Service creates one partition in every pipeline stage. If we have the Informatica Partitioning option, we can configure multiple partitions for a single pipeline stage.
Setting partition attributes includes partition points, the number of partitions, and the partition types. In the session properties we can add or edit partition points. When we change partition points we can define the partition type and add or delete partitions(number of partitions).
We can set the following attributes to partition a pipeline:
Partition point: Partition points mark thread boundaries and divide the pipeline into stages. A stage is a section of a pipeline between any two partition points. The Integration Service redistributes rows of data at partition points. When we add a partition point, we increase the number of pipeline stages by one. Increasing the number of partitions or partition points increases the number of threads. We cannot create partition points at Source instances or at Sequence Generator transformations.
Number of partitions: A partition is a pipeline stage that executes in a single thread. If we purchase the Partitioning option, we can set the number of partitions at any partition point. When we add partitions, we increase the number of processing threads, which can improve session performance. We can define up to 64 partitions at any partition point in a pipeline. When we increase or decrease the number of partitions at any partition point, the Workflow Manager increases or decreases the number of partitions at all partition points in the pipeline. The number of partitions remains consistent throughout the pipeline. The Integration Service runs the partition threads concurrently.
Partition types: The Integration Service creates a default partition type at each partition point. If we have the Partitioning option, we can change the partition type. The partition type controls how the Integration Service distributes data among partitions at partition points. We can define the following partition types: Database partitioning, Hash auto-keys, Hash user keys, Key range, Pass-through, Round-robin. Database partitioning: The Integration Service queries the database system for table partition information. It reads partitioned data from the corresponding nodes in the database.
Pass-through: The Integration Service processes data without redistributing rows among partitions. All rows in a single partition stay in the partition after crossing a pass-through partition point. Choose pass-through partitioning when we want to create an additional pipeline stage to improve performance, but do not want to change the distribution of data across partitions.
Round-robin: The Integration Service distributes data evenly among all partitions. Use round-robin partitioning where we want each partition to process approximately the same numbers of rows i.e. load balancing.
Hash auto-keys: The Integration Service uses a hash function to group rows of data among partitions. The Integration Service groups the data based on a partition key. The Integration Service uses all grouped or sorted ports as a compound partition key. We may need to use hash auto-keys partitioning at Rank, Sorter, and unsorted Aggregator transformations.
Hash user keys: The Integration Service uses a hash function to group rows of data among partitions. We define the number of ports to generate the partition key.
Key range: The Integration Service distributes rows of data based on a port or set of ports that we define as the partition key. For each port, we define a range of values. The Integration Service uses the key and ranges to send rows to the appropriate partition. Use key range partitioning when the sources or targets in the pipeline are partitioned by key range.
We cannot create a partition key for hash auto-keys, round-robin, or pass-through partitioning.
Add, delete, or edit partition points on the Partitions view on the Mapping tab of session properties of a session in Workflow Manager.
The PowerCenter® Partitioning Option increases the performance of PowerCenter through parallel data processing. This option provides a thread-based architecture and automatic data partitioning that optimizes parallel processing on multiprocessor and grid-based hardware environments

Pushdown Optimization In Informatica

Pushdown Optimization which is a new concept in Informatica PowerCentre, allows developers to balance data transformation load among servers. This article describes pushdown techniques.

What is Pushdown Optimization?

Pushdown optimization is a way of load-balancing among servers in order to achieve optimal performance. Veteran ETL developers often come across issues when they need to determine the appropriate place to perform ETL logic. Suppose an ETL logic needs to filter out data based on some condition. One can either do it in database by using WHERE condition in the SQL query or inside Informatica by using Informatica Filter transformation. Sometimes, we can even "push" some transformation logic to the target database instead of doing it in the source side (Especially in the case of EL-T rather than ETL). Such optimization is crucial for overall ETL performance.

How does Push-Down Optimization work?

One can push transformation logic to the source or target database using pushdown optimization. The Integration Service translates the transformation logic into SQL queries and sends the SQL queries to the source or the target database which executes the SQL queries to process the transformations. The amount of transformation logic one can push to the database depends on the database, transformation logic, and mapping and session configuration. The Integration Service analyzes the transformation logic it can push to the database and executes the SQL statement generated against the source or target tables, and it processes any transformation logic that it cannot push to the database.

Using Pushdown Optimization

Use the Pushdown Optimization Viewer to preview the SQL statements and mapping logic that the Integration Service can push to the source or target database. You can also use the Pushdown Optimization Viewer to view the messages related to pushdown optimization.
Let us take an example: image link

Image: Pushdown Optimization Example 1

Filter Condition used in this mapping is: DEPTNO>40
Suppose a mapping contains a Filter transformation that filters out all employees except those with a DEPTNO greater than 40. The Integration Service can push the transformation logic to the database. It generates the following SQL statement to process the transformation logic:

INSERT INTO EMP_TGT(EMPNO, ENAME, SAL, COMM, DEPTNO)      
SELECT 
EMP_SRC.EMPNO,
EMP_SRC.ENAME,
EMP_SRC.SAL,
EMP_SRC.COMM,
EMP_SRC.DEPTNO
FROM EMP_SRC
WHERE (EMP_SRC.DEPTNO >40)

The Integration Service generates an INSERT SELECT statement and it filters the data using a WHERE clause. The Integration Service does not extract data from the database at this time.
We can configure pushdown optimization in the following ways:

Using source-side pushdown optimization:

The Integration Service pushes as much transformation logic as possible to the source database. The Integration Service analyzes the mapping from the source to the target or until it reaches a downstream transformation it cannot push to the source database and executes the corresponding SELECT statement.

Using target-side pushdown optimization:

The Integration Service pushes as much transformation logic as possible to the target database. The Integration Service analyzes the mapping from the target to the source or until it reaches an upstream transformation it cannot push to the target database. It generates an INSERT, DELETE, or UPDATE statement based on the transformation logic for each transformation it can push to the database and executes the DML.

Using full pushdown optimization:

The Integration Service pushes as much transformation logic as possible to both source and target databases. If you configure a session for full pushdown optimization, and the Integration Service cannot push all the transformation logic to the database, it performs source-side or target-side pushdown optimization instead. Also the source and target must be on the same database. The Integration Service analyzes the mapping starting with the source and analyzes each transformation in the pipeline until it analyzes the target. When it can push all transformation logic to the database, it generates an INSERT SELECT statement to run on the database. The statement incorporates transformation logic from all the transformations in the mapping. If the Integration Service can push only part of the transformation logic to the database, it does not fail the session, it pushes as much transformation logic to the source and target database as possible and then processes the remaining transformation logic.
For example, a mapping contains the following transformations:
SourceDefn -> SourceQualifier -> Aggregator -> Rank -> Expression -> TargetDefn

SUM(SAL), SUM(COMM) Group by DEPTNO 
RANK PORT on SAL 
TOTAL = SAL+COMM

Image: Pushdown Optimization Example 2

The Rank transformation cannot be pushed to the database. If the session is configured for full pushdown optimization, the Integration Service pushes the Source Qualifier transformation and the Aggregator transformation to the source, processes the Rank transformation, and pushes the Expression transformation and target to the target database.
When we use pushdown optimization, the Integration Service converts the expression in the transformation or in the workflow link by determining equivalent operators, variables, and functions in the database. If there is no equivalent operator, variable, or function, the Integration Service itself processes the transformation logic. The Integration Service logs a message in the workflow log and the Pushdown Optimization Viewer when it cannot push an expression to the database. Use the message to determine the reason why it could not push the expression to the database.

How does Integration Service handle Push Down Optimization?

To push transformation logic to a database, the Integration Service might create temporary objects in the database. The Integration Service creates a temporary sequence object in the database to push Sequence Generator transformation logic to the database. The Integration Service creates temporary views in the database while pushing a Source Qualifier transformation or a Lookup transformation with a SQL override to the database, an unconnected relational lookup, filtered lookup.
1. To push Sequence Generator transformation logic to a database, we must configure the session for pushdown optimization with Sequence.
2. To enable the Integration Service to create the view objects in the database we must configure the session for pushdown optimization with View.
2. After the database transaction completes, the Integration Service drops sequence and view objects created for pushdown optimization.

Configuring Parameters for Pushdown Optimization

Depending on the database workload, we might want to use source-side, target-side, or full pushdown optimization at different times and for that we can use the $$PushdownConfig mapping parameter. The settings in the $$PushdownConfig parameter override the pushdown optimization settings in the session properties. Create $$PushdownConfig parameter in the Mapping Designer , in session property for Pushdown Optimization attribute select $$PushdownConfig and define the parameter in the parameter file.
The possible values may be,
1. none i.e the integration service itself processes all the transformations,
2. Source [Seq View],
3. Target [Seq View],
4. Full [Seq View]

Pushdown Optimization Viewer

Use the Pushdown Optimization Viewer to examine the transformations that can be pushed to the database. Select a pushdown option or pushdown group in the Pushdown Optimization Viewer to view the corresponding SQL statement that is generated for the specified selections. When we select a pushdown option or pushdown group, we do not change the pushdown configuration. To change the configuration, we must update the pushdown option in the session properties.

Database that supports Informatica Pushdown Optimization

We can configure sessions for pushdown optimization having any of the databases like Oracle, IBM DB2, Teradata, Microsoft SQL Server, Sybase ASE or Databases that use ODBC drivers.
When we use native drivers, the Integration Service generates SQL statements using native database SQL. When we use ODBC drivers, the Integration Service generates SQL statements using ANSI SQL. The Integration Service can generate more functions when it generates SQL statements using native language instead of ANSI SQL.

Handling Error when Pushdown Optimization is enabled

When the Integration Service pushes transformation logic to the database, it cannot track errors that occur in the database.
When the Integration Service runs a session configured for full pushdown optimization and an error occurs, the database handles the errors. When the database handles errors, the Integration Service does not write reject rows to the reject file.
If we configure a session for full pushdown optimization and the session fails, the Integration Service cannot perform incremental recovery because the database processes the transformations. Instead, the database rolls back the transactions. If the database server fails, it rolls back transactions when it restarts. If the Integration Service fails, the database server rolls back the transaction.

Wednesday, September 7, 2011

Informatica Performance Tuning - Complete Guide

This article is a comprehensive guide to the techniques and methodologies available for tuning the performance of Informatica PowerCentre ETL tool. It's a one stop performance tuning manual for Informatica.
Informatica performance tuning is not limited to mapping tuning. And it's not limited to finding source, target or transformation bottlenecks. Optimizing Informatica performance requires in-depth knowledge in Informatica, underlying database and host operating systems and involves tuning in all these layers. This article attempts to discuss all these aspects in a holistic manner to anable the reader optimally tune the performance of Informatica ETL tool.
The following picture depicts a mind-map of various elements that we will discuss in this article in the context of performance tuning.
performance-tuning-steps

Fig: Mind map showing various aspects of performance tuning

This article will discuss all the above elements of tuning and much more in great detail. Let's begin.

Methods to identify and Pin-point Performance Bottlenecks

Identifying performance bottleneck from session log
Identifying performance bottleneck from "Gather Performance Data" option

TUNING Look-Up Transformation

To me, look-up is the single most important (and difficult) transformation that we need to consider while tuning performance of Informatica jobs. The choice and use of correct type of Look-Up can drastically vary the session performance in Informatica. So let’s delve deeper into this.

To cache or not to cache?

When you cache a look-up, Informatica fires the lookup query to database and brings in the entire set of data from the database to the Cache file directory under Informatica server. Informatica then uses this data whenever this particular lookup is called. By doing this Informatica saves the time (and effort) to go back to database every time the lookup is called in the mapping.

In case of an un-cached or non-cached lookup, the opposite thing happens. Here Informatica goes to database to fetch data every time the lookup is called inside the mapping. How often the lookup is called depends on number of records coming from source (exactly equalling to number of records in case of a connected lookup)

There are couple of things that you need to consider while opting for any particular type of lookup. I tend to follow these general guidelines:

If the number of records coming from source is comparatively much lesser than the number of records present in the lookup table then you should consider using uncached lookup. This is because less number of records from source ensures less number of database calls.
If the lookup table is small (less than 2 million), it’s generally a good idea to cache the lookup table

Above guidelines are based on the following mathematical approach that suggests when to use cache lookup and when to use uncached lookup.

N = Number of records coming from source (For the purpose of simplicity, let’s assume N also equals to the number of times the lookup will be called)
M = Number of records retrieved from the Lookup query
t = time required to retrieve a single record from database based on the Lookup query
T = time required to retrieve all the records from database based on the Lookup query = M * t
L = Time required to build the lookup index cache for cached lookup = f(N)
C = Network Time required to do one round trip for data fetch from database

In case of an uncached lookup, total time required will be:
T _uncached = N * t + N * C
In case of a cached lookup, total time required to build the cache will be:
T_cached = T + L + C

In the above equation we can put L = f(N) and since C is generally very small number, we can ignore C and rewrite the above equation as –
T_cached = T + f(N)

Now if we assume that the database table has INDEX defined upon it and the index is being used while retrieving records from database then, we can further assume –
T = M * t

Replacing the above value in our earlier equation for T_cached we get –
T_cached= M * t + f(N)

We should use uncached lookup if –
T_cached> T_uncached=> M * t + f(N) > N * t + N * C
Dividing all sides by N, we get –

C + t < (M/N)*t + f(N)/N

For a given Informatica and database setup, C, t & f(N) are fixed.
So the above equation is more likely to be true if (M/N) ratio is on higher side, i.e. N << M – that is – number of records coming from source is considerably less than number of records present in the lookup table (Refer back to point no. 1 in the discussion above).

Similarly, if N >> M – that is number of records coming from source is considerably high than the number of records present in lookup table, then we should use cached lookup.

Effect of suppressing default ORDER BY clause in Lookup

When Informatica fires the lookup query to database, it appends a ORDER BY clause at the end of the query. However you can suppress this default behaviour by appending a comment “--“at the end of the override query. You should consider following factors regarding default Informatica Lookup ORDER BY –

ORDER BY clause may create additional processing overhead on database. So ensure that in the SELECT list of the lookup query, you do not include any column that are redundant because otherwise those columns will also be included in ORDER BY also
Do not completely suppress the ORDER BY.

I recommend you always perform ORDER BY at least on the columns which are being used in the joining condition. This is because Informatica creates its own index on the joining columns and therefore if the joining columns are ordered, Informatica would require less space (and time) to create the indexes.
If you completely suppress the ORDER BY from the Lookup query then Informatica will generate a Warning like below in the session log –

WARNING! Found unsorted input when building Lookup cache of [XXX_XXX_XXX] (Current number of entries in the index cache = 1). Continue building the cache by insertion. This will cause it to take longer and the index cache may require more space.

Use of Persistent Lookup Cache

Persistent Cache lookup can be Holy Grail for the cached lookups – if you know how to use them. Generally speaking, if the source data in the underlying lookup tables are not changing between consecutive sessions runs then one may use persistent lookup cache. Following article describes the detail process on how to implement persistent cache in Informatica session:

How to implement Persistent Lookup Cache

In short, to use a persistent cache, you have to check the following options in Lookup Transformation properties tab:

Lookup caching enabled
Lookup cache persistent

Once you do that, cache file created by Informatica session will *NOT* be deleted from the Cache directory and the same cache file will be used in all the consecutive runs. Advantage of doing this is you need not spend time building the same cache every time the session executes. However if the source data for the lookup changes meanwhile then you must refresh the cache by either of the following two options:

Delete the cache file manually from the Cache directory Or,
Check the “Re-cache from lookup source” option in the Properties tab of the lookup

If the persistent cache lookup is reusable, then you can share the same lookup in multiple mapping without rebuilding the cache in each one of them. And you can have one additional mapping with re-cache option enabled for this lookup, which you can run whenever you need to refresh the cache file.

Also note that there are some disadvantages of using persistent cache lookup. For details on that, please check the below article:

Benefit and disadvantage of informatica persistent cache lookup

Without going much deeper in to the disadvantage part, I would like to mention that there is one particular disadvantage that you must take into account while opting for persistent cache option for big lookup tables. If the cache file size of your lookup table is more than 2GB, most likely Informatica will create multiple cache files for one lookup wherein maximum file size for each file will be 2GB. If those are data cache files, Informatica will name them as .dat1, .dat2, .dat3 … etc. and corresponding index cache files will be named as .idx1, .idx2, .idx3 … etc.

Also note that in many flavors of UNIX (e.g. HP-UX 11i), NOLARGEFILES is a default option for the file system. This option prevents applications or users to create file larger than 2GB. You can check whether LARGEFILE option is enabled in your server by issuing the following command:

getconf FILESIZEBITS /mountpoint_name

However, please note that irrespective of the fact whether LARGEFILE option is enabled or disabled, Informatica will not create cache sized above 2GB (This is true for both 32-bit and 64-bit versions of Informatica)

My personal opinion is breaking one single data or index cache file into multiple files may slow down the lookup performance. Hence if your lookup cache size is more than 2GB then if possible consider the option of joining the lookup source table in the database level itself instead of building lookup cache.

Effect of choosing connected OR Unconnected Lookup

If all the other factors remain same, choice of connected and unconnected lookup can impact lookup performance in one particular case. In case of a connected lookup, since the lookup is connected in the data flow pipeline so the lookup will be called for each record coming from source, irrespective of the fact whether the data returned from lookup is at all used later in the mapping or not. In case of unconnected lookup, this can be controlled by calling the lookup only when it is really needed.

To illustrate the difference, let’s consider the following example. Suppose you are loading sales data and if the sales is done in one specific geography location (say North America), then you want to populate one additional column in your target table with GST amount (which is based on which American State the sales was made). Suppose one record in the source is like this:

SalesID = 100067345
    SalesLocation = CA
    Geography = North America
    SalesAmount = 100.54
    ProductPartNo = PX937

In case of connected lookup, the lookup will be called for all the records irrespective of the fact whether the sales Geography is in North America or not. In case of unconnected lookup, you can write something like this inside an Expression Transformation, which will ensure the lookup is only called when Sales Geography is Northa America:

IIF (SALES_GEOGRAPHY = ‘North America’, :LKP.LKP_GET_GST_PCT_PER_STATE(SALES_LOCATION), 0)

Suppose out of 1,000,000 records, the above condition matches only for 100,000 records (10%). Then you save calling the lookup unnecessarily for 90% of cases.

WHEN TO set Pre-build Lookup Cache OPTION (AND WHEN NOT TO)

Informatica may build the lookup cache even before the first row reaches the Lookup transformation (Pre-build scenario) or it may build the lookup cache only when the first row reaches the lookup transformation (On demand scenario). You can let Informatica decide what it wants to do (pre-build or on demand) or you may explicitly tell Informatica what you want to do. All you need to do is set the “Pre-Build Lookup Cache” option to “Always allowed” under the lookup transformation Properties tab to force Informatica build all the lookup caches beforehand.

So what’s the difference? How does it impact the performance? Generally speaking pre-building all the lookup cache beforehand may give you occasional performance boost as once the source qualifier starts pulling data, the processing of that data need not wait any more for the completion of building of lookup caches.

Performance boost of pre-building lookup caches is more apparent in cases where the source query itself is taking considerable amount of time to return rows so the same time can be simultaneously used to pre-build all the lookup caches instead of waiting for the source to return results and then starting to build other lookup caches on demand.

I said the performance benefit is occasional because there are number of other factors that need to be considered before setting Pre-Build Lookup Cache to Always allowed.

When Informatica pre-builds the lookup cache, how many caches it can build concurrently depends on various factors including “Additional Concurrent Pipelines for Lookup Cache Creation” property present in session level under “Config Object” tab. I will discuss this in more detail later below.
One should also check if the database where Informatica is firing the Lookup queries has enough capacity to process the Lookup queries simultaneously without hampering the individual query performance. This depends on various database level properties especially on database TEMP space availability, session level connection restrictions and degree-of-parallelism (DOP)
There can be scenario where not all the lookups are even required to be called (especially if the lookup is unconnected and called based on a condition – see above in the section “Effect of Choosing Connected or unconnected lookup). There is really no point in pre-building those lookups as those lookups should be built on demand only. This is also the reason why “Pre-Build Lookup Cache” option is available in individual lookup level and not as a generic option in session level

USE OF Additional Concurrent Pipeline for Lookup Cache Creation OPTIONS

To me, this option which is available in session level under “Config Object” tab, is more of a limiting option rather than a performance augmenting option. This option determines how many additional concurrent pipelines Integration service can use while trying to build lookup caches simultaneously. I prefer setting it to “Auto”.

If you set it to Auto, Informatica will decide the value at run time while building the caches concurrently. Below I describe how Informatica will typically behave when you set this to “Auto”. Suppose one mapping has 4 lookup transformations that Informatica decided to build concurrently. It will give a message like below first in the session log:

Enabled using [2 (auto)] additional concurrent pipelines to build lookup caches. (Session likely will build or refresh [4] lookup caches; [1] on-demand only)

The above message means that Informatica has started 2 pipelines to handle 2 different lookups already. So this means 2 more lookups are yet to be built. When Informatica starts building those 2 more lookups, based on the situation whether the already allocated pipelines for the first 2 lookups are free by that time or not, it will generate any of the below two messages respectively:

Starting additional concurrent pipeline to build the lookup cache needed by Lookup transformation [LKP_XXX_XXX_XXX]

Lookup cache of [LKP_XXX_XXX_XXX], previously started by an additional concurrent pipeline, is now needed on demand.

The first message will appear if –

Informatica decides that it can “afford” one more additional pipeline and
No additional pipeline allocated previously for other lookups is yet free at the time of building this lookup

Now that we understand how additional pipeline behaves, let’s see how we can use this property to augment session performance. As I described before, the best thing to do with this is to set it to “Auto” which will anyway use the maximum possible pipelines when lookups are required to be built concurrently. However if you want to limit the concurrency then you can specify a numeric number there which will mean that no more additional pipelines will be created after it reaches the specified number. If any more additional pipeline is required, then it will wait for any current pipeline to free up first so that Informatica can reuse this. If you set it to 0 (zero), all lookups will be built serially

Tuning JOINER Transformation

Joiner transformation allows you to join two heterogeneous sources in the Informatica mapping. You can use this transformation to perform INNER and OUTER joins between two input streams.
For performance reasons, I recommend you ONLY use JOINER transformation if any of the following condition is true –

Data sources are not relational in nature (joining two flat files)
Data sources are heterogeneous in nature (e.g. Flat File and Database)
Data sources are coming from different relational connections
One or more column(s) used in the joining condition(s) of the JOINER is derived in the mapping

Except the above mentioned conditions, it is better to use database side joins. To know why, please read –
Database performs join faster than Informatica
Although the article in the above link is particular about Oracle database but the conclusion will hold true in case of most of the other databases.
However, if you have to use the joiner transformation, following are the additional points/actions that you must consider:

When joining between two data sources, treat the data source containing less number of records as Master. This is because the Cache size of the Joiner transformation depends on master data (unless sorted input with the same source is used).
Ensure that both the master and detail input sources are sorted and both “Sorted Input” and “Master Sort Order” ports are checked and set
Consider using cache partitioning for Joiner transformation if you have partition option available under your Informatica license. More details on this topic later
Check if the Data and Index cache sizes can be configured. More details on this next

Understanding Joiner Cache

Joiner Transformation needs a space to store the cache for the data and index. This cache can be either memory cache (stored in RAM) or disk cache (stored in hard drive disks) or both depending on various factors which I won’t discuss now. Obviously the memory cache is much faster than the disk cache. So enough system memory should be available to Informatica for faster Join operation. One can actually configure the amount of memory for Joiner data cache and index cache by the following two options under Joiner Transformation Properties:

Joiner Data Cache Size
Joiner Index Cache Size

If you click on these properties under the “Mapping” tab of the session, you can access “Joiner-Cache Calculator” which is a small tool inbuilt into the PowerCenter Manager for calculating the required amount of cache sizes for the joining operation. You can use the values suggested by this calculator for joiner data and index cache or you can leave them as AUTO. If you do not leave them as Auto and input some values in those boxes, you must ensure that the allocated memory is available while the session executes. Otherwise the session will fail.

I prefer to leave Joiner Data and Index Cache sizes parameters to Auto and set the maximum values for the auto memory attributes in the session level. To know why, please read on.

Partitioning the Joiner cache

If the session containing the joiner is already partitioned, then one can take the advantage of cache partitioning for the Joiner. As the name suggests, the joiner cache itself gets divided in different partitions in this case. The benefit of this is Informatica accesses the cache in parallel for each partition which is faster than accessing the full cache in serial.
In order to achieve this cache partition, you need to create a partition point on the Joiner Transformation by choosing the “Add Partition Point” option in Session level under the Mapping::Partitions tab. Then you can edit that partition point to add more than one Hash Auto-key partitions to it. Suppose if you add 4 hash auto key partitions to the Joiner transformation then, by default, Informatica will automatically add 4 Pass-Through partitions in the source qualifier transformations of *both* the master and detail pipelines.
The benefit of choosing Hash Auto Keys partition in the Joiner transformation is – you need NOT explicitly tell Informatica how to divide the incoming data to individual partitions in Joiner level. You set it to Hash Auto Keys and you forget it, Informatica will take care for the rest.
However, as soon as you add number of Hash Auto Keys partition to the Joiner level, your source qualifiers also get Pass-Through partitioned. Here you may override the Source Qualifier query and specify individual SQL queries for each partition in the source qualifier transformation level. Supposing your original source side SQL query is like below:

SELECT AccountID, TransactionType, TransactionAmount 
FROM Transactions

You can override the above query for each partition level like below:

Partition #1:

SELECT AccountID, TransactionType, TransactionAmount 
FROM Transactions 
WHERE AccountType = ‘SAVINGS’

Partition #2:

SELECT AccountID, TransactionType, TransactionAmount 
FROM Transactions 
WHERE AccountType = ‘CURRENT’

Partition #3:

SELECT AccountID, TransactionType, TransactionAmount 
FROM Transactions 
WHERE AccountType = ‘CHECK-IN’

Partition #4:

SELECT AccountID, TransactionType, TransactionAmount 
FROM Transactions 
WHERE AccountType = ‘DEMAT’

The above method ensures that each of your source qualifier partition is populated with different set of data.
Alternatively you may also change the partition type in source qualifier level from Pass-Through to “Key range” and specify start and end range of values for each partition. You can also change the partition type to “Database Partitioning” if your source database is Oracle or DB2.
Another important point to note here is – if you add SORTER transformation before Joiner (which you should always do if your data is not already sorted from source) – then you should also consider creating partition points and adding same number of partitions to the sorter transformation. If the partition type at the sorter level is Hash Auto Key, then you need not add any partition point in the Joiner Transformation level
Based on whether your joiner data is sorted or not and the cache is partitioned or not, different number of cache(s) will be created by Informatica as shown below:

	Number of Cache(s) in Disk	Number of Cache(s) in Memory
Data Sorted	Only one	Equal to number of partitions
Data Un-sorted/ Not Partitioned	Only one	Only one
Data Un-sorted/ Partitioned	Equal to number of partitions	Equal to number of partitions

So, this is all about tuning a joiner transformation.

Tuning Aggregator Transformation

Like Joiner, the basic rule for tuning aggregator is to avoid aggregator transformation altogether unless –

You really can not do the aggregation in the source qualifier SQL query (e.g. Flat File source)
Fields used for aggregation are derived inside the mapping

If you have to do the aggregation using Informatica aggregator, then ensure that all the columns used in the group by are sorted in the same order of group by and “Sorted Input” option is checked in the aggregator properties. Ensuring the input data is sorted is absolutely must in order to achieve better performance and we will soon know why.
Other things that need to be checked to increase aggregator performance are –

Check if “Case-Sensitive String Comparison” option is really required. Keeping this option checked (default) slows down the aggregator performance
Enough memory (RAM) is available to do the in memory aggregation. See below section for details.
Aggregator cache is partitioned

How to (and when to) set aggregator Data and Index cache size

As I mentioned before also, my advice is to leave the “Aggregator Data Cache Size” and “Aggregator Index Cache Size” options as Auto (default) in the transformation level and if required, set either of the followings in the session level (under “Config Object” tab) to allow Informatica allocate enough memory automatically for the transformation:

Maximum Memory Allowed For Auto Memory Attributes
Maximum Percentage of Total Memory Allowed For Auto Memory Attributes

However if you do have to set Data Cache/ Index Cache size yourself, please note that the value you set here is actually RAM memory requirement (and not disk space requirement) and hence, your mapping will fail if Informatica can not allocate the entire memory in RAM at the session initiation. And yes, this can happen often because you never know what other jobs are running in the server and what amount of RAM other jobs are really occupying while you run this job.
Having understood the risk, let’s now see the benefit of manually configuring the Index and Data Cache sizes. If you leave the index and data cache sizes to auto then if Informatica does not get enough memory during session run time, your job will not fail, instead Informatica will page-out the data to hard disk level. Since I/O performance of hard disk drive is 1000~ times slower than RAM, paging out to hard disk drive will have performance penalty. So by setting data and index cache size manually you can ensure that Informatica block this memory in the beginning of session run so that the cache is not paged-out to disk and the entire aggregation actually take place in RAM. Do this at your own risk.

Manually configuring index and data cache sizes can be beneficial if ensuring consistent session performance is your highest priority compared to session stability and operational steadiness. Basically you risk your operations (since it creates high chance of session failure) to obtain optimized performance.

The best way to determine the data and index cache size(s) is to check the session log of already executed session. Session log clearly shows these sizes in bytes. But this size depends on the row count. So keep some buffer (around 20% in most cases) on top of these sizes and use those values for the configuration.
Other way to determine Index and Data Cache sizes are, of course, to use the inbuilt Cache-size calculator accessible in session level.

Fig. Aggregator - Cache Calculator

Using the Informatica Aggregator cache size calculator is a bit difficult (and lot inaccurate). The reason is to calculate cache size properly you will need to know the number of groups that the aggregator is going to process. The definition of number of groups is as below:

No. of Groups = Multiplication of cardinality values of each group by column

This means, suppose you group by store and product, and there are total 150 distinct stores and 10 distinct products, then no. of groups will be 150 X 10 = 1500.
This is inaccurate because, in most cases you can not ascertain how many distinct stores and product data will come on each load. You might have 150 stores and 10 products, but there is no guarantee that all the product will come on all the load. Hence the cache size you determine in this method is quite approximate.
You can, however, calculate the cache size in both the two methods discussed here and take the max of the values to be in safer side.

Sunday, June 19, 2011

ORACLE-CONDITIONAL SELECTIONS AND OPERATORS-2

We have two clauses used in this

Ø Where

Ø Order by

USING WHERE

Syntax:

select * from <table_name> where <condition>;

the following are the different types of operators used in where clause.

v Arithmetic operators

v Comparison operators

v Logical operators

v Arithmetic operators -- highest precedence

+, -, *, /

v Comparison operators

Ø =, !=, >, <, >=, <=, <>

Ø between, not between

Ø in, not in

Ø null, not null

Ø like

v Logical operators

Ø And

Ø Or -- lowest precedence

Ø not

a) USING =, >, <, >=, <=, !=, <>

Ex:

SQL> select * from student where no = 2;

NO NAME MARKS

--- ------- ---------

2 Saketh 200

2 Naren 400

SQL> select * from student where no < 2;

NO NAME MARKS

--- ------- ----------

1 Sudha 100

1 Jagan 300

SQL> select * from student where no > 2;

NO NAME MARKS

--- ------- ----------

3 Ramesh

4 Madhu

5 Visu

6 Rattu

SQL> select * from student where no <= 2;

NO NAME MARKS

--- ------- ----------

1 Sudha 100

2 Saketh 200

1 Jagan 300

2 Naren 400

SQL> select * from student where no >= 2;

NO NAME MARKS

--- ------- ---------

2 Saketh 200

2 Naren 400

3 Ramesh

4 Madhu

5 Visu

6 Rattu

SQL> select * from student where no != 2;

NO NAME MARKS

--- ------- ----------

1 Sudha 100

1 Jagan 300

3 Ramesh

4 Madhu

5 Visu

6 Rattu

SQL> select * from student where no <> 2;

NO NAME MARKS

--- ------- ----------

1 Sudha 100

1 Jagan 300

3 Ramesh

4 Madhu

5 Visu

6 Rattu

b) USING AND

This will gives the output when all the conditions become true.

Syntax:

select * from <table_name> where <condition1> and <condition2> and .. <conditionn>;

Ex:

SQL> select * from student where no = 2 and marks >= 200;

NO NAME MARKS

--- ------- --------

2 Saketh 200

2 Naren 400

c) USING OR

This will gives the output when either of the conditions become true.

Syntax:

select * from <table_name> where <condition1> and <condition2> or .. <conditionn>;

Ex:

SQL> select * from student where no = 2 or marks >= 200;

NO NAME MARKS

--- ------- ---------

2 Saketh 200

1 Jagan 300

2 Naren 400

d) USING BETWEEN

This will gives the output based on the column and its lower bound, upperbound.

Syntax:

select * from <table_name> where <col> between <lower bound> and <upper bound>;

Ex:

SQL> select * from student where marks between 200 and 400;

NO NAME MARKS

--- ------- ---------

2 Saketh 200

1 Jagan 300

2 Naren 400

e) USING NOT BETWEEN

This will gives the output based on the column which values are not in its lower bound,

upperbound.

Syntax:

select * from <table_name> where <col> not between <lower bound> and <upper bound>;

Ex:

SQL> select * from student where marks not between 200 and 400;

NO NAME MARKS

--- ------- ---------

1 Sudha 100