Wednesday 18 February 2015

Informatica Best practices -Part 2

General
 
  • Limit the number of rows as early in the mapping as possible (SQ, aggregators, joiners, filters) - the more you limit at the beginning of the data flow, the less rows will be processed by all downstream transformations.
  • For generating IDs use IDENTITY columns or DB triggers. There is a number of reasons:

    1. fewer PowerCenter objects reduce development time and maintenance effort
    2. IDs are PowerCenter independent, so any other app used to populate the target table will not cause any issues
    3. while migrating the code between environments there is no need to worry about the sequence value
      
  • If for some reason you need the ID on PowerCenter side, use Sequence Generator rather than Stored Procedure call for performance reasons - Sequence Generator is really fast as it simply gives you numbers in IS memory. Stored Procedure call requires connecting to DB engine, invoking the Stored Procedure, and sending the number back to IS. The communication overhead is huge and may heavily impact performance. This is also a factor of the connection speed and distance.
    If you’re afraid of losing the generated sequence number (which might happen when exporting/importing xml without the Retain Sequence Generator Value option), store the last ID in DB and use it as a start value adding to the sequence and update it with the last ID generated this way.

  •  
  • There should be no unused ports in any transformation (except Source Qualifier) - Source Qualifier should match Source Definition 1-to-1. But for any other transformation there should be no unused ports. Why fetch and process the data for half of the data flow and use extra memory, if you don’t need it?

  •  
  • Avoid implicit conversion, as it is slower and might lead to errors - There are number of issues with implicit conversion (e.g. when you link SQ string port to decimal port in the following expression):

    1. it’s simply not visible. Anyone checking the mapping will not notice that there is any conversion of datatypes
    2. the conversion might behave not as expected (e.g. roundings, trailing/leading spaces or zeros, etc.)
    3. there may be unhandled conversion errors
    Therefore use appropriate conversion functions (e.g. TO_DECIMAL) explicitely.


    1. Remove all the 'Error (Transformation error)' messages from default values in Expression transformations - you will avoid getting unwanted error messages in logs. It will not be checked at session initialization.
    2. Consider adding Expression transformation right after each SQ and right before each Target - pass-through expressions do not cause additional computation. But I couldn’t count how many times I had to make “little change” before or after the whole data flow in an existing mapping. Having this dummy expression transformation helps a lot. It’s also very useful when relinking all ports after some changes (using Autolink by name).


    Aggregators:


    1. Use as soon as possible - it limits the number of rows processed by any transformation downstream.
    2. Use 'Sorted Input' only if possible, but do NOT sort just for aggregation - This one is a bit tricky. Aggregator with a “Sorted input” property works faster and uses little cache. But aggregating data does not require sorting. Therefore, if dealing with unsorted data, Aggregator (without Sorted Input, large cache) will perform better than Sorter (large cache) + Aggregator (with Sorted Input, small cache). If you do add The Sorter, the cache will not disappear - it’ll be created for the Sorter. Adding Sorter is a good idea only if you need to sort the data anyway.


    Expression:


    1. Use local variables for common computation (especially if complex) - If you need to perform some computation for many output ports (e.g. datatype conversion on one of the input ports), you can define a variable port doing it once (e.g. TO_DECIMAL(input_port)) and use the port in expressions for the output ports. This way it will be done once. This is especially important, when the common logic is complex, because:
      1. the performance gain is even grater
      2. it’s easier to maintain - there is one place you need to check for any errors and implement any changes
    2. Avoid local variables if no code reuse opportunity is present - If the output port does some computation and it is not common for other output ports, use the output port expression. Do not add a variable port. Input -> Variable -> Output makes the expression not visible from Mapping Designer (i.e. without opening Expression Transformation window)


    Filter / Router:


    1. Any complex expressions should be done in an Expression transformation before the filter / router - it is very important in case of Router. Calculate once, then compare - instead of calculating many times. Remember, that for every input row, all group expressions are checked (Router can have many output for one input row). It is a good idea to do the same for Filter, as for example you can use variable ports in Expression while it is not possible in Filter.
    2. Keep in mind that router executes all expressions against each input row


    Floats:


    1. When to use? Only for Physics, where accuracy is not that important. Keep in mind that floating point numbers are not accurate! E.g. 0,1 is stored as ~0.0999999999999999991, which is very close to 0,1, but not equal. You can end up having 0 kept as -0,00000001 (real-life example). This may cause a lot of trouble when selecting all rows greater or equal to zero.
      For any financial, quantitative - almost any calculations use decimals.


    Lookup:


    1. Should have only the ports that are being used - The more ports in Lookup, the more data is fetched. It takes more time and memory to build the cache.
    2. Limit the number of rows fetched using filter - Use the Lookup Source Filter. This will limit the amount of data to fetch and keep in memory.


    Parameter files:


    1. Use one parameterfile per workflow - There are many ways to use parameters and parameter files. You can define General parameters for all sessions in a workflow. You can have each session use it’s own parameter file. What is the best way? There is no simple answer. You might need different setup for some purposes. However in general it’s good to have one parameter file per workflow. You won’t affect other workflows when changing the parameters. The file will be quite simple (in most cases).
    2. Consider using auto generated parameter files using values stored in DB - Workflows can start (or end) with a generic session creating the parameter file. This would allows easy maintaining all parameter files, parameters, do any checks for naming conventions, etc.


    Sorters:

    1. Avoid if possible - do not sort the data unless it is really needed.
    2. Use as soon as possible - if the sorting is needed, sort data early in the mapping and try to take advantage of it using Sorted Input for as many transformations downstream as possible.
    3. Use as long as possible - Try to arrange the mapping in a way that all transformations sharing the sort type come one after the other. Avoid resorting separately for every transformation.
    4. Take advantage on DB sorting if covering index is available - If covering index is available in DB, sort the data using Source Qualifier and use Sorted input for the transformations downstream.
    5. Don’t:
      1. Sort just for aggregator - Although Aggregator works faster and doesn’t use caching (almost) with “Sorted Input” property checked, it is not worth sorting data just for aggregation. This will just move the need for cache from Aggregator to Sorter and in fact will slow down the overall performance as the aggregation algorithm doesn’t require sorting.
      2. Sort for joiner (especially each sorter!) - Do not use two Sorters for Detail and Master groups of Joiner Transformation. In such case both pipes will require caching while for unsorted data just the Master input is cached.


    Source Qualifier:

    1. Connect only the necessary ports from SQ - Source Qualifier determines what really needs to be fetched from source. While it should match Source Definition 1:1 with all ports linked (to avoid future mistakes and misunderstandings), it reads only the data for ports linked to the next transformation. This can be observed while generating SQL statement (Properties->Sql Query->Generate SQL) - only the used ports will be listed in SQL statement. This is especially important for wide sources, as this way the amount of data transferred can be limited.
    2. Avoid using SQL override statements - Using the Generate SQL property overrides any and all other settings (i.e. port order, sorting, filters). This is also not visible at the first glance and reduces the code transparency. This should be therefore avoided. If you need to:
      1. sort the data - use "Number Of Sorted Ports" property with the proper ports order
      2. filter the data - use "Source Filter" property
      3. join multiple homogeneous sources - use "User Defined Join" property
    SQL override should be used only when some complex statements are needed (e.g. use of DB-specific functions or subquery)
    The above is also applicable (in general) to lookup transformation.

    1. All ports between Source Definition and SQ should be connected - As mentioned above, it is good to have SQ matching Source Definition as it reduces the number of future mistakes.
    2. All the datatypes should match Source Definition - In general implicit datatype conversion should be avoided as it may lead to errors and unexpected behavior.


    Update Strategy:

    1. Don’t use DD_REJECT - Don't process unneeded data (unless really needed in .bad file, use "Forward Rejected rows"). Otherwise use a Filter transformation to drop unneeded rows of data.

    1 comment:

    1. Hi,
      Do you provide training for Informatica Developer tool for Hive data(BDE).

      ReplyDelete