Thursday, 10 December 2015

12/10/2015 08:23:00 am

SQL Override Pros and Cons

This is one of most frequently asked interview question for a informatica developer "what are the pros and cons of using SQL Overide ? " 




Pros 


  1. We can perform complex join conditions and filter conditions easily and the same is directly pushed to the database level without Informatica having to write the queries.
  2. We can override queries to make them use indexes and to get better execution plan.We can also use hints in SQL query
  3. There are few things which can be achieved easily in sql query using advanced sql feature like Top/Pivot/CTE etc.To achieve same in informatica you have to write sql logic
  4. In few cases when we are migrating from some legacy system to Informatica we can directly use old SQL queries in sql override which will save lot of efforts

Cons

  1. The no of columns connected in the downstream transformation  should match with the datatype and length with the overridden column name even if it does not necessarily need the same column names
  2. We lose the ability to perform source side push down optimization the moment we perform an override
  3. By writing complex queries on the override, we complicate the code which is against the fundamental rules of using ETL packages 



12/10/2015 02:44:00 am

Control M Character in File --Everything explained

This is the problem which every IT Person faces when dealing with files across platforms.If you haven't faced it I can guarantee you will face it one day.



There is lot of information on internet about this but it has not been clearly explained.I will try to put all information together about Control M character at one place.

First of all it is important to understand

"Line Endings" in different operating systems 

Line Ending means when you are at end of line and how Enter Key is interpreted in different operating systems.

All major operating system interpret Enter character differently

OS Line Ending
Windows CR/LF
Unix LF
Mac CR/LF

CR = \r = Carriage Return
LF = \n = Line Feed


Why Do Control M Character Occur

When we  open a text file that was created under Windows. The text characters in the first line are all displayed correctly. At the end of the line, we find a CR. This means nothing special to Unix so the unix attempts to display the character. CR is a non-printable character. Under Unix many of these non-printable characters are mapped to control characters when displayed. In the case of CR, its displayable equivalent is Control-M. This is displayed in most editors as ^M.
The next character is a LF. Unix is fine with it as it is the standard line end character. The editor therefore moves to the start of the next line on the display and starts to process the next line.

How to view Control M Character in Unix

Below will show all  tabs, vertical tabs, carriage returns, linefeeds and whatnot using the slash notation.
od -c Yourfile.txt

If you only want to see control M Character

cat -v filename.txt


Possible Solutions

  1. Remove character in VI Editor Open the VI Editor and do :1,$/^M//g .^M character can be produced by  holding down the control key whilst pressing M at the same time
  2. Different Unix Command : There are various commands availaible to remove control m character    1)  tr -d '\r' < infile.txt > outfile.txt 2) dos2unix < DOSfile.txt > Unixfile.txt
  3. Fix during FTP : Control M character will not appears if files is transferred using ASCII Mode
If you need to process such file in informatica you can call above commands in pre session to remove control M character

I Hope you are clear about why Control M character occur and possible solutions




Thursday, 12 November 2015

11/12/2015 01:53:00 pm

All about sftp

SFTP, (SSH File Transfer Protocol) is a protocol packaged with SSH that works in a similar way over a secure connection. The advantage is the ability to leverage a secure connection to transfer files and traverse the file system on both the local and remote system.

sftp performs all operations over an encrypted ssh session. It uses many of the features of ssh, such as public key authentication and data compression.





sftp can be primarily used in 4 ways 




Interactive session :In this mode, sftp connects and logs into the specified host, then enters its interactive command mode, where you type all your commands at a prompt. To launch an interactive session of sftp, use the following syntax:

sftp user@server.myhost.com


Automatic Retrieval Mode : We can use sftp to retrieve files automatically, without any promp


Batch Mode :  We can write sftp scripts to do ftp in batch mode and these scripts can be called by cron jobs .

Interactive Mode in Remote Directory : We can start start an interactive mode session in a specific remote directory. We can do this by specifying it on the command line sftp user@server:Dirname









Wednesday, 15 April 2015

4/15/2015 10:07:00 am

Seven unusual facts about informatica


Here are the top seven unusual facts that I wish I knew when I started my Informatica development career. 




  1. A dynamic lookup will capture updates only if the output ports on the source table are connected to another transformation, even if the ports are not being used.
  2. If an XML over http web transformation call is unsuccessful, that pipeline dies on the spot if it is not first connected to a transformation other than a router. This is true even when other ports skip the http transformation.
  3. Just because the error returned is “Transformation Evaluation Error”, it does not make it true. This error is often seen when there is another problem with a source or target table. Make sure to look through all errors before applying a fix.
  4. A parameter remains unchanged throughout the session run while a variable can be set to a new value at any point during, before, or after the run.
  5. Even if “persistent” variable is not checked, Informatica will hold on to any variable set in the mapping for the next run. This can either be overwritten by using a parameter file or by using a pre-session assignment of a NULL value to the variable.
  6. If a zero ends up in a target number field when another number was expected, there is probably a varchar port somewhere upstream. 
  7. If the mapping creates a dynamic file name, a dummy file must also be named at the workflow level because the workflow does not know the name of the file until you pass it over, which will cause a fatal error.






Friday, 27 March 2015

3/27/2015 10:41:00 am

Seven Quirks in Informatica PowerCenter Development




Here are seven quirks that I wish I knew when I started my Informatica development career.

  1. If you find that your source queries or target pre/post SQL is not behaving as expected, check the workflow session to see if somebody overwrote the code at the workflow level. You can tell if there is a small revert button visible.  The revert button doesn’t mean that the code is different; it means that any changes made at the mapping level will not be carried over to the workflow level.  This means that when running your workflow, the code at the workflow level will be executed.  This can be a huge time-waster if you keep editing code at the mapping level and then run the workflow just to find it is still not working as intended.
  2. When using Monitor, if you “open” rather than “connect”, you will not see the newly running workflows in real time. Instead, you’ll have to disconnect and reconnect (and get previous runs) in order to see them.
  3. While using Workflow Manager, if you right click on a session to edit a task and you already have one or more tasks in the task developer, you will open the wrong task for editing.  As a work around, make it a habit to right click and clear all within the white space of the task developer when you are finished editing each task to avoid wasting time by editing the wrong task.
  4. If you use an external monitor for development  and drag the Edit Transformations window to it, the next time you open it you will no longer be able to expand the SQL Query box using the expand arrow.  Follow the instructions in this post to fix it..  Keep your Informatica on your main screen and use your external for your database, IDEs, email and whatever else you keep open.
  5. Pasting a query without expanding the SQL Query text box causes the query to fill several subsequent text boxes, overwriting what is already present.  If you expand the SQL Query text box and paste it, it will work just fine.  See quirk 4 if the text box doesn’t expand.
  6. The session log will show any queries that are being sent. Unfortunately, if you have a really long query because you are using best practices and not using select *, the log cuts off the bottom of the query where all the good stuff is (like the where clause, joins and what not).  Copy the query into a text file and then remove all columns that aren’t necessary to investigate the query being sent. Then run it again, the log will now have more room to show the query.
  7. If you are trying unsuccessfully to connect two ports and there is no pipeline error, check Layout -> Link Columns to make sure you did not accidentally unselect it. 

Wednesday, 18 February 2015

2/18/2015 05:23:00 am

Informatica Best practices -Part 2

General
 
  • Limit the number of rows as early in the mapping as possible (SQ, aggregators, joiners, filters) - the more you limit at the beginning of the data flow, the less rows will be processed by all downstream transformations.
  • For generating IDs use IDENTITY columns or DB triggers. There is a number of reasons:

    1. fewer PowerCenter objects reduce development time and maintenance effort
    2. IDs are PowerCenter independent, so any other app used to populate the target table will not cause any issues
    3. while migrating the code between environments there is no need to worry about the sequence value
      
  • If for some reason you need the ID on PowerCenter side, use Sequence Generator rather than Stored Procedure call for performance reasons - Sequence Generator is really fast as it simply gives you numbers in IS memory. Stored Procedure call requires connecting to DB engine, invoking the Stored Procedure, and sending the number back to IS. The communication overhead is huge and may heavily impact performance. This is also a factor of the connection speed and distance.
    If you’re afraid of losing the generated sequence number (which might happen when exporting/importing xml without the Retain Sequence Generator Value option), store the last ID in DB and use it as a start value adding to the sequence and update it with the last ID generated this way.

  •  
  • There should be no unused ports in any transformation (except Source Qualifier) - Source Qualifier should match Source Definition 1-to-1. But for any other transformation there should be no unused ports. Why fetch and process the data for half of the data flow and use extra memory, if you don’t need it?

  •  
  • Avoid implicit conversion, as it is slower and might lead to errors - There are number of issues with implicit conversion (e.g. when you link SQ string port to decimal port in the following expression):

    1. it’s simply not visible. Anyone checking the mapping will not notice that there is any conversion of datatypes
    2. the conversion might behave not as expected (e.g. roundings, trailing/leading spaces or zeros, etc.)
    3. there may be unhandled conversion errors
    Therefore use appropriate conversion functions (e.g. TO_DECIMAL) explicitely.


    1. Remove all the 'Error (Transformation error)' messages from default values in Expression transformations - you will avoid getting unwanted error messages in logs. It will not be checked at session initialization.
    2. Consider adding Expression transformation right after each SQ and right before each Target - pass-through expressions do not cause additional computation. But I couldn’t count how many times I had to make “little change” before or after the whole data flow in an existing mapping. Having this dummy expression transformation helps a lot. It’s also very useful when relinking all ports after some changes (using Autolink by name).


    Aggregators:


    1. Use as soon as possible - it limits the number of rows processed by any transformation downstream.
    2. Use 'Sorted Input' only if possible, but do NOT sort just for aggregation - This one is a bit tricky. Aggregator with a “Sorted input” property works faster and uses little cache. But aggregating data does not require sorting. Therefore, if dealing with unsorted data, Aggregator (without Sorted Input, large cache) will perform better than Sorter (large cache) + Aggregator (with Sorted Input, small cache). If you do add The Sorter, the cache will not disappear - it’ll be created for the Sorter. Adding Sorter is a good idea only if you need to sort the data anyway.


    Expression:


    1. Use local variables for common computation (especially if complex) - If you need to perform some computation for many output ports (e.g. datatype conversion on one of the input ports), you can define a variable port doing it once (e.g. TO_DECIMAL(input_port)) and use the port in expressions for the output ports. This way it will be done once. This is especially important, when the common logic is complex, because:
      1. the performance gain is even grater
      2. it’s easier to maintain - there is one place you need to check for any errors and implement any changes
    2. Avoid local variables if no code reuse opportunity is present - If the output port does some computation and it is not common for other output ports, use the output port expression. Do not add a variable port. Input -> Variable -> Output makes the expression not visible from Mapping Designer (i.e. without opening Expression Transformation window)


    Filter / Router:


    1. Any complex expressions should be done in an Expression transformation before the filter / router - it is very important in case of Router. Calculate once, then compare - instead of calculating many times. Remember, that for every input row, all group expressions are checked (Router can have many output for one input row). It is a good idea to do the same for Filter, as for example you can use variable ports in Expression while it is not possible in Filter.
    2. Keep in mind that router executes all expressions against each input row


    Floats:


    1. When to use? Only for Physics, where accuracy is not that important. Keep in mind that floating point numbers are not accurate! E.g. 0,1 is stored as ~0.0999999999999999991, which is very close to 0,1, but not equal. You can end up having 0 kept as -0,00000001 (real-life example). This may cause a lot of trouble when selecting all rows greater or equal to zero.
      For any financial, quantitative - almost any calculations use decimals.


    Lookup:


    1. Should have only the ports that are being used - The more ports in Lookup, the more data is fetched. It takes more time and memory to build the cache.
    2. Limit the number of rows fetched using filter - Use the Lookup Source Filter. This will limit the amount of data to fetch and keep in memory.


    Parameter files:


    1. Use one parameterfile per workflow - There are many ways to use parameters and parameter files. You can define General parameters for all sessions in a workflow. You can have each session use it’s own parameter file. What is the best way? There is no simple answer. You might need different setup for some purposes. However in general it’s good to have one parameter file per workflow. You won’t affect other workflows when changing the parameters. The file will be quite simple (in most cases).
    2. Consider using auto generated parameter files using values stored in DB - Workflows can start (or end) with a generic session creating the parameter file. This would allows easy maintaining all parameter files, parameters, do any checks for naming conventions, etc.


    Sorters:

    1. Avoid if possible - do not sort the data unless it is really needed.
    2. Use as soon as possible - if the sorting is needed, sort data early in the mapping and try to take advantage of it using Sorted Input for as many transformations downstream as possible.
    3. Use as long as possible - Try to arrange the mapping in a way that all transformations sharing the sort type come one after the other. Avoid resorting separately for every transformation.
    4. Take advantage on DB sorting if covering index is available - If covering index is available in DB, sort the data using Source Qualifier and use Sorted input for the transformations downstream.
    5. Don’t:
      1. Sort just for aggregator - Although Aggregator works faster and doesn’t use caching (almost) with “Sorted Input” property checked, it is not worth sorting data just for aggregation. This will just move the need for cache from Aggregator to Sorter and in fact will slow down the overall performance as the aggregation algorithm doesn’t require sorting.
      2. Sort for joiner (especially each sorter!) - Do not use two Sorters for Detail and Master groups of Joiner Transformation. In such case both pipes will require caching while for unsorted data just the Master input is cached.


    Source Qualifier:

    1. Connect only the necessary ports from SQ - Source Qualifier determines what really needs to be fetched from source. While it should match Source Definition 1:1 with all ports linked (to avoid future mistakes and misunderstandings), it reads only the data for ports linked to the next transformation. This can be observed while generating SQL statement (Properties->Sql Query->Generate SQL) - only the used ports will be listed in SQL statement. This is especially important for wide sources, as this way the amount of data transferred can be limited.
    2. Avoid using SQL override statements - Using the Generate SQL property overrides any and all other settings (i.e. port order, sorting, filters). This is also not visible at the first glance and reduces the code transparency. This should be therefore avoided. If you need to:
      1. sort the data - use "Number Of Sorted Ports" property with the proper ports order
      2. filter the data - use "Source Filter" property
      3. join multiple homogeneous sources - use "User Defined Join" property
    SQL override should be used only when some complex statements are needed (e.g. use of DB-specific functions or subquery)
    The above is also applicable (in general) to lookup transformation.

    1. All ports between Source Definition and SQ should be connected - As mentioned above, it is good to have SQ matching Source Definition as it reduces the number of future mistakes.
    2. All the datatypes should match Source Definition - In general implicit datatype conversion should be avoided as it may lead to errors and unexpected behavior.


    Update Strategy:

    1. Don’t use DD_REJECT - Don't process unneeded data (unless really needed in .bad file, use "Forward Rejected rows"). Otherwise use a Filter transformation to drop unneeded rows of data.

    Tuesday, 10 February 2015

    2/10/2015 02:10:00 pm

    Dynamically Generated Parameter Files



    There are few other solutions available for dynamically generating parameter file but this one is more generic and dynamic. It requires a metadata table setup with workflow parameter details. It then creates parameter files dynamically for all workflows in one go.


    This mapping reads a metadata table created for workflow parameters. It uses concept of Transaction Control transformation, currently processes file port to generate parameter files for any number of workflows.

    This solution is quite simple and generic compared to other solution in market place. There is no need to create complex UNIX scripts to achieve this


    1.    Simple Design
    2.    Used Transaction Control Transformation  to generate parameter files dynamically
    3.    Used Listagg function in Oracle
    4.    It can be implemented across any informatica power center project with minimal changes
    5.    It avoids creating complex Unix script to generate parameter files

    Please go through below link for more details as it has been also published on informatica marketplace and has informatica seal of approval.





    Wednesday, 4 February 2015

    2/04/2015 02:24:00 pm

    Informatica Advance techniques --File Watcher (Using Java Transformation)


    We are going to have a look at File watcher utility in informatica to look for incoming files from external server. It utilises Java transformation in informatica and create indicator/trigger files once file has been downloaded completely.

    It is particularly useful in those scenarios where external system is sending file to our server but we don’t know which job so we can’t setup dependency in scheduling tool (Control M/Autosys etc) .Normally we use time condition to cater for this which is not always correct. Our job fails if file does not arrive on time.

    We can utilise file watcher along with Informatica event wait feature to cater for these kinds of scenarios.

    Design a mapping Let say we call as m_File_Watcher_Start

    We need to create a Java transformation to cater for this. 




    Create one input port as STAGING_FILE and three output port as 


    FILE_SIZE
    FILE_EXISTS
    FILE_TIME

    Write below code in import packages.



    import java.io.File; // Used to facilitate file and directory object representations
    import java.text.DecimalFormat; // Used to format file and directory objects size
    import java.util.*;
    import java.text.*;


    Write below code in on input row.



    FILE_SIZE = getFileSize( STAGING_FILE);
    FILE_TIME =getFileTime(STAGING_FILE);

    File f = new File(STAGING_FILE);

    if(f.exists()){ FILE_EXISTS="Y"; }else{ FILE_EXISTS="N"; } 

    Here we have got below details about File size, File Time and whether file exists or not.

    We can create a table with below details: FILE_WATCHER_DETAILS
    CREATE TABLE FILE_WATCHER_DETAILS
      (
    STAGING_FILE  VARCHAR2(100 BYTE), 
    TRIGGER_FILE VARCHAR2(100 BYTE), 
    FILE_TIME VARCHAR2(100 BYTE), 
    FILE_SIZE_KB VARCHAR2(100 BYTE), 
    FILE_EXISTS VARCHAR2(1 BYTE), 
    TRIGGER_DONE VARCHAR2(1 BYTE), 
    TRIGGER_PICKED VARCHAR2(1 BYTE)
      )
    Suppose we have file Input_file.txt and size of file is 50 KB below will be data in table
    STAGING_FILE Input_file.txt   
    TRIGGER_FILE Input_file.txt.trg (append .trg at end of file)   
    FILE_TIME 04/02/2014 13:05:09   
    FILE_EXISTS Y   
    FILE_SIZW 50 KB   
    TRIGGER_DONE N  

    Design a new mapping Let say we call as m_File_Watcher_End

    Create mapping similar to earlier one but slight difference in java transformation with two input ports STAGING_FILE and Trigger File and output port Trigger_done (Which will be assigned value ‘Y’ in case file size is same ).

    We will call these two mapping between two sessions with time interval of 5 mins.This is done to make sure file has been completely downloaded.
    In this Java transformation we are check new file size vs old size as there will be 5 min gap between running of these two mappings.If size is same then create trigger file



    FILE_SIZE_KB = getFileSize( STAGING_FILE);
    OLD_FILE_SIZE_KB=OLD_FILE_SIZE_KB;
    if (FILE_SIZE_KB.equals(OLD_FILE_SIZE_KB) ) {
    File file = new File(TRIGGER_FILE); 
    boolean fileCreated = false; 
    try { 
            fileCreated = file.createNewFile();
         
    catch (IOException ioe) { 
    System.out.println("Error while creating empty file: " + ioe); 
    }

    Final Value in table will be 
    STAGING_FILE Input_file.txt   
    TRIGGER_FILE Input_file.txt.trg (append .trg at end of file)   
    FILE_TIME 04/02/2014 13:05:09   
    FILE_EXISTS Y   
    FILE_SIZW 50 KB   
    TRIGGER_DONE Y  


    In the end we need to create a Workflow with two sessions and a timer of 5 minutes between them.