ETL(Extract , Transform and Load) – Part 5 -Introduction to Data warehouse

Creating a Data Flow Task

There are three types of data flow task components in the SSIS :
■■ Data flow source adapters
■■ Data flow transformations
■■ Data flow destination adapters

Defining Data Flow Source Adapters

A data flow source adapter is used to extract data from a source and make it available to other components inside a data flow.

ADO.NET source : This type of  source provides connections to tables or queries through an ADO.NET provider.

CDC source Change Data Capture (CDC) : It allows the retrieval of only changed data (from insert, update, or delete operations) from the source system. It uses an ADO.NET provider to connect to a CDC-enabled table .

Excel source : This source allows extraction from a Microsoft Excel worksheet in an Excel file.

Flat File source : This allows you to extract data from delimited or fixed-width files created with various code pages. It uses the Flat File connection manager.

ODBC source : This source connects to a defined ODBC source by using native ODBC and not the OdbcDataProvider in ADO.NET, as in versions before SQL Server 2012.

OLE DB source :  This source connects to installed OLE DB providers such as SQL Server, SQL Server Analysis Services (SSAS), and Oracle.

Raw File source : The Raw File source reads data from a native SSIS data file that was written by the Raw File destination. Because the representation of the data is native to the source, the data requires no translation and almost no parsing. This means that the Raw File source can read data more quickly than any of the other sources.

XML source: The XML source allows raw data to be extracted from an XML file.

Defining Data Flow Destination Adapters :

Data flow destinations are similar to sources in that they use project or package connections. However, destinations are the endpoints in a data flow task, defining the location to which the data should be put into.

ADO.NET destination:  Used to insert data through an ADO.NET provider.

Data Mining Model Training : Allows you to pass data from the data flow into a data mining model in SSAS.

ADO.NET destination:  Used to insert data through an ADO.NET provider.
Data Mining Model Training Allows you to pass data from the data flow into a data mining model in SSAS.

DataReader destination : Lets you pass data in a ADO.NET recordset that can be programmatically referenced .

Excel destination : Used for writing data into a specific sheet in Excel.

Flat File destination : Allows insertion of data to a flat file such as a comma-delimited or tab delimited file.

ODBC destination ; Allows you to insert data by using an ODBC provider. This is a new component in SQL Server 2012. It supports a batch or row-by-row data access
mode for inserting the data. In a data warehouse environment, the batch mode is recommended, because of large volumes of data.

OLE DB destination :Uses the OLE DB provider to insert rows into a destination system that allows an OLE DB connection.
Partition Processing : Allows an SSAS partition to be processed directly from data flowing through the data flow.

Raw File destination : Stores data in native SSIS format as a binary file. Very useful in scenarios when you need to temporarily stage data—for example, when the destination
server is not available and you do not want to or cannot process the source data again. In SQL Server 2012, the Raw File destination can save all kinds of information (including the comparison flags for string columns).

Using Fast Parse

Fast Parse is a set of operations within SSIS that can be used for very fast loading of flat file data. When you are loading data using the Flat File source adapter and you do not need to parse any locally sensitive data (such as date formats, decimal symbols such as the comma, or currency symbols), SSIS can use the Fast Parse option to load data very quickly. Fast Parse is supported only on a subset of date, time, and integer formats.

To set the Fast Parse property, use the Advanced Editor:
1. Right-click the Flat File source and then click Show Advanced Editor.
2. In the Advanced Editor dialog box, click the Input And Output Properties tab.
3. In the Inputs And Outputs pane, click the column for which you want to enable Fast Parse.
4. In the properties window, expand the Custom Properties node, and then set the
FastParse property to True.

Working with Data Flow Transformations

Transformations give you the ability to modify and manipulate data in the data flow.

Selecting Transformations:

A transformation can operate on one row of data at a time or on several rows of data at once. The operations of some transformations are similar to others; therefore, the transformations can be categorized into natural groupings of similar components. In addition to this natural grouping, this lesson describes the type of blocking that occurs inside the data flow task for each transformation.

There are three types of blocking:
■■ In non-blocking transformations, each row is passed through the transformation without any waits.
■■ A partial-blocking transformation waits until a sufficient number of rows is stored and then it passes them through.
■■ In a blocking transformation, all rows must be read into the transformation before the transformation can pass them through.

Logical Row-Level Transformations

The most common transformations are logical row-level transformations,which perform operations on the row level without needing other rows from the source. These transformations, which logically work at the row level, often perform good .

Some common uses of this type of calculation in data warehouse scenarios include the
creation of calculated columns from multiple source columns, mathematical calculations, conversion of data type values, and the replacement of NULL with other values.

Audit :Adds additional columns to each row based on system package variables such as ExecutionStartTime and PackageName.

Cache Transform :Allows you to write data to a cache with the Cache connection
manager. The data can then be used by the Lookup transformation. This is useful if you are using multiple Lookup transformations against the same data, because SSIS will cache the needed data only once and not for each Lookup .

Character Map : This Performs common text operations such as Uppercase and
allows advanced linguistic bit-conversion operations.

Copy Column: It duplicates column values in each row to a new named
column.

Data Conversion : It Creates a new column in each row based on a new data type converted from the existing column. An example is converting text to numeric data or text to Unicode text.

Derived Conversion : This is the most often used logical row-level transformation because it enables the replacement of column values or the creation of new columns based on existing columns, variables, and parameters. For example : to concatenate firstname column and lastname column to a make a fullname column  !

Multi-Input and Multi-Output Transformations :

The Multi-input and multi-output transformations can work with more than one data input or can generate more than one output, respectively. These transformations enable you to combine multiple branches of data flow paths into one or create multiple branches of data flow paths from one .

Data flow transformation :

CDC Splitter : This Directs stream of changed records into different outputs based on based on type of change (Insert, delete and update) .

Conditional Split: This Routes or filters data based on a Boolean expression to one
or more outputs, from which each row can be sent out only one output path. It routes data rows to different output depending on the content of data. For instance : separate  rows that need to be cleaned from rows that are need to be loaded.

Lookup Performs a lookup operation between a current row and an
external dataset on one or more columns .It joins additional columns to data flow by looking up values in a table. For instance joins to the ’employee id’ column in employee table to get ‘hiredate’  and ’employee name’ .

Merge: This Combines the rows of two similar sorted inputs, one on top of
the other, based on a defined sort key.

Merge Join : Joins the rows of two sorted inputs based on a defined join
column or columns, adding columns from each source.

■■ What is the difference between the Union All and the Merge transformation?

The Merge transformation is similar to Union All, but with Merge, the sources have
to be sorted and the sort position is preserved.

Multi-Row Transformations

Multi-row transformations perform work based on criteria from multiple input rows or generate multiple output rows from a single input row. Multi-row transformations can be more intensive in operation and memory overhead but are crucial for meeting business requirements.

Aggregate Associates rows based on defined grouping and generates aggregations such as SUM, MAX, MIN, and COUNT.

Sort Orders the input based on defined sort columns and sort directions.

Determining Appropriate ETL Strategy and Tools

The volume of data that must be managed by data warehouses is growing every day, and data integration is becoming the biggest issue, with increasing demand for the implementation of different analytical solutions ranging from enterprise data warehouses to specific data marts needed for predictive analytics .

ETL Architecture

Data integration reference architecture defines the processes and environments that support the capture, quality checking, processing, and movement of data, whether it is transactional or bulk, to one or many targets. A typical architecture layer for ETL consists of a process and a landing zone, which is usually inside the database. Each process follows a landing zone. A standard architecture would be:
■■ Extract the data (process)
■■ Initial Staging (landing zone)
■■ Data Quality (process)
■■ Clean Staging (landing zone)
■■ Transformation (process)
■■ Load-ready Publish (landing zone)
■■ Load Enterprise Data Warehouse (process)
■■ Enterprise Data Warehouse (landing zone)
■■ Load Data Marts (process)
■■ Data Marts (landing zone)

Lookup Transformations

When we are inserting data into a data warehouse fact table, we need to get an appropriate data warehouse key (usually a surrogate key) from the dimension table. As you may remember that columns in a fact table include foreign keys and measures, and dimensions in your database define the foreign keys. The Lookup transformation is very useful in an ETL process for solving this problem, because it performs lookups by joining data in input columns with columns in a reference dataset.

Using a Lookup Transformation :

You specify how the reference dataset will be stored in memory by selecting the appropriate cache mode:

■■ Full cache is the default mode. In this mode, the database is queried once during the
pre-execute phase of the data flow, and the entire reference set is stored into memory.
Lookup operations will be very fast during their execution but we should have enough memory.

■■ Partial cache means that the lookup cache is empty at the beginning of the data
flow. When a new row comes in, the Lookup transformation checks its cache for the
matching values. If no match is found, it queries the database. If the match is found
at the database, the values are cached so they can be used the next time a matching
row comes in.

■■ The no cache mode will store only the last matched row, which means that the Lookup transformation will query the database for each row.

■■ What is the difference between the Lookup and Merge Join transformations?

 Answer :

The Lookup transformation does not need a sorted input; it is a non-blocking
transformation, and in cases when more matching rows exist from the referenced
dataset, only the first one will be retrieved. This means that the transformation will
never retrieve more rows than exist in the input rows. With the Merge Join transformation more rows can be retrieved, because all matching data is retrieved.

Using a  Sorting the Data :

The Sort transformation is an expensive in terms of memory and processor usage, and it is also a full blocking transformation. On the other hand, the Merge Join transformation needs sorted input; so the question is, what is the best way to develop the package?
If you are reading the data from a database—either from the staging area or directly from a transactional system—you can put the burden of sorting the data on the underlying database. You achieve this by specifying a custom SQL query with an ORDER BY clause in the Data Source .

SSIS Variables

The SSIS variables can be used to store values that are determined automatically at run time and that are to be used or reused in various elements of the solution. Some variables (such as the time the process started, the name of a dynamically created folder to be used for storing temporary files, or the number of CPUs) are set once .

System and User Variables :

There are two types of SSIS variables:

■■ System variables : representing specific properties of various SSIS objects
(packages, tasks, containers, components, and event handlers). System variables are
read-only; their values are set by SSIS.
■■ User-defined variables : defined by the SSIS developer and used to store various pieces of information acquired or created during execution .

Variable Scope:

The scope of a variable is determined through the SSIS object hierarchy; scope restricts the visibility and accessibility of a variable to a particular branch of the SSIS object hierarchy .

Package-scoped variables can be considered global.
Variables defined at container level are only accessible to that particular container and to other objects it contains .

Property Parameterization :

Property parameterization, allowing specific SSIS object properties to be set dynamically, can be implemented in several ways:

■■ SSIS object properties associated with parameters can be set dynamically from the calling environment—even after deployment.

■■ Explicit assignment of property values from variables is available in some tasks and
components to facilitate the most common use cases and to simplify development.

■■ Assignment through expressions is available for most SSIS object properties, allowing
property values to be computed by using a combination of constants, variables,
parameters, or by using expression functions.

Inferred Dimension Members:

In data warehouse , it is possible for fact records to arrive with a source business key that has not yet been loaded in the dimension table. This problem is known in DW jargon as late-arriving dimensions or early-arriving facts. When inserting records to a fact table that cannot be resolved to a dimension surrogate key, the typical solution is as follows:

1. Insert a row in the dimension table by using the source system key, and assign a surrogate key to this member. Mark this record to indicate that it was populated as an
inferred member by adding a column to your dimension for instance : by adding an InferredMember column and setting its value to True.

2. Use the newly created surrogate key and assign it to the fact record.

3. Change the loading process for the dimension to check whether a record was populated as an inferred member. If this is the case, you must treat all the attributes as Type 1 SCD regardless of your specification, except for the surrogate key and the business key. If you do not, then all Type 2 SCD attributes will create a new row, because the inferred row has no value for this attribute.

Effectively Updating Dimensions

Although the Slowly Changing Dimension transformation is very easy to use and can quickly create the whole data flow process for updating dimensions, be aware that it does not work optimally in terms of performance, and thus it should only be used for small dimensions .

The algorithm for updating a dimension that has Type 1 and Type 2 SCD attributes is as
follows:

1. Check whether the attributes from the source row have changed.

2. Mark rows for Type 1 or Type 2 SCD change depending on the attribute specification for the dimension.

3. Apply the required update logic to the dimension:

■■ For Type 1 SCD, update the row.

■■ For Type 2 SCD, update the current row’s validity to indicate that it was valid until now, and insert the new row.

Set-Based Update Logic

Main reasons why the Slowly Changing Dimension transformation is slow is because it performs row-level SQL updates. That is why it is important to implement set-based logic when you are designing SSIS processes that need to apply Type 1 or Type 2 SCD changes.

Leave a comment