What is SQL

SQL stands for : Structured Query Language.

Was designed to manage and maintain data in Relational Database Management Systems(RDBMS) which revolves around relational model and predicate logics.

Has three types of statements :

DDL (Data Definition Language)

Statements include :  Create, Alter, Drop

DML (Data Manipulation Language)

Statements include :  Select , Insert, Update, Truncate, Delete

DCL (Data Control Language)     

Statements include :  Grant, Revoke

 

What is a Database, Schema and Object

Consider a Database as a main box which contains inside it different boxes(schemas) and those boxes(schemas) contain different objects like (tables,views, stored procedures and functions etc)

  • When It is said that Objects are contained in the Schema ,so the best advantage of the Schema level is Security . You can assign the different users of a database , a certain amount of privileges/rights ;
  • For example:

– Allowing User A : To use only select statements .

– Allowing User B : To see Encrypted Stored Procedures allowing all other privileges.

– Allowing User C : Not giving him rights to delete, update or execute any stored procedure.

The Below picture illustrates the design of Database ,schema and object . A schema can contain multiple objects in it .

objct2

Interview Questions

Q 1 : What are types of system databases:

The following are the types of databases:

Master database holds instance-wide metadata information, server configuration, information about all databases in the instance, and initialization information. the system views contain information about system, hardware, indexes, columns , memory etc.

tempdb The tempdb database is where SQL Server stores temporary data such as work tables, sort space, row versioning information, and so on. SQL Server allows you to create temporary tables for your own use, and the physical location of those temporary tables is tempdb .

Resource The Resource database is a hidden, read-only database that holds the definitions of all system objects. When you query system objects in a database, they appear to reside in the sys schema of the local database, but in actuality their definitions reside in the Resource database. contains information that was
previously in the master database and was split out from the master database to make service pack upgrades easier to install.

Model The model database is used as a template for new databases. Every new database that you create is initially created as a copy of model. So if you want certain objects (such as data types) to appear in all new databases that you create, or certain database properties to be configured in a certain way in all new databases, you need to create those objects and configure those properties in the model database. Note that changes you apply to the model database will not affect existing databases . The contains information that was previously in the master database and was split out from the master database to make service pack upgrades easier to install.

msdb The msdb database is where a service called SQL Server Agent stores its data. SQL Server Agent is in charge of automation, which includes entities such as jobs, schedules, and alerts. The SQL Server Agent is also the service in charge of replication. The msdb database also holds information related to other SQL Server features such as Database Mail, Service Broker, backups, and more.

Q 2 : What is a Filegroup ?

The database is made up of data files and transaction log files . The data files hold object data, and the log files hold information that SQL Server needs to maintain transactions. Data files are organized in logical groups called filegroups.

A filegroup is the target for creating an object, such as a table or an index. The object data will be spread across the files that belong to the target filegroup. Filegroups are your way of controlling the physical locations of your objects.

A database must have at least one filegroup called PRIMARY, and can optionally have other user filegroups as well. The PRIMARY filegroup contains the primary data file (which has an .mdf extension) for the database, and the database’s system catalog. You can optionally add secondary data files (which have an .ndf extension) to PRIMARY. User filegroups contain only secondary data files. You can decide which filegroup is marked as the default filegroup.

This technique can support two main strategies:

Using multiple filegroups can increase performance by separating heavily used tables or indexes onto different disk subsystems.

■ Using multiple filegroups can organize the backup and recovery plan by containing static data in one filegroup and more active data in another filegroup. We can allocate or spread filegroups nd1,nd2…ndn to multiple drives to increase performance.

Q 3 : What is a Primary key?

A primary key constraint enforces uniqueness of rows and also disallows NULL marks in the constraint attributes.

Q 4 : What is a Unique Constraints  ?

A unique constraint enforces the uniqueness of rows, allowing you to implement the concept of alternate keys from the relational model in your database. Unlike with primary keys, you can define multiple unique constraints within the same table. Also, a unique constraint is not restricted to columns defined as NOT NULL.

Q 5  : What is a Foreign Key Constraints  ?

A foreign key enforces referential integrity. This constraint is defined on one or more attributes in what’s called the referencing table and points to candidate key (primary key or unique constraint) attributes in what’s called the referenced table . Note that NULL marks are allowed in the foreign key columns (mgrid in the last example) even if
there are no NULL marks in the referenced candidate key columns.

Q 6 : What is a Check Constraints ? 

A check constraint allows you to define a predicate that a row must meet to be entered into the table or to be modified. For example, the following check constraint ensures that the salary column in the Employees table will support only positive values.
ALTER TABLE dbo.Employees
ADD CONSTRAINT CHK_Employees_salary
CHECK(salary > 0.00);

Q 7 : What is the difference between where and having clause  ?

The WHERE clause is evaluated before rows are grouped, and therefore is evaluated
per row. The HAVING clause is evaluated after rows are grouped, and therefore
is evaluated per group.

Q 8 : Why it is not allowed to refer to a column alias defined by the SELECT
clause in the WHERE clause?

Because the WHERE clause is logically evaluated first in query processing than the SELECT clause.

Q 9 : What is the  performance wise benefits of the WHERE clause ? 

It reduces network traffic and if properly used with indexes can reduce the full table scanning.

Q 10 : What is the difference between self-contained and correlated subqueries?

Self-contained sub queries are independent of the outer query, whereas correlated
sub queries have a reference to an element from the table in the outer query.

Q 11 : What is the difference between the APPLY and JOIN operators?

With a JOIN operator, both inputs represent static relations. With APPLY, the
left side is a static relation, but the right side can be a table expression with
correlations to elements from the left table.

Q 12 : What are two requirements for the queries involved in a set operator ?

The number of columns in the two queries needs to be the same  and the corresponding
columns need to have compatible types.

Q 13 : What makes a query a grouped query?

When you use an aggregate function, a GROUP BY clause, or both.

Q 14 : What are the clauses that you can use to define multiple grouping sets in the
same query?

GROUPING SETS, CUBE, and ROLLUP.

Q 15  : What is the difference between PIVOT and UNPIVOT?

PIVOT rotates data from a state of rows to a state of columns; UNPIVOT rotates
the data from columns to rows.

Q 16 : Can you store indexes from the same full-text catalog to different filegroups?

Yes. A full-text catalog is a virtual object only; full-text indexes are physical objects.
You can store each full-text index from the same catalog to a different file group.

Q 17 : How do you search for synonyms of a word with the CONTAINS predicate?

You have to use the CONTAINS(FTcolumn, ‘FORMSOF(THESAURUS, SearchWord1)’) syntax.

Q 18 : Can a table or column name contain spaces, apostrophes, and other nonstandard characters?

Yes

Q 19 : What types of table compression are available?

Page and row level compression.

Q 20 : How SQL Server enforce uniqueness in both primary key and unique constraints?

SQL Server uses unique indexes to enforce uniqueness for both primary key
and unique constraints.

Q 21 : What type of data does an inline function return?

Inline functions return tables, and accordingly, are often referred to as inline
table-valued functions.

Q 22 : What is difference between view and an inline function ? 

An inline table-valued function can be said as a parameterized view—that is, a
view that takes parameters.

Q 23 : What is the difference between SELECT INTO and INSERT SELECT?

SELECT INTO creates the target table and inserts into it the result of the query.
INSERT SELECT inserts the result of the query into an already existing table.

Q 24: Can we update rows in more than one table in one UPDATE statement?

No, we can use columns from multiple tables as the source, but update only
one table at a time.

Q 25 : How many columns with an IDENTITY property are supported in one table? And How do you obtain a new value from a sequence?

One.

We use NEXT VALUE FOR function for it.

Q 26 : What is the purpose of the ON clause in the MERGE statement?

The ON clause determines whether a source row is matched by a target row,
and whether a target row is matched by a source row. Based on the result of
the predicate, the MERGE statement knows which WHEN clause to activate and
as a result, which action to take against the target.

Q 27 : What are the possible actions in the WHEN MATCHED clause?

UPDATE and DELETE.

Q 28 : How many WHEN MATCHED clauses can a single MERGE statement have?

Two—one with an UPDATE action and one with a DELETE action.

Q 29: Why is it important for SQL Server to maintain the ACID quality of
transactions?

To ensure that the integrity of database data will not be compromised.

Q 30 : How does SQL Server implement transaction durability?

By first writing all changes to the database transaction log before making changes permanently to the database data on disk.

Q 31 : How many ROLLBACKs must be executed in a nested transaction to roll it back?

Only one ROLLBACK. A ROLLBACK always rolls back the entire transaction, no
matter how many levels the transaction has.

Q 32 : How many COMMITs must be executed in a nested transaction to ensure that
the entire transaction is committed?

One COMMIT for each level of the nested transaction. Only the last COMMIT
actually commits the entire transaction.

Q 33 : Can readers block readers?

No  because shared locks are compatible with other shared locks.

Q 34: Can readers block writers?

Yes, even if only momentarily, because any exclusive lock request has to wait
until the shared lock is released.

Q 35 : If two transactions never block each other, can a deadlock between them
result?

No. In order to deadlock, each transaction must already have locked a resource the other transaction wants, resulting in mutual blocking.

Q 36 : Can a SELECT statement be involved in a deadlock?

Yes. If the SELECT statement locks some resource that keeps a second transaction
from finishing, and the SELECT cannot finish because it is blocked by the
same transaction, the deadlock results.

Q 37 : If your session is in the READ COMMITTED isolation level, is it possible for one of your queries to read uncommitted data?

Yes, if the query uses the WITH (NOLOCK) or WITH (READUNCOMMITTED)
table hint where WITH (NOLOCK) ignoring the locks . The session value for the isolation level does not change, just the characteristics for reading that table.

Q 38 : Is there a way to prevent readers from blocking writers and still ensure that
readers only see committed data?

Yes, that is the purpose of the READ COMMITTED SNAPSHOT option within the
READ COMMITTED isolation level. Readers see earlier versions of data changes
for current transactions, not the currently uncommitted data.

Q 39 : What is the result of the parsing phase of query execution?

The result of this phase, if the query passed the syntax check, is a tree of logical
operators known as a parse tree.

Q 40 : How we  measure the amount of disk I/O a query is performing?

We use the SET STATISTICS IO command.

Q 41 : Which DMO gives you detailed text of queries executed?

You can retrieve the text of batches and queries executed from the

sys.dm_exec_sql_text DMO.

Q 42 :What are the two types of parameters for a T-SQL stored procedure? 

A T-SQL stored procedure can have only an input and  a output parameter.

Q 43 : Can a stored procedure span multiple batches of T-SQL code? 

No, a stored procedure can only contain one batch of T-SQL code.

Q 44 : What are the two types of DML triggers that can be created?

You can create AFTER and INSTEAD OF DML-type triggers.

Q 45 : If an AFTER trigger discovers an error, how does it prevent the DML command from completing?

An AFTER trigger issues a THROW or RAISERROR command to cause the transaction
of the DML command to roll back.

Q 46 : What are the two types of table-valued UDFS? And What type of UDF returns only a single value?

You can create inline or multistatement table-valued UDFs. And  scalar UDF returns only a single value.

Q 47 : What kind of clustering key would you select for an OLTP environment?

For an OLTP environment, a short, unique, and sequential clustering key might be
the best choice.

Q 48 : Which clauses of a query should you consider supporting with an index?

The list of the clauses you should consider supporting with an index includes, but
is not limited to, the WHERE, JOIN, GROUP BY, and ORDER BY clauses.

Q 49 : How would you quickly update statistics for the whole database after an upgrade?

We should use the sys.sp_updatestats system procedure .

Q 50  : What are the commands that are required to work with a cursor?

DECLARE, OPEN, FETCH in a loop, CLOSE, and DEALLOCATE.

Q 51 : When using the FAST_FORWARD option in the cursor declaration command,
what does it mean regarding the cursor properties?

It means that the cursor is read-only, forward-only.

Q 52 : How would you determine whether SQL Server used the batch processing mode for a specific iterator?

You can check the iterator’s Actual Execution Mode property.

Q 53 : Would you prefer using plan guides instead of optimizer hints?

With plan guides, you do not need to change the query text.

Q 54 : Why relational model is called set based model ? 

Relational model means that it is based on the concepts of mathematical set theory. SQL queries that query on the SQL tables outputs the rows in the form of sets of rows.

Q 55 : Give example of iterative model? 

Iterative model means the same concept which is used in loop iteration in high level languages such as C or python . In the same way , the iterative model works on rows as they go row by row. They are by comparison slower in performance .Example : Cursors.

Q 56 : What does fast forward cursor means ?

Means that cursor will start from the initiating point to the last element and will not go backward.

Q 57 : What are scopes of temporary tables ?

There are two types of temporary tables in SQL :  local and global.

Local temp tables are visible to the level that created , across the all inner batches and to the all inner levels of call stack.

Global temp tables are where destroyed when the session that created it terminates or destroyed.

Table variables are named with @ sign : @TV1. They are only accessible to batch that created it .They are not visible to across batches at same level and not even to inner levels .

Q 58: What is the difference between temp table and table variable?

Temp tables are similar to regular database tables . Any data changes in temp tables during transaction can be rolled back  . But changes in table variable in a transaction can not be rolled back.

Another difference is performance wise , for temp tables SQL maintains histograms . Means we can see their statistics and work on them to improve performance for instance : we can create indexes on columns for filtering out the data properly.

In case of table variables , it performs the full table scan which decreases the performance.

Q 59 : What does SET NOCOUNT ON do ?

NOCOUNT  ON can remove the messages like : 2 rows affected returned .Putting
a SET NOCOUNT ON at the beginning of  stored procedure prevents  from returning that message to the client.

Q 60 : What does GOTO statement does ?

With GOTO you can jump to the particular label to from where you are at!

For instance :

PRINT ‘First’;
GOTO Label_1;
PRINT ‘Second’;
Label_1:
PRINT ‘End now’;

Q 61 : Can the stored procedure have multiple batches ?

No.

Q 62 : What does RETURN statement do ?

It exits the stored procedure and returns to caller statement or procedure .

Q 63: What does @@Rowcount does?

It counts the number of rows read or affected by the SQL statement .

Q 64: Can the AFTER triggers be nested ?

Yes they can be nested , means that trigger on a table T1 can have a trigger and that inserts rows into table T2 that is also having a trigger on it and so on .
The number of maximum nested triggers SQL can have is 32.

Q 65 : What does SCHEMABINDING statement do?

WITH SCHEMABINDING means that a schema object is dependent or have a some sort of bound with other object . Object can be a table ,view or procedure. For instance : You cannot change the table structure if it is schemabinded with another view until you drop that view.

Q 66 : Can we use multiple select statements in a view ?

No . Only you can use one select statement as it is required that a view returns one result set.

But , you can use union statement with two or more select statements as the final result would be one result set.

Q 67 : Can data be modified in a table with a view ? If yes then what precautions are there?

Yes we can modify data in a table through a view instead of directly modifying through the table.

There are few precautions and restriction that should be taken carefully :

  • The DML statement must use or point to one table only even if the view is made up of or referencing multiple tables.
  • The view column to be modified should not have a aggregate function on it in the table  or a view whose column is resulted from GROUP BY , DISTINCT or HAVING .
  • We cannot modify a view if it is having TOP or offset fetch with the WITH CHECK OPTION.
  • The data in the view column cannot be modified if it is made up from Union, union All , intersect or cross joins.

Q 68 : What are partitioned views?

We can partition large table in SQL with help of views on  one or across several servers. Simply you can use union on partitioned tables and create a view for it. It is called partitioned view. If the table is spread across multiple SQL instances then it is called distributed partitioned view.

Q 69 : What is inline table valued function ? 

It is also a type of view called parameterized view .  The difference is that only it can take parameter to filter the rows from the view.

Q 70 : What is identity?

It is a type of property for a column having a type numeric. We typically use the identity column to generate surrogate keys which are mostly generated by system automatically when you insert the data. It has two values in it . First is seed value which is the first value and second is step value which is incremental value. We define both of these values at definition.

Q 71 : What is SET IDENTITY_INSERT?

We can specify our own values for identity column for insert by SET IDENTITY_INSERT = ON .  But we cannot update identity column value.

Q 72 : How can we find last Identity value?

The SCOPE_IDENTITY returns last identity value which is generated in the session in the current scope (batch, procedure or function) .

The @@IDENTITY returns the last identity value generated in the session inspite what is the scope.

The IDENT_CURRENT takes table as input and returns the last identity value being generated in the given table .

Q 73 : What will happen if we use Scope_Identity , @@Identity and Ident_current in different sessions?

Scope_Identity and @@Identity will return NULL  but Ident_current will return the last value of the identity column whatever the session is.

Q 74 : What is sequence ?

Sequence is an independent object in SQL Server . It is quite like the identity column .

All numeric types are accepted by sequence as like the identity does. But By default is BIGINT.

There are number of properties which identity does not have :

INCREMENT BY : It Increments the value. The default value is 1.

MINVALUE : The minimum value for the type .

MAXVALUE :The maximum value to be given for the type .

CYCLE | NO CYCLE :  It deals if  to allow the sequence to cycle or not. The default value is NO CYCLE.

START WITH : It is the sequence start value.

Q 75 : How can we request next Value in sequence  ?

To request new value from the sequence, run the following code .
SELECT NEXT VALUE FOR tableName;

Q 76 : Can we change the datatype of sequence ? And can we change properties and values ?

No we cannot change the datatype but yes we can change properties and values.

Q 77 : What is cache in sequence ?

This property means writing the sequence value to disk . For instance a CACHE with value 100 means it will write to disk after every 100 . Performance wise using NO CACHE has to write to disk each request of new sequence value. With CACHING performance is good.

Q78 : What is APPLY operator ? 

The APPLY operator works on two input tables in which the second can be a table expression.
And we will refer to them as the “left” and “right” tables and the right table is usually a derived table or an inline table valued function.

The APPLY operator applies the right table expression to each row from the left table and produces a result table with the unified result sets.

We will discuss this in separate post.

Q78 : What is the difference between Cross and outer apply?

The APPLY operator has two types;

CROSS APPLY doesn’t return left rows that get an empty set back from the right side.

The OUTER APPLY preserves the left side, and therefore, does return left rows when the right side returns an empty set. NULLs are used as placeholders from the right side in the outer rows if they are empty.

Q79 : What is Transaction ?

Transaction is a unit of work that has many activities like querying a data and changing multiple data definition statement or queries.

Q80 : What is implicit_transactions? Statement ?

This statement is OFF by default in SQL  . If you do not start your transaction with BEGIN statement it is fine but you have to specify COMMIT OR ROLLBACK TRAN to end the transaction.

Q 81 : Define 4 properties of a transaction ?

Atomicity:  Either all changes in the transaction takes place or None.  If a system fails before a COMMIT in a transaction or there is an error , the transaction is being rolled back.

Consistency : It is the state of data the database gives access to the user .The isolation level is also a part of consistency. Consistency also refers to the integrity rules it follows like (Primary Key , Foreign Key and Unique constraints ) etc .

Isolation :  Isolation level means to handle and control the access level of data and ensure data is in the level of consistency . There are two types of isolation levels : Locking and row versioning .

Durability : It means that the data is durable  . Whenever , the data is changed it is first logged into the Logs of the database before it can be written to disk . Once the data is Committed and written to logs it is considered safe even the system fails.

Q 82 : What is redo and undo ?

If the system fails , when it gets started it checks the logs and replays the data and checks for committed data , if its committed it rolls forward and Redo .

If the data is uncommitted it gets rolled back to previous state once the system is up .

Q 83 : What is a lock ?

Lock is a control resource obtained by the transaction to protect data and prevent data changes or access by another transaction .

Q 84 : What is an Exclusive Lock ? 

When we want to modify a data , the transaction requests an exclusive lock on the data and its resources. If it acquires the exclusive lock it will not allow any other transaction to access or modify data until that transaction releases the lock .

In single statement transaction , the lock is held until the statement completes.

In multi statement transaction , the lock is held until all the statements are executed and the transaction is ended by Commit tran or rollback Tran.

If a transaction is holding any type of lock on a resource then it can not acquire Exclusive lock and no lock can acquire any resource if a transaction is having an Exclusive lock.

Q 85 : What is shared lock ?

When a transaction reads a data , the transaction acquires Shared lock on it and its resources . Multiple transactions can acquire shared locks simultaneously on a same resource.

Q 86 : What is row versioning ? What is READ COMMITTED SNAPSHOT Isolation level ?

In Azure SQL database Read Committed Snapshot is the default isolation level.

Instead of Locking technology , this isolation level works on row versioning , so the transaction does not wait for acquiring shared lock for reading .

In this READ COMMITTED SNAPSHOT isolation level , if the transaction modifies a row and another transaction tries to read the row , it will read the last committed state of the row that was available before the start of the statement (optimistic Concurrency) .

This case is very different in READ COMMITTED isolation level , if the transaction is modifying the row , another transaction cannot read the same rows of data until the first transaction completes (Pessimistic Concurrency ) .

Q 87 : What are lockable resources ? 

Lockable resources are  : RID, rows , tables , pages and database tables.

Q 88 : What is higher level of Granularity ? 

To obtain  lock on a resource , your transaction must first obtain intent locks of the same mode on higher levels of granularity . For example, to get an exclusive lock on a row, your transaction must first acquire an intent exclusive lock on the page where the row resides and an intent exclusive lock on the object that owns the page.

To get a shared lock on a certain level of granularity, your transaction first needs to acquire intent shared locks on higher levels of granularity.

The purpose of intent locks is to efficiently detect incompatible lock requests on higher levels of granularity and prevent the granting of those.

For instance , if a transaction holds a lock on a row and another asks for an incompatible lock mode on the whole page(higher level) or table where that row resides, it is easy for SQL Server to identify the conflict because of the intent locks that the first transaction acquired on the page and table. Intent locks do not interfere with requests for locks on lower levels of granularity. For example, an intent lock on a page doesn’t prevent other transactions from acquiring incompatible lock modes on rows within the page.

Q 89 : What is Blocking ?

When one transaction holds a data resource and another requests for the same resource , so requester gets blocked and enters into the wait state.

Q 90 : What is isolation level  READ UNCOMMITTED ?

It the lowest available isolation level. In this isolation level, a reader doesn’t
ask for a shared lock. A reader that doesn’t ask for a shared lock can never be in conflict with a writer that is holding an exclusive lock. This means that the reader can read uncommitted changes (also known as dirty reads). It also means that the reader won’t interfere with a writer that asks for an exclusive lock. In other words, a writer can change data while a reader that is running under the READ UNCOMMITTED isolation level reads data.

Q 91 : What is the isolation level READ COMMITED ? 

It does not allow dirty reads or uncommitted data . It allows the reader to acquire shared lock to prevent reading the uncommitted data.

This means that if a writer is holding an exclusive lock, the reader’s shared lock request will be in conflict with the writer, and it has to wait.

As soon as the writer commits the transaction, the reader can get its shared lock, but what it reads are necessarily only committed changes.

Q 92 : What is the isolation level REPEATABLE READ ?

It not only does a reader need a shared lock to be able to read, but it also holds the lock until the end of the transaction. This means that as soon as the reader has acquired a shared lock on a data resource to read it, no one can obtain an exclusive lock to modify that resource until the reader ends the transaction.

If a transaction is having a shared lock and it is not committed and it is in repeatable read isolation level, then it will not allow any other transaction to update the data of the first transaction until the first transaction is not committed.

Q 93 :  What is isolation level SERIALIZABLE ?

One problem in REPEATABLE READ is that what if a another transaction enters a new row or rows into the table ? this phenomenon is called Phantom Reads . To overcome this problem , we use the SERIALIZABLE . It blocks the other transaction to enter the new rows.

What is an object?

An object is an Entity, for example: car, house , person, time etc. Object can be tangible or intangible.

Lets consider a car (tangible) which is our object . An object has an attribute, behaviour and has a unique Id.

Car has attribute (color, model) , behaviour ( accelerate, brake the car, change gear) and has a unique registration number.

Now , lets take an intangible object example: Time. A time can have attributes like: Year, month, day. A time can have its behaviour set, like set the Year, set the Month, set the day and it can have a unique identifier : Date of birth, date of joining the college, date of creation.

Common Linux Commands Part 1:

ls – list current directory contents

pwd – prints the present working directory

cd – change directory

file – print the file type

less – prints the contents within a file

cp – copy files or folder

mkdir – make a directory

mv – rename or move files

rm – remove file or directory

cat – displays the contents of the file

wc – displays the number of lines, words and bytes in a file.

head – displays the first few lines of the file

tail – displays the last few lines of the file

clear – clears the screen

history – displays the content of the previously executed commands on the shell from history list

chgrp – change the ownership of the file’s group

passwd – change user’s password

chmod – change the permission mode of the file

chown – change file owner / group

ps – displays details of all the current processes

jobs – displays a list of all the active jobs

ping – send echo request to the network hosts

traceroute – displays the route packets trace to network host

netstat – displays the network connections, routing tables, interface stats etc

wget – network downloader

ssh – the openSSH ssh client to connect to remote host

locate – find files by name

find – search for files in directories

gzip – compress or expand files

tar – archiving utility

zip – package and compress file

grep – searches for the text in a file according to the regular expression given

sort – sort lines of text files

uniq – omits the repeated lines

cut – removes the section from each line of the file

diff – examine the changes or differences in the files

How to get details of SQL Server agent Jobs from Queries

All the SQL Server Agent Jobs details are logged into the MSDB Database tables .

These tables contains information for each SQL server Agent job like : JobID, sessionID, RunDate , Enabled Status, Owner, Job Steps, Last Run outcomes etc . Tables which are commonly used are :

select * from [dbo].[sysjobactivity]
select * from [dbo].[sysjobhistory]
select * from [dbo].[sysjobs]
select * from [dbo].[sysjobschedules]
select * from [dbo].[sysjobservers]
select * from [dbo].[sysjobsteps]
select * from [dbo].[sysjobstepslogs]

Msg 8120, Level 16, State 1, Line 1 Column ” is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

We will be using the table ‘[HR].[Employees]’ from TSQL2012 database.

The error as described in the Title of the post :

"Msg 8120, Level 16, State 1, Line 1 Column 'HR.Employees.lastname' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause."

The error occured when I executed the following query :

select min(empid) as minimum,lastname from [HR].[Employees] 

The reason for the error is : We are missing the group by clause at the end .When we are using an aggregate function along with other columns in the select clause we must use the group by clause .

select min(empid) as minimum,lastname from [HR].[Employees] group by lastname

If you are using more than one column in the select clause while using the aggregate function , then it is necessary to include all the columns in the group by clause except the column on which the aggregate function is applied. For instance : if you use execute the below query , it will throw the same error:

select min(empid) as minimum,lastname,firstname from [HR].[Employees] group by lastname

So , we must include the firstname column as well to successfully execute it as below :

select min(empid) as minimum,lastname,firstname from [HR].[Employees] group by lastname,firstname

Msg 8152, Level 16, State 4, Line 1 String or binary data would be truncated.

This Error is very common when we insert or update data.

This error occurs when we try to insert data which is greater than the length of the column in which we are inserting the data.

For Instance : Lets see an example in our TSQL2012 database :

insert into [Production].Categories(categoryname,description) values ('No chocolate items','no chocolate')

It gave me an error :

Msg 8152, Level 16, State 4, Line 1
String or binary data would be truncated.
The statement has been terminated.

This happened because the Column Categoryname has the column type : nvarchar(15) , which means nvarchar can take upto 15 characters maximum. And the total length of characters tried to insert: ‘No chocolate items’ is of length 18 .

To solve this problem ,we can increase the length of column size to 18 . Or , we can reduce the characters accordingly to fit the size of 15 .

MS SQL Script for creating the database TSQL2012 ,which we will use.

We will be working on the script attached during our posts . You can copy and paste the script in Microsoft SQL 2012 as we will be working on the tables , views , stored procedures , functions and other objects in the script.

Copy , paste and execute the whole script in a new MS SQL 2012 management Studio on the New Query tab to create the TSQL2012 database.

Attached is the script below in the TSQL2012.sql file .

ETL(Extract , Transform and Load) – Part 1 -Introduction to Data warehouse

The analyses of data from relational databases (LOB) applications is not an easy . The normalized relational schema used for an LOB application can consist of hundreds of tables. It is difficult to discover where the data you need for a report is located.
In addition, LOB applications do not track data over time(they perform many DML operations), though many analyses depend on historical data.

A Data warehouse is a centralized data database for an enterprise that contains merged, cleansed, and historical data. The DW data is more better for reporting purposes . The logical design of schema of DW is called Star schema and snowflake schema which consists of Dimension and fact Tables.

The data in DW comes from LOB / relational database . The data which comes in DW is more cleansed ,transformed . OF course , while coming into DW , the data must have gone through some changes so we can get the refreshed data in DW . One way to refresh or let the new data come into DW is through the night job schedule . The report for newly or refreshed data can then be read . 

The physical design for the  DW is simpler than the relational database design as it contains less tables joins .

————————————————————————————————————————————-

CHAPTER 1 : Introduction to Star and Snowflake Schemas

First learning why Reporting is a Problem with a Normalized Schema : 

Lets take AdventureWorks2012 sample database . Now The report should include the sales amount for Internet sales in different countries over multiple years. This will end up with almost 10 tables. The AdventureWorks2012 database schema is very normalized; it’s intended as an example schema to support LOB applications.

The goal of normalization is to have a complete and non-redundant schema.
Every piece of information must be stored exactly once. This way, you can enforce data integrity.

So, a query that joins 10 to 12 tables, as would be required in reporting sales by countries
and years, would not be very fast and can cause performance issues as it reads huge amounts of data sales over multiple years and thus would interfere with the regular transactional work of insertion and updation of the data.

Another problem is in many cases, LOB databases are purged otypically after each new fiscal year starts. Even if you have all of the historical data for the sales transactions, you might have a problem showing the historical data correctly. For example, you might have only the latest customer address, which might prevent you from calculating historical sales by country correctly .

The AdventureWorks2012 sample database stores all data in a single database. However, in an enterprise, you might have multiple LOB applications, each of which might store data in its own database. You might also have part of the sales data in one database and part in another.
And you could have customer data in both databases, without a common identification. In such cases, you face the problems of how to merge all this data and how to identify which customer from one database is actually the same as a customer from another database.

Finally, data quality can be low. The old rule, “garbage in garbage out,” applies to analyses as well. Parts of the data could be missing; other parts could be wrong. Even with good data, you could still have different representations of the same data in different databases. For example, gender in one database could be represented with the letters F and M, and in another database with the numbers 1 and 2. So we can put a proper standard in our data warehouse to this problem.

Star Schema

In Figure below, you can easily identify how the Star schema as it resembles
a star. There is a single central table, called a fact table, surrounded by multiple tables called dimensions. One Star schema covers a particular business area. In this case, the schema covers Internet sales. An enterprise data warehouse covers multiple business areas and consists of multiple Star and Snowflake schemas.

star

The fact table is connected to all the dimensions with foreign keys. Usually, all foreign keys taken together uniquely identify each row in the fact table, and thus they all together form a unique key, so you can use all the foreign keys as a composite primary key of the fact table. The fact table is on the “many” side of its relationships with the dimensions. If you were to form a proposition from a row in a fact table, you might express it with a sentence such as,  “Customer CC purchased product BB on date DD in quantity QQ for amount SS.” . 

As we  know, a data warehouse consists of multiple Star schemas. From a business perspective, these Star schemas are connected. For example, you have the same customers in sales as in accounting. You deal with many of the same products in sales, inventory, and production. Of course, your business is performed at the same time over all the different business areas. To represent the business correctly, you must be able to connect the multiple Star schemas in your data warehouse. The connection is simple – you use the same dimensions for each Star schema. In fact, the dimensions should be shared among multiple Star schemas. Dimensions have foreign key relationships with multiple fact tables. Dimensions which have connections to multiple fact tables are called shared or conformed dimensions.

shared dim

Snowflake Schema

You can imagine multiple dimensions designed in a similar normalized way, with a central fact table connected by foreign keys to dimension tables, which are connected with foreign keys to lookup tables, which are connected with foreign keys to their second-level lookup tables.

In this configuration, a star starts to resemble a snowflake. Therefore, a Star schema with normalized dimensions is called a Snowflake schema.

shared dim

In most long-term projects, you should design Star schemas. Because the Star schema is
simpler than a Snowflake schema, it is also easier to maintain. Queries on a Star schema are simpler and faster than queries on a Snowflake schema, because they involve fewer joins.

Hybrid Schema

In some cases, you can also employ a hybrid approach, using a Snowflake schema only for the first level of a dimension lookup table. In this type of approach, there are no additional levels of lookup tables; the first-level lookup table is denormalized. Figure 1-6 shows such a partially denormalized schema.

Quick Question : 

How do we connect multiple star schemas ?

Answer : 

Through shared dimensions .

Granularity Level

The number of dimensions connected with a fact table defines the granularity level.

Auditing and Lineage

A data warehouse may also contain auditing tables. For every update, you should audit who and when the update was done and by whom and how many rows were affected to each dimension and fact table in your DW. If you also audit how much time was needed for each load, you can calculate the performance and take action if it slows down. You store this information in an auditing table .

You might also need to know where each row in a dimension and/or fact table came from and when it was added. In such cases, you must add appropriate columns to the dimension and fact tables. Such fine detailed auditing information is also called lineage.

ETL(Extract , Transform and Load) – Part 2 – Designing Dimensions

Dimensions give context to measures. Typical analysis includes pivot tables and pivot graphs. These pivot on one or more dimension columns used for analysis—these columns are called attributes in DW and OLAP terminology.

Columns with unique values identify rows. These columns are keys. In a data warehouse,
you need keys just like you need them in an LOB database. Keys uniquely identify entities. Therefore, keys are the second type of columns in a dimension.

Pivoting makes no sense if an attribute’s values are continuous, or if an attribute has too
many distinct values. Imagine how a pivot table would look if it had 1,000 columns, or how a pivot graph would look with 1,000 bars. For pivoting, discrete attributes with a small number of distinct values is most appropriate. A bar chart with more than 10 bars becomes difficult to comprehend. Continuous columns or columns with unique values, such as keys, are not appropriate for analyses.

If you have a continuous column and you would like to use it in analyses as a pivoting attribute, you should discretize it. Discretizing means grouping or binning values to a few discrete groups. If you are using OLAP cubes, SSAS can help you. SSAS can discretize continuous attributes. However, automatic discretization is usually worse than discretization from a business perspective. Age and income are typical attributes that should be discretized from a business perspective. One year makes a big difference when you are 15 years old, and much less when you are 55 years old. When you discretize age, you should use narrower ranges for younger people and wider ranges for older people (these are used for Graph type pivoting).

A customer typically has an address, a phone number, and an email address. You do not analyze data on these columns. You do not need them for pivoting. However, you often need information such as the customer’s address on a report. If that data is not present in a DW, you will need to get it from an LOB database, probably with a distributed query. It is much simpler to store this data in your data warehouse. In addition, queries that use this data perform better, because the queries do not have to include data from LOB databases. Columns used in reports as labels only, not for pivoting, are called member properties.

In addition to the types of dimension columns already defined for identifying, naming,
pivoting, and labeling on a report, you can have columns for lineage information,

A dimension may contain the following types of columns:

■■ Keys Used to identify entities
■■ Name columns Used for human names of entities
■■ Attributes Used for pivoting in analyses
■■ Member properties Used for labels in a report
■■ Lineage columns Used for auditing, and never exposed to end users

Hierarchies

Figure 1-9 shows the DimCustomer dimension of the AdventureWorksDW2012 sample database.

dimcustomer

In the figure, the following columns are attributes (columns used for pivoting):
■■ BirthDate (after calculating age and discretizing the age)
■■ MaritalStatus
■■ Gender
■■ YearlyIncome (after discretizing)
■■ TotalChildren
■■ NumberChildrenAtHome
■■ EnglishEducation (other education columns are for translations)
■■ EnglishOccupation (other occupation columns are for translations)
■■ HouseOwnerFlag
■■ NumberCarsOwned
■■ CommuteDistance

All these attributes are unrelated. Pivoting on MaritalStatus, for example, is unrelated to
pivoting on YearlyIncome. None of these columns have any functional dependency between them, and there is no natural drill-down path through these attributes. Now look at the Dim- Date columns, as shown in Figure 1-10.

dimdate

Some attributes of the DimDate edimension include the following (not in the order shown in the figure):
■■ FullDateAlternateKey (denotes a date in date format)
■■ EnglishMonthName
■■ CalendarQuarter
■■ CalendarSemester
■■ CalendarYear

You will immediately notice that these attributes are connected. There is a functional dependency among them, so they break third normal form. They form a hierarchy. Hierarchies are particularly useful for pivoting and OLAP analyses—they provide a natural drill-down path. You perform divide-and-conquer analyses through hierarchies.

Hierarchies have levels. When drilling down, you move from a parent level to a child level. For example, a calendar drill-down path in the DimDate dimension goes through the following levels: CalendarYear ➝ CalendarSemester ➝ CalendarQuarter ➝ EnglishMonthName ➝ FullDateAlternateKey

At each level, you have members. For example, the members of the month level are, of course, January, February, March, April, May, June, July, August, September, October, November, and December. In DW and OLAP jargon, rows on the leaf level—the actual dimension rows—are called members. This is why dimension columns used in reports for labels are called member properties.

In a Snowflake schema, lookup tables show you levels of hierarchies. In a Star schema, you need to extract natural hierarchies from the names and content of columns. Nevertheless, because drilling down through natural hierarchies is so useful and welcomed by end users, you should use them as much as possible.

Slowly Changing Dimensions

There is one common problem with dimensions in a data warehouse: the data in the dimension changes over time. This is usually not a problem in an OLTP application; when a piece of data changes, you just update it. However, in a DW, you have to maintain history. The question that arises is how to maintain it. Do you want to update only the changed data, as in an OLTP application, and pretend that the value was always the last value, or do you want to maintain both the first and intermediate values? This problem is known in DW jargon as the Slowly Changing Dimension (SCD) problem.

The problem is best explained in an example. Table 1-1 shows original source OLTP data
for a customer.

OLTP1

The customer lives in Vienna, Austria, and is a professional. Now imagine that the customer moves to Ljubljana, Slovenia. In an OLTP database, you would just update the City column, resulting in the values shown in Table 1-2.

OLTP2

If you create a report, all the historical sales for this customer are now attributed to the
city of Ljubljana, and (on a higher level) to Slovenia. The fact that this customer contributed to sales in Vienna and in Austria in the past would have disappeared.

In a DW, you can have the same data as in an OLTP database. You could use the same key,
such as the business key, for your Customer dimension. You could update the City column when you get a change notification from the OLTP system, and thus overwrite the history.

 Type 1 SCD :  Type 1 means overwriting the history for an attribute and for all higher levels of hierarchies to which that attribute belongs.

The problem here would rise that you want to store the historical data . What if we want transactions of the same customer when he did in Vienna ? The solution to this problem would be to add a surrogate key (DW key ).

 Type 2 SCD : When you implement Type 2 SCD, for the sake of simpler querying, you typically also add a flag to denote which row is current for a dimension member. Alternatively, you could add two columns showing the interval of validity of a value.

SCD2

You could have a mixture of Type 1 and Type 2 changes in a single dimension. For example, in Table 1-3, you might want to maintain the history for the City column but overwrite the history for the Occupation column. That raises yet another issue. When you want to update the Occupation column, you may find that there are two (and maybe more) rows for the same customer. The question is, do you want to update the last row only, or all the rows? Table 1-4 shows a version that updates the last (current) row only, whereas Table 1-5 shows all of the rows being updated.

scd3

Fact Table Column Types

Fact tables are collections of measurements associated with a specific business process. You store measurements in columns. Logically, this type of column is called a measure. Measures are the essence of a fact table. They are usually numeric and can be aggregated. They store values that are of interest to the business, such as sales amount, order quantity, and discount amount.

All foreign keys together usually uniquely identify each row and can be used as a composite primary key.

For example, suppose you start building a sales fact table from an order details table in a source system, and then add foreign keys that pertain to the order as a whole from the Order Header table in the source system. Tables 1-7, 1-8, and 1-9 illustrate an example of such a design process.

Table 1-7 shows a simplified example of an Orders Header source table. The OrderId
column is the primary key for this table. The CustomerId column is a foreign key from the Customers table. The OrderDate column is not a foreign key in the source table; however, it becomes a foreign key in the DW fact table, for the relationship with the explicit date dimension. Note, however, that foreign keys in a fact table can—and usually are—replaced with DW surrogate keys of DW dimensions.

orderheader1

Table 1-8 shows the source Order Details table. The primary key of this table is a composite one and consists of the OrderId and LineItemId columns. In addition, the Source Order Details table has the ProductId foreign key column. The Quantity column is the measure .

orderdetail1

Table 1-9 shows the Sales Fact table created from the Orders Header and Order Details
source tables. The Order Details table was the primary source for this fact table. The OrderId, LineItemId, and Quantity columns are simply transferred from the source Order Details table.
The ProductId column from the source Order Details table is replaced with a surrogate DW ProductKey column. The CustomerId and OrderDate columns take the source Orders Header table; these columns pertain to orders, not order details. However, in the fact table, they are replaced with the surrogate DW keys CustomerKey and OrderDateKey.

salesfact

You do not need the OrderId and LineItemId columns in this sales fact table. For analyses, you could create a composite primary key from the CustomerKey, OrderDateKey, and Product- Key columns.

Additivity of Measures

Additivity of measures is not exactly a data warehouse design problem. However, you should consider which aggregate functions you will use in reports for which measures, and which aggregate functions you will use when aggregating over which dimension.
The simplest types of measures are those that can be aggregated with the SUM aggregate
function across all dimensions, such as amounts or quantities. For example, if sales for product A were $200.00 and sales for product B were $150.00, then the total of the sales was $350.00.
If yesterday’s sales were $100.00 and sales for the day before yesterday were $130.00, then the total sales amounted to $230.00. Measures that can be summarized across all dimensions are called additive measures.

Some measures are not additive over any dimension. Examples include prices and percentages, such as a discount percentage. Typically, you use the AVERAGE aggregate function for such measures, or you do not aggregate them at all. Such measures are called non-additive measures.

For some measures, you can use SUM aggregate functions over all dimensions but time.
Some examples include levels and balances. Such measures are called semi-additive measures.
For example, if customer A has $2,000.00 in a bank account, and customer B has $3,000.00, together they have $5,000.00. However, if customer A had $5,000.00 in an account yesterday but has only $2,000.00 today, then customer A obviously does not have $7,000.00 altogether. You should take care how you aggregate such measures in a report. For time measures, you can calculate average value or use the last value as the aggregate.

Quick Question : 

■■ You are designing an accounting system. Your measures are debit, credit, and
balance. What is the additivity of each measure?

Answer: Debit and credit are additive measure while balance is semi additive measure.

Summary : 

■■ Fact tables include measures, foreign keys, and possibly an additional primary key and lineage columns.
■■ Measures can be additive(fixed total sum of sales ) , non-additive(price changes or percentage changes ), or semi-additive(credit and debit in account) 32

.

ETL(Extract , Transform and Load) – Part 3 -Introduction to Data warehouse

Creating a Data Warehouse Database

A DW is a transformed  (LOB) data. You load data to your DW  on a schedule mostly in an overnight job. The DW data is not online, real-time data. You do not need to back up the transaction log for your data warehouse, as you would in an LOB database. Therefore, the recovery model for your data warehouse should be Simple.

■■ In the Full recovery model, all transactions are fully logged, with all associated data. You have to regularly back up the log. You can recover data to any arbitrary point in time. Point-in-time recovery is particularly useful when human errors occur.

■■ The Bulk Logged recovery model is an adjunct of the Full recovery model that permits
high-performance bulk copy operations. Bulk operations, such as index creation or
bulk loading of text or XML data, can be minimally logged. For such operations, SQL
Server can log only the Transact-SQL command, without all the associated data. You
still need to back up the transaction log regularly.

■■ In the Simple recovery model, SQL Server automatically reclaims log space for committed transactions. SQL Server keeps log space requirements small, essentially eliminating the need to manage the transaction log space.

In your data warehouse, large fact tables typically occupy most of the space. You can
optimize querying and managing large fact tables through partitioning. Table partitioning has management advantages and provides performance benefits. Queries often touch only subsets of partitions, and SQL Server can efficiently eliminate other partitions early in the query execution process.

A database can have multiple data files, grouped in multiple filegroups. There is no single best practice as to how many filegroups you should create for your data warehouse. However, for most DW scenarios, having one filegroup for each partition is the most appropriate. For the number of files in a filegroup, you should consider your disk storage. Generally, you should create one file per physical disk.

Loading data from source systems is often quite complex. To mitigate the complexity,
you can implement staging tables in your DW. You can even implement staging tables and other objects in a separate database. You use staging tables to temporarily store source data before cleansing it or merging it with data from other sources. In addition, staging tables also serve as an intermediate layer between DW and source tables. If something changes in the source—for example if a source database is upgraded—you have to change only the query that reads source data and loads it to staging tables. After that, your regular ETL process should work just as it did before the change in the source system. The part of a DW containing staging tables is called the data staging area (DSA).

Implementing Dimensions

Implementing a dimension involves creating a table that contains all the needed columns. In addition to business keys, you should add a surrogate key to all dimensions that need Type 2 Slowly Changing Dimension (SCD) management. You should also add a column that flags the current row or two date columns that mark the validity period of a row when you implement Type 2 SCD management for a dimension.

You can use simple sequential integers for surrogate keys. SQL Server can autonumber them for you. You can use the IDENTITY property to generate sequential numbers.

A sequence is a user-defined, table-independent (and therefore schema-bound) object.
SQL Server uses sequences to generate a sequence of numeric values according to your specification. You can generate sequences in ascending or descending order, using a defined interval.of possible values. You can even generate sequences that cycle (repeat).

As mentioned, sequences are independent objects, not associated with tables. You control the relationship between sequences and tables in your ETL application. With sequences, you can coordinate the key values across multiple tables.
You should use sequences instead of identity columns in the following scenarios:
■■ When you need to determine the next number before making an insert into a table.

■■ When you want to share a single series of numbers between multiple tables.

■■ When you need to restart the number series when a specified number is reached (that is, when you need to cycle the sequence).

Implementing Fact Tables

After you implement dimensions, you need to implement fact tables in your data warehouse. You should always implement fact tables after you implement your dimensions. A fact table is on the “many” side of a relationship with a dimension, so the parent side must exist if you want to create a foreign key constraint.

You should partition a large fact table for easier maintenance and better performance.

Columns in a fact table include foreign keys and measures. Dimensions in your database
define the foreign keys. All foreign keys together usually uniquely identify each row of a fact table.

In production, you can remove foreign key constraints to achieve better load performance.
If the foreign key constraints are present, SQL Server has to check them during the load. However, we recommend that you retain the foreign key constraints during the development and testing phases. It is easier to create database diagrams if you have foreign keys defined. In addition, during the tests, you will get errors if constraints are violated. Errors inform you that there is something wrong with your data; when a foreign key violation occurs, it’s most likely that the parent row from a dimension is missing for one or more rows in a fact table. These types of errors give you information about the quality of the data you are dealing with.
If you decide to remove foreign keys in production, you should create your ETL process
so that it’s resilient when foreign key errors occur. In your ETL process, you should add a row to a dimension when an unknown key appears in a fact table. A row in a dimension added during fact table load is called an inferred member. Except for the key values, all other column Key values for an inferred member row in a dimension are unknown at fact table load time, and you should set them to NULL. This means that dimension columns (except keys) should allow NULLs.

dw1

Columnstore Indexes and Batch Processing

A columnstore index is just another nonclustered index on a table. Query Optimizer considers using it during the query optimization phase just as it does any other index .

A columnstore index is often compressed even further than any data compression type
can compress the row storage—including page and Unicode compression. When a query
references a single column that is a part of a columnstore index, then SQL Server fetches only that column from disk; it doesn’t fetch entire rows as with row storage. This also reduces disk IO and memory cache consumption. Columnstore indexes use their own compression algorithm; you cannot use row or page compression on a columnstore index.

On the other hand, SQL Server has to return rows. Therefore, rows must be reconstructed when you execute a query. This row reconstruction takes some time and uses some CPU and memory resources. Very selective queries that touch only a few rows might not benefit from columnstore indexes. Columnstore indexes accelerate data warehouse queries but are not suitable for OLTP workloads.

The columnstore index is divided into units called segments. Segments are stored as large objects, and consist of multiple pages. A segment is the unit of transfer from disk to memory.
Each segment has metadata that stores the minimum and maximum value of each column for that segment. This enables early segment elimination in the storage engine. SQL Server loads only those segments requested by a query into memory.

SQL Server 2012 includes another important improvement for query processing. In batch mode processing, SQL Server processes data in batches rather than processing one row at a time. In SQL Server 2012, a batch represents roughly 1000 rows of data. Each column within a batch is stored as a vector in a separate memory area, meaning that batch mode processing is vector-based. Batch mode processing interrupts a processor with metadata only once per batch rather than once per row, as in row mode processing, which lowers the CPU burden substantially.

Loading and Auditing Loads

Loading large fact tables can be a problem. You have only a limited time window in which to do the load, so you need to optimize the load operation. In addition, you might be required to track the loads.

Using Partitions

Loading even very large fact tables is not a problem if you can perform incremental loads. However, this means that data in the source should never be updated or deleted; data should be inserted only. This is rarely the case with LOB applications. In addition, even if you have the possibility of performing an incremental load, you should have a parameterized ETL procedure in place so you can reload portions of data loaded already in earlier loads. There is always a possibility that something might go wrong in the source system, which means that you will have to reload historical data. This reloading will require you to delete part of the data from your data warehouse.

Deleting large portions of fact tables might consume too much time, unless you perform a minimally logged deletion. A minimally logged deletion operation can be done by using the TRUNCATE TABLE command; however, this command deletes all the data from a table—and deleting all the data is usually not acceptable. More commonly, you need to delete only portions of the data.

Inserting huge amounts of data could consume too much time as well. You can do a minimally logged insert, but as you already know, minimally logged inserts have some limitations. Among other limitations, a table must either be empty, have no indexes, or use a clustered index only on an ever-increasing (or ever-decreasing) key, so that all inserts occur on one end of the index. However, you would probably like to have some indexes on your fact table—at least a columnstore index. With a columnstore index, the situation is even worse—the table becomes read only.

You can resolve all of these problems by partitioning a table. You can even achieve better
query performance by using a partitioned table, because you can create partitions in different filegroups on different drives, thus parallelizing reads. You can also perform maintenance procedures on a subset of filegroups, and thus on a subset of partitions only. That way, you can also speed up regular maintenance tasks. Altogether, partitions have many benefits.

Although you can partition a table on any attribute, partitioning over dates is most common in data warehousing scenarios. You can use any time interval for a partition. Depending on your needs, the interval could be a day, a month, a year, or any other interval. You can have as many as 15,000 partitions per table in SQL Server 2012.

In addition to partitioning tables, you can also partition indexes. Partitioned table and
index concepts include the following:

■■ Partition function : This is an object that maps rows to partitions by using values
from specific columns. The columns used for the function are called partitioning columns. A partition function performs logical mapping.

■■ Partition scheme : A partition scheme maps partitions to filegroups. A partition
scheme performs physical mapping.

■■ Aligned index : This is an index built on the same partition scheme as its base table.
If all indexes are aligned with their base table, switching a partition is a metadata operation only, so it is very fast. Columnstore indexes have to be aligned with their base
tables. Nonaligned indexes are, of course, indexes that are partitioned differently than
their base tables.

■ Partition elimination : This is a Query Optimizer process in which SQL Server accesses only those partitions needed to satisfy query filters.

■■ Partition switching  : This is a process that switches a block of data from one table or
partition to another table or partition. You switch the data by using the ALTER TABLE T-SQL command. You can perform the following types of switches:
■ Reassign all data from a nonpartitioned table to an empty existing partition of a partitioned table.
■ Switch a partition of one partitioned table to a partition of another partitioned table.
■ Reassign all data from a partition of a partitioned table to an existing empty nonpartitioned table.

Any time you create a large partitioned table you should create two auxiliary nonindexed empty tables with the same structure, including constraints and data compression options.

For minimally logged deletions of large portions of data, you can switch a partition from
the fact table to the empty table version without the check constraint. Then you can truncate that table. The TRUNCATE TABLE statement is minimally logged. Your first auxiliary table is prepared to accept the next partition from your fact table for the next minimally logged deletion.

For minimally logged inserts, you can bulk insert new data to the second auxiliary table,
the one that has the check constraint. In this case, the INSERT operation can be minimally
logged because the table is empty. Then you create a columnstore index on this auxiliary
table, using the same structure as the columnstore index on your fact table. Now you can
switch data from this auxiliary table to a partition of your fact table. Finally, you drop the columnstore index on the auxiliary table, and change the check constraint to guarantee that all of the data for the next load can be switched to the next empty partition of your fact table. Your second auxiliary table is prepared for new bulk loads again.

ETL(Extract , Transform and Load) – Part 4 -Introduction to Data warehouse

Creating SSIS Packages

Data movement represents an important part of data management. Data is transported
from client applications to the data server to be stored, and transported back from
the database to the client to be managed and used. In data warehousing, data movement
represents a particularly important element, considering the typical requirements of a data warehouse: the need to import data from one or more operational data stores, the need to cleanse and consolidate the data, and the need to transform data, which allows it to be stored and maintained appropriately in the data warehouse .

Based on the level of complexity, data movement scenarios can be divided into two
groups:
■■ Simple data movements, where data is moved from the source to the destination “as-is” (unmodified)
■■ Complex data movements, where the data needs to be transformed before it can be
stored, and where additional programmatic logic is required to accommodate the
merging of the new and/or modified data, arriving from the source, with existing data, already present at the destination .

What constitutes a complex data movement? Three distinct elements can be observed in
any complex data movement process:

1. The data is extracted from the source (retrieved from the operational data store).

2. The data is transformed (cleansed, converted, reorganized, and restructured) to comply with the destination data model.

3. The data is loaded into the destination data store (such as a data warehouse).

Planning a Simple Data Movement

To determine whether the and Export Wizard is the right tool for a particular data movement, ask yourself a few simple questions:
■■ Will the data need to be transformed before it can be stored at the destination?
If no transformations are required, then the Import and Export Wizard might be the
right tool for the job.

■■ Is it necessary to merge source data with existing data at the destination?
If no data exists at the destination (for example, because the destination itself does not
yet exist), then using the Import and Export Wizard should be the right choice.

Quick Questions :

1. What is the SQL Server Import and Export Wizard?

Answer :

The Import and Export Wizard is a utility that provides a simplified interface
for developing data movement operations where data is extracted from a
source and loaded into a destination, without the need for any transformations.

2. What is the principal difference between simple and complex data movements?

Answer :

In simple data movements, data is copied from one data store into another one
unmodified, whereas in complex data movements, data is modified (transformed)
before being loaded into the destination data store

Learning SSIS : Introducing Control Flow, Data Flow, and Connection Managers

■■ Connection managers Provide connections to data stores, either as data sources or
data destinations. Because the same data store can play the role of the data source as
well as the data destination, connection managers allow the connection to be defined
once and used many times in the same package (or project).

■■ Control flow Defines both the order of operations and the conditions under which
they will be executed. A package can consist of one or more operations, represented
by control flow tasks. Execution order is defined by how individual tasks are connected
to one another. Tasks that do not follow any preceding task as well as tasks that follow
the same preceding task are executed in parallel.

■■ Data flow Encapsulates the data movement components—the ETL:

■ One or more source components, designating the data stores from which the data will be extracted.
■ One or more destination components, designating the data stores into which the
data will be loaded.
■ One or more (optional) transformation components, designating the transformations
through which the data will be passed.

Quick Questions : 

1. What is a control flow?

Answer : 

In SSIS packages, the control flow defines the tasks used in performing data
management operations; it determines the order in which these tasks are executed
and the conditions of their execution.
2. What is a data flow?

Answer : 

In SSIS packages, the data flow is a special control flow task used specifically in
data movement operations and data transformations.

Implementing Control Flow :

SSIS supports a variety of data stores (such as files, [relational] database management systems, SQL Server Analysis Services databases, web servers, FTP servers, mail servers, web services.

ADO connection manager : The ADO connection manager enables connections to ActiveX Data Objects (ADO) and is provided mainly for backward compatibility .

ADO.NET connection manager : The ADO.NET connection manager enables connections to data stores using a Microsoft .NET provider. It is compatible with SQL Server.

Analysis Services connection manager: The Analysis Services connection manager provides access to SSAS databases.

Excel connection manager : As the name suggests, the Excel connection manager provides access to data in Microsoft Excel workbooks.

Flat File connection manager : These connection managers provide access to flat files—delimited or fixed width text files (such as comma-separated values files)

Connection Manager Scope :

SQL Server Database Tools (SSDT) support two connection manager definition techniques,
providing two levels of availability:
■■ Package-scoped connection managers are only available in the context of the SSIS
package in which they were created and cannot be reused by other SSIS packages in
the same SSIS project

■■ Project-scoped connection managers are available to all packages of the project in
which they were created .

Planning a Complex Data Movement :

Typically, the transformation could be any or all of the following:
■■ Data cleansing :  Unwanted or invalid pieces of data are discarded or replaced with
valid ones. Many diverse operations fit this description—anything from basic cleanup
(such as string trimming or replacing decimal commas with decimal points) to quite
elaborate parsing (such as extracting meaningful pieces of data by using Regular
Expressions).

■■ Data normalization : In this chapter, we would like to avoid what could grow into a
lengthy debate about what exactly constitutes a scalar value, so the simplest definition
of normalization would be the conversion of complex data types into primitive
data types (for example, extracting individual atomic values from an XML document or
atomic items from a delimited string).

■■ Data type conversion : The source might use a different type system than the
destination. Data type conversion provides type-level translation of individual values
from the source data type to the destination data type (for example, translating a .NET
Byte[] array into a SQL Server VARBINARY(MAX) value).

■■ Data translation :  The source might use different domains than the destination.
Translation provides a domain-level replacement of individual values of the source domain with an equivalent value from the destination domain (for example, the character “F” designating a person’s gender at the source is replaced with the string “female” representing the same at the destination).

■■ Data validation : This is the verification and/or application of business rules against
individual values (for example, “a person cannot weigh more than a ton”), tuples (for
example, “exactly two different persons constitute a married couple”), and/or sets (for
example, “exactly one person can be President of the United States at any given time”).’
■■ Data calculation and data aggregation : In data warehousing, specifically, a common requirement is to not only load individual values representing different facts or
measures, but also to load values that have been calculated (or pre-aggregated) from
the original values (for example, “net price” and “tax” exist at the source, but “price
including tax” is expected at the destination).
■■ Data pivoting and data unpivoting:  Source data might need to be restructured or
reorganized in order to comply with the destination data model (for example, data in
the entry-attribute-value (EAV) might need to be restructured into columns or vice versa).

Tasks

Data Preparation Tasks : These tasks, shown in Table 4-3, are used to prepare data sources for further processing; the preparation can be as simple as copying the source to the server, or as complex as profiling the data, determining its informational value, or even discovering what it actually is.

File System task:  This task provides operations on file system objects (files and folders), such as copying, moving, renaming, deleting objects, creating folders, and setting object attributes.

FTP task : This task provides operations on file system objects on a remote file store via the File Transfer Protocol (FTP), such as receiving, sending, and deleting files, as well
as creating and removing directories.

Web Service task :  This task provides access to web services; it invokes web service methods, receives the results, and stores them in an SSIS variable or writes them to a file
connection.

This task provides :  XML manipulation against XML files and XML data, such as
validation (against a Document Type Definition or an XML Schema), transformations
(using XSLT), and data retrieval (using XPath expressions). It also supports
more advanced methods, such as merging two XML documents .

Data Profiling task : This task can be used in determining data quality and in data cleansing. It can be useful in the discovery of properties of an unfamiliar data set.

Workflow Tasks :

These tasks, shown in Table , facilitate workflow, which is the structure of the process in
terms of its relationships with the environment and related processes; these tasks automate the interaction between individual SSIS processes and/or the interaction between SSIS processes and external processes (processes that exist outside SQL Server).

Execute Package task : This task executes other SSIS packages, thus allowing the distribution of programmatic logic across multiple SSIS packages, which in turn increases the reusability of individual SSIS packages and enables a more efficient division of labor within the SSIS development team.

Execute Process task :  This task executes external processes (that is, processes external to SQL Server). The Execute Process task can be used to start any kind of Windows application.

Message Queue task : This task is used to send and receive messages to and from Microsoft Message Queuing (MSMQ) queues on the local server.

Send Mail task  :  The task allows the sending of email messages from SSIS packages by using the Simple Mail Transfer Protocol (SMTP).

Expression task :  This task is used in the workflow to process variables and/or parameters and to assign the results to other variables used by the SSIS process .

CDC Control task :  This task controls the life cycle of SSIS packages that rely on the SQL Server 2012 Change Data Capture (CDC) functionality. It provides CDC information
from the data source to be used in CDC-dependent data flows.

Data Movement Tasks

Bulk Insert task : This task allows the loading of data from formatted text files into a SQL Server database table (or view); the data is loaded unmodified (because transformations are not supported), which means that the loading process is fast and efficient .

Execute SQL task : This task executes SQL statements or stored procedures against a supported data store. The task supports the following data providers: EXCEL, OLE DB, ODBC, ADO, ADO.NET, and SQLMOBILE, so keep this in mind when planning connection managers.

Data flow task : This task is essential to data movements, especially complex data movements, because it provides all the elements of ETL (extract-transform-load); the architecture of the data flow task allows all of the transformations to be performed in flight and in memory, without the need for temporary storage.

SQL Server Administration Tasks

Transfer Database task : Use this task to copy or move a database from one SQL Server instance to another or create a copy of it on the same server. It supports two modes of operation:
■■ In online mode, the database is transferred by using SQL Server Management Objects (SMO), allowing it to remain online for the duration of the transfer.
■■ In offline mode, the database is detached from the source instance, copied to the destination file store, and attached at the destination instance, which takes less time compared to the online mode, but for the entire duration the database is inaccessible.

Transfer Error Messages task :  Use this task to transfer user-defined error messages from one SQL Server instance to another; you can transfer all user-defined messages or specify individual ones.

Transfer Jobs task :  Use this task to transfer SQL Server Agent Jobs from one SQL Server instance to another; you can transfer all jobs or specify individual ones.

Transfer Logins task :  Use this task to transfer SQL Server logins from one SQL Server instance to another; you can transfer all logins, logins mapped to users of one or more
specified databases, or individual users.

SQL Server Maintenance Tasks

Back Up Database task :  Use this task in your maintenance plan to automate full, differential, or transaction log backups of one or more system and/or user databases.
Filegroup and file level backups are also supported.

Check Database Integrity task :  Use this task in your maintenance plan to automate data and index page integrity checks in one or more system and/or user databases.

Execute SQL Server Agent Job task:  Use this task in your maintenance plan to automate the invocation of SQL Server Agent Jobs to be executed as part of the maintenance plan. Execute T-SQL Statement task Use this task in your maintenance plan to execute Transact-SQL scripts as part of the maintenance plan.

You should not confuse the very basic Execute T-SQL Statement Task with the more advanced Execute SQL Task described earlier in this lesson. The Execute T-SQL Statement Task only provides a very basic interface, which will allow you to select the connection manager and specify the statement to execute; parameters, for instance, are not supported in this task.

History Cleanup task : Use this task in your maintenance plan to automate the purging of historical data about backups and restore operations, as well as SQL Server Agent and maintenance plan operations on your SQL Server instance.

Maintenance Cleanup task : Use this task in your maintenance plan to automate the removal of files left over by maintenance plan executions; you can configure the task to remove old backup files or maintenance plan text reports.

Notify Operator task : Use this task in your maintenance plan to send email messages to SQL Server Agent operators.

Rebuild Index task :  Use this task in your maintenance plan to automate index rebuilds for one or more databases and one or more objects (tables or indexed views).
Reorganize Index task : Use this task in your maintenance plan to automate index reorganizations for one or more databases and one or more objects (tables or indexed views).
Shrink Database task :  Use this task in your maintenance plan to automate database shrink operations.
Update Statistics task : Use this task in your maintenance plan to automate updates of statistics for one or more databases and one or more objects (tables or indexed views).

The Script Task

This special task exposes the SSIS programming model via its .NET Framework implementation to provide extensibility to SSIS solutions. The Script task allows you to integrate custom data management operations with SSIS packages. Customizations can be provided by using any of the programming languages supported by the Microsoft Visual Studio Tools for Applications (VSTA) environment (such as Microsoft Visual C# 2010 or Microsoft Visual Basic 2010).
Typically, the Script task would be used to provide functionality that is not provided by
any of the standard built-in tasks, to integrate external solutions with the SSIS solution, or to provide access to external solutions and services through their application programming interfaces (APIs).

Containers

When real-world concepts are implemented in SSIS, the resulting operations can be composed of one or more tasks. To allow tasks that logically form a single unit to also behave as a single unit SSIS introduces containers. Containers provide structure (for example, tasks that represent the same logical unit can be grouped in a single container, both for improved readability as well as manageability), encapsulation (for example, tasks enclosed in a loop container will be executed repeatedly as a single unit), and scope (for example, container-scoped resources can be accessed by the tasks placed in the same container, but not by tasks placed outside).

For Loop container:  This container executes the encapsulated tasks repeatedly, based on an expression—the looping continues while the result of the expression is true; it is based on the same concept as the For loop in most programming
languages.
Foreach Loop container : This container executes the encapsulated tasks repeatedly, per each item of the selected enumerator; it is based on the same iterative concept as the
For-Each loop in most contemporary programming languages.It is suited for executing a set of operations repeatedly based on an enumerable collection of items (such as files in a folder, a set of rows in a table, or an
array of items).

Sequence container ; This container has no programmatic logic other than providing structure to encapsulate tasks that form a logical unit, to provide a scope for SSIS
variables to be accessible exclusively to a specific set of tasks or to provide
a transaction scope to a set of tasks.

There are three precedence constraint types, all of them equivalent in defining sequences but different in defining the conditions of execution:
■■ A success constraint allows the following operation to begin executing when the preceding operation has completed successfully (without errors).
■■ A failure constraint allows the following operation to begin executing only if the preceding operation has completed unsuccessfully (with errors).
■■ A completion constraint allows the following operation to begin executing when the
preceding operation has completed, regardless of whether the execution was successful
or not.