What is SQL

SQL stands for : Structured Query Language.

Was designed to manage and maintain data in Relational Database Management Systems(RDBMS) which revolves around relational model and predicate logics.

Has three types of statements :

DDL (Data Definition Language)

Statements include : Create, Alter, Drop

DML (Data Manipulation Language)

Statements include : Select , Insert, Update, Truncate, Delete

DCL (Data Control Language)

Statements include : Grant, Revoke

What is a Database, Schema and Object

Consider a Database as a main box which contains inside it different boxes(schemas) and those boxes(schemas) contain different objects like (tables,views, stored procedures and functions etc)

When It is said that Objects are contained in the Schema ,so the best advantage of the Schema level is Security . You can assign the different users of a database , a certain amount of privileges/rights ;

For example:

– Allowing User A : To use only select statements .

– Allowing User B : To see Encrypted Stored Procedures allowing all other privileges.

– Allowing User C : Not giving him rights to delete, update or execute any stored procedure.

The Below picture illustrates the design of Database ,schema and object . A schema can contain multiple objects in it .

objct2

Interview Questions

Q 1 : What are types of system databases:

The following are the types of databases:

■ Master database holds instance-wide metadata information, server configuration, information about all databases in the instance, and initialization information. the system views contain information about system, hardware, indexes, columns , memory etc.

■ tempdb The tempdb database is where SQL Server stores temporary data such as work tables, sort space, row versioning information, and so on. SQL Server allows you to create temporary tables for your own use, and the physical location of those temporary tables is tempdb .

■ Resource The Resource database is a hidden, read-only database that holds the definitions of all system objects. When you query system objects in a database, they appear to reside in the sys schema of the local database, but in actuality their definitions reside in the Resource database. contains information that was
previously in the master database and was split out from the master database to make service pack upgrades easier to install.

■ Model The model database is used as a template for new databases. Every new database that you create is initially created as a copy of model. So if you want certain objects (such as data types) to appear in all new databases that you create, or certain database properties to be configured in a certain way in all new databases, you need to create those objects and configure those properties in the model database. Note that changes you apply to the model database will not affect existing databases . The contains information that was previously in the master database and was split out from the master database to make service pack upgrades easier to install.

■ msdb The msdb database is where a service called SQL Server Agent stores its data. SQL Server Agent is in charge of automation, which includes entities such as jobs, schedules, and alerts. The SQL Server Agent is also the service in charge of replication. The msdb database also holds information related to other SQL Server features such as Database Mail, Service Broker, backups, and more.

Q 2 : What is a Filegroup ?

The database is made up of data files and transaction log files . The data files hold object data, and the log files hold information that SQL Server needs to maintain transactions. Data files are organized in logical groups called filegroups.

A filegroup is the target for creating an object, such as a table or an index. The object data will be spread across the files that belong to the target filegroup. Filegroups are your way of controlling the physical locations of your objects.

A database must have at least one filegroup called PRIMARY, and can optionally have other user filegroups as well. The PRIMARY filegroup contains the primary data file (which has an .mdf extension) for the database, and the database’s system catalog. You can optionally add secondary data files (which have an .ndf extension) to PRIMARY. User filegroups contain only secondary data files. You can decide which filegroup is marked as the default filegroup.

This technique can support two main strategies:

■ Using multiple filegroups can increase performance by separating heavily used tables or indexes onto different disk subsystems.

■ Using multiple filegroups can organize the backup and recovery plan by containing static data in one filegroup and more active data in another filegroup. We can allocate or spread filegroups nd1,nd2…ndn to multiple drives to increase performance.

Q 3 : What is a Primary key?

A primary key constraint enforces uniqueness of rows and also disallows NULL marks in the constraint attributes.

Q 4 : What is a Unique Constraints ?

A unique constraint enforces the uniqueness of rows, allowing you to implement the concept of alternate keys from the relational model in your database. Unlike with primary keys, you can define multiple unique constraints within the same table. Also, a unique constraint is not restricted to columns defined as NOT NULL.

Q 5 : What is a Foreign Key Constraints ?

A foreign key enforces referential integrity. This constraint is defined on one or more attributes in what’s called the referencing table and points to candidate key (primary key or unique constraint) attributes in what’s called the referenced table . Note that NULL marks are allowed in the foreign key columns (mgrid in the last example) even if
there are no NULL marks in the referenced candidate key columns.

Q 6 : What is a Check Constraints ?

A check constraint allows you to define a predicate that a row must meet to be entered into the table or to be modified. For example, the following check constraint ensures that the salary column in the Employees table will support only positive values.
ALTER TABLE dbo.Employees
ADD CONSTRAINT CHK_Employees_salary
CHECK(salary > 0.00);

Q 7 : What is the difference between where and having clause ?

The WHERE clause is evaluated before rows are grouped, and therefore is evaluated
per row. The HAVING clause is evaluated after rows are grouped, and therefore
is evaluated per group.

Q 8 : Why it is not allowed to refer to a column alias defined by the SELECT
clause in the WHERE clause?

Because the WHERE clause is logically evaluated first in query processing than the SELECT clause.

Q 9 : What is the performance wise benefits of the WHERE clause ?

It reduces network traffic and if properly used with indexes can reduce the full table scanning.

Q 10 : What is the difference between self-contained and correlated subqueries?

Self-contained sub queries are independent of the outer query, whereas correlated
sub queries have a reference to an element from the table in the outer query.

Q 11 : What is the difference between the APPLY and JOIN operators?

With a JOIN operator, both inputs represent static relations. With APPLY, the
left side is a static relation, but the right side can be a table expression with
correlations to elements from the left table.

Q 12 : What are two requirements for the queries involved in a set operator ?

The number of columns in the two queries needs to be the same and the corresponding
columns need to have compatible types.

Q 13 : What makes a query a grouped query?

When you use an aggregate function, a GROUP BY clause, or both.

Q 14 : What are the clauses that you can use to define multiple grouping sets in the
same query?

GROUPING SETS, CUBE, and ROLLUP.

Q 15 : What is the difference between PIVOT and UNPIVOT?

PIVOT rotates data from a state of rows to a state of columns; UNPIVOT rotates
the data from columns to rows.

Q 16 : Can you store indexes from the same full-text catalog to different filegroups?

Yes. A full-text catalog is a virtual object only; full-text indexes are physical objects.
You can store each full-text index from the same catalog to a different file group.

Q 17 : How do you search for synonyms of a word with the CONTAINS predicate?

You have to use the CONTAINS(FTcolumn, ‘FORMSOF(THESAURUS, SearchWord1)’) syntax.

Q 18 : Can a table or column name contain spaces, apostrophes, and other nonstandard characters?

Yes

Q 19 : What types of table compression are available?

Page and row level compression.

Q 20 : How SQL Server enforce uniqueness in both primary key and unique constraints?

SQL Server uses unique indexes to enforce uniqueness for both primary key
and unique constraints.

Q 21 : What type of data does an inline function return?

Inline functions return tables, and accordingly, are often referred to as inline
table-valued functions.

Q 22 : What is difference between view and an inline function ?

An inline table-valued function can be said as a parameterized view—that is, a
view that takes parameters.

Q 23 : What is the difference between SELECT INTO and INSERT SELECT?

SELECT INTO creates the target table and inserts into it the result of the query.
INSERT SELECT inserts the result of the query into an already existing table.

Q 24: Can we update rows in more than one table in one UPDATE statement?

No, we can use columns from multiple tables as the source, but update only
one table at a time.

Q 25 : How many columns with an IDENTITY property are supported in one table? And How do you obtain a new value from a sequence?

One.

We use NEXT VALUE FOR function for it.

Q 26 : What is the purpose of the ON clause in the MERGE statement?

The ON clause determines whether a source row is matched by a target row,
and whether a target row is matched by a source row. Based on the result of
the predicate, the MERGE statement knows which WHEN clause to activate and
as a result, which action to take against the target.

Q 27 : What are the possible actions in the WHEN MATCHED clause?

UPDATE and DELETE.

Q 28 : How many WHEN MATCHED clauses can a single MERGE statement have?

Two—one with an UPDATE action and one with a DELETE action.

Q 29: Why is it important for SQL Server to maintain the ACID quality of
transactions?

To ensure that the integrity of database data will not be compromised.

Q 30 : How does SQL Server implement transaction durability?

By first writing all changes to the database transaction log before making changes permanently to the database data on disk.

Q 31 : How many ROLLBACKs must be executed in a nested transaction to roll it back?

Only one ROLLBACK. A ROLLBACK always rolls back the entire transaction, no
matter how many levels the transaction has.

Q 32 : How many COMMITs must be executed in a nested transaction to ensure that
the entire transaction is committed?

One COMMIT for each level of the nested transaction. Only the last COMMIT
actually commits the entire transaction.

Q 33 : Can readers block readers?

No because shared locks are compatible with other shared locks.

Q 34: Can readers block writers?

Yes, even if only momentarily, because any exclusive lock request has to wait
until the shared lock is released.

Q 35 : If two transactions never block each other, can a deadlock between them
result?

No. In order to deadlock, each transaction must already have locked a resource the other transaction wants, resulting in mutual blocking.

Q 36 : Can a SELECT statement be involved in a deadlock?

Yes. If the SELECT statement locks some resource that keeps a second transaction
from finishing, and the SELECT cannot finish because it is blocked by the
same transaction, the deadlock results.

Q 37 : If your session is in the READ COMMITTED isolation level, is it possible for one of your queries to read uncommitted data?

Yes, if the query uses the WITH (NOLOCK) or WITH (READUNCOMMITTED)
table hint where WITH (NOLOCK) ignoring the locks . The session value for the isolation level does not change, just the characteristics for reading that table.

Q 38 : Is there a way to prevent readers from blocking writers and still ensure that
readers only see committed data?

Yes, that is the purpose of the READ COMMITTED SNAPSHOT option within the
READ COMMITTED isolation level. Readers see earlier versions of data changes
for current transactions, not the currently uncommitted data.

Q 39 : What is the result of the parsing phase of query execution?

The result of this phase, if the query passed the syntax check, is a tree of logical
operators known as a parse tree.

Q 40 : How we measure the amount of disk I/O a query is performing?

We use the SET STATISTICS IO command.

Q 41 : Which DMO gives you detailed text of queries executed?

You can retrieve the text of batches and queries executed from the

sys.dm_exec_sql_text DMO.

Q 42 :What are the two types of parameters for a T-SQL stored procedure?

A T-SQL stored procedure can have only an input and a output parameter.

Q 43 : Can a stored procedure span multiple batches of T-SQL code?

No, a stored procedure can only contain one batch of T-SQL code.

Q 44 : What are the two types of DML triggers that can be created?

You can create AFTER and INSTEAD OF DML-type triggers.

Q 45 : If an AFTER trigger discovers an error, how does it prevent the DML command from completing?

An AFTER trigger issues a THROW or RAISERROR command to cause the transaction
of the DML command to roll back.

Q 46 : What are the two types of table-valued UDFS? And What type of UDF returns only a single value?

You can create inline or multistatement table-valued UDFs. And scalar UDF returns only a single value.

Q 47 : What kind of clustering key would you select for an OLTP environment?

For an OLTP environment, a short, unique, and sequential clustering key might be
the best choice.

Q 48 : Which clauses of a query should you consider supporting with an index?

The list of the clauses you should consider supporting with an index includes, but
is not limited to, the WHERE, JOIN, GROUP BY, and ORDER BY clauses.

Q 49 : How would you quickly update statistics for the whole database after an upgrade?

We should use the sys.sp_updatestats system procedure .

Q 50 : What are the commands that are required to work with a cursor?

DECLARE, OPEN, FETCH in a loop, CLOSE, and DEALLOCATE.

Q 51 : When using the FAST_FORWARD option in the cursor declaration command,
what does it mean regarding the cursor properties?

It means that the cursor is read-only, forward-only.

Q 52 : How would you determine whether SQL Server used the batch processing mode for a specific iterator?

You can check the iterator’s Actual Execution Mode property.

Q 53 : Would you prefer using plan guides instead of optimizer hints?

With plan guides, you do not need to change the query text.

Q 54 : Why relational model is called set based model ?

Relational model means that it is based on the concepts of mathematical set theory. SQL queries that query on the SQL tables outputs the rows in the form of sets of rows.

Q 55 : Give example of iterative model?

Iterative model means the same concept which is used in loop iteration in high level languages such as C or python . In the same way , the iterative model works on rows as they go row by row. They are by comparison slower in performance .Example : Cursors.

Q 56 : What does fast forward cursor means ?

Means that cursor will start from the initiating point to the last element and will not go backward.

Q 57 : What are scopes of temporary tables ?

There are two types of temporary tables in SQL : local and global.

Local temp tables are visible to the level that created , across the all inner batches and to the all inner levels of call stack.

Global temp tables are where destroyed when the session that created it terminates or destroyed.

Table variables are named with @ sign : @TV1. They are only accessible to batch that created it .They are not visible to across batches at same level and not even to inner levels .

Q 58: What is the difference between temp table and table variable?

Temp tables are similar to regular database tables . Any data changes in temp tables during transaction can be rolled back . But changes in table variable in a transaction can not be rolled back.

Another difference is performance wise , for temp tables SQL maintains histograms . Means we can see their statistics and work on them to improve performance for instance : we can create indexes on columns for filtering out the data properly.

In case of table variables , it performs the full table scan which decreases the performance.

Q 59 : What does SET NOCOUNT ON do ?

NOCOUNT ON can remove the messages like : 2 rows affected returned .Putting
a SET NOCOUNT ON at the beginning of stored procedure prevents from returning that message to the client.

Q 60 : What does GOTO statement does ?

With GOTO you can jump to the particular label to from where you are at!

For instance :

PRINT ‘First’;
GOTO Label_1;
PRINT ‘Second’;
Label_1:
PRINT ‘End now’;

Q 61 : Can the stored procedure have multiple batches ?

No.

Q 62 : What does RETURN statement do ?

It exits the stored procedure and returns to caller statement or procedure .

Q 63: What does @@Rowcount does?

It counts the number of rows read or affected by the SQL statement .

Q 64: Can the AFTER triggers be nested ?

Yes they can be nested , means that trigger on a table T1 can have a trigger and that inserts rows into table T2 that is also having a trigger on it and so on .
The number of maximum nested triggers SQL can have is 32.

Q 65 : What does SCHEMABINDING statement do?

WITH SCHEMABINDING means that a schema object is dependent or have a some sort of bound with other object . Object can be a table ,view or procedure. For instance : You cannot change the table structure if it is schemabinded with another view until you drop that view.

Q 66 : Can we use multiple select statements in a view ?

No . Only you can use one select statement as it is required that a view returns one result set.

But , you can use union statement with two or more select statements as the final result would be one result set.

Q 67 : Can data be modified in a table with a view ? If yes then what precautions are there?

Yes we can modify data in a table through a view instead of directly modifying through the table.

There are few precautions and restriction that should be taken carefully :

The DML statement must use or point to one table only even if the view is made up of or referencing multiple tables.
The view column to be modified should not have a aggregate function on it in the table or a view whose column is resulted from GROUP BY , DISTINCT or HAVING .
We cannot modify a view if it is having TOP or offset fetch with the WITH CHECK OPTION.
The data in the view column cannot be modified if it is made up from Union, union All , intersect or cross joins.

Q 68 : What are partitioned views?

We can partition large table in SQL with help of views on one or across several servers. Simply you can use union on partitioned tables and create a view for it. It is called partitioned view. If the table is spread across multiple SQL instances then it is called distributed partitioned view.

Q 69 : What is inline table valued function ?

It is also a type of view called parameterized view . The difference is that only it can take parameter to filter the rows from the view.

Q 70 : What is identity?

It is a type of property for a column having a type numeric. We typically use the identity column to generate surrogate keys which are mostly generated by system automatically when you insert the data. It has two values in it . First is seed value which is the first value and second is step value which is incremental value. We define both of these values at definition.

Q 71 : What is SET IDENTITY_INSERT?

We can specify our own values for identity column for insert by SET IDENTITY_INSERT = ON . But we cannot update identity column value.

Q 72 : How can we find last Identity value?

The SCOPE_IDENTITY returns last identity value which is generated in the session in the current scope (batch, procedure or function) .

The @@IDENTITY returns the last identity value generated in the session inspite what is the scope.

The IDENT_CURRENT takes table as input and returns the last identity value being generated in the given table .

Q 73 : What will happen if we use Scope_Identity , @@Identity and Ident_current in different sessions?

Scope_Identity and @@Identity will return NULL but Ident_current will return the last value of the identity column whatever the session is.

Q 74 : What is sequence ?

Sequence is an independent object in SQL Server . It is quite like the identity column .

All numeric types are accepted by sequence as like the identity does. But By default is BIGINT.

There are number of properties which identity does not have :

■ INCREMENT BY : It Increments the value. The default value is 1.

■ MINVALUE : The minimum value for the type .

■ MAXVALUE :The maximum value to be given for the type .

■ CYCLE | NO CYCLE : It deals if to allow the sequence to cycle or not. The default value is NO CYCLE.

■ START WITH : It is the sequence start value.

Q 75 : How can we request next Value in sequence ?

To request new value from the sequence, run the following code .
SELECT NEXT VALUE FOR tableName;

Q 76 : Can we change the datatype of sequence ? And can we change properties and values ?

No we cannot change the datatype but yes we can change properties and values.

Q 77 : What is cache in sequence ?

This property means writing the sequence value to disk . For instance a CACHE with value 100 means it will write to disk after every 100 . Performance wise using NO CACHE has to write to disk each request of new sequence value. With CACHING performance is good.

Q78 : What is APPLY operator ?

The APPLY operator works on two input tables in which the second can be a table expression.
And we will refer to them as the “left” and “right” tables and the right table is usually a derived table or an inline table valued function.

The APPLY operator applies the right table expression to each row from the left table and produces a result table with the unified result sets.

We will discuss this in separate post.

Q78 : What is the difference between Cross and outer apply?

The APPLY operator has two types;

CROSS APPLY doesn’t return left rows that get an empty set back from the right side.

The OUTER APPLY preserves the left side, and therefore, does return left rows when the right side returns an empty set. NULLs are used as placeholders from the right side in the outer rows if they are empty.

Q79 : What is Transaction ?

Transaction is a unit of work that has many activities like querying a data and changing multiple data definition statement or queries.

Q80 : What is implicit_transactions? Statement ?

This statement is OFF by default in SQL . If you do not start your transaction with BEGIN statement it is fine but you have to specify COMMIT OR ROLLBACK TRAN to end the transaction.

Q 81 : Define 4 properties of a transaction ?

Atomicity: Either all changes in the transaction takes place or None. If a system fails before a COMMIT in a transaction or there is an error , the transaction is being rolled back.

Consistency : It is the state of data the database gives access to the user .The isolation level is also a part of consistency. Consistency also refers to the integrity rules it follows like (Primary Key , Foreign Key and Unique constraints ) etc .

Isolation : Isolation level means to handle and control the access level of data and ensure data is in the level of consistency . There are two types of isolation levels : Locking and row versioning .

Durability : It means that the data is durable . Whenever , the data is changed it is first logged into the Logs of the database before it can be written to disk . Once the data is Committed and written to logs it is considered safe even the system fails.

Q 82 : What is redo and undo ?

If the system fails , when it gets started it checks the logs and replays the data and checks for committed data , if its committed it rolls forward and Redo .

If the data is uncommitted it gets rolled back to previous state once the system is up .

Q 83 : What is a lock ?

Lock is a control resource obtained by the transaction to protect data and prevent data changes or access by another transaction .

Q 84 : What is an Exclusive Lock ?

When we want to modify a data , the transaction requests an exclusive lock on the data and its resources. If it acquires the exclusive lock it will not allow any other transaction to access or modify data until that transaction releases the lock .

In single statement transaction , the lock is held until the statement completes.

In multi statement transaction , the lock is held until all the statements are executed and the transaction is ended by Commit tran or rollback Tran.

If a transaction is holding any type of lock on a resource then it can not acquire Exclusive lock and no lock can acquire any resource if a transaction is having an Exclusive lock.

Q 85 : What is shared lock ?

When a transaction reads a data , the transaction acquires Shared lock on it and its resources . Multiple transactions can acquire shared locks simultaneously on a same resource.

Q 86 : What is row versioning ? What is READ COMMITTED SNAPSHOT Isolation level ?

In Azure SQL database Read Committed Snapshot is the default isolation level.

Instead of Locking technology , this isolation level works on row versioning , so the transaction does not wait for acquiring shared lock for reading .

In this READ COMMITTED SNAPSHOT isolation level , if the transaction modifies a row and another transaction tries to read the row , it will read the last committed state of the row that was available before the start of the statement (optimistic Concurrency) .

This case is very different in READ COMMITTED isolation level , if the transaction is modifying the row , another transaction cannot read the same rows of data until the first transaction completes (Pessimistic Concurrency ) .

Q 87 : What are lockable resources ?

Lockable resources are : RID, rows , tables , pages and database tables.

Q 88 : What is higher level of Granularity ?

To obtain lock on a resource , your transaction must first obtain intent locks of the same mode on higher levels of granularity . For example, to get an exclusive lock on a row, your transaction must first acquire an intent exclusive lock on the page where the row resides and an intent exclusive lock on the object that owns the page.

To get a shared lock on a certain level of granularity, your transaction first needs to acquire intent shared locks on higher levels of granularity.

The purpose of intent locks is to efficiently detect incompatible lock requests on higher levels of granularity and prevent the granting of those.

For instance , if a transaction holds a lock on a row and another asks for an incompatible lock mode on the whole page(higher level) or table where that row resides, it is easy for SQL Server to identify the conflict because of the intent locks that the first transaction acquired on the page and table. Intent locks do not interfere with requests for locks on lower levels of granularity. For example, an intent lock on a page doesn’t prevent other transactions from acquiring incompatible lock modes on rows within the page.

Q 89 : What is Blocking ?

When one transaction holds a data resource and another requests for the same resource , so requester gets blocked and enters into the wait state.

Q 90 : What is isolation level READ UNCOMMITTED ?

It the lowest available isolation level. In this isolation level, a reader doesn’t
ask for a shared lock. A reader that doesn’t ask for a shared lock can never be in conflict with a writer that is holding an exclusive lock. This means that the reader can read uncommitted changes (also known as dirty reads). It also means that the reader won’t interfere with a writer that asks for an exclusive lock. In other words, a writer can change data while a reader that is running under the READ UNCOMMITTED isolation level reads data.

Q 91 : What is the isolation level READ COMMITED ?

It does not allow dirty reads or uncommitted data . It allows the reader to acquire shared lock to prevent reading the uncommitted data.

This means that if a writer is holding an exclusive lock, the reader’s shared lock request will be in conflict with the writer, and it has to wait.

As soon as the writer commits the transaction, the reader can get its shared lock, but what it reads are necessarily only committed changes.

Q 92 : What is the isolation level REPEATABLE READ ?

It not only does a reader need a shared lock to be able to read, but it also holds the lock until the end of the transaction. This means that as soon as the reader has acquired a shared lock on a data resource to read it, no one can obtain an exclusive lock to modify that resource until the reader ends the transaction.

If a transaction is having a shared lock and it is not committed and it is in repeatable read isolation level, then it will not allow any other transaction to update the data of the first transaction until the first transaction is not committed.

Q 93 : What is isolation level SERIALIZABLE ?

One problem in REPEATABLE READ is that what if a another transaction enters a new row or rows into the table ? this phenomenon is called Phantom Reads . To overcome this problem , we use the SERIALIZABLE . It blocks the other transaction to enter the new rows.

Git

The Three States

Git has three main states that your files can reside in: committed, modified, and staged.

Committed means that the data is safely stored in your local database.

Modified means that you have changed the file but have not committed it to your database yet.

Staged means that you have marked a modified file in its current version to go into your next commit snapshot.

The Git directory is where Git stores the metadata and object database for your project. This is the most important part of Git, and it is what is copied when you clone a repository from another computer.

The working directory is a single checkout of one version of the project. These files are pulled out of the compressed database in the Git directory and placed on disk for you to use or modify.

The staging area is a file, generally contained in your Git directory, that stores information about what will go into your next commit.

The basic Git workflow goes something like this:

You modify files in your working directory.
You stage the files, adding snapshots of them to your staging area.
You do a commit, which takes the files as they are in the staging area and stores that snapshot permanently to your Git directory.

If a particular version of a file is in the Git directory, it’s considered committed. If it’s modified but has been added to the staging area, it is staged. And if it was changed since it was checked out but has not been staged, it is modified.

Initializing a Repository in an Existing Directory

$ git init

This creates a new subdirectory named .git that contains all your necessary repository files—a Git repository skeleton. At this point, nothing in your project is tracked yet.

Cloning an Existing Repository

If you want to get a copy of an existing Git repository—for example, a project you’d like to contribute to—the command you need is git clone. If you’re familiar with other VCS systems such as Subversion, you’ll notice that the command is “clone” and not “checkout.” This is an important distinction—instead of getting just a working copy, Git receives a full copy of nearly all data that the server has. Every version of every file for the history of the project is pulled down by default when you run git clone. In fact, if your server disk gets corrupted, you can often use nearly any of the clones on any client to set the server back to the state it was in when it was cloned

You clone a repository with git clone [url]. For example, if you want to clone the Git linkable library called libgit2, you can do so like this:

$ git clone https://github.com/libgit2/libgit2

That creates a directory named libgit2, initializes a .git directory inside it, pulls down all the data for that repository, and checks out a working copy of the latest version. If you go into the new libgit2 directory, you’ll see the project files in there, ready to be worked on or used.

Recording Changes to the Repository

You have a bona fide Git repository and a checkout or working copy of the files for that project. You need to make some changes and commit snapshots of those changes into your repository each time the project reaches a state you want to record.

Remember that each file in your working directory can be in one of two states: tracked or untracked. Tracked files are files that were in the last snapshot; they can be unmodified, modified, or staged. Untracked files are everything else—any files in your working directory that were not in your last snapshot and are not in your staging area. When you first clone a repository, all your files will be tracked and unmodified because you just checked them out and haven’t edited anything.

As you edit files, Git sees them as modified, because you’ve changed them since your last commit.

Data warehouse Definitions

What is an object?

An object is an Entity, for example: car, house , person, time etc. Object can be tangible or intangible.

Lets consider a car (tangible) which is our object . An object has an attribute, behaviour and has a unique Id.

Car has attribute (color, model) , behaviour ( accelerate, brake the car, change gear) and has a unique registration number.

Now , lets take an intangible object example: Time. A time can have attributes like: Year, month, day. A time can have its behaviour set, like set the Year, set the Month, set the day and it can have a unique identifier : Date of birth, date of joining the college, date of creation.

Information hiding:

Some characteristics of object are:

Information is stored within an object.

It is hidden from outside world.

It can only be manipulated by the object itself.

Example: An object called Alex who is a person and he is our object model, the outside world or any other person cannot know his name by seeing him or interacting with him. We cannot also know his exact age until he lets us know. We can get his name and age only when he wants us to let us know. So, we cannot access his name and age directly but we can ask him to tell his name.

Another example is contact numbers stored on a mobile phone. A mobile phone is an object. It is hiding phone numbers inside it. If i want to get a number starting from letter ‘S’ then i have to ask or access the phone to get me the list of numbers starting from ‘S’.

So information hiding simplifies the process. An object Alex enquires from object phone to tell him the numbers starting from ‘S’. So in short, Alex’s information like age or name is hidden until he tells you or he gives access to his details and you cannot access the phone numbers until you get access to the phone itself and perform the operations.

2. Encapsulation:

State and behaviour of an object are tightly coupled. You can perceive an object whose name is Alex and age is 36. You do not know how he has stored that information in his brain, when you ask him , he performs that operation and then lets you know his name and age. You do not know in which part of the brain he has that information and how he got it.

A phone has stored a phone number in digital format and knows how to convert in human readable format. A user does know how data is stored and how it is converted to human readable format.

3. Object has an interface:

An object encapsulates data and behaviour. So how objects interact with each other ? Each object provides an interface (operations).

So lets say , an object whose name is Alex. You ask Alex in english , what is your name ? So he will tell you his name. If you ask him any other thing like where are you going or what is your plan , you will never get his name. So in simple terms the only interface which Alex has allowed to get exposed to you or outer world is by asking what is your name. Any other way you try to get his name , you will not get it.

Another example can be: an object car , whose interface can be : Steer wheels, change gear, accelerate, apply brakes.

If you want to speed up the car you have to accelerate. If you want to change the direction of the car you will use its interface steer wheels, if you have to stop the car you have to apply the brakes interface, you cannot stop the car if you try to use any other interface!

Abstraction:

Abstraction is a way to cope with complexity and reduce the complexity .

Principle of abstraction: Capture those details about an object that are relevant to that perspective.

For example: Alex is a student of PhD in a school and teaches some undergrad student courses. He might be having attributes like studentId, badge no, age, year,CGPA, Employee Id, designation, salary.

So it seems he is a student as well as a teacher in the school. There can be other more perspectives to the object Alex, like his sports playing perspectice like his score, his match date, his jersey no. There can be his medical perspectives like his height, age, blood group, his vaccination dates etc. These are different perspectives related to Alex. To make things simple in OOP, we only relate the object to its perspective and this way we make things simple.

Lets take the example we highlighted in bold , where Alex is a PhD student as well as a teacher. There are two perspectives one as a student (studentId, badge no, age, year, CGPA) and one as a teacher(EmployeeId, designation, salary).

Similarly , Alex’s behaviour as a student can be: study, play sports, give exams.

Alex’s behaviour as a teacher can be deliver lecture, prepare exam, teach and take exam.

Lets take another example : A car with a driver’s perspective and engineer’s perspective.

Driver’s perspective: A car has 4 doors, white color, a steering wheel, good seats.

Engineer’s perspective: Structure of car, engine, number of rods, wheel type etc .

So each perspective is separate and making them separate will reduce complexity.

Classes:

In OOP Model: Some of the Objects exhibit identical characteristics ( information structure and behaviour).

We say they belong to the same class.

For example: There are 15 students in a class room. They exhibit almost same behaviour, they take notes, sit on a chair and give the exam. So we can instantiate each student as an object with identical characteristics.

Take another example of a wooden stamp which has a government seal. When you stamp it , it creates a different object but the characteristic is same, the government seal.

So, take an example: A student class , there are three students , Alex who studies maths, Katherine who studies chemistry and Jack studies Physics. We created three objects here as an instance of student class.

In Python, every piece of data is represented as an instance of some class. A class provides a set of behaviors in the form of member functions (also known as methods), with implementations that are common to all instances of that class. A class also serves as a blueprint for its instances, effectively determining the way that state information for each instance is represented in the form of attributes (also known as fields, instance variables, or data members).

Inheritance:

A child inherits characteristics of his parents.

Besides that a child may also have its own unique characteristics.

If class B inherits from class A, then class B will have all characteristics (attributes and behaviour) of class A . A parent class is called Base class and child class is called derived class.

Consider a Parent class Person having its child classes student, teacher and doctor.

A parent class shape having child classes circle, triangle and line.

A person parent class has (name,age, eat( ), walk( ) ) has a derived class student (class, year, study( ), give exam( ) ) and a class teacher (employeeId, designation, teach( ), take exam( ) ).

Generalization in Inheritance:

The derived class inherit general characteristics.

A teacher and a student both eat and walk. So we can put these characteristics in a base class called person.

Person(name,age, gender, walk( ), eat( ) ) .

Student (program,year, give exam( ), study( ) ) and Teacher( designation, employee Id, take exam( ), teach( ) ).

Student is a person. Teacher is a person.

2. Subtyping( Extension) :

Sub-typing means that derived class is behaviourally compatible with base class. Behaviourally compatible means that base class can be replaced by derived class.

3. Speciaization (Restriction):

Specialization means that derived class is behaviourally incompatible with base class.And it means that base class cannot always be replaced by derived class.

Suppose a base class called Person has a derived class Adult. Person <– Adult , now if we add a restriction in adult class that age < 18 then it cannot be an adult , so Person class then cannot be replaced by Adult class.

Another example is base class called Integer Set (add(element) ) which has derived class called Natural Set( add(element) ).

Integer Set (add element( ) ) <– Natural Set (add element( ) ).

Consider an integer Set class which includes negative, zero and positive numbers. Now lets look at Natural set ,which can only include 0,1,2,3 and so on.. It cannot include negative numbers. So a Natural Set class is a specialized class of Integer Set class.

4. Overriding:

A derived class may override a default behaviour from its parent class but may exhibit totally new behaviour or new implementation. This is called overriding.

The reasons for overriding can be:

Provide a new behaviour specific to the derived class

Restricting default behaviour.

Extend the default behaviour.

Improve performance.

For example : In Integer Set class we use add(element) to add two variables but in Natural Set we can use to add 5 elements.

Overriding concept is under inheritance in OO modeling not outside it.

Abstract Classes

An abstract class implements abstract concept.

Main purpose is to be inherited by other classes.

Cannot be instantiated.

Promotes re use.

In the person and student,teacher , doctor classes example, person class can be an abstract class. Firstly the abstract class are mostly at top of the hierarchy so it can be inherited. A person class can be an abstract class because in real life for example : we say he is a doctor , he is a student , she is a teacher, not he is a person, she is a person.

Take another example of a shape class which has a derived class called circle, line, triangle. We do not say it is a shape when we see a circle or a triangle in real life. We say it is a circle or a triangle.

Another example: class vehicle(color,model,year, accelerate( ), brake( ) ) , car(color, model, year, accelerate( ), brake( )), truck(color, model, year, accelerate( ), brake( )).

That is why we keep abstract class at the top because it exhibits general , common concept like : shape, person, vehicle.

Rather than the classes from which we can create objects from are called concrete classes. Like teacher , doctor, student from Person class . Car, truck , cycle from vehicle class. So we instantiate object from a concrete class .

Multiple Inheritance:

A class inherited from multiple classes is said to have multiple inheritance. For example : If i have to create a mermaid i have to inherit properties of a human and a fish. So in OO model a mermaid class will be made from human class and a fish class.

Association:

Alex drives a car. It can have 1….* association. Alex can drive his car or any other car.

Association can be two way, if Alex is friend of Katty then Katty is also friend of Alex.And this is 1..1 is a two way association.

There can be ternary association as well. For example: Student, course and teacher. Student * .. 1 teacher, Course *…1 teacher. One teacher can teach multiple courses to multiple students.

Composition:

An object may be composed of smaller objects.

The relationship between part objects and whole object is known as composition.

Composition is represented by a filled diamond towards the composed object. It is a strong relationship.

For example: Alex is composed of parts called: 2 legs, 2 arms, 1 head and 1 body.

Aggregation:

An object may contain collection(aggregate) of other objects.

The relationship between container and the contained object is called aggregation. And it is represented by an unfilled diamond towards the container object.

An room contains bed, a cupboard and chairs. But it is not formed or made with these objects. It contains it.

A garden contains many plants. It contains them.

Aggregation is a weaker relationship as it contains the items but not formed from it. Why we say it a weak relationship is because if we remove the room , we can still use the chair and the cupboard. Or we remove these items from the room we can still use the room.

Polymorphism

Polymorphism refers to existence of different forms of same entity.

Diamond and coal are different forms of Carbon.

It is a powerful tool to develop flexible and resusable system.

For example: you have open an editor and there is a print option , whatever file is open either a pdf file, png file or word file, it will print depending on the object.

We can add or create new objects depending on the requirement.

Lets take another example: We have shape class , and its derived class triangle, circle , line . Now we want to add a new derived class square. We can use the same draw( ) method . System will use the draw( ) depending on the object.

Lets work with examples now with Python:

Creating Object:

Instantiation:

The process of creating a new object of a class is called instantiation.

Consider a class as a blueprint and the object instance as a physical object.

When you create an object instance it allocates a space for that object in the memory.

It then calls its constructor function which is a special function with __init__( ) as a syntax in python, it initializes an initial state for the newly created object. Think of a constructor that it sets an initial stage. If we do not initialize or set attributes in constructor then we have to manually write that attributes value every time.

For example:

			
# without constructor
class Robot:
  pass
robot = Robot()
robot.name="Optimus"
robot.battery = 100
# with constructor
class Robot:
def__init__(self,name,battery=100):
  self.name= name
  self.battery = battery
robot = Robot("optimus")

		

Many of Python’s built-in classes support what is known as a literal form for designating new instances. For example, the command temperature = 98.6 results in the creation of a new instance of the float class; the term 98.6 in that expression is a literal form.

From a programmer’s perspective, yet another way to indirectly create a new instance of a class is to call a function that creates and returns such an instance. For example, Python has a built-in function named sorted that takes a sequence of comparable elements as a parameter and returns a new instance of the list class containing those elements in sorted order.

Constructor:

Internally, constructor is a call to the specially named init method that serves as the constructor of the class. Its primary responsibility is to establish the state of a newly created object with appropriate instance variables.

Functions:

When using a method of a class, it is important to understand its behavior. Some methods return information about the state of an object, but do not change that state. These are known as accessors. Other methods, such as the sort method of the list class, do change the state of an object. These methods are known as mutators or update methods.

Python’s Built-In Classes

Table provides a summary of commonly used, built-in classes in Python; we take particular note of which classes are mutable and which are immutable. A class is immutable if each object of that class has a fixed value upon instantiation that cannot subsequently be changed. For example, the float class is immutable. Once an instance has been created, its value cannot be changed (although an identifier referencing that object can be reassigned to a different value).

Class	Description	Immutable?
bool	Boolean value	Immutable
int	integer (arbitrary magnitude)	Immutable
float	floating-point number	Immutable
list	mutable sequence of objects	Mutable
tuple	immutable sequence of objects	Immutable
str	character string	Immutable
set	unordered set of distinct objects	Mutable
frozenset	immutable form of set class	Immutable
dict	associative mapping (aka dictionary)	Mutable

Operator Overloading

Python’s built-in classes provide natural semantics for many operators. For ex- ample, the syntax a + b invokes addition for numeric types, yet concatenation for sequence types. When defining a new class, we must consider whether a syntax like a + b should be defined when a or b is an instance of that class.

By default, the + operator is undefined for a new class. However, the author of a class may provide a definition using a technique known as operator overload- ing. This is done by implementing a specially named method. In particular, the + operator is overloaded by implementing a method named add , which takes the right-hand operand as a parameter and which returns the result of the expres- sion. That is, the syntax, a + b, is converted to a method call on object a of the form, a. add (b). Similar specially named methods exist for other operators.

Abstract Method:

In Python the abstract method is declared in base class but it is implemented or utilized or we can say overidden by subclass. Python enforces dis- allowing instantiation for any subclass that does not override the abstract methods with concrete implementations.

Dynamic dispatching:

In traditional object-oriented terminol- ogy, Python uses what is known as dynamic dispatch (or dynamic binding) to determine, at run-time, which implementation of a function to call based upon the type of the object upon which it is invoked.

Namespace

A namespace is an abstraction that manages all of the identifiers that are defined in a particular scope, mapping each name to its associated value. In Python, functions, classes, and modules are all first-class objects, and so the “value” associated with an identifier in a namespace may in fact be a function, class, or module.

We begin by exploring what is known as the instance namespace, which man- ages attributes specific to an individual object.

There is a separate class namespace for each class that has been defined. This namespace is used to manage members that are to be shared by all instances of a class, or used without reference to any particular instance.

Three such namespaces: a class namespace containing methods of the CreditCard class, another class namespace with meth- ods of the PredatoryCreditCard class, and finally a single instance namespace for a sample instance of the PredatoryCreditCard class. We note that there are two different definitions of a function named charge, one in the CreditCard class, and then the overriding method in the PredatoryCreditCard class. In similar fashion, there are two distinct init implementations. However, process month is a name that is only defined within the scope of the PredatoryCreditCard class. The instance namespace includes all data members for the instance (including the apr member that is established by the PredatoryCreditCard constructor).

Shallow and Deep copying:

Take example of list warmtones:

red = 123

green = 222

brown =111

warmtones = [red,green,brown]

palette = list(warmtones)

In this case, we explicitly call the list constructor, sending the first list as a param- eter. This causes a new list to be created, as shown in Figure 2.10; however, it is what is known as a shallow copy. The new list is initialized so that its contents are precisely the same as the original sequence. However, Python’s lists are referential

We prefer that palette be what is known as a deep copy of warmtones. In a deep copy, the new copy references its own copies of those objects referenced by the original version.

Common Linux Commands Part 1:

ls – list current directory contents

pwd – prints the present working directory

cd – change directory

file – print the file type

less – prints the contents within a file

cp – copy files or folder

mkdir – make a directory

mv – rename or move files

rm – remove file or directory

cat – displays the contents of the file

wc – displays the number of lines, words and bytes in a file.

head – displays the first few lines of the file

tail – displays the last few lines of the file

clear – clears the screen

history – displays the content of the previously executed commands on the shell from history list

chgrp – change the ownership of the file’s group

passwd – change user’s password

chmod – change the permission mode of the file

chown – change file owner / group

ps – displays details of all the current processes

jobs – displays a list of all the active jobs

ping – send echo request to the network hosts

traceroute – displays the route packets trace to network host

netstat – displays the network connections, routing tables, interface stats etc

wget – network downloader

ssh – the openSSH ssh client to connect to remote host

locate – find files by name

find – search for files in directories

gzip – compress or expand files

tar – archiving utility

zip – package and compress file

grep – searches for the text in a file according to the regular expression given

sort – sort lines of text files

uniq – omits the repeated lines

cut – removes the section from each line of the file

diff – examine the changes or differences in the files

How to get details of SQL Server agent Jobs from Queries

All the SQL Server Agent Jobs details are logged into the MSDB Database tables .

These tables contains information for each SQL server Agent job like : JobID, sessionID, RunDate , Enabled Status, Owner, Job Steps, Last Run outcomes etc . Tables which are commonly used are :

select * from [dbo].[sysjobactivity]

select * from [dbo].[sysjobhistory]

select * from [dbo].[sysjobs]

select * from [dbo].[sysjobschedules]

select * from [dbo].[sysjobservers]

select * from [dbo].[sysjobsteps]

select * from [dbo].[sysjobstepslogs]

Msg 8120, Level 16, State 1, Line 1 Column ” is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

We will be using the table ‘[HR].[Employees]’ from TSQL2012 database.

The error as described in the Title of the post :

"Msg 8120, Level 16, State 1, Line 1 Column 'HR.Employees.lastname' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause."

The error occured when I executed the following query :

select min(empid) as minimum,lastname from [HR].[Employees]

The reason for the error is : We are missing the group by clause at the end .When we are using an aggregate function along with other columns in the select clause we must use the group by clause .

select min(empid) as minimum,lastname from [HR].[Employees] group by lastname

If you are using more than one column in the select clause while using the aggregate function , then it is necessary to include all the columns in the group by clause except the column on which the aggregate function is applied. For instance : if you use execute the below query , it will throw the same error:

select min(empid) as minimum,lastname,firstname from [HR].[Employees] group by lastname

So , we must include the firstname column as well to successfully execute it as below :

select min(empid) as minimum,lastname,firstname from [HR].[Employees] group by lastname,firstname

Msg 8152, Level 16, State 4, Line 1 String or binary data would be truncated.

This Error is very common when we insert or update data.

This error occurs when we try to insert data which is greater than the length of the column in which we are inserting the data.

For Instance : Lets see an example in our TSQL2012 database :

insert into [Production].Categories(categoryname,description) values ('No chocolate items','no chocolate')

It gave me an error :

Msg 8152, Level 16, State 4, Line 1
String or binary data would be truncated.
The statement has been terminated.

This happened because the Column Categoryname has the column type : nvarchar(15) , which means nvarchar can take upto 15 characters maximum. And the total length of characters tried to insert: ‘No chocolate items’ is of length 18 .

To solve this problem ,we can increase the length of column size to 18 . Or , we can reduce the characters accordingly to fit the size of 15 .

MS SQL Script for creating the database TSQL2012 ,which we will use.

We will be working on the script attached during our posts . You can copy and paste the script in Microsoft SQL 2012 as we will be working on the tables , views , stored procedures , functions and other objects in the script.

Copy , paste and execute the whole script in a new MS SQL 2012 management Studio on the New Query tab to create the TSQL2012 database.

Attached is the script below in the TSQL2012.sql file .

T-SQL Fundamentals 2012

ETL(Extract , Transform and Load) – Part 1 -Introduction to Data warehouse

The analyses of data from relational databases (LOB) applications is not an easy . The normalized relational schema used for an LOB application can consist of hundreds of tables. It is difficult to discover where the data you need for a report is located.
In addition, LOB applications do not track data over time(they perform many DML operations), though many analyses depend on historical data.

A Data warehouse is a centralized data database for an enterprise that contains merged, cleansed, and historical data. The DW data is more better for reporting purposes . The logical design of schema of DW is called Star schema and snowflake schema which consists of Dimension and fact Tables.

The data in DW comes from LOB / relational database . The data which comes in DW is more cleansed ,transformed . OF course , while coming into DW , the data must have gone through some changes so we can get the refreshed data in DW . One way to refresh or let the new data come into DW is through the night job schedule . The report for newly or refreshed data can then be read .

The physical design for the DW is simpler than the relational database design as it contains less tables joins .

————————————————————————————————————————————-

CHAPTER 1 : Introduction to Star and Snowflake Schemas

First learning why Reporting is a Problem with a Normalized Schema :

Lets take AdventureWorks2012 sample database . Now The report should include the sales amount for Internet sales in different countries over multiple years. This will end up with almost 10 tables. The AdventureWorks2012 database schema is very normalized; it’s intended as an example schema to support LOB applications.

The goal of normalization is to have a complete and non-redundant schema.
Every piece of information must be stored exactly once. This way, you can enforce data integrity.

So, a query that joins 10 to 12 tables, as would be required in reporting sales by countries
and years, would not be very fast and can cause performance issues as it reads huge amounts of data sales over multiple years and thus would interfere with the regular transactional work of insertion and updation of the data.

Another problem is in many cases, LOB databases are purged otypically after each new fiscal year starts. Even if you have all of the historical data for the sales transactions, you might have a problem showing the historical data correctly. For example, you might have only the latest customer address, which might prevent you from calculating historical sales by country correctly .

The AdventureWorks2012 sample database stores all data in a single database. However, in an enterprise, you might have multiple LOB applications, each of which might store data in its own database. You might also have part of the sales data in one database and part in another.
And you could have customer data in both databases, without a common identification. In such cases, you face the problems of how to merge all this data and how to identify which customer from one database is actually the same as a customer from another database.

Finally, data quality can be low. The old rule, “garbage in garbage out,” applies to analyses as well. Parts of the data could be missing; other parts could be wrong. Even with good data, you could still have different representations of the same data in different databases. For example, gender in one database could be represented with the letters F and M, and in another database with the numbers 1 and 2. So we can put a proper standard in our data warehouse to this problem.

Star Schema

In Figure below, you can easily identify how the Star schema as it resembles
a star. There is a single central table, called a fact table, surrounded by multiple tables called dimensions. One Star schema covers a particular business area. In this case, the schema covers Internet sales. An enterprise data warehouse covers multiple business areas and consists of multiple Star and Snowflake schemas.

The fact table is connected to all the dimensions with foreign keys. Usually, all foreign keys taken together uniquely identify each row in the fact table, and thus they all together form a unique key, so you can use all the foreign keys as a composite primary key of the fact table. The fact table is on the “many” side of its relationships with the dimensions. If you were to form a proposition from a row in a fact table, you might express it with a sentence such as, “Customer CC purchased product BB on date DD in quantity QQ for amount SS.” .

As we know, a data warehouse consists of multiple Star schemas. From a business perspective, these Star schemas are connected. For example, you have the same customers in sales as in accounting. You deal with many of the same products in sales, inventory, and production. Of course, your business is performed at the same time over all the different business areas. To represent the business correctly, you must be able to connect the multiple Star schemas in your data warehouse. The connection is simple – you use the same dimensions for each Star schema. In fact, the dimensions should be shared among multiple Star schemas. Dimensions have foreign key relationships with multiple fact tables. Dimensions which have connections to multiple fact tables are called shared or conformed dimensions.

shared dim

Snowflake Schema

You can imagine multiple dimensions designed in a similar normalized way, with a central fact table connected by foreign keys to dimension tables, which are connected with foreign keys to lookup tables, which are connected with foreign keys to their second-level lookup tables.

In this configuration, a star starts to resemble a snowflake. Therefore, a Star schema with normalized dimensions is called a Snowflake schema.

shared dim

In most long-term projects, you should design Star schemas. Because the Star schema is
simpler than a Snowflake schema, it is also easier to maintain. Queries on a Star schema are simpler and faster than queries on a Snowflake schema, because they involve fewer joins.

Hybrid Schema

In some cases, you can also employ a hybrid approach, using a Snowflake schema only for the first level of a dimension lookup table. In this type of approach, there are no additional levels of lookup tables; the first-level lookup table is denormalized. Figure 1-6 shows such a partially denormalized schema.

Quick Question :

How do we connect multiple star schemas ?

Answer :

Through shared dimensions .

Granularity Level

The number of dimensions connected with a fact table defines the granularity level.

Auditing and Lineage

A data warehouse may also contain auditing tables. For every update, you should audit who and when the update was done and by whom and how many rows were affected to each dimension and fact table in your DW. If you also audit how much time was needed for each load, you can calculate the performance and take action if it slows down. You store this information in an auditing table .

You might also need to know where each row in a dimension and/or fact table came from and when it was added. In such cases, you must add appropriate columns to the dimension and fact tables. Such fine detailed auditing information is also called lineage.

ETL(Extract , Transform and Load) – Part 2 – Designing Dimensions

Dimensions give context to measures. Typical analysis includes pivot tables and pivot graphs. These pivot on one or more dimension columns used for analysis—these columns are called attributes in DW and OLAP terminology.

Columns with unique values identify rows. These columns are keys. In a data warehouse,
you need keys just like you need them in an LOB database. Keys uniquely identify entities. Therefore, keys are the second type of columns in a dimension.

Pivoting makes no sense if an attribute’s values are continuous, or if an attribute has too
many distinct values. Imagine how a pivot table would look if it had 1,000 columns, or how a pivot graph would look with 1,000 bars. For pivoting, discrete attributes with a small number of distinct values is most appropriate. A bar chart with more than 10 bars becomes difficult to comprehend. Continuous columns or columns with unique values, such as keys, are not appropriate for analyses.

If you have a continuous column and you would like to use it in analyses as a pivoting attribute, you should discretize it. Discretizing means grouping or binning values to a few discrete groups. If you are using OLAP cubes, SSAS can help you. SSAS can discretize continuous attributes. However, automatic discretization is usually worse than discretization from a business perspective. Age and income are typical attributes that should be discretized from a business perspective. One year makes a big difference when you are 15 years old, and much less when you are 55 years old. When you discretize age, you should use narrower ranges for younger people and wider ranges for older people (these are used for Graph type pivoting).

A customer typically has an address, a phone number, and an email address. You do not analyze data on these columns. You do not need them for pivoting. However, you often need information such as the customer’s address on a report. If that data is not present in a DW, you will need to get it from an LOB database, probably with a distributed query. It is much simpler to store this data in your data warehouse. In addition, queries that use this data perform better, because the queries do not have to include data from LOB databases. Columns used in reports as labels only, not for pivoting, are called member properties.

In addition to the types of dimension columns already defined for identifying, naming,
pivoting, and labeling on a report, you can have columns for lineage information,

A dimension may contain the following types of columns:

■■ Keys Used to identify entities
■■ Name columns Used for human names of entities
■■ Attributes Used for pivoting in analyses
■■ Member properties Used for labels in a report
■■ Lineage columns Used for auditing, and never exposed to end users

Hierarchies

Figure 1-9 shows the DimCustomer dimension of the AdventureWorksDW2012 sample database.

dimcustomer

In the figure, the following columns are attributes (columns used for pivoting):
■■ BirthDate (after calculating age and discretizing the age)
■■ MaritalStatus
■■ Gender
■■ YearlyIncome (after discretizing)
■■ TotalChildren
■■ NumberChildrenAtHome
■■ EnglishEducation (other education columns are for translations)
■■ EnglishOccupation (other occupation columns are for translations)
■■ HouseOwnerFlag
■■ NumberCarsOwned
■■ CommuteDistance

All these attributes are unrelated. Pivoting on MaritalStatus, for example, is unrelated to
pivoting on YearlyIncome. None of these columns have any functional dependency between them, and there is no natural drill-down path through these attributes. Now look at the Dim- Date columns, as shown in Figure 1-10.

dimdate

Some attributes of the DimDate edimension include the following (not in the order shown in the figure):
■■ FullDateAlternateKey (denotes a date in date format)
■■ EnglishMonthName
■■ CalendarQuarter
■■ CalendarSemester
■■ CalendarYear

You will immediately notice that these attributes are connected. There is a functional dependency among them, so they break third normal form. They form a hierarchy. Hierarchies are particularly useful for pivoting and OLAP analyses—they provide a natural drill-down path. You perform divide-and-conquer analyses through hierarchies.

Hierarchies have levels. When drilling down, you move from a parent level to a child level. For example, a calendar drill-down path in the DimDate dimension goes through the following levels: CalendarYear ➝ CalendarSemester ➝ CalendarQuarter ➝ EnglishMonthName ➝ FullDateAlternateKey

At each level, you have members. For example, the members of the month level are, of course, January, February, March, April, May, June, July, August, September, October, November, and December. In DW and OLAP jargon, rows on the leaf level—the actual dimension rows—are called members. This is why dimension columns used in reports for labels are called member properties.

In a Snowflake schema, lookup tables show you levels of hierarchies. In a Star schema, you need to extract natural hierarchies from the names and content of columns. Nevertheless, because drilling down through natural hierarchies is so useful and welcomed by end users, you should use them as much as possible.

Slowly Changing Dimensions

There is one common problem with dimensions in a data warehouse: the data in the dimension changes over time. This is usually not a problem in an OLTP application; when a piece of data changes, you just update it. However, in a DW, you have to maintain history. The question that arises is how to maintain it. Do you want to update only the changed data, as in an OLTP application, and pretend that the value was always the last value, or do you want to maintain both the first and intermediate values? This problem is known in DW jargon as the Slowly Changing Dimension (SCD) problem.

The problem is best explained in an example. Table 1-1 shows original source OLTP data
for a customer.

OLTP1

The customer lives in Vienna, Austria, and is a professional. Now imagine that the customer moves to Ljubljana, Slovenia. In an OLTP database, you would just update the City column, resulting in the values shown in Table 1-2.

OLTP2

If you create a report, all the historical sales for this customer are now attributed to the
city of Ljubljana, and (on a higher level) to Slovenia. The fact that this customer contributed to sales in Vienna and in Austria in the past would have disappeared.

In a DW, you can have the same data as in an OLTP database. You could use the same key,
such as the business key, for your Customer dimension. You could update the City column when you get a change notification from the OLTP system, and thus overwrite the history.

Type 1 SCD : Type 1 means overwriting the history for an attribute and for all higher levels of hierarchies to which that attribute belongs.

The problem here would rise that you want to store the historical data . What if we want transactions of the same customer when he did in Vienna ? The solution to this problem would be to add a surrogate key (DW key ).

Type 2 SCD : When you implement Type 2 SCD, for the sake of simpler querying, you typically also add a flag to denote which row is current for a dimension member. Alternatively, you could add two columns showing the interval of validity of a value.

SCD2

You could have a mixture of Type 1 and Type 2 changes in a single dimension. For example, in Table 1-3, you might want to maintain the history for the City column but overwrite the history for the Occupation column. That raises yet another issue. When you want to update the Occupation column, you may find that there are two (and maybe more) rows for the same customer. The question is, do you want to update the last row only, or all the rows? Table 1-4 shows a version that updates the last (current) row only, whereas Table 1-5 shows all of the rows being updated.

scd3

Fact Table Column Types

Fact tables are collections of measurements associated with a specific business process. You store measurements in columns. Logically, this type of column is called a measure. Measures are the essence of a fact table. They are usually numeric and can be aggregated. They store values that are of interest to the business, such as sales amount, order quantity, and discount amount.

All foreign keys together usually uniquely identify each row and can be used as a composite primary key.

For example, suppose you start building a sales fact table from an order details table in a source system, and then add foreign keys that pertain to the order as a whole from the Order Header table in the source system. Tables 1-7, 1-8, and 1-9 illustrate an example of such a design process.

Table 1-7 shows a simplified example of an Orders Header source table. The OrderId
column is the primary key for this table. The CustomerId column is a foreign key from the Customers table. The OrderDate column is not a foreign key in the source table; however, it becomes a foreign key in the DW fact table, for the relationship with the explicit date dimension. Note, however, that foreign keys in a fact table can—and usually are—replaced with DW surrogate keys of DW dimensions.

orderheader1

Table 1-8 shows the source Order Details table. The primary key of this table is a composite one and consists of the OrderId and LineItemId columns. In addition, the Source Order Details table has the ProductId foreign key column. The Quantity column is the measure .

orderdetail1

Table 1-9 shows the Sales Fact table created from the Orders Header and Order Details
source tables. The Order Details table was the primary source for this fact table. The OrderId, LineItemId, and Quantity columns are simply transferred from the source Order Details table.
The ProductId column from the source Order Details table is replaced with a surrogate DW ProductKey column. The CustomerId and OrderDate columns take the source Orders Header table; these columns pertain to orders, not order details. However, in the fact table, they are replaced with the surrogate DW keys CustomerKey and OrderDateKey.

salesfact

You do not need the OrderId and LineItemId columns in this sales fact table. For analyses, you could create a composite primary key from the CustomerKey, OrderDateKey, and Product- Key columns.

Additivity of Measures

Additivity of measures is not exactly a data warehouse design problem. However, you should consider which aggregate functions you will use in reports for which measures, and which aggregate functions you will use when aggregating over which dimension.
The simplest types of measures are those that can be aggregated with the SUM aggregate
function across all dimensions, such as amounts or quantities. For example, if sales for product A were $200.00 and sales for product B were $150.00, then the total of the sales was $350.00.
If yesterday’s sales were $100.00 and sales for the day before yesterday were $130.00, then the total sales amounted to $230.00. Measures that can be summarized across all dimensions are called additive measures.

Some measures are not additive over any dimension. Examples include prices and percentages, such as a discount percentage. Typically, you use the AVERAGE aggregate function for such measures, or you do not aggregate them at all. Such measures are called non-additive measures.

For some measures, you can use SUM aggregate functions over all dimensions but time.
Some examples include levels and balances. Such measures are called semi-additive measures.
For example, if customer A has $2,000.00 in a bank account, and customer B has $3,000.00, together they have $5,000.00. However, if customer A had $5,000.00 in an account yesterday but has only $2,000.00 today, then customer A obviously does not have $7,000.00 altogether. You should take care how you aggregate such measures in a report. For time measures, you can calculate average value or use the last value as the aggregate.

Quick Question :

■■ You are designing an accounting system. Your measures are debit, credit, and
balance. What is the additivity of each measure?

Answer: Debit and credit are additive measure while balance is semi additive measure.

Summary :

■■ Fact tables include measures, foreign keys, and possibly an additional primary key and lineage columns.
■■ Measures can be additive(fixed total sum of sales ) , non-additive(price changes or percentage changes ), or semi-additive(credit and debit in account) 32

	Talha on Working on Sets 1
	How to Find All Sql… on How to Find All Sql Servers In…