Sunday, 24 August 2014

What is Data WareHouse?

There are two version of definitions for Data Warehouse.

1) Definition of R. Kimbal 2) Definition of Inmon 

According to R. Kimbal 

"A data warehouse is a copy of transactional data specifically structured for querying and analysis".
According to this definition:
The form of the stored data (RDBMS, flat file) has nothing to do with whether something is a data warehouse.
Data warehousing is not necessarily for the needs of "decision makers" or used in the process of decision making.

According to  Inmon

A data warehouse is a
subject-oriented,
integrated,
nonvolatile, and
time-variant
collection of data in support of management’s decisions. The data warehouse contains granular corporate data.

Subject-Oriented?
Classical operations systems are organized around the applications of the company. For an insurance company, the applications may be auto, health, life, and casualty. The major subject areas of the insurance corporation might be customer, policy, premium, and claim. For a manufacturer, the major subject areas might be product, order, vendor, bill of material, and raw goods. For a retailer, the major subject areas may be product, SKU, sale, vendor, and so forth. Each
type of company has its own unique set of subjects.

Integrated
Of all the aspects of a data warehouse, integration is the most important. Data is  fed from multiple disparate sources into the data warehouse. As the data is fed it is
converted, reformatted, resequenced, summarized, and so forth. The result is that data—once it resides in the data warehouse—has a single physical corporate image.

 Nonvolatile
Data is updated in the operational environment as a regular matter of course, but warehouse data exhibits a very different set of characteristics. Data warehouse data is loaded (usually en masse) and accessed, but it is not updated (in the general sense). Instead, when data in the data warehouse is loaded, it is loaded in a snapshot, static format. When subsequent changes occur, a new snapshot record is written. In doing so a history of data is kept in the data warehouse. 

Time-variant
Time variancy implies that every unit of data in the data warehouse is accurate as of some one moment in time. In some cases, a record is time stamped. In other cases, a record has a date of transaction. But in every case, there is some form of time marking to show the moment in time during which the record is accurate. A 1 to 2 year time horizon is normal for operational systems; a 5-to-15-year time horizon is normal for the data warehouse. As a result of this difference in time horizons, the data warehouse contains much more history than any other environment

Summary:

While Creating a  Data WareHouse Data will be isolated at department level. Each Department deals with particular
 subject. For instance consider bank,
it has Insurance, Customer, Transnational etc details. While creating a WareHouse each department will be treated as a subject, further diluting the concept of subject, each department will have its own source file to be processed.

When we Receive files from different departments then, each dept has its own standards to be followed. for instance,

Encoding: (Gender field)

Dept A: Male/ Female
Dept B: M/F
Dept c: 0/1

Dept A,Dept B,Dept C Gender field will be converted into one unit or same format. i.e
 Integrated.

Non Volatile
 data will not under go for any changes or not frequent update or alter statements will be run against Data WareHouse.

Time-Variant
 is vital feature of DWH. When any source is processed, the Timestamp of source, Process date & time and Period will be recorded.



1 comment: