Saturday 27 September 2014

05:: Theoretical Aspects


•Limitations of Distributed Systems

1.Absence of Global clock
2.Absence of Shared Memory

How above are issues?

For instance assume there is Transaction happening. And there is no global lock or shared memory. In this case, if Money is deducted from Account-xyz 500$ and total balance was 5000$,And transferred Account ABC.


ACCOUNT XYZ-------------communication Channel----------> Account ABC
5000$                                                                                                0$

Divide above transfer process in 3 states

1) State of Account XYZ is recorded Before sending the Amount i.e 5000$
2) State of communication channel is recorded after Transferring 500$
3) State of Account ABC recorded before receiving Amount i.e 0$

In above scenario, The total amount produced 5000$ + 500$ =5500$.
Which is inconsistent. To overcome from above inconsistencies or issue, We have two approaches.

1) Lamports Logical clock 

2) Vector Clock

https://www.youtube.com/watch?v=ELu_jVWqPNs

Follow above link for understanding the Lamports clock and Vector Clock.

Casual Ordering of Messages 

what is casual ordering of messages & Algorithm to implement the Ordering explained in below link.

BSS
SES
Matrix Algorithm


Thursday 25 September 2014

4:: Communication Models


•High level constructs [Helps the program in using underlying communication network]
•Two Types of Communication Models
1) Message passing
2) Remote Procedure Calls


1) Message Passing Primitives

•Two basic communication primitives
-SEND(a,b) , a-> Message , b-> Destination
-RECEIVE(c,d), c-> Source , d-> Buffer for storing the message

Message Passing Primitive Design Issues

  • buffered vs. unbuffered
  • blocking vs. nonblocking
  • reliable vs. unreliable
  • synchronous vs. asynchronous
Blocking vs Non blocking

           •Nonblocking
- SEND primitive return the control to the user process as soon as the message is copied from user buffer to kernel buffer
-Advantage : Programs have maximum flexibility in performing computation and communication in any order
-Drawback  Programming becomes tricky and difficult
•Blocking
-SEND primitive does not return the control to the user process until message has been sent or acknowledgement has been received
-Advantage : Program’s behavior is predictable
-Drawback  Lack of flexibility in programming

----> Means, Non Blocking control will not wait for any acknowledgement. It will engage in other tasks. And blocking will become dedicated control till message is transmitted.

Synchronous vs Asynchronous

•Synchronous
-SEND primitive is blocked until corresponding RECEIVE primitive is executed at the target computer
•Asynchronous
-Messages are buffered
-SEND primitive does not block even if there is no corresponding execution of the RECEIVE primitive
-The corresponding RECEIVE primitive can be either blocking or non-blocking

2) Remote Procedure Calls

RPC is an interaction between a client and a server
•Client invokes procedure on sever
•Server executes the procedure and pass the result back to client
•Calling process is suspended and proceeds only after getting the result from server

RPC Design issues

•Structure
•Binding
•Parameter and Result Passing
•Error handling, semantics and Correctness

Structure


Binding

•Determines remote procedure and machine on which it will be executed
•Check compatibility of the parameters passed
•Use Binding Server

•Parameter and Result Passing

Stub procedures converts parameters & Results in appropriate format.
Pack the info in parameters and while receiving it will converted as per local machine standard.

•Pack parameters into a buffer
•Receiver Stub Unpacks the parameters
•Expensive if done on every call
-Send Parameters along with code that helps to identify format so that receiver can do conversion
-Alternatively Each data type may have a standard format. Sender will convert data to standard format and receiver will convert from standard format to its local representation
•Passing Parameters by Reference 

Error handling, Semantics and Correctness

•RPC may fail either due to computer or communication failure
•If the remote server is slow the program invoking remote procedure may call it twice.
•If client crashes after sending RPC message
•If client recovers quickly after crash and reissues RPC
•Orphan Execution of Remote procedures
•RPC Semantics
-At least once semantics
-Exactly Once
-At most once




Thursday 18 September 2014

03:: Issues with Distributed OS

Following are the issues.

  1. Global knowledge
  2. Naming
  3. Scalability
  4. Compatibility
  5. Process Synchronization
  6. Resource Management
  7. Security
  8. Structuring

1.Global Knowledge:

•No Global Memory
•No Global Clock
•Unpredictable Message Delays

Since there is no shared memory nor resources in Distributed systems, In other sense, there is no centralized KNOWLEDGE REPOSITORY of resources while exchanging communication or resources. It is not easy for synchronous transfer of messaging because no centralized or Global area where timing gets captured.   

2.Naming:


•Name refers to objects [ Files, Computers etc]
•Name Service Maps logical name to physical address
•Techniques
-LookUp Tables [Directories]
-Algorithmic
-Combination of above two

3.Scalability:

•Grow with time
•Scaling Dimensions – Size, Geographically & Administratively

•Techniques – Hiding Communication Latencies, Distribution & Caching


Hiding Communication Latencies:

Diagram (a) describes that, Whenever User submits request, it always hits server. Though it is simple validation, it will hit server. This will hinder the performance of server.

Diagram (b) Explains that, Instead of hitting the server every time, Many things can be processed at Client side itself. Which will reduce the burden on server and improve the performance of server.

Distribution 

An example of dividing the DNS name space into zones

Scalability can be addressed via distribution & caching. Above pic, depicts that, DOMAIN name can be divided into different server. For instance, "ac.in" is academic related server, When user tries hitting the server, Process will look for the server which holds all "ac.in" related address. 

SCALING TECHNIQUE  (REPLICATION):
•Replicate components across the distributed system
•Replication increases availability , balancing load distribution
•Consistency problem has to be handled

Ex: Clustering

4. Compatibility:

  • Interoperability among resources in system
  • Levels of Compatibility – Binary Level, Execution Level & Protocol Level
Above can be understood with sample example. Java program written on Windows machine can be executed on Linux machine. This is called compatibility.

5.Process Synchronization

  • Difficult because of unavailability of shared memory
  • Mutual Exclusion Problem 

6.Resource Management

Make both local and remote resources available
•Specific Location of resource should be hidden from user
•Techniques
-Data Migration [DFS, DSM]
-Computation Migration [RPC]
-Distributed Scheduling [ Load Balancing]

7.Security

•Authentication – a entity is what it claims to be
•Authorization – what privileges an entity has and making only those privileges available

This can be best explained with reference to System LOGIN as Authentication and Analogy of R-W-X in unix is Authorization. 

8.Structuring

  • the monolithic kernel: one piece
  • the collective kernel structure: a collection of processes
  • object oriented: the services provided by the OS are implemented as a set of objects.
  • client-server: servers provide the services and clients use the services.

Wednesday 17 September 2014

Distributed System Architecture Types


•Minicomputer Model
•Workstation Model
•Workstation – Server Model
•Processor Pool Model
•Hybrid Model

•Minicomputer Model

Above model was used in mid-80's. Mini computers will be serving different terminals. For instance, If  UserA is assigned with a machine which is being served by MINI-COMPUTER A and there are interconnected MINI-COMPUTERS say B,C & D. each MINI COMPUTER has 3 users then You can login only to the computer which is being servered by MINI COMPUTER A.

Major resources like Memory or Local Disk is centralized at MINI COMPUTER. 
EX: Unix based systems.

•Workstation Model

Each user is assigned with a Work station. And it is more like Advanced version of MINI COMPUTER model. In this model Each WORK STATION will be capable of holding calculations. And each WORKSTATION Serves "single user".

Workstation – Server Model

In above, WORK STATION will be dedicated to the users. So each WORK STATION Serves single user. But MINI COMPUTER will act as server and each MINI COMPUTER Services can be shared among the WORK STATIONS. 
EX: Printer is installed on one MINI COMPUTER Server. This can be used across the Workstations.

•Processor Pool Model

This is again extension for WORKSTAION SERVER Model. In this model SERVER's are combined into a pool. When a request comes from the user it will be handled by the pool on servers based on availability. 

•Hybrid Model

  • Based upon workstation-server model but with additional pool of processors
  • Processors in the pool can be allocated dynamically
  • Gives guaranteed response time to interactive jobs
  • More expensive to bui

Tuesday 16 September 2014

Introduction

Distributed Systems
“ A Distributed System is a collection of independent computers that appears to its users as a single coherent system ” [Tanenbaum]
“ A Distributed System is
-a system having several computers that do not share a memory or a clock
-Communication is via message passing
-Each computer has its own OS+Memory
[Shivaratri & Singhal]

This sparks the types of MULTIPROCESSOR SYSTEM ARCHITECTURE.
Multiprocessor System Architecture Types
•Tightly Coupled Systems
•Loosely Coupled Systems

•Tightly Coupled Systems


Key point to be noticed is it shares the memory across the CUP's.Communication will be established using shared memory.

•Loosely Coupled Systems



Key Note: Loosely Coupled Systems are actually Distributed Systems. Because as per definition given above, System which does not share memory with other systems and have their own clock is Distributed system. And loosely Coupled align as per the definition.

What is the motivation to use the Distributed System/Loosely coupled System??
•Resource Sharing: Printer connected a server. Each and every individual user can avail the printer.
•Enhanced Performance: Having one machine with huge configuration(Say tera bytes) does not make sense, instead having multiple Machines distributed across the network each and every user working on his own task. Obviously makes sense.
•Improved Reliability & Availability: The word "Dedicated Machines" says all.
•Modular expandability: You can add a new user without disturbing the existing environment.



Advance Operating System

Monday 8 September 2014

08:: Dimensional Modeling

Dimensional Modeling
• Used by most contemporary BI
solutions
– “Right” mix of normalization and
denormalization often called Dimensional
Normalization
– Some use for full data warehouse design
– Others use for data mart designs
• Consists of two primary types of tables

– 1) Dimension tables
– 2) Fact tables

Dimensional normalization
– Logical design technique that presents data in an intuitive
way allowing high-performance access
– Targets decision support information
– Focused on easy user navigation and high performance
design

(vs)

•  Transactional normalization
– Logical design technique to eliminate data redundancy, to
keep data consistency, and storage efficiency
– Makes transactions simple and deterministic
– ER models for enterprise are usually complex often

containing hundreds, or even thousands, of entities/tables


Fact & Dimension tables example


Dimension tables contains Surrogate keys and they are more of descriptive tables. they describe entities in particular. Where as Fact table is a centralized table which contains Foreign-keys of dimension table. Below image depicts the all.
In above pic FactInternetSales table is Fact table & rest are Dimension tables.

Fact and dimension table key points


Details of Dimension table

What does a Dimension Table capture??

Keys in dimension table?



Types of Dimension table?



1) Slowly Changing Dimensions
2) Conformed Dimension
3) Role Playing Dimension
4) De-Generate Dimension
5) Junk Dimension


1)Slowly Changing Dimensions:
Go through below link for understanding Slowly changing dimensions.
https://www.youtube.com/watch?v=mUGvYgYX13U

2) Conformed Dimension
The dimensions which is used more than one fact table is called conformed dimensions.
Ex-Product Dimension related to Order fact, Sles fact
  • Same Dimension joins to multiple Fact Tables or is used across multiple Data Marts.
  • It is a master dimension and is used across multiple dimensional models.

Time,Product & Staff are conformed dimensions.

3) Role Playing Dimension:

In this type of dimension, single dimension plays dual role for fact table. Refer example below.
In above example Dim_Date(date dimension) playing role of both Order_Date and Ship_Date.


4) De-Generate Dimension:

Degenerate dimension: A column of the key section of the fact table that does not have the associated dimension table but used for reporting and analysis, such column is called degenerate dimension or line item dimension.

For ex, we have a fact table with customer_id, product_id, branch_id, employee_id, bill_no, date in key section and price, quantity, amount in measure section. In this fact table, bill_no from key section is a single value, it has no associated dimension table. Instead of cteating a separate dimension table for that single value, we can include it in fact table to improve performance.SO here the column, bill_no is a degenerate dimension or line item dimension.

5) Junk Dimension

consider a trade company which consists of fact about trading that take places in a share trading firm.in these there may be some facts like mode of trade(which indicates whether the user is trading through phone or online)which will be not related to any of the dimensions such as account,date,indices,amount of share etc.
so these unrelated facts are removed from the fact table and stored as a separate dimension as a junk dimension which will be useful to provide extra information
 

Details of Fact Tables

What does a Fact table capture?
Fact Table Granularity
– The level of detail of data contained in the fact table
– The description of a single instance (a record) of the fact
Typically includes a time level and a distinct
combinations of other dimensions
• e.g. Daily item totals by product, by store,
Weekly snapshot of store inventory by product

Additive Nature
• Additive: Facts that can be summed
up/aggregated across all of the dimensions in the
fact table (e.g., discrete numerical measures of
activity, i.e., quantity sold, dollars sold)
• Semi-Additive: Facts that can be summed up for
some of the dimensions in the fact table, but not
the others (e.g., account balances, inventory
level, distinct counts)
• Non-Additive: Facts that cannot be summed up
for any of the dimensions present in the fact table
(e.g., measurement of room temperature)

Types of Fact Tables??

1) Transactional
2) Snapshot or inventory
3) Accumulating Fact table


1) Transactional fact table

Transaction fact table is one that holds data at the grain of per transaction.
e.g. If a customer buys 3 different products from a point of sale. then fact table will have three records for each transaction indicating 3 different type of product sale. Basically if three transactions are displayed on customer receipt then we have to store three records in fact table as granularity of fact table is at transaction level.
e.g. Customer Bank transaction
CustomerTransaction TypeAmountDate
Customer1Credit1000001-01-2012
Customer1Debit500002-01-2012
Customer1Credit100003-01-2012
Grain of above transaction table is per transaction
Transaction fact table are at the most detailed level and generally have large number of dimensions associated with it

2) Periodic snapshot fact table

As its name suggests periodic snapshot fact table are used to store a snapshot of data taken at particular point of time. Periodic fact tables stores one row for particular period of time.
e.g. let’s take an example of credit/debit transaction made by a customer.
CustomerTransaction TypeAmountDate
Customer1Credit1000001-01-2012
Customer1Debit500002-01-2012
Customer1Credit100003-01-2012
Above table is a transaction fact table suppose we need to create a periodic fact table whose grain is month which stores customer balance at the end of month then it should look like as below

CustomerMonthAmount
Customer1Jan-20126000

3) Accumulating Fact Table

Accumulating fact table stores one row for a entire life time of an event. To understand it better let’s take an example of order processing process.
DateOrder Status
01-01-2012Customer Ordered Product
02-01-2012Order Product Despatched from Warehouse
03-01-2012Handed Over to Courier Company
04-01-2012Delivered to Customer
If you look at the above events you could see that each date has its own name
e.g. Customer Order Date , Warehouse despatch date etc.
In accumulating fact table at each stage , dates will be updated with relevant facts

Tuesday 2 September 2014

07:: Data WareHouse Requirement Gathering

When you are Gathering Requirement while building any Data Warehouse project, You have to keep following things in mind. This can be discussed with various stakeholders in organization.


06::LifeCycle


05::Data Warehouse Architecture


SourceSystems:

You own a business, For instance Banking. You will have lot of applications design for your bank.
There will be daily transactions happening on your application Databases. Which will be further demoralized and readily catered for Dataware house system as input.

Staging Area:

Business logic will be implemented here. You will never dump the source data directly to your DataMarts. You should design the extracted data from source in such way that it should serve as input to your master tables.

Presentation are:

Here actual reports will be generated for analysis purpose.


Types of Architecture:








04::TopDown VS BottomUp(R. Kimbal vs Inmon)

Bill Inmon :(Top-Down Approach)
Normalized data model
Enterprise view of data
Single, central storage of data
Takes longer to build
High exposure to risk and failure.

Kimbal :(Bottom-Up Approach)
De-normalized data model
Collection of conformed data marts which gives enterprise view
Inherently incremental
Less risk of failure and allows project team to learn and grow.

Monday 1 September 2014

DWH vs DataMart

A data warehouse is a central repository for all or significant parts of the data that an enterprise's various business systems collect. Enables strategic decision making.

A data mart is a repository of data gathered from operational data and other sources that is designed to serve a particular community of knowledge workers. In scope, the data may derive from an enterprise-wide database or data warehouse or be more specialized. The emphasis of a data mart is on meeting the specific demands of a particular group of knowledge users in terms of analysis, content, presentation, and ease-of-use. Users of a data mart can expect to have data presented in terms that are familiar.
In practice, the terms data mart and data warehouse each tend to imply the presence of the other in some form. However, most writers using the term seem to agree that the design of a data mart tends to start from an analysis of user needs and that a data warehouse tends to start from an analysis of what data already exists and how it can be collected in such a way that the data can later be used. A data warehouse is a central aggregation of data (which can be distributed physically); a data mart is a data repository that may derive from a data warehouse or not and that emphasizes ease of access and usability for a particular designed purpose.

DataWareHouse:
Corporate/Enterprise-wide
Union of all data marts
Data received from staging area
Queries on presentation resource
Structure for corporate view of data
Organized on E-R Model.
DataMart:
Departmental
A Single business process
STAR join(facts and Dim)
Technology optimal for data access and analysis
Structure to suit the departmental view of data

Summary:

DataMart is subject oriented. When you start designing warehouse for a bank, You will have lot of subjects to take-care under single roof. For instance, Insurance, Transactions, Mortgages etc..Each title is a subject which will have its own DataMart.