DW-CONCEPTS
What is a three-tier DW: A data warehouse can be thought of as a three-tier system in which a middle system provides usable data in a secure way to end-users. On either side of this middle system are the end users and the back-end data stores.
Data Warehousing: A Data Warehouse Is Subject-Oriented, Integrated, Time-Varying, Non-Volatile collection of data in support of the management’s Decision-Making Process.
Subject-oriented - Data that gives information about a particular subject instead of about a company's on-going operations
Integrated - Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole
Time-variant - All data in the data warehouse is identified with a particular time period
Non-volatile - Data is stable in a data warehouse. More data is added, but data is never removed. This enables management to gain a consistent picture of the business
What is a Data Warehouse?
A Collection of transformed and integrated data, stored for the purpose of providing strategic information to the entire enterprise. It usually contains historical data derived from transaction data, but it can include data from different sources.
When users are requesting access to large amount of historical information for reporting purpose, we should strongly consider a warehouse, the user will benefit when the information is organized in an efficient manner for this type of access.
What is the life cycle of DW?
1) Getting data from OLTP systems from diff data sources,
2) Analysis & staging - Putting in a staging layer- cleaning, purging, putting surrogate keys, SCM, dimensional modeling, Loading. 3) Writing of metadata
OLAP: An OLAP database is generally used to analyze data. It is optimized so that we can quickly retrieve data. An OLAP databases is generally created from the information we have put in an OLTP databases. In OLAP data stored in multi dimensional databases, so business people view it, allowing them to slice and dice the data to answer business questions. OLAP database will provide must faster response time for analytical queries,
OLTP: OLTP databases are designed for day-to-day transactions, typically an OLTP database has thousands of users connected to it and Performing transactions round the clock, Examples are ATMs, online shopping, railway reservation, online application filling, etc, These systems are optimized for quickly insert and update records; these systems are not designed for analysis and reporting
Difference between OLTP and DWH
OLTP DWH
Application Oriented Clerical User Subject Orient Knowledge User
Used to run business Performance Sensitive Used to analyze business Performance relaxed
Detailed data Few Records accessed at a time Summarized and refined Large volumes accessed at a time
Current up to date Read/Update Access Snapshot data Mostly Read (Batch Update)
Isolated Data No data redundancy Integrated Data Redundancy present
Repetitive access DB Size 100MB-100 GB Ad-hoc access DBSize 100 GB - few terabytes
BI : Bi is a broad category of applications and technologies for gathering, storing, analyzing, and providing access to data to help enterprise users make better business decisions.
Source : Here a source is a database where we extract the database. Generally
sources are read only and existing databases.
Target : Here Target is Warehouse Database.
ODS : Operational Data Store, It is typically used as staging file, Containing data
to be checked before loading into the warehouse.
Aggregates : These are special tables where the data in Aggregated form. Aggregate tables are structured to define
“Totals” of data stored in granular fact tables. Aggregate tables may be built after creating fact tables
Means at the staging level or as a post-loaded process in the warehouse DBMS itself.
ETL : EXTRACT,TRANSFORM AND LOAD, Process of extracting, transforming(or transporting) and loading
data from source systems into the data warehouse. These tools are particularly useful to aggregate
data from different database suppliers.
Meta Data : Data about the Data (or) Information about the Data.
Staging Area : It is a temporary storage area and is a collection of processes used to
prepare source system data for loading a data warehouse.
Stages: Source Data Extraction, Data transformation(Restructuring),
Data transformation(data cleaning, Value Transformations) Surrogate Key assignments.
Data Mart : A data mart is also Data warehouse, basically a subset of data warehouse.
Data mart is restricted to a single business process or single business group.
Union of data marts equal to data warehouse.
Data Mining : Some times called as Knowledge Discovery. Its a process of finding the
hidden relationships or hidden patterns of the data.
Raw Data : The data in native format, that is source data.
Summary Data : It is the analyzed form of the Data.
Integration Layer: This is the location,where we bring the different formats data from different locations to single format.
different locations to single format.
Data Model : A data model is a conceptual representation of the data structures that are required by a database.
The data structures includes the data objects, the associations between data objects and the rules
which govern operations on the objects.
Dimension : It is an object of the data w.r.t which the analysis is being performed.
Fact : It is an object that is being analyzed.
In multi dimensional analysis the fact w.r.t to n no of dimensions.
Cube : Cube is an appropriate structure used for representing the data in multi dimensional format.
Hyper Cube : A cube with more than three dimensions.
Local cube : A cube contained in a file.
Fact table : A fact table consists of a composite key and facts, where every attribute of the composite key
is foreign key of dimension table. Fact tables are fully normalized.
Fact less fact tables: Fact tables which do not have any facts are called fact less fact tables.
Dimension table : A dimension table contains detail description of the facts. (or) Detail
information of the facts. Dimension tables are de-normalized.
Conformed Dimensions: These are used to analyze facts from two or more data marts. Conformed
Dimensions are linked to multiple fact tables.
Conformed facts: If the definitions of measurements(facts) are highly consistent, we call them as conformed facts.
OLTP : Online transactional processing : Defines the transaction processing that
supports the daily business operations.
OLAP : ON-LINE ANYLITICAL PROCESSING, Means Multi-Dimensional analysis and reporting.
Warehouse key : This key is used for identifying data, when the data comes from multiple sources. It is also
known as surrogate key. Using this joins are simplified and comparing integers take less space.
Developing OLAP : Multi-Dimensionality (We need fact and dimensions)
Slices and Dies, Drill down (we need hierarchies).
MOLAP : Multi-dimensional Online Analytical and Processing:
In this Cubes are located in files. OLAP concepts implemented in a files.
It provides best performance ,but does not support large volumes of data.
Cognos Power Play transformer Implements MOLAP.
ROLAP : Relational Online Analytical and Processing.
It is also like MOLAP where the data process is with in the Relational
Database Management System. It supports large volumes of data.
Micro strategy implements these concepts.
DOLAP : Desktop Online Analytical and Processing.
The processing of data with in a client desktop Machines. Business objects implements these concepts.
HOLAP : Hybrid Online Analytical and Processing. It is combination of MOLAP and ROLAP.
The source data is usually stored using a ROLAP strategy and aggregations are stored
using a MOLAP strategy. Microsoft analytical server supports these concepts.
Data cleansing : The process of correcting errors and removing inconsistencies before importing data into the DW.
Data scrubbing : the process of filtering, merging, decoding and translating the source data
To create the validation data for data warehouse.
MDX (Multi dimensional Expressions): This is querying language for OLAP cubes.
Multi dimensional Analysis: A process of analysis that involves organizing and summarizing data in a multiple number of dimensions.
Replication : The physical copying of data from one database to another.
Slice and Dice : the ability to move between different combinations of dimensions when Viewing data.
Choose different categories for Ur Report, Drill down or Drill Up. Drag dimensions from the dimension line,
Swap rows, columns or layers, filter informationDrilling Across : Drilling Across adds more data to an existing row. If drilling down is requesting ever finer and granular data from the same fact table, then drilling across is the process of linking two or more fact tables at the same granularity, or, in other words, tables with the same set of grouping columns and dimensional constraints.
A drill across report can be created by using grouping columns that apply to all the fact tables used in the report.
Drill-across: Drill-across allows the user create a new report where a specific entity is used. For example, in a Product report you select a specific line (product); then you can drill-across to an Order-Line or Invoice-Line Report to display only the lines, where this product is referenced.
Hetrogenous Homogenous.
Stored in different schemas common structure.
Stored in different file or db types same database type.
Spread across in several countries same data center.
Different platform in H/W config. Same platform and hardware config.
Candidate key: When an entity has more than one possible key, each key is referred to as a candidate KEY
Granularity : The level of detail of the facts stored in a data warehouse.
Hierarchy : An organization of data into a logical tree structure defining parent-child relationships between the levels in a dimension. Controls data consolidation and drill down paths.
A typical time dimension would have a hierarchy based on date, week, month, quarter and year.
Level : A position in a hierarchy. For example, a time dimension might have a hierarchy that represents data at the Month, Quarter, and Year levels.
Category : Items that match a specific description or classification. Categories can be different levels of information within a dimension. Any categories can be grouped into more general categories. For example, a set of dates could be grouped into a month, and months into quarters, and quarters into years.
Measure: A numeric value stored in a fact table or cube. Typical examples include sales value, sales volume, price, stock and headcount.
Additive measure: Measure of a fact that can be added across all dimensions
Non-additive measure: Measure of a fact that cannot be added across any of its dimensions. A typical example of a non-additive measure is a percentage.
Semi additive measure: Measure of a fact that can be added across only some of its dimensions. A typical example of a semi additive measure is a balance.
Fact Constellation: A fact constellation consists of two fact tables sharing a common dimensional table.
Monday, February 4, 2008
Subscribe to:
Posts (Atom)