Friday, July 4, 2008

Data warehouse architecture?

Data warehouse architecture?


Data Warehouse Design Process

1) Top-down, bottom-up approaches or a combination of both

a) Top-down: Starts with overall design and planning (mature)

b)Bottom-up: Starts with experiments and prototypes (rapid)

2) From software engineering point of view

a) Waterfall: structured and systematic analysis at each step before proceeding to the next

b) Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around
3) Typical data warehouse design process

a) Choose a business process to model, e.g., orders, invoices, etc.

b) Choose the grain (atomic level of data) of the business process

c) Choose the dimensions that will apply to each fact table record

d) Choose the measure that will populate each fact table record





Data warehouse implementation

Efficient Data Cube Computation

1) Data cube can be viewed as a lattice of cuboids

a) The bottom-most cuboid is the base cuboid

b) The top-most cuboid (apex) contains only one cell

c) How many cuboids in an n-dimensional cube with L levels?

2) Materialization of data cube

a) Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization)

b) Selection of which cuboids to materialize

c) Based on size, sharing, access frequency, etc.


Indexing OLAP Data: Bitmap Index


1) Index on a particular column

2) Each value in the column has a bit vector: bit-op is fast

3) The length of the bit vector: # of records in the base table

4) The i-th bit is set if the i-th row of the base table has the value for the indexed column

5) Not suitable for high cardinality domains



Indexing OLAP Data: Join Indices



1) Traditional indices map the values to a list of record ids

a) It materializes relational join in JI file and speeds up relational join — a rather costly operation

2) In data warehouses, join index relates the values of the dimensions of a start schema to rows in the fact table.


i) E.g. fact table: Sales and two dimensions city and product


a) A join index on city maintains for each distinct city a list of R-IDs of the tuples recording the Sales in the city

b) Join indices can span multiple dimensions





Efficient Processing OLAP Queries


1) Determine which operations should be performed on the available cuboids:

a) Transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g., dice = selection + projection

2) Determine to which materialized cuboids(s) the relevant operations should be applied.

Metadata Repository


1) Meta data is the data defining warehouse objects. It has the following kinds

i) Description of the structure of the warehouse

a) schema, view, dimensions, hierarchies, derived data defn, data mart locations and contents

ii) Operational meta-data

a) data lineage (history of migrated data and transformation path), currency of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails)

iii) The algorithms used for summarization

iv) The mapping from operational environment to the data warehouse

v) Data related to system performance

a) Warehouse schema, view and derived data definitions

vi) Business data

a) business terms and definitions, ownership of data, charging policies

Data Warehouse Back-End Tools and Utilities

1) Data extraction:

a) get data from multiple, heterogeneous, and external sources

2) Data cleaning:

a) detect errors in the data and rectify them when possible

3) Data transformation:

a) convert data from legacy or host format to warehouse format

4) Load:

a) sort, summarize, consolidate, compute views, check integrity, and build indicies and partitions

5) Refresh

a) propagate the updates from the data sources to the warehouse

No comments: