Sunday, August 21, 2022

Data Warehouse Series-Introduction

 

Data Warehouse:

A data warehouse is a central repository of information that can be analyzed to make more strategic decisions. Data flows into a data warehouse from transactional systems, relational databases, and other sources on a regular cadence. Business analysts, data engineers, data scientists, and decision-makers access the data through business intelligence (BI) tools, SQL clients.

Why do we need Data Warehouse?

Data and analytics have become indispensable for businesses to stay competitive. Business users rely on analytics tools to extract insights from their data, monitor KPIs, and support decision-making. Data warehouses power these reports, dashboards, and analytics tools by storing data efficiently to minimize the input and output (I/O) of data and deliver query results quickly to hundreds and thousands of users concurrently.

How is a data warehouse architected?

A data warehouse architecture is made up of tiers. The top tier is the front-end client that presents results through reporting, analysis, and data mining tools. It is generally known as the analytical/reporting layer. The middle tier consists of the analytics engine that is used to access and analyze the data. Here metadata is mapped with physical database tables and columns. The bottom tier of the architecture is the database server, where data is loaded and stored. The data is organized into tables and columns. Within each column, you can define a description of the data, such as integer, data field, or string. Tables can be organized inside of schemas, which you can think of as folders. When data is ingested, it is stored in various tables described by the schema. Query tools use the schema to determine which data tables to access and analyze.

When do we need a data warehouse?

When you need to take Informed decision making, consolidate data from many sources, perform historical data analysis, use data with quality, consistency, and accuracy, and generate analytic reports with the best performance than transactional databases.

Where do we use Data Warehouse?

Data warehouses are used in BI, reporting, and data analysis to extract and summarize data from operational databases. Ex: management wants to know the total revenues generated by each salesperson on a monthly basis for each product category. Transactional databases may not capture this data, but the data warehouse does.

Limitations:

1.      No support for unstructured data like images, text, IoT data, HL7, JSON, and XML.

2.      SQL-only — DWHs typically offer no support for Python or R

3.      Duplicated data —DWH/subject-area/(departmental) data marts, which results in duplicated data, lots of redundant ETL, and no single source of truth

4.      Expensive — commercial DWH charges you for Data storage & analysis.


The below picture represent the data flow in Data Warehouse





No comments:

Post a Comment

Spark- Window Function

  Window functions in Spark ================================================ -> Spark Window functions operate on a group of rows like pa...