Sunday, August 21, 2022

Apache Hive Series-Storage & Ingestion

 Hive- Data Ingestion and Data Storage


Hive is a data warehouse, designed mainly to process huge data in hdfs. The working of Apache Hive is simple. It translates the input program written in HiveQL (Similar to SQL) into one or more Java a MapReduce and Spark jobs. To know more about Hive architecture and components, refer to my previous post link given in the first comment.

To hold the data, a table should be created. The syntax is given below:

Syntax for table creation:

Create table test.emp
(sno int,
Empname string,
City string)
ROW FORMAT delimited fields terminated by ‘,’ LINES TERMINATED BY ‘/n’ STORED AS TEXTFILE.


Data Ingestion:

We can ingest the data with the hive table in 3 ways
1. Loading from files --must mention row format, otherwise, it will be loaded only null values
2. Inserting from sqoop, spark--, no need to specify the row, field, and line format, the hive itself will take care.
3. Copying the data file into /user/hive/warehouse path after table creation.

We need to add how fields; rows are presented and the file format of the source file when you are loading the data from files by the below line in the time of table creation itself.
"ROW FORMAT delimited fields terminated by ‘,’ LINES TERMINATED BY ‘/n’ STORED AS TEXTFILE"

While loading the data from the source file, the syntax will vary based on where it is located. If the source file is in the Linux box, mention the keyword ‘local’ before inpath.

 For example:
Load data local inpath ‘/home/test/emp_data.txt’ into table emp;

If the source file is present inside HDFS, no need to mention the keyword ‘local’ before inpath.

For example:
load data inpath ‘/home/test/emp_data.txt’ into table emp






Hive with MapReduce jobs:

Hive is running on the MapReduce engine, if needed, you can configure Tej, spark instead of MapReduce. Hive may/may not trigger MapReduce jobs for data processing. Hive does not use/trigger MapReduce job for simple load command whereas insertion, it is used MapReduce job.

Select, simple load using load commandno map reducer job used

Select with Aggregation function, Insertion- map reducer job used

Hive Data Storage:

Hive is a query engine, not a database as it does not have its own storage. The data files are stored in /user/hive/warehouse by default under the Hadoop ecosystem. To know more about hive tables, you can use below commands:

1. desc <tablename>-gives field name and its datatype
2. desc extended <tablename>-gives detailed table info like table, dbname, owner name, Fieldschema, data file location, bucket numbers,serde info, field delimitation, table types, table size, etc.







Data Engineering Series-Python skills for Data Engineers


            


Python skills required for Data Engineers

While interacting with data enthusiasts, one of the questions asked repeatedly is :
What level of python skill requires for a Data engineer?

Today we are going to see the most important python skills that require data engineers. To become a Data engineer, one should familiar with minimum python fundamentals and a few libraries used frequently. To know the important python libraries requires for a data engineer, please refer my previous post given in the first comment.

Python Basics:

Lists
Tuples
Directories & sets
Variables
Strings
Math expression
Loops
Python Functions
Error/File Handling
Unit Testing

To interact with Data sources(DBs)

Pycopg2
Mysql

Working with data:

Json
Jsonschema(validation)
Datetime
Numpy
Pandas

Connection to APIs:

Requests
logging

 






Data Warehouse Series-Introduction

 

Data Warehouse:

A data warehouse is a central repository of information that can be analyzed to make more strategic decisions. Data flows into a data warehouse from transactional systems, relational databases, and other sources on a regular cadence. Business analysts, data engineers, data scientists, and decision-makers access the data through business intelligence (BI) tools, SQL clients.

Why do we need Data Warehouse?

Data and analytics have become indispensable for businesses to stay competitive. Business users rely on analytics tools to extract insights from their data, monitor KPIs, and support decision-making. Data warehouses power these reports, dashboards, and analytics tools by storing data efficiently to minimize the input and output (I/O) of data and deliver query results quickly to hundreds and thousands of users concurrently.

How is a data warehouse architected?

A data warehouse architecture is made up of tiers. The top tier is the front-end client that presents results through reporting, analysis, and data mining tools. It is generally known as the analytical/reporting layer. The middle tier consists of the analytics engine that is used to access and analyze the data. Here metadata is mapped with physical database tables and columns. The bottom tier of the architecture is the database server, where data is loaded and stored. The data is organized into tables and columns. Within each column, you can define a description of the data, such as integer, data field, or string. Tables can be organized inside of schemas, which you can think of as folders. When data is ingested, it is stored in various tables described by the schema. Query tools use the schema to determine which data tables to access and analyze.

When do we need a data warehouse?

When you need to take Informed decision making, consolidate data from many sources, perform historical data analysis, use data with quality, consistency, and accuracy, and generate analytic reports with the best performance than transactional databases.

Where do we use Data Warehouse?

Data warehouses are used in BI, reporting, and data analysis to extract and summarize data from operational databases. Ex: management wants to know the total revenues generated by each salesperson on a monthly basis for each product category. Transactional databases may not capture this data, but the data warehouse does.

Limitations:

1.      No support for unstructured data like images, text, IoT data, HL7, JSON, and XML.

2.      SQL-only — DWHs typically offer no support for Python or R

3.      Duplicated data —DWH/subject-area/(departmental) data marts, which results in duplicated data, lots of redundant ETL, and no single source of truth

4.      Expensive — commercial DWH charges you for Data storage & analysis.


The below picture represent the data flow in Data Warehouse





Apache Hive Series-Architecture



Hive Architecture:

As you know, Hive is a query engine which is built upon Hadoop and uses MapReduce model for processing. Before going for optimization series, we need to understand its architecture.

The main components of Hive as follows:

UI(User Interface): It is an interface between User and Hive. It enables user to submit queries and other operations.

Client: It is the Command Line(CLI) interface through which we can submit the hive queries.

Server: It is known as apache thrift server, which accepts the requests from hive clients and submit to hive driver.

Driver: The driver is mainly to receive the queries from various resources like thrift, JDBC, ODBS, UI and CLIENTS. After receiving the queries, it transfers in to the complier.

Compiler: It performs query parsing, generate execution plan based on table and partition info of metastore.

Metastore: Here hive stores the meta-information about the databases like schema of the table, data types of the columns, location in the HDFS, etc

Executor:
It will execute the execution plan generated by complier. It manages dependencies within stage of DAG and execute it on system.

Job Execution flow:

The user will execute query from CLI/UI, once queries send by server, Driver designs a session handle for query to complier, which gets necessary metadata from metastore and make an execution plan and then transfer execute plan to driver, which send it to executor. The executor will execute(via MapReduce) and return the results to driver, finally the result will be submitted to UI by driver.




Spark- Window Function

  Window functions in Spark ================================================ -> Spark Window functions operate on a group of rows like pa...