Monday, September 26, 2022

Hive - View and Index

 

Hive commands

1. How to connect the Hive using python

You need to set this property in your hive-site.xml file.

<property>
<name>hive.server2.authentication</name>
<value>NOSASL</value>
</property>

To install you’ll need these libraries:

pip install sasl
pip install thrift
pip install thrift-sasl
pip install PyHive

After installation, you can connect to Hive like this:

from pyhive import hive
conn = hive.Connection(host="YOUR_HIVE_HOST", port=PORT, username="YOU")

Now that you have the hive connection, you have options how to use it. You can just straight-up query:

cursor = conn.cursor()
cursor.execute("SELECT cool_stuff FROM hive_table")
for result in cursor.fetchall():
use_result(result)

…or to use the connection to make a Pandas dataframe:

import pandas as pd
df = pd.read_sql("SELECT cool_stuff FROM hive_table", conn)

2. What is view in hive ?

In other words, Apache Hive View is a searchable object in a database which we can define by the query. However, we can not store data in the view. Still, some refer to as a view as “virtual tables”.

Hence, we can query a view like we can a table. Moreover, by using joins it is possible to combine data from or more table. Also, it contains a subset of information.

Creating a Hive View

CREATE VIEW [IF NOT EXISTS] view_name [(column_name [COMMENT column_comment], …) ]
[COMMENT table_comment]
AS SELECT …

Droping Hive View

DROP VIEW view_name

2. Index in Hive

Indexes are a pointer or reference to a record in a table as in relational databases. Indexing is a relatively new feature in Hive. In Hive, the index table is different than the main table. Indexes facilitate in making query execution or search operation faster. However, storing indexes require disk space and creating an index involves cost.

“EXPLAIN” query must be checked to evaluate the benefit through a query execution plan. Indexing in hive makes large dataset analysis relatively quicker by better query performance on operations.

There are two types of indexing in Hive:

  • Bitmap Indexing: This is used with columns having a few distinct values. It is known to store both the indexed column’s value and the list of rows as a bitmap. From Hive V0.8.0 onwards, the bitmap index handler is built-in in Hive.
  • Compact Indexing: This type of indexing is known to store the column value and storage blockid.

CREATE INDEX index_name
ON TABLE base_table_name (col_name, …)
AS index_type
[WITH DEFERRED REBUILD] [IDXPROPERTIES (property_name=property_value, …)] [IN TABLE index_table_name] [ [ ROW FORMAT …] STORED AS …
| STORED BY … ] [LOCATION hdfs_path] [TBLPROPERTIES (…)] [COMMENT “index comment”]

SHOW formatted index on base_table_name

ALTER INDEX index_name ON table_name [PARTITION partition_spec] REBUILD

DROP INDEX IF EXISTS index_name ON base_table_name

The index and the indexed table is deleted automatically if the table on which index was built is dropped. Similarly, if a partitioned table is indexed then on dropping the partitions, the indexes are also automatically deleted.

No comments:

Post a Comment

Spark- Window Function

  Window functions in Spark ================================================ -> Spark Window functions operate on a group of rows like pa...