Monday, September 26, 2022

Hive - Q&A - Part -2


Hive Interview Questions


11. Why does Hive not store metadata information in HDFS?

Ans. Using RDBMS instead of HDFS, Hive stores metadata information in the metastore. Basically, to achieve low latency we use RDBMS. Because HDFS read/write operations are time-consuming processes.

12. What is the difference between local and remote metastore?

Ans. Local Metastore:

It is the metastore service runs in the same JVM in which the Hive service is running and connects to a database running in a separate JVM. Either on the same machine or on a remote machine.

Remote Metastore:

In this configuration, the metastore service runs on its own separate JVM and not in the Hive service JVM.

13. What is the difference between the external table and managed table?

Ans. Managed table

The metadata information along with the table data is deleted from the Hive warehouse directory if one drops a managed table.\

External table
Hive just deletes the metadata information regarding the table. Further, it leaves the table data present in HDFS untouched.

14. Is it possible to change the default location of a managed table?

Ans. Yes, by using the clause — LOCATION ‘<hdfs_path>’ we can change the default location of a managed table.

15. What is the default database provided by Apache Hive for metastore?

Ans. It offers an embedded Derby database instance backed by the local disk for the metastore, by default. It is what we call embedded metastore configuration.

16. When should we use SORT BY instead of ORDER BY?

Ans. Despite ORDER BY we should use SORT BY. Especially while we have to sort huge datasets. The reason is SORT BY clause sorts the data using multiple reducers. ORDER BY sorts all of the data together using a single reducer.

Hence, using ORDER BY will take a lot of time to execute a large number of inputs.

17. What is a partition in Hive?

Ans. Basically, for the purpose of grouping similar type of data together on the basis of column or partition key, Hive organizes tables into partitions.

Moreover, to identify a particular partition each table can have one or more partition keys. On defining Hive Partition, in other words, it is a sub-directory in the table directory.

18. Why do we perform partitioning in Hive?

Ans. In a Hive table, Partitioning provides granularity. Hence, by scanning only relevant partitioned data instead of the whole dataset it reduces the query latency.

19. What is dynamic partitioning and when is it used?

Ans. Dynamic partitioning values for partition columns are known in the runtime. In other words, it is known during loading of the data into a Hive table.

  • Usage:
  1. While we Load data from an existing non-partitioned table, in order to improve the sampling. Thus it decreases the query latency.
  2. Also, while we do not know all the values of the partitions beforehand. Thus, finding these partition values manually from a huge dataset is a tedious task.

20. Why do we need buckets?

Ans. Basically, for performing bucketing to a partition there are two main reasons:

  • A map side join requires the data belonging to a unique join key to be present in the same partition.
  • It allows us to decrease the query time. Also, makes the sampling process more efficient.

21. How Hive distributes the rows into buckets?

Ans. By using the formula: hash_function (bucketing_column) modulo (num_of_buckets) Hive determines the bucket number for a row. Basically, hash_function depends on the column data type. Although, hash_function for integer data type will be:
hash_function (int_type_column)= value of int_type_column

22. What is indexing and why do we need it?

Ans. Hive index is a Hive query optimization techniques. Basically, we use it to speed up the access of a column or set of columns in a Hive database. Since, the database system does not need to read all rows in the table to find the data with the use of the index, especially that one has selected.

23. What is the use of Hcatalog?

Ans. Basically, to share data structures with external systems we use Hcatalog. It offers access to hive metastore to users of other tools on Hadoop. Hence, they can read and write data to hive’s data warehouse.

24.How to optimize Hive Performance?

Ans. There are several types of Query Optimization Techniques we can use in Hive in order to Optimize Hive Performance. Such as:

  1. Tez-Execution Engine in Hive
  2. Usage of Suitable File Format in Hive
  3. Hive Partitioning
  4. Bucketing in Hive
  5. Vectorization In Hive
  6. Cost-Based Optimization in Hive (CBO)
  7. Hive Indexing

25.Can we change the data type of a column in a hive table?

Ans. By using REPLACE column option we change the data type of a column in a hive table
ALTER TABLE table_name REPLACE COLUMNS ……

No comments:

Post a Comment

Spark- Window Function

  Window functions in Spark ================================================ -> Spark Window functions operate on a group of rows like pa...