Impact of Data File Formats in Big Data
Overview
Let’s discuss different kinds of data file formats used in Big Data, These are widely used but unspoken of but are building blocks of all data engineering tasks they could be data processing, data visualization, or could be ML. You will have to retrieve data from someplace.
If this storage/file format is wrong then shall result in many common issues like slow query runtime, slow dashboard refreshes even certain joins on spark taking more time.
Anyone aspiring to be a Data Engineer/Data Analyst/ML engineer should be aware of this magic of file format in big data.
Why do we even need a data file format?
- Imagine you visit a grocery store and nothing is in order, You will find all items on different shelves would this make your shopping experience better? I think the answer is No, In fact, you might never visit this grocery store again.
- If you understood the example above you now can imagine the impact of unorganized data in a company can be.
- Every company gets 10s & 1000s of GB data every day. If these are not stored in a proper format, then understanding this data will be difficult or impossible sometimes.
- More time you spend sorting through the data, The company is missing out on the opportunity to retain customers or generate more orders/revenue.
This is why data file formats are used in Data lake and play a key role.
Different kinds of file formats
We have CSV,JSON,ORC,Parquet,Avro. These are the leading data file formats used today.
Which one to choose from and why to choose them completely depends on the use case and information you’re looking for.
Consider a use-case of sales in a company. Which file format would be the best way to store this information?
Use-case 1:
If you are looking into total sales data from a table, Then this requires 1 column in your table sale_amount to be scanned/queried mostly.
Hence storing this table in columnar format is the best way to go.
File formats to choose: ORC, Parquet
Use-case 2 :
If you are trying to identify the consumer behavior :
- What kind of items are customers placing the order for?
- Which category of item has the customer placed the most orders from?
To gather this information you have to scan row-level information spanning multiple columns like user_id, item_name, sale_amount.
Here, Storing the data in row-level format is the right choice to make.
File formats to choose: CSV, JSON, Avro
These 2 use-cases should be enough to showcase the power of data file formats.
Conclusion
It’s important to know that we don’t have one fit-all file format yet (Anything can happen in the future). The question you must ask “How will this data be used?” Based on this file format should be defined.
Now that you are aware of the impact this can do I believe you are sure what data file format to choose when you write the data into the data lake.
No comments:
Post a Comment