Saturday, September 17, 2022

Common Issues in Data Platform

 

In data platform, we can encounter many common issues in day-to-day life. In this post, we will discuss issues and actions to be taken to resolve them. Here I am given them as a one liner.

Issue: Lack of data definition
Action: a central catalog of data definition and business glossary

Issue: Cross-system mismatch
Action: map data across the system in a unanimous fashion

Issue: Orphaned data
Action: all data files should be indexed/cataloged

Issue: Irrelevant Data
Action: identify and reconcile on a regular basis

Issue: Lack of history
Action: use snapshot enabled tools

Issue: Mishandling of late data
Action: separate data pipeline for backfilling with least downtime

Issue: Missing Attributes
Action: schema validation and reconciliation

Issue: Missing Values
Action: drop or impute

Issue: Missing Records
Action: attach metadata

Issue: Default Values
Action: cover in data catalog

Issue: Duplication of records
Action: check and purge/merge

Issue: Attribute format inconsistency
Action: standardize attribute format throughout its lifecycle

In summary, we should place controls at various levels to fix issues in our data platform:

- Fixes at the source system
- Fixes during the transformation process
- Continuous data profiling
- Guardrails at the Metadata layer
- DQ checks and alerts at the consumption layer.

No comments:

Post a Comment

Spark- Window Function

  Window functions in Spark ================================================ -> Spark Window functions operate on a group of rows like pa...