In data platform, we can encounter many common issues in day-to-day life. In this post, we will discuss issues and actions to be taken to resolve them. Here I am given them as a one liner.
Issue: Lack of data definition
Action: a central catalog of data definition and business glossary
Issue: Cross-system mismatch
Action: map data across the system in a unanimous fashion
Issue: Orphaned data
Action: all data files should be indexed/cataloged
Issue: Irrelevant Data
Action: identify and reconcile on a regular basis
Issue: Lack of history
Action: use snapshot enabled tools
Issue: Mishandling of late data
Action: separate data pipeline for backfilling with least downtime
Issue: Missing Attributes
Action: schema validation and reconciliation
Issue: Missing Values
Action: drop or impute
Issue: Missing Records
Action: attach metadata
Issue: Default Values
Action: cover in data catalog
Issue: Duplication of records
Action: check and purge/merge
Issue: Attribute format inconsistency
Action: standardize attribute format throughout its lifecycle
In summary, we should place controls at various levels to fix issues in our data platform:
- Fixes at the source system
- Fixes during the transformation process
- Continuous data profiling
- Guardrails at the Metadata layer
- DQ checks and alerts at the consumption layer.
No comments:
Post a Comment