Thursday, October 13, 2022

File Formats in Bigdata

 

Impact of Data File Formats in Big Data

  • Imagine you visit a grocery store and nothing is in order, You will find all items on different shelves would this make your shopping experience better? I think the answer is No, In fact, you might never visit this grocery store again.
  • If you understood the example above you now can imagine the impact of unorganized data in a company can be.
  • Every company gets 10s & 1000s of GB data every day. If these are not stored in a proper format, then understanding this data will be difficult or impossible sometimes.
  • More time you spend sorting through the data, The company is missing out on the opportunity to retain customers or generate more orders/revenue.

Use-case 1:

If you are looking into total sales data from a table, Then this requires 1 column in your table sale_amount to be scanned/queried mostly.

Use-case 2 :

If you are trying to identify the consumer behavior :

  • What kind of items are customers placing the order for?
  • Which category of item has the customer placed the most orders from?

Wednesday, October 12, 2022

DE Series - 3

 Data Engineering Series

As a part of Data Engineering Series, we have already covered part-1(Data Engineering-Introduction) and part-2(Basic Python). As a continuity of previous post on Basic Python, we are going to see Advance Python Concepts in this post.


Magic Methods in Python

In Python, Magic methods in Python are the special methods that start and end with the double underscores

  • Magic methods are not meant to be invoked directly by you, but the invocation happens internally from the class once certain action is performed
  • Examples for magic methods are: __new__, __repr__, __init__, __add__, __len__, __del__ etc. The __init__ method used for initialization is invoked without any call
  • Use the dir() function to see the number of magic methods inherited by a class
  • The advantage of using Python’s magic methods is that they provide a simple way to make objects behave like built-in types
  • Magic methods can be used to emulate the behavior of built-in types of user-defined objects. Therefore, whenever you find yourself trying to manipulate a user-defined object’s output in a Python class, then use magic methods.

Example :

v = 4

v.__add__(2)

Implementation —

# __Del__ methodfrom os.path import joinclass FileObject:def __init__(self, file_path='~', file_name='test.txt'):
self.file = open(join(file_path, file_name), 'rt')
def __del__(self):
self.file.close()
del self.file

Implementation —

# __repr__ methodclass String:

def __init__(self, string):
self.string = string
def __repr__(self):
return 'Object: {}'.format(self.string)

Inheritance and Polymorphism in Python

  • In Python, Inheritance and Polymorphism are very powerful and important concept
  • Using inheritance you can use or inherit all the data fields and methods available in the parent class
  • On top of it, you can add you own methods and data fields
  • Python allows multiple inheritance i.e you can inherit from multiple classes
  • Inheritance provides a way to write better organized code and re-use the code

One of the best article I read on class inheritance by 

Syntax —

class ParentClass:

Body of parent class

class DerivedClass(ParentClass):

Body of derived class

  • In Python, Polymorphism allows us to define methods in the child class with the same name as defined in their parent class

Example —

class X:

def sample(self):

print(“sample() method from class X”)

class Y(X):

def sample(self):

print(“sample() method from class Y”)

Implementation —

# Inheritanceclass Vehicle:def __init__(self, name, color):
self.__name = name
self.__color = color
def getColor(self):
return self.__color
def setColor(self, color):
self.__color = color
def get_Name(self):
return self.__name
class Bike(Vehicle):def __init__(self, name, color, model):

super().__init__(name, color) # call parent class
self.__model = model
def get_details(self):
return self.get_Name() + self.__model + " in " +
self.getColor() + " color"
b_obj = Bike("Cziar", "red", "TK720")
print(b_obj.get_details())
print(b_obj.get_Name())

Output —

Cziar TK720 in red color
Cziar

Implementation —

# Polymorphismfrom math import piclass Shape:
def __init__(self, name):
self.name = name
def area(self):
pass
class Sqr(Shape):
def __init__(self, length):
super().__init__("Square")
self.length = length
def area(self):
return self.length**2
class Circle(Shape):
def __init__(self, radius):
super().__init__("Circle")
self.radius = radius
def area(self):
return pi*self.radius**2
a = Square(6)
b = Circle(10)
print(a.area())
print(b.area())

Output —

36
314.1592653589793

Errors and Exception Handling in Python

In Python, an error can be a syntax error or an exception.

When the parser detects an incorrect statement, Syntax errors occur.

  • Exceptions errors are raised when an external event occurs which in some way changes the normal flow of the program
  • Exception error occurs whenever syntactically correct python code results in an error
  • Python comes with various built-in exceptions as well as the user can create user-defined exceptions
  • Garbage collection is the memory management feature i.e a process of cleaning shared computer memory

Some of python’s built in exceptions —

IndexError : When the wrong index of a list is retrieved

ImportError : When an imported module is not found

KeyError : When the key of the dictionary is not found

NameError: When the variable is not defined

MemoryError : When a program run out of memory

TypeError : When a function and operation is applied in an incorrect type

AssertionError : When assert statement fails

AttributeError : When an attribute assignment is failed

Try and Except in Python

In Python, exceptions can be handled using a try statement

  • The block of code which can raise an exception is placed inside the try clause. The code that handles the exceptions is written in the except clause
  • In case no exception has occurred, the except block is skipped and program normal flow continues
  • A try clause can have any number of except clauses to handle different exceptions but only one will be executed in case the exception occurs
  • We can also raise exceptions using the raise keyword
  • The try statement in Python can have an optional finally clause which executes regardless of the result of the try- and except blocks

Example :

try:

print(a)

except:

print(“Something went wrong”)

finally:

print(“Exit”)

Implementation —

# try, except, finallytry:
print(1 / 0)
except:
print("Error occurred")
finally:
print("Exit")

Output —

Error occurred
Exit

User-defined Exceptions

In Python, user can create his own error by creating a new exception class

  • Exceptions need to be derived from the Exception class, either directly or indirectly
  • Exceptions errors are raised when an external event occurs which in some way changes the normal flow of the program
  • User defined exceptions can be implemented by raising an exception explicitly, by using assert statement or by defining custom classes for user defined exceptions
  • Use assert statement to implement constraints on the program. When, the condition given in assert statement is not met, the program gives AssertionError in output
  • You can raise an existing exception by using the raise keyword and the name of the exception
  • To create a custom exception class and define an error message, you need to derive the errors from the Exception class directly
  • When creating a module that can raise several distinct errors, a common practice is to create a base class for exceptions defined by that module, and subclass that to create specific exception classes for different error conditions, this is called Hierarchical custom exceptions

Example —

class class_name(Exception)

Implementation —

class Error(Exception):
pass
class TooSmallValueError(Error):
pass
number = 100while True:
try:
num = int(input("Enter a number: "))
if num < number:
raise TooSmallValueError
break
except TooSmallValueError:
print("Value too small")

Output —

Enter a number: 40
Value too small

Garbage Collection in Python

In Python, Garbage collection is the memory management feature i.e a process of cleaning shared computer memory which is currently being put to use by a running program when that program no longer needs that memory and can be used other programs

  • In python, Garbage collection works automatically. Hence, python provides with good memory management and prevents the wastage of memory
  • In python, forcible garbage collection can be done by calling collect() function of the gc module
  • In python, when there is no reference left to the object in that case it is automatically destroyed by the Garbage collector of python and __del__() method is executed

Example :

import gc

gc.collect()

Implementation —

#manual garbage collectionimport sys, gcdef test():
list = [18, 19, 20,34,78]
list.append(list)
def main():
print("Garbage Creation")
for i in range(5):
test()
print("Collecting..")
n = gc.collect()
print("Unreachable objects collected by GC:", n)
print("Uncollectable garbage list:", gc.garbage)
if __name__ == "__main__":
main()
sys.exit()

Output —

Garbage Creation
Collecting..
Unreachable objects collected by GC: 33

Python Debugger

Debugging is the process of locating and solving the errors in the program. In python, pdb which is a part of Python’s standard library is used to debug the code

  • pdb module internally makes used of bdb and cmd modules
  • It supports setting breakpoints and single stepping at the source line level, inspection of stack frames, source code listing etc

Syntax —

import pdb

pdb.set_trace()

  • To set the breakpoints, there is a built-in function called breakpoint()

Implementation —

import pdb

def multiply(a, b):
answer = a * b
return answer

pdb.set_trace()
a = int(input("Enter first number : "))
b = int(input("Enter second number : "))
sum = multiply(a, b)

Decorators in Python

In Python, a decorator is any callable Python object that is used to modify a function or a class. It takes a function, adds some functionality, and returns it.

  • Decorators are a very powerful and useful tool in Python since it allows programmers to modify/control the behavior of function or class.
  • In Decorators, functions are passed as an argument into another function and then called inside the wrapper function.
  • Decorators are usually called before the definition of a function you want to decorate.

There are two different kinds of decorators in Python:

Function decorators

Class decorators

  • When using Multiple Decorators to a single function, the decorators will be applied in the order they’ve been called
  • By recalling that decorator function, we can re-use the decorator

Implementation —

#Decoratorsdef test_decorator(func):
def function_wrapper(x):
print("Before calling" + func.__name__)
res = func(x)
print(res)
print("After calling" + func.__name__)
return function_wrapper
@test_decorator
def sqr(n):
return n**2
sqr(20)

Output —

Before callingsqr
400
After callingsqr

Implementation —

# Multiple Decoratorsdef lowercase_decorator(function):
def wrapper():
func= function()
make_lowercase = func.lower()
return make_lowercase
return wrapper
def split_string(function):
def wrapper():
func= function()
split_string =func.split()
return split_string
return wrapper
@split_string
@lowercase_decorator
def test_func():
return 'MOTHER OF DRAGONS'
test_func()

Output —

['mother', 'of', 'dragons']

Memoization using Decorators

In Python, memoization is a technique which allows you to optimize a Python function by caching its output based on the parameters you supply to it.

  • Once you memoize a function, it will only compute its output once for each set of parameters you call it with. Every call after the first will be quickly retrieved from a cache.
  • If you want to speed up the parts in your program that are expensive, memoization can be a great technique to use.

One of the best article I read about Decorators by 

There are three approaches to Memoization —

Using global

Using objects

Using default parameter

Using a Callable Class

Implementation —

#fibonacci series using Memoization using decoratorsdef memoization_func(t):
dict_one = {}
def h(z):
if z not in dict_one:
dict_one[z] = t(z)
return dict_one[z]
return h

@memoization_func
def fib(n):
if n == 0:
return 0
elif n == 1:
return 1
else:
return fib(n-1) + fib(n-2)
print(fib(20))

Output —

6765

Defaultdict

In python, a dictionary is a container that holds key-value pairs. Keys must be unique, immutable objects

  • If you try to access or modify keys that don’t exist in the dictionary, it raise a KeyError and break up your code execution. To tackle this issue, Python defaultdict type, a dictionary-like class is used
  • If you try to access or modify a missing key, then defaultdict will automatically create the key and generate a default value for it
  • A defaultdict will never raise a KeyError
  • Any key that does not exist gets the value returned by the default factory
  • Hence, whenever you need a dictionary, and each element’s value should start with a default value, use a defaultdict

Syntax —

from collections import defaultdict

demo = defaultdict(int)

Implementation —

from collections import defaultdict 

default_dict_var = defaultdict(list)

for i in range(10):
default_dict_var[i].append(i)

print(default_dict_var)

Output —

defaultdict(<class 'list'>, {0: [0], 1: [1], 2: [2], 3: [3], 4: [4], 5: [5], 6: [6], 7: [7], 8: [8], 9: [9]})

OrderedDict

In python, OrderedDict is one of the high performance container datatypes and a subclass of dict object. It maintains the order in which the keys are inserted. In case of deletion or re-insertion of the key, the order is maintained and used when creating an iterator

  • It’s a dictionary subclass that remembers the order in which its contents are added
  • When the value of a specified key is changed, the ordering of keys will not change for the OrderedDict
  • If an item is overwritten in the OrderedDict, it’s position is maintained
  • OrderedDict popitem removes the items in FIFO order
  • The reversed() function can be used with OrderedDict to iterate elements in the reverse order
  • OrderedDict has a move_to_end() method to efficiently reposition an element to an endpoint

Example —

from collections import OrderedDict

my_dict = {‘Sunday’: 0, ‘Monday’: 1, ‘tuesday’: 2}

# creating ordered dict

ordered_dict = OrderedDict(my_dict)

Generators in Python

In Python, Generator functions act just like regular functions with just one difference that they use the Python yield keyword instead of return . A generator function is a function that returns an iterator A generator expression is an expression that also returns an iterator

  • Generator objects are used either by calling the next method on the generator object or using the generator object in a “for in” loop.
  • A return statement terminates a function entirely but a yield statement pauses the function saving all its states and later continues from there on successive calls.
  • Generator expressions can be used as the function arguments. Just like list comprehensions, generator expressions allow you to quickly create a generator object within minutes with just a few lines of code.
  • The major difference between a list comprehension and a generator expression is that a list comprehension produces the entire list while the generator expression produces one item at a time as lazy evaluation. For this reason, compared to a list comprehension, a generator expression is much more memory efficient

Example —

def generator():

yield “x”

yield “y”

for i in generator():

print(i)

Implementation —

def test_sequence():
num = 0
while num<10:
yield num
num += 1
for i in test_sequence():
print(i, end=",")

Output —

0,1,2,3,4,5,6,7,8,9,

Implementation —

# Python generator with Loop#Reverse a string
def reverse_str(test_str):
length = len(test_str)
for i in range(length - 1, -1, -1):
yield test_str[i]
for char in reverse_str("Trojan"):
print(char,end =" ")

Output —

n a j o r T

Implementation —

# Generator Expression
# Initialize the list
test_list = [1, 3, 6, 10]
# list comprehension
list_comprehension = [x**3 for x in test_list]
# generator expression
test_generator = (x**3 for x in test_list)
print(list_comprehension)
print(type(test_generator))
print(tuple(test_generator))

Output —

[1, 27, 216, 1000]
<class 'generator'>
(1, 27, 216, 1000)

Coroutine in Python

  • Coroutines are computer program components that generalize subroutines for non-preemptive multitasking, by allowing execution to be suspended and resumed
  • Because coroutines can pause and resume execution context, they’re well suited to concurrent processing
  • Coroutines are a special type of function that yield control over to the caller, but does not end its context in the process, instead maintaining it in an idle state
  • Using coroutines the yield directive can also be used on the right-hand side of an = operator to signify it will accept a value at that point in time.

Example —

def func():

print(“My first Coroutine”)

while True:

var = (yield)

print(var)

coroutine = func()

next(coroutine)

Implementation —

def func(): 
print("My first Coroutine")
while True:
var = (yield)
print(var)
coroutine = func()
next(coroutine)

Output —

My first Coroutine

Spark- Window Function

  Window functions in Spark ================================================ -> Spark Window functions operate on a group of rows like pa...