Tuesday, October 18, 2022

Best websites to resume making

 5 Websites to make your resume 5x great.



1. Overleaf

⚡ https://www.overleaf.com/

Overleaf provides a rich-text editor so you don't need to know any code to get started you can just edit the text, add images, and see the resume, cv automatically created for you.

2. Resume Worded

⚡️https://resumeworded.com/

Designed by top recruiters, the AI-powered platform instantly gives you tailored feedback on your resume and LinkedIn profile.

3. Grammarly

⚡️https://app.grammarly.com/

Grammarly improves your writing by checking your grammar, spelling, and punctuation correctness. 
It improves the clarity of your sentences by making them concise. It can help you with your resume writing.

4. Resume Worded

⚡️https://lnkd.in/dVVyRKg6

Free AI-powered resume checker scores your resume on key criteria recruiters and hiring managers look for.

5. Naukri.com

⚡️https://lnkd.in/dymAz6jh

Get resume feedback report and know what to improve.

Happy Learning!






Sunday, October 16, 2022

Airflow- Introduction


Airflow for Data Engineers

Working with data involves a ton of prerequisites to get up and running with the required set of data, it’s formatting and storage. The first step of a data science process is Data engineering, which plays a crucial role in streamlining every other process of a data science project.

Traditionally, data engineering processes involve three steps: Extract, Transform and Load, which is also known as the ETL process. The ETL process involves a series of actions and manipulations on the data to make it fit for analysis and modeling. Most data science processes require these ETL processes to run almost every day for the purpose of generating daily reports.

Ideally, these processes should be executed automatically in definite time and order. But, it isn’t as easy as it sounds. You might have tried using a time-based scheduler such as Cron by defining the workflows in Crontab. This works fairly well for workflows that are simple. However, when the number of workflows and their dependencies increase, things start getting complicated. It gets difficult to effectively manage as well as monitor these workflows considering they may fail and need to be recovered manually. Apache Airflow is such a tool that can be very helpful for you in that case, whether you are a Data Scientist, Data Engineer, or even a Software Engineer.

What is Apache Airflow?

“Apache Airflow is an open-source workflow management platform. It started at Airbnb in October 2014 as a solution to manage the company’s increasingly complex workflows. “

Apache Airflow (or simply Airflow) is a highly versatile tool that can be used across multiple domains for managing and scheduling workflows. It allows you to perform as well as automate simple to complex processes that are written in Python and SQL. Airflow provides a method to view and create workflows in the form of Direct Acyclic Graphs (DAGs) with the help of intelligent command-line tools as well as GUIs.

Apache Airflow is a revolutionary open-source tool for people working with data and its pipelines. It is easy to use and deploy considering data scientists have basic knowledge of Python. Airflow provides the flexibility to use Python scripts to create workflows along with various ready to use operators for easy integrations with platforms such as Amazon AWS, Google Cloud Platform, Microsoft Azure, etc. Moreover, it ensures that the tasks are ordered correctly based on dependencies with the help of DAGs, and also continuously tracks the state of tasks being executed.

One of the most crucial features of Airflow is its ability to recover from failure and manage the allocation of scarce resources dynamically. This makes the Airflow a great choice for running any kind of data processing or modeling tasks in a fairly scalable and maintainable way.

In this tutorial, we will understand how to install it, create the pipeline and why data scientists should be using it, in detail.

Understanding the workflow and DAG

The set of processes that take place in regular intervals is termed as the ‘workflow’. It can consist of any task ranging from extracting data to manipulating them.

Direct Acyclic Graphs (DAG) are one of the key components of Airflow. They represent the series of tasks that needs to be run as a part of the workflow. Each task is represented as a single node in the graph along with the path it takes for execution, as shown in the figure.

Airflow also provides you with the ability to specify the order, relationship (if any) in between 2 or more tasks and enables you to add any dependencies regarding required data values for the execution of a task. A DAG file is a Python script and is saved with a .py extension.

Basic Components of Apache Airflow

Now that you have a basic idea about workflows and DAG, we’ve listed below some of the commonly used components of Apache Airflow that make up the architecture of the Apache Airflow pipeline.

Web Server: It is the graphical user interface built as a Flask app. It provides details regarding the status of your tasks and gives the ability to read logs from a remote file store.

Scheduler: This component is primarily responsible for scheduling tasks, i.e., the execution of DAGs. It retrieves and updates the status of the task in the database.

Executer: It is the mechanism that initiates the execution of different tasks one by one.

Metadata Database: It is the centralized database where Airflow stores the status of all the tasks. All read/write operations of a workflow are done from this database.

Now that you understand the basic architecture of the Airflow, let us begin by installing Python and Apache Airflow into our system.

Create your first Airflow pipeline

The Apache Airflow pipeline is basically an easy and scalable tool for data engineers to create, monitor and schedule one or multiple workflows simultaneously. The pipeline requires a database backend for running the workflows, which is why we will start by initializing the database using the command:

airflow initdb

Upon initializing the database, you can now start the server using the command

airflow webserver -p 8080

This will start an Airflow webserver at port 8080 on your localhost.
Note: You can specify any port as per your preference.

Now, open up another terminal and start the airflow scheduler using the command:

airflow scheduler

Now, if you have successfully started the airflow scheduler, you access the tool from your browser to monitor as well as manage the status of completed and ongoing tasks in your workflow at localhost:8080/admin.

DE Series- 4

How to write Efficient code in python?


As a part of Data Engineering Series, we have already covered part-1(Data Engineering-Introduction) ,part-2(Basic Python) and part-3 (Advance Python). As a continuity of previous post on Advance python, we are going to see how to writing efficient code in python in this post

In python, Enumerate is used to write efficient python code. Many a times we need to keep a count of iterations. Python’s enumerate takes a collection i.e iterable, adds counter to it and returns it as an enumerate object

Syntax :

enumerate(iterable, start=0)

Implementation —

"""
Enumerate : Use enumerate() function : Python’s enumerate takes a collection i.e iterable, adds counter to it and returns it as an enumerate object.
"""
countries = ['USA','Canada','Singapore','Taiwan']
enum_countries = enumerate(countries)
enumerate_countries = enumerate(countries,5)
print(list(enumerate_countries))
print(type(enumerate_countries))

Output —

[(5, 'USA'), (6, 'Canada'), (7, 'Singapore'), (8, 'Taiwan')]
<class 'enumerate'>

Implementation 2 —

countries = ['USA','Canada','Singapore','Taiwan']
for i,item in enumerate(countries):
print(i,item)

Output —

0 USA
1 Canada
2 Singapore
3 Taiwan

In python, Zip takes one or more iterables(list,tuples etc) and aggregates them into tuple and returns the iterator object

Syntax :

zip(*iterators)

Implementation —

# Use Zip : Zip takes one or more iterables and aggregates them into # tuple and returns the iterator objectname = ["Steve","Paul","Brad"]
roll_no = [4,1,3]
marks = [20,40,50]
mapped = zip(name,roll_no,marks)
mapped = set(mapped)
print(mapped)

Output —

{('Brad', 3, 50), ('Steve', 4, 20), ('Paul', 1, 40)}

To make code work faster use builtin functions and libraries like map() which applies a function to every member of iterable sequence and returns the result.

Implementation —

"""
Map function : In Python, map() function applies the given function #to each item of a given iterable construct (i.e lists, tuples etc) and returns a map object.
"""
numbers =(100,200,300)
result = map(lambda x:x+x,numbers)
total = list(result)
print(total)

Output —

[200, 400, 600]

NumPy arrays are homogeneous and provide a fast and memory efficient alternative to Python lists.NumPy arrays vectorization technique, vectorize operations so they are performed on all elements of an object at once which allows the programmer to efficiently perform calculations over entire arrays.

Implementation —

import numpy as np
def reciprocals(values):
output = np.empty(len(values))
for i in range(len(values)):
output[i] = 1.0/values[i]
return output
values = np.random.randint(1,15,size=6)
reciprocals(values)

Output —

array([0.25      , 0.5       , 0.1       , 0.16666667, 0.14285714,
0.07142857])

To swap the variables, use multiple assignment

Implementation —

# Use multiple assignmentf_name,l_name,city = "Steve","Paul","NewYork"print(f_name,l_name,city)#To swap variablea = 5 
b = 10
a,b = b,a
print(a,b)

Output —

Steve Paul NewYork
10 5

Use Comprehensions

Implementation —

#List Comprehensionlist_two = [5,10,15,20,20,40,50,60]
new_list = [x**3 for x in list_two]
print(new_list)
#Dictionary Comprehensiondict_one = [1,2,3,4]
new_dict = {x:x**2 for x in dict_one if x%2 ==0}
print(new_dict)

Output —

[125, 1000, 3375, 8000, 8000, 64000, 125000, 216000]
{2: 4, 4: 16}

Membership : To check if membership of a list, it’s generally faster to use the “in” keyword

Implementation —

days = ["sunday","monday","tuesday"]
for d in days:
print('Today is {}'.format(d))
print('tuesday' in days)
print('friday' in days)

Output —

Today is sunday
Today is monday
Today is tuesday
True
False

Counter : Counter is one of the high performance container data types

Implementation —

from collections import Counter
sample_dict = {'a':4,'b':8,'c':2}
print(Counter(sample_dict))

Output —

Counter({'b': 8, 'a': 4, 'c': 2})

Python Itertools are fast, memory efficient functions — a collection of constructs for handling iterators.

Implementation —

import itertools
for i in itertools.count(30,4):
print(i)
if i>30:
break

Output —

30
34

Implementation 2 —

import itertools
countries =[("West","USA"), ("East","Singapore"),("West","Canada"),("East","Taiwan")]
iter_one = itertools.groupby(countries,lambda x:x[0])
for key,group in iter_one:
result = {key:list(group)}
print(result)

Output —

{'West': [('West', 'USA')]}
{'East': [('East', 'Singapore')]}
{'West': [('West', 'Canada')]}
{'East': [('East', 'Taiwan')]}

Use sets to remove duplicates

Implementation —

s1 = {1,2,4,6,0,3,2,1,7,4,3}
s1.add(10)
s1.update([12,13])
print(s1)

Output —

{0, 1, 2, 3, 4, 6, 7, 10, 12, 13}

Use Generators

Range ( range()) uses lazy evaluation, so instead of range() use xrange() which returns the generator object

Implementation —

def test_sequence():
num = 0
while num<10:
yield num
num+=1

for i in test_sequence():
print(i,end=",")

Output —

0,1,2,3,4,5,6,7,8,9,

Practice writing idiomatic code as it will make your code run faster

Examine Runtime of your code snippet

Implementation —

%timeit ('x=3; L=[x**n for n in range(20)]')

Output —

12.9 ns ± 0.894 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Thursday, October 13, 2022

File Formats in Bigdata

 

Impact of Data File Formats in Big Data

  • Imagine you visit a grocery store and nothing is in order, You will find all items on different shelves would this make your shopping experience better? I think the answer is No, In fact, you might never visit this grocery store again.
  • If you understood the example above you now can imagine the impact of unorganized data in a company can be.
  • Every company gets 10s & 1000s of GB data every day. If these are not stored in a proper format, then understanding this data will be difficult or impossible sometimes.
  • More time you spend sorting through the data, The company is missing out on the opportunity to retain customers or generate more orders/revenue.

Use-case 1:

If you are looking into total sales data from a table, Then this requires 1 column in your table sale_amount to be scanned/queried mostly.

Use-case 2 :

If you are trying to identify the consumer behavior :

  • What kind of items are customers placing the order for?
  • Which category of item has the customer placed the most orders from?

Spark- Window Function

  Window functions in Spark ================================================ -> Spark Window functions operate on a group of rows like pa...