Hadoop and Spark by Leela Prasad
249 FOLLOWERS
Visit the blog to find useful articles on Hadoop and Spark by Leela Prasad.
Hadoop and Spark by Leela Prasad
1M ago
In Python, you define a class by using the class keyword followed by a name and a colon. Then you use .__init__() to declare which attributes each instance of the class should have:
# dog.py
class Dog:
def __init__(self, name, age):
self.name = name
self.age = age
In the body of .__init__(), there are two statements using the self variable:
self.name = name creates an attribute called name and assigns the value of the name parameter to it.
self.age = age creates an attribute called age an ..read more
Hadoop and Spark by Leela Prasad
2M ago
Databricks provides a community edition for free and can be used to explore it's capabilities or can be used for trying out on its Notebooks. Both Python and scala are supported.
Filesystem: It's filesystem is called dbfs
df.write.partitionBy("Location").mode("overwrite").parquet("Table1")
To View the files written and is similar as HDFS/S3/gs in GCP,
dbutils.fs.ls("/Table1 ..read more
Hadoop and Spark by Leela Prasad
5M ago
Hadoop and Spark by Leela Prasad
5M ago
Problems: Redundancy
Different kinds of Normal Forms:
1NF, 2NF,3NF etc
Data Modelling Star Schema
Star schemas denormalize the data, which means adding redundant columns to some dimension tables to make querying and working with the data faster and easier. The purpose is to trade some redundancy (duplication of data) in the data model for increased query speed, by avoiding computationally expensive join operations.
In this model, the fact table is normalized but the dimensions tables are not. That is, data from the fact table exists only on the fact table, but dimensional tables m ..read more
Hadoop and Spark by Leela Prasad
5M ago
How to improve an existing Data Architecture?
How do you choose Datalake vs Datawarehouse for your solution.
Types of Data Stores - OLTP, ODS, OLAP, Data Mart, Cube, etc - https://www.youtube.com/watch?v=bZoO48yPi-Q
References: https://www.youtube.com/watch?v=9ToVk0Fgsz0 ..read more
Hadoop and Spark by Leela Prasad
4y ago
Performance: Many articles say that "Spark Scala is 10 times faster than pySpark", but in reality and from Spark 2.x onwards this statement is no longer true. pySpark used to be buggy and poorly supported, but was updated well in recent times. However, for batch jobs where data magniture is more Spark Scala gives better performance.
Library Stack:
Pandas in Pyspark is an advantage.
Python's Visualization libraries complement pySpark. Where these are not available in Scala.
Python comes with some libraries that are well known for data analysis. Several Libraries are available like Machine lear ..read more
Hadoop and Spark by Leela Prasad
4y ago
Input configuration as CSV and get primary keys for the respective tables and Updated_date
Source Unload - Select primarykeys, Updated_date from srcTbl1 where Updated_date between X and Y
Sink Unload - Select primarykeys, Updated_date from sinkTbl1 where Updated_date between X and Y
Recon Comparison Process:
Get Max Updated from srcTbl - val srcMaxUpdatedDate = srcDf.agg(max("Updated_date")).rdd(map(x => x.mkString).collect.toString
From Sink Table get only the columns whose Updated_date is less than Max Updated_date of Source.
val sinkDf = spark.sql("select * from sinkTbl where Updated_ ..read more
Hadoop and Spark by Leela Prasad
4y ago
If we are doing bulky join and writing as Parquet, below is the screenshot of the failure Task.
Output - The estimated amount of records for this task after Join operation in Size and record count.
Shuffle Read - This can be considered as input for this job as it represents the mount of data involved or considered as input for this Join.
Shuffle Spill(Memory) - The amount of RAM consumed for this operation.
Shuffle Spill (Disk) - This indicates the amount of data written to Disk for this operation. Typically, this is not an ideal case to sill records to Disk and indicates that this stage ..read more
Hadoop and Spark by Leela Prasad
4y ago
Below are the 2 useful links,
https://medium.com/swlh/4-simple-tips-to-improve-your-spark-job-performance-fbf586ce8f7d
https://medium.com/datakaresolutions/key-factors-to-consider-when-optimizing-spark-jobs-72b1a0dc22bf ..read more
Hadoop and Spark by Leela Prasad
4y ago
This Article explains the way to Write, Read and Update data to MongoDB.
One of my Friend's Thomas has written a nice article in which he explained the same in an awesome manner. Please follow the link https://medium.com/@thomaspt748/how-to-load-millions-of-data-into-mongo-db-using-apache-spark-3-0-8bcf089bd6ed
I would like to add 3 Points apart from the one's explained by my friend.
1. Dealing with Nested JSON.
val foodDf = Seq((123,"food2",false,"Italian",2), (123,"food3",true,"American",3), (123,"food1",true,"Mediterranean",1)) .toDF("userId","foodName","isFavFood","cuisine","sco ..read more