Classes and Object Oriented Python
Hadoop and Spark by Leela Prasad
by
1M ago
  In Python, you define a class by using the class keyword followed by a name and a colon. Then you use .__init__() to declare which attributes each instance of the class should have: # dog.py class Dog: def __init__(self, name, age): self.name = name self.age = age In the body of .__init__(), there are two statements using the self variable: self.name = name creates an attribute called name and assigns the value of the name parameter to it. self.age = age creates an attribute called age an ..read more
Visit website
Databricks
Hadoop and Spark by Leela Prasad
by
2M ago
  Databricks provides a community edition for free and can be used to explore it's capabilities or can be used for trying out on its Notebooks. Both Python and scala are supported. Filesystem: It's filesystem is called dbfs df.write.partitionBy("Location").mode("overwrite").parquet("Table1") To View the files written and is similar as HDFS/S3/gs in GCP, dbutils.fs.ls("/Table1 ..read more
Visit website
Design Patterns in Data Engineering
Hadoop and Spark by Leela Prasad
by
5M ago
..read more
Visit website
Normalization in DBMS
Hadoop and Spark by Leela Prasad
by
5M ago
Problems: Redundancy Different kinds of Normal Forms: 1NF, 2NF,3NF etc Data Modelling Star Schema Star schemas denormalize the data, which means adding redundant columns to some dimension tables to make querying and working with the data faster and easier. The purpose is to trade some redundancy (duplication of data) in the data model for increased query speed, by avoiding computationally expensive join operations. In this model, the fact table is normalized but the dimensions tables are not. That is, data from the fact table exists only on the fact table, but dimensional tables m ..read more
Visit website
Database Architecture
Hadoop and Spark by Leela Prasad
by
5M ago
  How to improve an existing Data Architecture? How do you choose Datalake vs Datawarehouse for your solution. Types of Data Stores - OLTP, ODS, OLAP, Data Mart, Cube, etc - https://www.youtube.com/watch?v=bZoO48yPi-Q References: https://www.youtube.com/watch?v=9ToVk0Fgsz0 ..read more
Visit website
Spark Scala vs pySpark
Hadoop and Spark by Leela Prasad
by
4y ago
Performance: Many articles say that "Spark Scala is 10 times faster than pySpark", but in reality and from Spark 2.x onwards this statement is no longer true. pySpark used to be buggy and poorly supported, but was updated well in recent times. However, for batch jobs where data magniture is more Spark Scala gives better performance. Library Stack: Pandas in Pyspark is an advantage. Python's Visualization libraries complement pySpark. Where these are not available in Scala. Python comes with some libraries that are well known for data analysis. Several Libraries are available like Machine lear ..read more
Visit website
Reconciliation in Spark
Hadoop and Spark by Leela Prasad
by
4y ago
Input configuration as CSV and get primary keys for the respective tables and Updated_date Source Unload - Select primarykeys, Updated_date from srcTbl1 where Updated_date between X and Y Sink Unload - Select primarykeys, Updated_date from sinkTbl1 where Updated_date between X and Y Recon Comparison Process: Get Max Updated from srcTbl - val srcMaxUpdatedDate = srcDf.agg(max("Updated_date")).rdd(map(x => x.mkString).collect.toString From Sink Table get only the columns whose Updated_date is less than Max Updated_date of Source. val sinkDf = spark.sql("select * from sinkTbl where Updated_ ..read more
Visit website
Handling Spark job Faliures
Hadoop and Spark by Leela Prasad
by
4y ago
If we are doing bulky join and writing as Parquet, below is the screenshot of the failure Task. Output - The estimated amount of records for this task after Join operation in Size and record count. Shuffle Read - This can be considered as input for this job as it represents the mount of data involved or considered as input for this Join. Shuffle Spill(Memory) - The amount of RAM consumed for this operation. Shuffle Spill (Disk) - This indicates the amount of data written to Disk for this operation. Typically, this is not an ideal case to sill records to Disk and indicates that this stage ..read more
Visit website
Improve Spark job performance
Hadoop and Spark by Leela Prasad
by
4y ago
Below are the 2 useful links, https://medium.com/swlh/4-simple-tips-to-improve-your-spark-job-performance-fbf586ce8f7d https://medium.com/datakaresolutions/key-factors-to-consider-when-optimizing-spark-jobs-72b1a0dc22bf ..read more
Visit website
Mongo Spark Connector
Hadoop and Spark by Leela Prasad
by
4y ago
This Article explains the way to Write, Read and Update data to MongoDB. One of my Friend's Thomas has written a nice article in which he explained the same in an awesome manner. Please follow the link https://medium.com/@thomaspt748/how-to-load-millions-of-data-into-mongo-db-using-apache-spark-3-0-8bcf089bd6ed I would like to add 3 Points apart from the one's explained by my friend. 1. Dealing with Nested JSON. val foodDf = Seq((123,"food2",false,"Italian",2), (123,"food3",true,"American",3), (123,"food1",true,"Mediterranean",1)) .toDF("userId","foodName","isFavFood","cuisine","sco ..read more
Visit website

Follow Hadoop and Spark by Leela Prasad on FeedSpot

Continue with Google
Continue with Apple
OR