Hadoop and Spark by Leela Prasad on Feedspot

Spark Scala vs pySpark

Hadoop and Spark by Leela Prasad

by

3y ago

Performance: Many articles say that "Spark Scala is 10 times faster than pySpark", but in reality and from Spark 2.x onwards this statement is no longer true. pySpark used to be buggy and poorly supported, but was updated well in recent times. However, for batch jobs where data magniture is more Spark Scala gives better performance. Library Stack: Pandas in Pyspark is an advantage. Python's Visualization libraries complement pySpark. Where these are not available in Scala. Python comes with some libraries that are well known for data analysis. Several Libraries are available like Machine lear ..read more

Visit website

Reconciliation in Spark

Hadoop and Spark by Leela Prasad

by

3y ago

Input configuration as CSV and get primary keys for the respective tables and Updated_date Source Unload - Select primarykeys, Updated_date from srcTbl1 where Updated_date between X and Y Sink Unload - Select primarykeys, Updated_date from sinkTbl1 where Updated_date between X and Y Recon Comparison Process: Get Max Updated from srcTbl - val srcMaxUpdatedDate = srcDf.agg(max("Updated_date")).rdd(map(x => x.mkString).collect.toString From Sink Table get only the columns whose Updated_date is less than Max Updated_date of Source. val sinkDf = spark.sql("select * from sinkTbl where Updated_ ..read more

Visit website

Handling Spark job Faliures

Hadoop and Spark by Leela Prasad

by

4y ago

If we are doing bulky join and writing as Parquet, below is the screenshot of the failure Task. Output - The estimated amount of records for this task after Join operation in Size and record count. Shuffle Read - This can be considered as input for this job as it represents the mount of data involved or considered as input for this Join. Shuffle Spill(Memory) - The amount of RAM consumed for this operation. Shuffle Spill (Disk) - This indicates the amount of data written to Disk for this operation. Typically, this is not an ideal case to sill records to Disk and indicates that this stage ..read more

Visit website

Improve Spark job performance

Hadoop and Spark by Leela Prasad

by

4y ago

Below are the 2 useful links, https://medium.com/swlh/4-simple-tips-to-improve-your-spark-job-performance-fbf586ce8f7d https://medium.com/datakaresolutions/key-factors-to-consider-when-optimizing-spark-jobs-72b1a0dc22bf ..read more

Visit website

Mongo Spark Connector

Hadoop and Spark by Leela Prasad

by

4y ago

This Article explains the way to Write, Read and Update data to MongoDB. One of my Friend's Thomas has written a nice article in which he explained the same in an awesome manner. Please follow the link https://medium.com/@thomaspt748/how-to-load-millions-of-data-into-mongo-db-using-apache-spark-3-0-8bcf089bd6ed I would like to add 3 Points apart from the one's explained by my friend. 1. Dealing with Nested JSON. val foodDf = Seq((123,"food2",false,"Italian",2), (123,"food3",true,"American",3), (123,"food1",true,"Mediterranean",1)) .toDF("userId","foodName","isFavFood","cuisine","sco ..read more

Visit website

Updating data in a Hive table

Hadoop and Spark by Leela Prasad

by

4y ago

This can be achieved with out ORC file format and transaction=false, can be achieved only when the table is a partitioned table. This is a 2 step process: 1. Create data set with Updated entries using Union of New record and non-updated records in the partition. select tbl2.street,tbl2.city,tbl2.zip,tbl2.state,tbl2.beds,tbl2.baths,tbl2.sq__ft,tbl2.sale_date,tbl2.price,tbl2.latitude,tbl2.longitude,tbl2.type from (select * from samp_tbl_part where type = "Multi-Family") tbl1 JOIN (select * from samp_tbl where type = "Multi-Family") tbl2 ON tbl1.zip=tbl2.zip UNION ALL select tbl1.* from (se ..read more

Visit website

CICD Process

Hadoop and Spark by Leela Prasad

by

5y ago

Basic Structure and usage:Initially Masterwould be created and that would have 1.0 version of code. 2 more branches will be created out of it namely stagingand Dev. Development process:1. Each developer would create each of their feature branch namely “feature/JIRATicketNumber” branches from Dev. This branch would be created only in local machine and not in Remote server at this point of time. Below are the steps for this process:èClone code from all the 3 branches Master, staging and Dev- command: Git Clone.èCheckout Dev for creating a branch out of this. command: Git checkout DevèCrea ..read more

Visit website

Follow Hadoop and Spark by Leela Prasad on FeedSpot