Spark Scala vs pySpark
Hadoop and Spark by Leela Prasad
by
3y ago
Performance: Many articles say that "Spark Scala is 10 times faster than pySpark", but in reality and from Spark 2.x onwards this statement is no longer true. pySpark used to be buggy and poorly supported, but was updated well in recent times. However, for batch jobs where data magniture is more Spark Scala gives better performance. Library Stack: Pandas in Pyspark is an advantage. Python's Visualization libraries complement pySpark. Where these are not available in Scala. Python comes with some libraries that are well known for data analysis. Several Libraries are available like Machine lear ..read more
Visit website
Reconciliation in Spark
Hadoop and Spark by Leela Prasad
by
3y ago
Input configuration as CSV and get primary keys for the respective tables and Updated_date Source Unload - Select primarykeys, Updated_date from srcTbl1 where Updated_date between X and Y Sink Unload - Select primarykeys, Updated_date from sinkTbl1 where Updated_date between X and Y Recon Comparison Process: Get Max Updated from srcTbl - val srcMaxUpdatedDate = srcDf.agg(max("Updated_date")).rdd(map(x => x.mkString).collect.toString From Sink Table get only the columns whose Updated_date is less than Max Updated_date of Source. val sinkDf = spark.sql("select * from sinkTbl where Updated_ ..read more
Visit website
Handling Spark job Faliures
Hadoop and Spark by Leela Prasad
by
4y ago
If we are doing bulky join and writing as Parquet, below is the screenshot of the failure Task. Output - The estimated amount of records for this task after Join operation in Size and record count. Shuffle Read - This can be considered as input for this job as it represents the mount of data involved or considered as input for this Join. Shuffle Spill(Memory) - The amount of RAM consumed for this operation. Shuffle Spill (Disk) - This indicates the amount of data written to Disk for this operation. Typically, this is not an ideal case to sill records to Disk and indicates that this stage ..read more
Visit website
Improve Spark job performance
Hadoop and Spark by Leela Prasad
by
4y ago
Below are the 2 useful links, https://medium.com/swlh/4-simple-tips-to-improve-your-spark-job-performance-fbf586ce8f7d https://medium.com/datakaresolutions/key-factors-to-consider-when-optimizing-spark-jobs-72b1a0dc22bf ..read more
Visit website
Mongo Spark Connector
Hadoop and Spark by Leela Prasad
by
4y ago
This Article explains the way to Write, Read and Update data to MongoDB. One of my Friend's Thomas has written a nice article in which he explained the same in an awesome manner. Please follow the link https://medium.com/@thomaspt748/how-to-load-millions-of-data-into-mongo-db-using-apache-spark-3-0-8bcf089bd6ed I would like to add 3 Points apart from the one's explained by my friend. 1. Dealing with Nested JSON. val foodDf = Seq((123,"food2",false,"Italian",2), (123,"food3",true,"American",3), (123,"food1",true,"Mediterranean",1)) .toDF("userId","foodName","isFavFood","cuisine","sco ..read more
Visit website
Updating data in a Hive table
Hadoop and Spark by Leela Prasad
by
4y ago
This can be achieved with out ORC file format and transaction=false, can be achieved only when the table is a partitioned table. This is a 2 step process: 1. Create data set with Updated entries using Union of New record and non-updated records in the partition. select tbl2.street,tbl2.city,tbl2.zip,tbl2.state,tbl2.beds,tbl2.baths,tbl2.sq__ft,tbl2.sale_date,tbl2.price,tbl2.latitude,tbl2.longitude,tbl2.type from (select * from samp_tbl_part where type = "Multi-Family") tbl1 JOIN (select * from samp_tbl where type = "Multi-Family") tbl2 ON tbl1.zip=tbl2.zip  UNION ALL  select tbl1.* from (se ..read more
Visit website
CICD Process
Hadoop and Spark by Leela Prasad
by
5y ago
Basic Structure and usage:Initially Masterwould be created and that would have 1.0 version of code. 2 more branches will be created out of it namely stagingand Dev. Development process:1.       Each developer would create each of their feature branch namely “feature/JIRATicketNumber” branches from Dev. This branch would be created only in local machine and not in Remote server at this point of time. Below are the steps for this process:èClone code from all the 3 branches Master, staging and Dev- command: Git Clone.èCheckout Dev for creating a branch out of this. command: Git checkout DevèCrea ..read more
Visit website

Follow Hadoop and Spark by Leela Prasad on FeedSpot

Continue with Google
Continue with Apple
OR