Hadoop In Real World
720 FOLLOWERS
We are a group of Hadoop engineers who are passionate about Hadoop and related Big Data technologies. Read the blog to find helpful articles on Hadoop technologies.
Hadoop In Real World
5M ago
Hadoop Developer In Real World 3 node Hadoop cluster will be shutdown permanently on June 30th 2024.
Reason for the shutdown
Hadoop Developer In Real World 3 Hadoop node cluster has been in operation for 9 years since the go live of Hadoop Developer In Real World course on Udemy.
From the inception of the cluster, we have made it clear that the access to the cluster is complimentary and we have the right to deny access or shutdown the cluster at any time. The user activity on the cluster has gone down significantly and has almost flat lined. Looking at the student enrollment numbers and the us ..read more
Hadoop In Real World
1y ago
Apache Spark is a powerful open-source distributed computing system used for big data processing. However, sometimes you may need to kill a running Spark application for various reasons, such as if the application is stuck, consuming too many resources, or taking too long to complete. In this post, we will discuss how to kill a running Spark application.
Finding the application ID
To kill a running Spark application, you first need to find its application ID. You can find the application ID by running the following command in the Spark shell:
sc.applicationId
This command will return the appli ..read more
Hadoop In Real World
1y ago
This is going to be a short post.
Number of executors in YARN deployments
Spark.executor.instances controls the number of executors in YARN. By default, the number of executors is 2.
Number of executors in Standalone deployments
Spark standalone mode requires each application to run an executor on every node in the cluster, whereas with YARN, you choose the number of executors to use.
You can however control the number of cores available by each executor but spark.executor.cores and spark.cores.max.
Number of cores allocated will be spark.executor.cores or spark.cores.max if spark.execu ..read more
Hadoop In Real World
1y ago
There is no separate command in AWS CLI to find the number of objects in an S3 bucket but there is a workaround.
Solution
aws s3 listing with recursive option will list all the objects in the bucket.
[osboxes@wk1 ~]$ aws s3 ls s3://hirw-kickstarter --recursive
2020-11-18 16:46:03 0 Output2/_SUCCESS2
2020-11-18 16:43:34 111 Output2/_committed_1768500173264775255
2020-11-18 16:43:34 0 Output2/_started_1768500173264775255
2020-11-18 16:43:34 170043 Output2/part-00000-tid-1768500173264775255-99c4578a-e958-4e22-869b-46bfaf1acbfa-611-c000.csv
2018-04-07 16:20:35 46500 ..read more
Hadoop In Real World
1y ago
Apache Spark, the popular distributed computing framework, has taken a significant leap forward with the release of Apache Spark 3.0. Packed with new features and enhancements, Spark 3.0 introduces Adaptive Query Execution (AQE) along with several other advancements that enhance performance and usability. In this blog post, we will delve into the key features of Spark 3.0.
Adaptive Query Execution (AQE)
Adaptive Query Execution is a game-changer introduced in Spark 3.0. It addresses the limitations of traditional static execution plans by dynamically optimizing query execution based on runtime ..read more
Hadoop In Real World
1y ago
Both spark.sql.shuffle.partitions and spark.default.parallelism control the number of tasks that get executed at runtime there by controlling the distribution and parallelism. Which means both properties have a direct effect on performance.
spark.sql.shuffle.partitions
spark.sql.shuffle.partitions configures the number of partitions that are used when shuffling data for joins or aggregations. The default for this this property is 200.
spark.default.parallelism
spark.default.parallelism is the default number of partitions in RDDs returned by transformations like join, reduceByKey, and paralleli ..read more
Hadoop In Real World
1y ago
This post aims at describing the differences between client and cluster deploy modes in Spark.
Client mode
The driver is launched directly within the spark-submit process which acts as a client to the cluster.
Driver does not use any cluster resources if the driver is launched from a node outside the cluster.
Application can not be tracked if the node that kicked off the driver has connectivity issues to the Spark cluster.
Not ideal if the node that kicks off the driver is not on the same network as the cluster or the bandwidth between the driver node and cluster is not optimal.
Client mode i ..read more
Hadoop In Real World
1y ago
In modern software systems, data is often generated and consumed in real-time. To handle these data streams, various processing techniques have been developed, including stream processing and message processing. However, there is often confusion about the difference between these two techniques. In this blog post, we will explore the differences between stream processing and message processing.
Once we understand the difference, we will see where Kafka and other messaging systems like RabbitMQ fit.
Stream processing and message processing are both techniques used to handle real-time data strea ..read more
Hadoop In Real World
1y ago
By default shards should be automatically allocated to nodes. But in extreme cases where you had to add or remove nodes to the cluster or after several results your server might be in a non green status and you could have shards that are not allocated.
Solution
cluster.routing.allocation.enable is a cluster level property and it is a dynamic property. Set this to all at the cluster and monitor the cluster to see whether the error goes away.
In most cases below should fix the issue.
curl -XPUT 'localhost:9200/_cluster/settings' -d '{
"transient" : { "cluster.routing.allocation.enable" : "all ..read more
Hadoop In Real World
1y ago
This is a pretty common requirement and here is the solution.
Solution
Let’s create a bucket named hirw-sample-aws-bucket first.
[osboxes@wk1 ~]$ aws s3 mb s3://hirw-sample-aws-bucket
Use the cp with recursive option to upload the folder named aws to hirw-sample-aws-bucket
[osboxes@wk1 ~]$ aws s3 cp --recursive aws s3://hirw-sample-aws-bucket
----
----
upload: aws/dist/libz.so.1 to s3://hirw-sample-aws-bucket/dist/libz.so.1
upload: aws/dist/math.cpython-37m-x86_64-linux-gnu.so to s3://hirw-sample-aws-bucket/dist/math.cpython-37m-x86_64-linux-gnu.so
upload: aws/dist/termios.cpython-37m-x86_ ..read more