Hadoop In Real World on Feedspot

How to kill a running Spark application?

Hadoop In Real World

by Big Data In Real World

7M ago

Apache Spark is a powerful open-source distributed computing system used for big data processing. However, sometimes you may need to kill a running Spark application for various reasons, such as if the application is stuck, consuming too many resources, or taking too long to complete. In this post, we will discuss how to kill a running Spark application. Finding the application ID To kill a running Spark application, you first need to find its application ID. You can find the application ID by running the following command in the Spark shell: sc.applicationId This command will return the appli ..read more

Visit website

What is the default number of executors in Spark?

Hadoop In Real World

by Big Data In Real World

7M ago

This is going to be a short post. Number of executors in YARN deployments Spark.executor.instances controls the number of executors in YARN. By default, the number of executors is 2. Number of executors in Standalone deployments Spark standalone mode requires each application to run an executor on every node in the cluster, whereas with YARN, you choose the number of executors to use. You can however control the number of cores available by each executor but spark.executor.cores and spark.cores.max. Number of cores allocated will be spark.executor.cores or spark.cores.max if spark.execu ..read more

Visit website

How to find the number of objects in an S3 bucket?

Hadoop In Real World

by Big Data In Real World

8M ago

There is no separate command in AWS CLI to find the number of objects in an S3 bucket but there is a workaround. Solution aws s3 listing with recursive option will list all the objects in the bucket. [osboxes@wk1 ~]$ aws s3 ls s3://hirw-kickstarter --recursive 2020-11-18 16:46:03 0 Output2/_SUCCESS2 2020-11-18 16:43:34 111 Output2/_committed_1768500173264775255 2020-11-18 16:43:34 0 Output2/_started_1768500173264775255 2020-11-18 16:43:34 170043 Output2/part-00000-tid-1768500173264775255-99c4578a-e958-4e22-869b-46bfaf1acbfa-611-c000.csv 2018-04-07 16:20:35 46500 ..read more

Visit website

Exploring the Power of Apache Spark 3.0: Adaptive Query Execution and More

Hadoop In Real World

by Big Data In Real World

9M ago

Apache Spark, the popular distributed computing framework, has taken a significant leap forward with the release of Apache Spark 3.0. Packed with new features and enhancements, Spark 3.0 introduces Adaptive Query Execution (AQE) along with several other advancements that enhance performance and usability. In this blog post, we will delve into the key features of Spark 3.0. Adaptive Query Execution (AQE) Adaptive Query Execution is a game-changer introduced in Spark 3.0. It addresses the limitations of traditional static execution plans by dynamically optimizing query execution based on runtime ..read more

Visit website

What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism in Spark?

Hadoop In Real World

by Big Data In Real World

9M ago

Both spark.sql.shuffle.partitions and spark.default.parallelism control the number of tasks that get executed at runtime there by controlling the distribution and parallelism. Which means both properties have a direct effect on performance. spark.sql.shuffle.partitions spark.sql.shuffle.partitions configures the number of partitions that are used when shuffling data for joins or aggregations. The default for this this property is 200. spark.default.parallelism spark.default.parallelism is the default number of partitions in RDDs returned by transformations like join, reduceByKey, and paralleli ..read more

Visit website

What is the difference between client and cluster deploy modes in Spark?

Hadoop In Real World

by Big Data In Real World

9M ago

This post aims at describing the differences between client and cluster deploy modes in Spark. Client mode The driver is launched directly within the spark-submit process which acts as a client to the cluster. Driver does not use any cluster resources if the driver is launched from a node outside the cluster. Application can not be tracked if the node that kicked off the driver has connectivity issues to the Spark cluster. Not ideal if the node that kicks off the driver is not on the same network as the cluster or the bandwidth between the driver node and cluster is not optimal. Client mode i ..read more

Visit website

Stream Processing vs. Message Processing: What’s the Difference?

Hadoop In Real World

by Big Data In Real World

10M ago

In modern software systems, data is often generated and consumed in real-time. To handle these data streams, various processing techniques have been developed, including stream processing and message processing. However, there is often confusion about the difference between these two techniques. In this blog post, we will explore the differences between stream processing and message processing. Once we understand the difference, we will see where Kafka and other messaging systems like RabbitMQ fit. Stream processing and message processing are both techniques used to handle real-time data strea ..read more

Visit website

How to fix unassigned shards issue in Elasticsearch?

Hadoop In Real World

by Big Data In Real World

10M ago

By default shards should be automatically allocated to nodes. But in extreme cases where you had to add or remove nodes to the cluster or after several results your server might be in a non green status and you could have shards that are not allocated. Solution cluster.routing.allocation.enable is a cluster level property and it is a dynamic property. Set this to all at the cluster and monitor the cluster to see whether the error goes away. In most cases below should fix the issue. curl -XPUT 'localhost:9200/_cluster/settings' -d '{ "transient" : { "cluster.routing.allocation.enable" : "all ..read more

Visit website

How to recursively upload a folder to S3 using AWS CLI?

Hadoop In Real World

by Big Data In Real World

10M ago

This is a pretty common requirement and here is the solution. Solution Let’s create a bucket named hirw-sample-aws-bucket first. [osboxes@wk1 ~]$ aws s3 mb s3://hirw-sample-aws-bucket Use the cp with recursive option to upload the folder named aws to hirw-sample-aws-bucket [osboxes@wk1 ~]$ aws s3 cp --recursive aws s3://hirw-sample-aws-bucket ---- ---- upload: aws/dist/libz.so.1 to s3://hirw-sample-aws-bucket/dist/libz.so.1 upload: aws/dist/math.cpython-37m-x86_64-linux-gnu.so to s3://hirw-sample-aws-bucket/dist/math.cpython-37m-x86_64-linux-gnu.so upload: aws/dist/termios.cpython-37m-x86_ ..read more

Visit website

How to delete an index in Elasticsearch?

Hadoop In Real World

by Big Data In Real World

10M ago

Simple problem with a simple solution. In this post we will see how to delete an index in Elasticsearch. Solution We have 3 indices in Elasticsearch at the moment. curl http://localhost:9200/_cat/indices?v health status index uuid pri rep docs.count docs.deleted store.size pri.store.size yellow open account_v2 Kws5DlAdQuCv8845NZPUhQ 1 1 993 0 393.5kb 393.5kb yellow open account_v3 vqWW9i3KSD-wjqFH5kFWwg 1 1 20 0 17.4kb 17.4kb yellow open account avtO6o3jTmWtgAyQwTTM6Q 1 1 993 ..read more

Visit website

Follow Hadoop In Real World on FeedSpot