How the GoDaddy data platform achieved over 60% cost reduction and 50% performance boost by adopting Amazon EMR Serverless
AWS Blog » Amazon EMR
by Brandon Abear
2M ago
This is a guest post co-written with Brandon Abear, Dinesh Sharma, John Bush, and Ozcan IIikhan from GoDaddy. GoDaddy empowers everyday entrepreneurs by providing all the help and tools to succeed online. With more than 20 million customers worldwide, GoDaddy is the place people come to name their ideas, build a professional website, attract customers, and manage their work. At GoDaddy, we take pride in being a data-driven company. Our relentless pursuit of valuable insights from data fuels our business decisions and ensures customer satisfaction. Our commitment to efficiency is unwavering, an ..read more
Visit website
GoDaddy benchmarking results in up to 24% better price-performance for their Spark workloads with AWS Graviton2 on Amazon EMR Serverless
AWS Blog » Amazon EMR
by Mukul Sharma
7M ago
This is a guest post co-written with Mukul Sharma, Software Development Engineer, and Ozcan IIikhan, Director of Engineering from GoDaddy. GoDaddy empowers everyday entrepreneurs by providing all the help and tools to succeed online. With more than 22 million customers worldwide, GoDaddy is the place people come to name their ideas, build a professional website, attract customers, and manage their work. GoDaddy is a data-driven company, and getting meaningful insights from data helps us drive business decisions to delight our customers. At GoDaddy, we embarked on a journey to uncover the effic ..read more
Visit website
Define per-team resource limits for big data workloads using Amazon EMR Serverless
AWS Blog » Amazon EMR
by Gaurav Sharma
8M ago
Customers face a challenge when distributing cloud resources between different teams running workloads such as development, testing, or production. The resource distribution challenge also occurs when you have different line-of-business users. The objective is not only to ensure sufficient resources be consistently available to production workloads and critical teams, but also to prevent adhoc jobs from using all the resources and delaying other critical workloads due to mis-configured or non-optimized code. Cost controls and usage tracking across these teams is also a critical factor. In the ..read more
Visit website
Query big data with resilience using Trino in Amazon EMR with Amazon EC2 Spot Instances for less cost
AWS Blog » Amazon EMR
by Ashwini Kumar
8M ago
Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances offer spare compute capacity available in the AWS Cloud at steep discounts compared to On-Demand prices. Amazon EMR provides a managed Hadoop framework that makes it straightforward, fast, and cost-effective to process vast amounts of data using EC2 instances. Amazon EMR with Spot Instances allows you to reduce costs for running your big data workloads on AWS. Amazon EC2 can interrupt Spot Instances with a 2-minute notification whenever Amazon EC2 needs to reclaim capacity for On-Demand customers. Spot Instances are best suited for runni ..read more
Visit website
Apache Iceberg optimization: Solving the small files problem in Amazon EMR
AWS Blog » Amazon EMR
by Avijit Goswami
8M ago
In our previous post Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes, we discussed how you can implement solutions to improve operational efficiencies of your Amazon Simple Storage Service (Amazon S3) data lake that is using the Apache Iceberg open table format and running on the Amazon EMR big data platform. Iceberg tables store metadata in manifest files. As the number of data files increase, the amount of metadata stored in these manifest files also increases, leading to longer query planning time. The query runtime also increases because it’s proport ..read more
Visit website
Design considerations for Amazon EMR on EKS in a multi-tenant Amazon EKS environment
AWS Blog » Amazon EMR
by Lotfi Mouhib
1y ago
Many AWS customers use Amazon Elastic Kubernetes Service (Amazon EKS) in order to take advantage of Kubernetes without the burden of managing the Kubernetes control plane. With Kubernetes, you can centrally manage your workloads and offer administrators a multi-tenant environment where they can create, update, scale, and secure workloads using a single API. Kubernetes also allows you to improve resource utilization, reduce cost, and simplify infrastructure management to support different application deployments. This model is beneficial for those running Apache Spark workloads, for several rea ..read more
Visit website
Configure Hadoop YARN CapacityScheduler on Amazon EMR on Amazon EC2 for multi-tenant heterogeneous workloads
AWS Blog » Amazon EMR
by Suvojit Dasgupta
1y ago
Apache Hadoop YARN (Yet Another Resource Negotiator) is a cluster resource manager responsible for assigning computational resources (CPU, memory, I/O), and scheduling and monitoring jobs submitted to a Hadoop cluster. This generic framework allows for effective management of cluster resources for distributed data processing frameworks, such as Apache Spark, Apache MapReduce, and Apache Hive. When supported by the framework, Amazon EMR by default uses Hadoop YARN. Please note that not all frameworks offered by Amazon EMR use Hadoop YARN, such as Trino/Presto and Apache HBase. In this post, we ..read more
Visit website
Disaster recovery considerations with Amazon EMR on Amazon EC2 for Spark workloads
AWS Blog » Amazon EMR
by Bharat Gamini
1y ago
Amazon EMR is a cloud big data platform for running large-scale distributed data processing jobs, interactive SQL queries, and machine learning (ML) applications using open-source analytics frameworks such as Apache Spark, Apache Hive, and Presto. Amazon EMR launches all nodes for a given cluster in the same Amazon Elastic Compute Cloud (Amazon EC2) Availability Zone to improve performance. During an Availability Zone failure or due to any unexpected interruption, Amazon EMR may not be accessible, and we need a disaster recovery (DR) strategy to mitigate this problem. Part of architecting a re ..read more
Visit website
Build a self-service environment for each line of business using Amazon EMR and AWS Service Catalog
AWS Blog » Amazon EMR
by Tanzir Musabbir
1y ago
Enterprises often want to centralize governance and compliance requirements, and provide a common set of policies on how Amazon EMR instances should be set up. You can use AWS Service Catalog to centrally manage commonly deployed Amazon EMR cluster configurations, and this helps you achieve consistent governance and meet your compliance requirements, while at the same time enabling your end users to quickly deploy only the approved EMR cluster configurations on a self-service basis. In this post, we will demonstrate how enterprise administrators can use AWS Service Catalog to create and manage ..read more
Visit website
How Drop used the Amazon EMR runtime for Apache Spark to halve costs and get results 5.4 times faster
AWS Blog » Amazon EMR
by Michael Chau
1y ago
February 2022 update – When this Blog post was published in June 2020 AWS Glue V1 offered an average starting time of 10 minutes. In September 2020 Glue V2 was released offering 10X faster start times. Because of this the part of this blog post that compares the starting times between AWS Glue and Amazon EMR is no longer valid and should not be consider as part of the selection process between these two services. This is a guest post by Michael Chau, software engineer with Drop, and Leonardo Gomez, AWS big data specialist solutions architect. In their own words, “Drop is on a mission to level ..read more
Visit website

Follow AWS Blog » Amazon EMR on FeedSpot

Continue with Google
Continue with Apple
OR