Cloudera Blog | Apache Hadoop on Feedspot

Large Scale Industrialization Key to Open Source Innovation

Cloudera Blog | Apache Hadoop

by Sudhir Menon

1y ago

We are now well into 2022 and the megatrends that drove the last decade in data—The Apache Software Foundation as a primary innovation vehicle for big data, the arrival of cloud computing, and the debut of cheap distributed storage—have now converged and offer clear patterns for competitive advantage for vendors and value for customers. Cloudera has been parlaying those patterns into clear wins for the community at large and, more importantly, streamlining the benefits of that innovation to our customers. At Cloudera, we have had the benefit of an early start, and as a result we have cu ..read more

Visit website

Apache Ozone Powers Data Science in CDP Private Cloud

Cloudera Blog | Apache Hadoop

by George Huang

2y ago

Apache Ozone is a scalable distributed object store that can efficiently manage billions of small and large files. Ozone natively provides Amazon S3 and Hadoop Filesystem compatible endpoints in addition to its own native object store API endpoint and is designed to work seamlessly with enterprise scale data warehousing, machine learning and streaming workloads. The object store is readily available alongside HDFS in CDP (Cloudera Data Platform) Private Cloud Base 7.1.3+. This means that there is out of the box support for Ozone storage in services like Apache Hive, Apache Impala, Apache Spark ..read more

Visit website

One billion files in Ozone

Cloudera Blog | Apache Hadoop

by Nandakumar Vadivelu

3y ago

Apache Hadoop Ozone is a distributed key-value store that can manage both small and large files alike. Ozone was designed to address the scale limitations of HDFS with respect to small files. HDFS is designed to store large files and the recommended number of files on HDFS is 300 million for a Namenode, and doesn’t scale well beyond this limit. Principal features of Ozone which help it achieve scalability are: The namespace in Ozone is written to a local RocksDB instance, with this design a balance between performance (keeping everything in memory) and scalability (persisting the less used me ..read more

Visit website

EMR workloads + CDP = better performance and lower costs

Cloudera Blog | Apache Hadoop

by Wim Stoop

3y ago

The first thing that comes to mind when talking about synergy is how 2+2=5. Being the writer that he is, Mark Twain described it a lot more eloquently as “the bonus that is achieved when things work together harmoniously”. There is a multitude of product and business examples to illustrate the point and I particularly like how car manufacturers can bring together relatively small engines to do big things. To provide supercar performance in a more environmentally friendly way for the i8, BMW stepped away from ever bigger power plants. They paired the same 1.5-liter petrol engine as you’ll ..read more

Visit website

An Architecture for Secure COVID-19 Contact Tracing

Cloudera Blog | Apache Hadoop

by Tristan Stevens

3y ago

This post describes an architecture, and associated controls for privacy, to build a data platform for a nationwide proactive contact tracing solution. Background After calls for a way of using technology to facilitate the lifting of restrictions on freedom of movement for people not self isolating, whilst ensuring regulatory obligations such as the UK Human Rights Act and equivalent GDPR provisions, this paper proposes a reference architecture for a contact pairing database that maintains privacy, yet is built to scale to support large-scale lifting of restrictions of movement. Contact Tracin ..read more

Visit website

Operational Database Management

Cloudera Blog | Apache Hadoop

by Gokul Kamaraj

3y ago

This blog post is part of a series on Cloudera’s Operational Database (OpDB) in CDP. Each post goes into more details about new features and capabilities. Start from the beginning of the series with, Operational Database in CDP. This blog post gives you an overview of the OpDB management tools and features in the Cloudera Data Platform. The tools discussed in this article will help you understand the various options available to manage the operations of your OpDB cluster. Backup and recovery tools Cloudera provides multiple mechanisms to allow backup and recovery, including: Snapshots R ..read more

Visit website

Hadoop: Decade Two, Day Zero*

Cloudera Blog | Apache Hadoop

by Arun Murthy

3y ago

This blog was originally published on Medium The Data Cloud — Powered By Hadoop One key aspect of the Cloudera Data Platform (CDP), which is just beginning to be understood, is how much of a recombinant-evolution it represents, from an architectural standpoint, vis-à-vis Hadoop in its first decade. I’ve been having a blast showing CDP to customers over the past few months and the response has been nothing short of phenomenal… Through these discussions, I see that the natural proclivity is to imagine that CDP is just another “distro” (i.e. “unity distro”) of the two-parent distros (CDH &a ..read more

Visit website

Operational Database Accessibility

Cloudera Blog | Apache Hadoop

by Liliana Kadar

3y ago

This blog post is part of a series on Cloudera’s Operational Database (OpDB) in CDP. Each post goes into more details about new features and capabilities. Start from the beginning of the series with, Operational Database in CDP. Cloudera’s OpDB provides a rich set of capabilities to store and access data. In this blog post, we’ll look at the accessibility capabilities of OpDB and how you can make use of these capabilities to access your data. Distribution and sharding Cloudera’s Operational Database (OpDB) is a scale-out Database Management System (DBMS) that is designed to s ..read more

Visit website

Apache Hadoop Ozone Security – Authentication

Cloudera Blog | Apache Hadoop

by Xiaoyu Yao

3y ago

Apache Ozone is a distributed object store built on top of Hadoop Distributed Data Store service. It can manage billions of small and large files that are difficult to handle by other distributed file systems. Ozone supports rich APIs such as Amazon S3, Kubernetes CSI as well as native Hadoop File System APIs. This makes Ozone easily consumable by different kinds of big data workloads such as data warehouse on Apache Hive, data ingestion with Apache Nifi, streaming with Apache Spark/Flink and machine learning with Tensorflow. With the growing data footprint and multifac ..read more

Visit website

Disk and Datanode Size in HDFS

Cloudera Blog | Apache Hadoop

by Lokesh Jain

3y ago

This blog discusses answers to questions like what is the right disk size in datanode and what is the right capacity for a datanode. A few of our customers have asked us about using dense storage nodes. It is certainly possible to use dense nodes for archival storage because IO bandwidth requirements are usually lower for cold data. However the decision to use denser nodes for hot data must be evaluated carefully as it can have an impact on the performance of the cluster. You may be able to use denser nodes for hot data if you have provisioned adequate network bandwidth to mitigate the higher ..read more

Visit website

Follow Cloudera Blog | Apache Hadoop on FeedSpot