Curated SQL | Hadoop on Feedspot

Curated SQL | Hadoop

by Kevin Feasel

1M ago

Anandaganesh Balakrishnan covers a few open-source products and formats: Apache Iceberg is an open-source table format designed for large-scale data lakes, aiming to improve data reliability, performance, and scalability. Its architecture introduces several key components and concepts that address the challenges commonly associated with big data processing and analytics, such as managing large datasets, schema evolution, efficient querying, and ensuring transactional integrity. Here’s a deep dive into the core components and architectural design of Apache Iceberg: Click through for ..read more

Visit website

Using Schema Registry for Data Quality in Apache Kafka

Curated SQL | Hadoop

by Kevin Feasel

3M ago

Kai Waehner talks data quality: Good data quality is one of the most critical requirements in decoupled architectures, like microservices or data mesh. Apache Kafka became the de facto standard for these architectures. But Kafka is a dumb broker that only stores byte arrays. The Schema Registry enforces message structures. This blog post looks at enhancements to leverage data contracts for policies and rules to enforce good data quality on field-level and advanced use cases like routing malicious messages to a dead letter queue. Click through to learn more about the topic. This focuses a lot ..read more

Visit website

Continuing the Advent of Fabric

Curated SQL | Hadoop

by Kevin Feasel

4M ago

Tomaz Kastrun has been busy. On day 9, we build a custom environment: Microsoft Fabric provides you with the capability to create a new environment, where you can select different Spark runtimes, configure your compute resources, and create a list of Python libraries (public or custom; from Conda or PyPI) to be installed. Custom environments behave the same way as any other environment and can be used and attached to your notebook or used on a workspace. Custom environments can also be attached to Spark job definitions. On day 10, we have Spark job definitions: An Apache Spark job definitio ..read more

Visit website

Databricks Security Analysis Tool

Curated SQL | Hadoop

by Kevin Feasel

4M ago

Advait Bhadane takes a look at a tool: In today’s data-driven world a cutting-edge platform is required that seamlessly integrates with the cloud, embraces open-source innovation and prioritises robust data security. Databricks is a pioneer in this field. Not only does it provide a unified lake house platform, but it takes data protection to the next level with its Security Analysis Tool (SAT). In this blog, we will unravel the power of Databricks’ SAT, focusing on the pivotal role it plays in generating daily health reports for your workspaces. It will also walk you through the step-by-step ..read more

Visit website

Producing Messages with librdkafka

Curated SQL | Hadoop

by Kevin Feasel

4M ago

Jakub Korab dives into a Kafka library: In a previous blog post (How To Survive an Apache Kafka® Outage) I outlined the effects on applications during partial or total Kafka cluster outages and proposed some architectural strategies to handle these types of service interruptions. The applications most heavily impacted by this type of outage are external interfaces that receive data, do not control request flow, and possibly perform some form of business transaction with the outside world before producing to Kafka. These applications are most commonly found in finance and written in languages ..read more

Visit website

Key Constraints in Databricks Unity Catalog

Curated SQL | Hadoop

by Kevin Feasel

5M ago

Meagan Longoria gives us a warning: I’ve been building lakehouses using Databricks Unity catalog for a couple of clients. Overall, I like the technology, but there are a few things to get used to. This includes the fact that primary key and foreign key constraints are informational only and not enforced. If you come from a relational database background, this unenforced constraint may bother you a bit as you may be used to enforcing it to help with referential integrity. Read on to see what is available and why it can nonetheless be useful in some circumstances ..read more

Visit website

Lakehouse Management in Fabric via mssparkutils

Curated SQL | Hadoop

by Kevin Feasel

5M ago

Sandeep Pawar scripts out some lakehouse work: At MS Ignite, Microsoft unveiled a variety of new APIs designed for working with Fabric items, such as workspaces, Spark jobs, lakehouses, warehouses, ML items, and more. You can find detailed information about these APIs here. These APIs will be critical in the automation and CI/CD of Fabric workloads. With the release of these APIs, a new method has been added to the mssparkutils library to simplify working with lakehouses. In this blog, I will explore the available options and provide examples. Please note that at the ..read more

Visit website

An Overview of Data Lake Operations with Apache NiFi

Curated SQL | Hadoop

by Kevin Feasel

6M ago

Lav Kumar gives us a 10,000 foot view: In the world of data-driven decision-making, ETL (Extract, Transform, Load) processes play a pivotal role. The effective management and transformation of data are essential to ensure that businesses can make informed choices based on accurate and relevant information. Data lakes have emerged as a powerful way to store and analyze massive amounts of data, and Apache NiFi is a robust tool for streamlining ETL processes in a data lake environment. Read on for a brief primer on NiFi and how some of its capabilities can assist in ETL and ..read more

Visit website

Apache Zookeeper Vulnerability

Curated SQL | Hadoop

by Kevin Feasel

6M ago

The Instaclustr team reviews an announcement: On October 11, 2023, the Apache ZooKeeper project announced that a security vulnerability has been identified in Apache ZooKeeper, CVE-2023-44981. The Apache ZooKeeper project has classified the severity of this CVE as critical. The CVSS (Common Vulnerability Scoring System) 3.x severity rating for this vulnerability by the NVD (National Vulnerability Database) is base score 9.1 Critical. That’s a rather high base score and is comes about if you have the setting quorum.auth.enableSasl=true. Updating to the Zoo ..read more

Visit website

Capturing a TCP Dump in an Azure Databricks Notebook

Curated SQL | Hadoop

by Kevin Feasel

6M ago

Stithi Panigrahi does some troubleshooting: Due to the potential impact on performance and storage costs, Azure Databricks clusters don’t capture networking logs by default. Follow the below instructions if you need to capture tcpdump to investigate multiple networking issues related to the cluster. These steps will capture a TCP dump on each cluster node–both driver and workers during the entire lifetime of the cluster. Click through for an initiation script, which generates the actual script, which itself generates the TCP dumps ..read more

Visit website

Follow Curated SQL | Hadoop on FeedSpot