Data Management with Open Table Formats
Curated SQL | Hadoop
by Kevin Feasel
1M ago
Anandaganesh Balakrishnan covers a few open-source products and formats: Apache Iceberg is an open-source table format designed for large-scale data lakes, aiming to improve data reliability, performance, and scalability. Its architecture introduces several key components and concepts that address the challenges commonly associated with big data processing and analytics, such as managing large datasets, schema evolution, efficient querying, and ensuring transactional integrity. Here’s a deep dive into the core components and architectural design of Apache Iceberg: Click through for ..read more
Visit website
Using Schema Registry for Data Quality in Apache Kafka
Curated SQL | Hadoop
by Kevin Feasel
3M ago
Kai Waehner talks data quality: Good data quality is one of the most critical requirements in decoupled architectures, like microservices or data mesh. Apache Kafka became the de facto standard for these architectures. But Kafka is a dumb broker that only stores byte arrays. The Schema Registry enforces message structures. This blog post looks at enhancements to leverage data contracts for policies and rules to enforce good data quality on field-level and advanced use cases like routing malicious messages to a dead letter queue. Click through to learn more about the topic. This focuses a lot ..read more
Visit website
Continuing the Advent of Fabric
Curated SQL | Hadoop
by Kevin Feasel
4M ago
Tomaz Kastrun has been busy. On day 9, we build a custom environment: Microsoft Fabric provides you with the capability to create a new environment, where you can select different Spark runtimes, configure your compute resources, and create a list of Python libraries (public or custom; from Conda or PyPI) to be installed. Custom environments behave the same way as any other environment and can be used and attached to your notebook or used on a workspace. Custom environments can also be attached to Spark job definitions. On day 10, we have Spark job definitions: An Apache Spark job definitio ..read more
Visit website
Databricks Security Analysis Tool
Curated SQL | Hadoop
by Kevin Feasel
4M ago
Advait Bhadane takes a look at a tool: In today’s data-driven world a cutting-edge platform is required that seamlessly integrates with the cloud, embraces open-source innovation and prioritises robust data security. Databricks is a pioneer in this field. Not only does it provide a unified lake house platform, but it takes data protection to the next level with its Security Analysis Tool (SAT). In this blog, we will unravel the power of Databricks’ SAT, focusing on the pivotal role it plays in generating daily health reports for your workspaces. It will also walk you through the step-by-step ..read more
Visit website
Producing Messages with librdkafka
Curated SQL | Hadoop
by Kevin Feasel
4M ago
Jakub Korab dives into a Kafka library: In a previous blog post (How To Survive an Apache Kafka® Outage) I outlined the effects on applications during partial or total Kafka cluster outages and proposed some architectural strategies to handle these types of service interruptions. The applications most heavily impacted by this type of outage are external interfaces that receive data, do not control request flow, and possibly perform some form of business transaction with the outside world before producing to Kafka. These applications are most commonly found in finance and written in languages ..read more
Visit website
Key Constraints in Databricks Unity Catalog
Curated SQL | Hadoop
by Kevin Feasel
5M ago
Meagan Longoria gives us a warning: I’ve been building lakehouses using Databricks Unity catalog for a couple of clients. Overall, I like the technology, but there are a few things to get used to. This includes the fact that primary key and foreign key constraints are informational only and not enforced. If you come from a relational database background, this unenforced constraint may bother you a bit as you may be used to enforcing it to help with referential integrity.  Read on to see what is available and why it can nonetheless be useful in some circumstances ..read more
Visit website
Lakehouse Management in Fabric via mssparkutils
Curated SQL | Hadoop
by Kevin Feasel
5M ago
Sandeep Pawar scripts out some lakehouse work: At MS Ignite, Microsoft unveiled a variety of new APIs designed for working with Fabric items, such as workspaces, Spark jobs, lakehouses, warehouses, ML items, and more. You can find detailed information about these APIs here. These APIs will be critical in the automation and CI/CD of Fabric workloads. With the release of these APIs, a new method has been added to the mssparkutils library to simplify working with lakehouses. In this blog, I will explore the available options and provide examples. Please note that at the ..read more
Visit website
An Overview of Data Lake Operations with Apache NiFi
Curated SQL | Hadoop
by Kevin Feasel
6M ago
Lav Kumar gives us a 10,000 foot view: In the world of data-driven decision-making, ETL (Extract, Transform, Load) processes play a pivotal role. The effective management and transformation of data are essential to ensure that businesses can make informed choices based on accurate and relevant information. Data lakes have emerged as a powerful way to store and analyze massive amounts of data, and Apache NiFi is a robust tool for streamlining ETL processes in a data lake environment. Read on for a brief primer on NiFi and how some of its capabilities can assist in ETL and ..read more
Visit website
Apache Zookeeper Vulnerability
Curated SQL | Hadoop
by Kevin Feasel
6M ago
The Instaclustr team reviews an announcement: On October 11, 2023, the Apache ZooKeeper project announced that a security vulnerability has been identified in Apache ZooKeeper, CVE-2023-44981. The Apache ZooKeeper project has classified the severity of this CVE as critical. The CVSS (Common Vulnerability Scoring System) 3.x severity rating for this vulnerability by the NVD (National Vulnerability Database) is base score 9.1 Critical.   That’s a rather high base score and is comes about if you have the setting quorum.auth.enableSasl=true. Updating to the Zoo ..read more
Visit website
Capturing a TCP Dump in an Azure Databricks Notebook
Curated SQL | Hadoop
by Kevin Feasel
6M ago
Stithi Panigrahi does some troubleshooting: Due to the potential impact on performance and storage costs, Azure Databricks clusters don’t capture networking logs by default. Follow the below instructions if you need to capture tcpdump to investigate multiple networking issues related to the cluster. These steps will capture a TCP dump on each cluster node–both driver and workers during the entire lifetime of the cluster. Click through for an initiation script, which generates the actual script, which itself generates the TCP dumps ..read more
Visit website

Follow Curated SQL | Hadoop on FeedSpot

Continue with Google
Continue with Apple
OR