Beginner's Hadoop on Feedspot

Databricks Unity Catalog

Beginner's Hadoop

by beginnershadoop

10M ago

The Databricks Unity Catalog is a feature provided by Databricks Unified Data Analytics Platform that allows you to organize and manage metadata about your data assets, such as tables, databases, and views. It provides a centralized metadata repository that enables users to discover, understand, and collaborate on data assets within a Databricks environment. The Unity Catalog integrates with various data sources and supports different metadata management capabilities. Some key features and benefits of the Databricks Unity Catalog include: Metadata Management: The Unity Catalog allows you to r ..read more

Visit website

What is airflow task decorator

Beginner's Hadoop

by beginnershadoop

10M ago

In Apache Airflow, the task decorator is a Python decorator used to define tasks within a Directed Acyclic Graph (DAG). Airflow is an open-source platform used to programmatically author, schedule, and monitor workflows. Each task within a DAG represents a unit of work that needs to be executed, such as running a script, executing a SQL query, or any other operation. The task decorator in Airflow allows you to define a task using a Python function. When you decorate a function with @task, it becomes an instance of the airflow.decorators.TaskDecorator class. The decorator sets properties and at ..read more

Visit website

Delta Lake Architecture

Beginner's Hadoop

by beginnershadoop

1y ago

Delta Lake is an open-source storage layer that allows developers to build scalable and efficient data pipelines for big data workloads. Delta Lake provides reliability, performance, and flexibility to big data workflows by adding a transactional layer on top of existing data lakes. Delta Lake enables developers to store data in a versioned, append-only format and provides ACID transactions to ensure data consistency. Delta Lake can be used with Apache Spark, which provides a scalable and distributed processing engine for big data workloads. Delta Lake can be used to store data in a variety o ..read more

Visit website

Data Architecture for Beginners

Beginner's Hadoop

by beginnershadoop

1y ago

Big data is a term used to describe large and complex data sets that require advanced computational and analytical tools to process and interpret. The field of big data has rapidly evolved in recent years, and with it, a number of big data architecting models have emerged. These models are designed to help organizations manage and analyze big data more effectively. In this article, we will explore some of the most important big data architecting models and compare them based on their strengths and weaknesses. Lambda Architecture Kappa Architecture Microservices Architecture Event-Driven Archi ..read more

Visit website

Apache Spark: WindowSpec & Window

Beginner's Hadoop

by beginnershadoop

4y ago

WindowSpec is a window specification that defines which rows are included in a window (frame), i.e. the set of rows that are associated with the current row by some relation. WindowSpec takes the following when created: Partition specification (Seq[Expression]) which defines which records are in the same partition. With no partition defined, all records belong to a single partition Ordering Specification (Seq[SortOrder]) which defines how records in a partition are ordered that in turn defines the position of a record in a partition. The ordering c ..read more

Visit website

Barrier Execution Mode in Spark

Beginner's Hadoop

by beginnershadoop

4y ago

The barrier execution mode is experimental and it only handles limited scenarios. See SPIP: Barrier Execution Mode and Design Doc.In case of a task failure, instead of only restarting the failed task, Spark will abort the entire stage and re-launch all tasks for this stage. Use RDD.barrier transformation to mark the current stage as a barrier stage. barrier(): RDDBarrier[T] barrier simply creates a RDDBarrier that comes with the barrier-aware mapPartitions transformation. mapPartitions[S: ClassTag]( f: Iterator[T] => Iterator[S], preservesPart ..read more

Visit website

Impala Export to CSV

Beginner's Hadoop

by beginnershadoop

4y ago

Apache Impala is an open source massively parallel processing SQL query engine for data stored in a computer cluster running Apache Hadoop. In some cases, impala-shell is installed manually on other machines that are not managed through Cloudera Manager. In such cases, you can still launch impala-shell and submit queries from those external machines to a DataNode where impalad is running. In such a specific scenario, impala-shell is started and connected to remote hosts by passing an appropriate hostname and port (if not the default, 21000). To use Impala shell to connect to Impala daem ..read more

Visit website

Spark Structured Streaming and Streaming Queries

Beginner's Hadoop

by beginnershadoop

4y ago

Structured streaming: Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can express your streaming computation the same way you would express a batch computation on static data. The Spark SQL engine will take care of running it incrementally and continuously and updating the final result as streaming data continues to arrive Reference: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html ..read more

Visit website

Add constant column in spark

Beginner's Hadoop

by beginnershadoop

4y ago

If we want to add a column with default value then we can do in spark. In spark 2.2 there are two ways to add constant value in a column in DataFrame: 1) Using lit 2) Using typedLit. The difference between the two is that typedLit can also handle parameterized scala types e.g. List, Seq, and Map Spark 2.2 introduces typedLit to support Seq, Map, and Tuples (SPARK-19254) and following calls should be supported (Scala): import org.apache.spark.sql.functions.typedLit df.withColumn("some_array", typedLit(Seq(1, 2, 3))) df.withColumn("some_struct", typedLit(("foo", 1, .0.3))) df.withCo ..read more

Visit website

Salesforce connector in Spark

Beginner's Hadoop

by beginnershadoop

4y ago

Salesforce is a customer relationship management solution that brings companies and customers together. It’s one integrated CRM platform that gives all your departments — including marketing, sales, commerce, and service — a single, shared view of every customer. Get salesforce connector from here Code: import org.apache.spark.sql.SparkSession import org.apache.spark.sql._ object Sample { def main(arg: Array[String]) { val spark = SparkSession.builder(). appName("salesforce"). master("local[*]"). getOrCreate() val tableName = "account" val outp ..read more

Visit website

Follow Beginner's Hadoop on FeedSpot