Databricks Unity Catalog
Beginner's Hadoop
by beginnershadoop
10M ago
The Databricks Unity Catalog is a feature provided by Databricks Unified Data Analytics Platform that allows you to organize and manage metadata about your data assets, such as tables, databases, and views. It provides a centralized metadata repository that enables users to discover, understand, and collaborate on data assets within a Databricks environment. The Unity Catalog integrates with various data sources and supports different metadata management capabilities. Some key features and benefits of the Databricks Unity Catalog include: Metadata Management: The Unity Catalog allows you to r ..read more
Visit website
What is airflow task decorator
Beginner's Hadoop
by beginnershadoop
10M ago
In Apache Airflow, the task decorator is a Python decorator used to define tasks within a Directed Acyclic Graph (DAG). Airflow is an open-source platform used to programmatically author, schedule, and monitor workflows. Each task within a DAG represents a unit of work that needs to be executed, such as running a script, executing a SQL query, or any other operation. The task decorator in Airflow allows you to define a task using a Python function. When you decorate a function with @task, it becomes an instance of the airflow.decorators.TaskDecorator class. The decorator sets properties and at ..read more
Visit website
Delta Lake Architecture
Beginner's Hadoop
by beginnershadoop
1y ago
Delta Lake is an open-source storage layer that allows developers to build scalable and efficient data pipelines for big data workloads. Delta Lake provides reliability, performance, and flexibility to big data workflows by adding a transactional layer on top of existing data lakes. Delta Lake enables developers to store data in a versioned, append-only format and provides ACID transactions to ensure data consistency. Delta Lake can be used with Apache Spark, which provides a scalable and distributed processing engine for big data workloads. Delta Lake can be used to store data in a variety o ..read more
Visit website
Data Architecture for Beginners
Beginner's Hadoop
by beginnershadoop
1y ago
Big data is a term used to describe large and complex data sets that require advanced computational and analytical tools to process and interpret. The field of big data has rapidly evolved in recent years, and with it, a number of big data architecting models have emerged. These models are designed to help organizations manage and analyze big data more effectively. In this article, we will explore some of the most important big data architecting models and compare them based on their strengths and weaknesses. Lambda Architecture Kappa Architecture Microservices Architecture Event-Driven Archi ..read more
Visit website
Apache Spark: WindowSpec & Window
Beginner's Hadoop
by beginnershadoop
4y ago
WindowSpec is a window specification that defines which rows are included in a window (frame), i.e. the set of rows that are associated with the current row by some relation. WindowSpec takes the following when created: Partition specification (Seq[Expression]) which defines which records are in the same partition. With no partition defined, all records belong to a single partition Ordering Specification (Seq[SortOrder]) which defines how records in a partition are ordered that in turn defines the position of a record in a partition. The ordering c ..read more
Visit website
Barrier Execution Mode in Spark
Beginner's Hadoop
by beginnershadoop
4y ago
The barrier execution mode is experimental and it only handles limited scenarios. See SPIP: Barrier Execution Mode and Design Doc.In case of a task failure, instead of only restarting the failed task, Spark will abort the entire stage and re-launch all tasks for this stage. Use RDD.barrier transformation to mark the current stage as a barrier stage. barrier(): RDDBarrier[T] barrier simply creates a RDDBarrier that comes with the barrier-aware mapPartitions transformation. mapPartitions[S: ClassTag]( f: Iterator[T] => Iterator[S], preservesPart ..read more
Visit website
Impala Export to CSV
Beginner's Hadoop
by beginnershadoop
4y ago
Apache Impala is an open source massively parallel processing SQL query engine for data stored in a computer cluster running Apache Hadoop. In some cases, impala-shell is installed manually on other machines that are not managed through Cloudera Manager. In such cases, you can still launch impala-shell and submit queries from those external machines to a DataNode where impalad is running. In such a specific scenario, impala-shell is started and connected to remote hosts by passing an appropriate hostname and port (if not the default, 21000). To use Impala shell to connect to Impala daem ..read more
Visit website
Spark Structured Streaming and Streaming Queries
Beginner's Hadoop
by beginnershadoop
4y ago
Structured streaming: Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can express your streaming computation the same way you would express a batch computation on static data. The Spark SQL engine will take care of running it incrementally and continuously and updating the final result as streaming data continues to arrive Reference: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html ..read more
Visit website
Add constant column in spark
Beginner's Hadoop
by beginnershadoop
4y ago
If we want to add a column with default value then we can do in spark. In spark 2.2 there are two ways to add constant value in a column in DataFrame: 1) Using lit 2) Using typedLit. The difference between the two is that typedLit can also handle parameterized scala types e.g. List, Seq, and Map Spark 2.2 introduces typedLit to support Seq, Map, and Tuples (SPARK-19254) and following calls should be supported (Scala): import org.apache.spark.sql.functions.typedLit df.withColumn("some_array", typedLit(Seq(1, 2, 3))) df.withColumn("some_struct", typedLit(("foo", 1, .0.3))) df.withCo ..read more
Visit website
Salesforce connector in Spark
Beginner's Hadoop
by beginnershadoop
4y ago
Salesforce is a customer relationship management solution that brings companies and customers together. It’s one integrated CRM platform that gives all your departments — including marketing, sales, commerce, and service — a single, shared view of every customer. Get salesforce connector from here  Code: import org.apache.spark.sql.SparkSession import org.apache.spark.sql._ object Sample { def main(arg: Array[String]) { val spark = SparkSession.builder(). appName("salesforce"). master("local[*]"). getOrCreate() val tableName = "account" val outp ..read more
Visit website

Follow Beginner's Hadoop on FeedSpot

Continue with Google
Continue with Apple
OR