Databricks by Srinath Shankar And Prakash Chockal.. - 2M ago
Databricks is thrilled to announce our new optimized auto-scaling feature. The new Apache Spark™-aware resource manager leverages Spark shuffle and executor statistics to resize a cluster intelligently, improving resource utilization. When we tested long-running big data workloads, we observed cloud cost savings of up to 30%.
What’s the problem with current state-of-the-art auto-scaling approaches?
Today, every big data tool can auto-scale compute to lower costs. But most of these tools expect a static resource size allocated for a single job. Resource schedulers like YARN then take care of “coarse-grained” auto-scaling between different jobs, releasing resources back only after a Spark job finishes.
–class org.apache.spark.examples.SparkPi \
–master yarn \
–deploy-mode cluster \ # can be client for client mode –num-executors 50 \
An example spark-submit command that takes the number of executors required for the Spark job as a parameter.
This introduces two major problems:
Identifying the right number of executors required for a single job: How much compute resource does my job require to finish within an acceptable SLA? There is significant trial and error here to decide the right number of executors.
Sub-optimal resource utilization, which typically stems from over-provisioning. Users over-provision resources because:
Production Spark jobs typically have multiple Spark stages. Some stages might require huge compute resources compared to other stages. Users provide a number of executors based on the stage that requires maximum resources. Having such a static size allocated to an entire Spark job with multiple stages results in suboptimal utilization of resources.
The amount of data processed by ETL jobs fluctuates based on time of day, day of week, and other seasonalities like Black Friday. Typically resources are provisioned for the Spark job in expectation of maximum load. This is highly inefficient when the ETL job is processing small amounts of data.
To overcome the above problems, Apache Spark has a Dynamic Allocation option, as described here. But this requires setting up a shuffle service external to the executor on each worker node in the same cluster to allow executors to be removed without deleting the shuffle files that they write. And while the executor can be removed, the worker node is still kept alive so the external shuffle service can continue serving files.
Introducing Databricks Optimized Auto-Scaling
The new optimized auto-scaling service for compute resources allows clusters to scale up and down more aggressively in response to load and improves the utilization of cluster resources automatically without the need for any complex setup from users.
Traditional coarse-grained auto-scaling algorithms do not fully scale down cluster resources allocated to a Spark job while the job is running. The main reason is the lack of information on executor usage. Removing workers with active tasks or in-use shuffle files would trigger re-attempts and recomputation of intermediate data, which leads to poorer performance, lower effective utilization and therefore higher costs for the user. However in cases where there are only a few active tasks running on a cluster, such as when the Spark job exhibits skew or when a particular stage of the job has lower resource requirements, the inability to scale down leads to poorer utilization and therefore higher costs for users. This is a massive missed opportunity for traditional auto-scaling.
Databricks’ optimized auto-scaling solves this problem by periodically reporting detailed statistics on idle executors and the location of intermediate files within the cluster. The Databricks service uses this information to more precisely target workers to scale down when utilization is low. In particular the service can scale down and remove idle workers on an under-utilized cluster even when there are tasks running on other executors for the same Spark job. This behavior is different from traditional auto-scaling, which requires the entire Spark job to be finished to begin scale-down. During scale-down, the Databricks service removes a worker only if it is idle and does not contain any shuffle data that is being used by running queries. Therefore running jobs and queries are not affected during down-scaling.
Since Databricks can precisely target workers for scale-down under low utilization, clusters can be resized much more aggressively in response to load. In particular, under low utilization, Databricks clusters can be scaled down aggressively without killing tasks or recomputing intermediate results. This keeps wasted compute resources to a minimum while also maintaining the responsiveness of the cluster. And since Databricks can scale a cluster down aggressively, it also scales the cluster up aggressively in response to demand to keep responsiveness high without sacrificing efficiency.
The following section illustrates the behavior and benefits of the new auto-scaling feature when used to run a job in Databricks.
We have a genomics data pipeline that is periodically scheduled to run as a Databricks job on its own cluster. That is, each instance of the pipeline periodically spins up a cluster in Databricks, runs the pipeline, and shuts down the cluster after it is completed.
We ran identical instances of this pipeline on two separate clusters with identical compute configurations. In both instances the clusters were running Databricks Runtime 4.0 and configured to scale between 1 and 24 i3.2xl instances. The first cluster was set up to scale up and down in the traditional manner, and in the second we enabled the new Databricks optimized auto-scaling.
The following figure plots the number of executors deployed and the number of executors actually in use as the job progresses (x axis is time in minutes).
Figure 1. Traditional auto-scaling: Active executors vs total executors
Clearly, the number of deployed workers just goes up to 24 and never reduces for the duration of the workload. That is, for a single Spark job, traditional auto-scaling is not much better than simply allocating a fixed number of resources.
Figure 2. Databricks’ optimized auto-scaling: Active executors vs total executors
With Databricks’ optimized auto-scaling, the number of deployed workers tracks the workload usage more closely. In this case, aggressive auto-scaling results in 25% fewer resources being deployed over the lifetime of the workload, meaning a 25% cost savings for the user. The end-to-end runtime of the workload was only slightly higher (193 minutes with aggressive auto-scaling vs. 185 minutes).
You’ll get the new optimized auto-scaling algorithm when you run Databricks jobs on Databricks Runtime 3.4+ clusters that have the “Enable auto-scaling” flag selected. See Cluster Size and auto-scaling in the Databricks documentation for more information.
Start running your Spark jobs on the Databricks Unified Analytics Platform and start saving on your cloud costs by signing up for a free trial.
Databricks by Justin Olsson, Sr. Legal Counsel An.. - 2M ago
With GDPR enforcement rapidly approaching (May 25, 2018), many companies are still trying to figure out how to comply. A big pain point, particularly for companies who utilize data lakes to store vast amounts of data, is how to comply with one of the main requirements under the GDPR – data subject requests, also known as “DSRs”.
What is a DSR?
One of the most operationally significant parts of the GDPR for companies is the data subject request. The GDPR provides all European data subjects (that is, any individual person located in Europe) with a set of enumerated rights related to their personal data including the right to:
access (i.e., the right to know what personal data a controller or processor has about the individual),
rectification (i.e., the right to update incorrect personal data),
erasure (i.e., the right to be forgotten), and
portability (i.e., the right to export personal data in a machine-readable format).
Companies have, unless the request is “complex” or “numerous”, thirty days from receipt of the data subject request to comply with the request (keeping in mind any applicable exceptions).
So what’s the big deal?
Finding data in a data lake is hard; being sure that you’ve found all data about a particular individual is very hard. And many data lakes do not even enable users to perform “delete” operations, even once the data is located, so actually removing it may be practically impossible. In the best case, finding and removing such data is computationally difficult, expensive, and time consuming. And if a company receives more than just a few data subject requests in a short period of time, the resources spent to comply with the requests could be significant. Further, failure to comply with the GDPR could result in significant penalties, potentially as high as €20 million (or even more – up to 4% of a company’s global annual revenues).
So that sounds bad. Is there anything that can be done?
Fortunately, Databricks offers a solution. Enter Databricks Delta, a unified data management system built into the Databricks platform, that brings data reliability and performance optimizations to cloud data lakes.
Databricks Delta’s structured data management system adds transactional capabilities to your data lake that enable you to easily and quickly search, modify, and clean your data using standard SQL DML statements (e.g. DELETE, UPDATE, MERGE INTO). To accomplish this, first ingest your raw data into Delta tables which adds metadata to your files. Once ingested, you can easily search and modify individual records within your Delta tables to meet DSR obligations. The final step is to make Delta your single source of truth by erasing any underlying raw data. This removes any lingering records from your raw data sets. We suggest setting up a retention policy with AWS or Azure of thirty days or less to automatically remove raw data so that no further action is needed to delete the raw data to meet DSR response timelines under the GDPR.
Can you provide an example of how this works?
Let’s say your organization received a DSR to delete information related to Justin Olsson (firstname.lastname@example.org). After ingesting your raw data into Delta tables, Databricks Delta would enable you to find and delete information related to user email@example.com by running two commands:
(1) DELETE FROM data WHERE email LIKE 'firstname.lastname@example.org';
(2) VACUUM data;
The first command identifies records that have the string “email@example.com” stored in the column email, accounting for varying case (e.g., JDO@databricks.com would also match), and deletes the data containing these records, rewriting the respective underlying files with the user’s data removed. The second command cleans up the Delta table, removing any stale records that have been logically deleted and those that are outside of the default retention period (e.g., 7 days).
After running these commands, and waiting for your default retention period to delete the underlying raw files, you would be able to state that you had removed records relating to the user firstname.lastname@example.org from your data lake.
Okay, that sounds great , but if I put my data in a Delta table, won’t I be locked in? What if I want to go somewhere else?
Nope! Databricks Delta is architected with portability in mind. Databricks Delta uses an open file format (parquet) and you can at any time (either if you ever decide to stop using Delta or if you need to output data to a system that cannot read Delta tables) quickly and easily convert your data back into a format that can be read by other tools. While doing so, particularly on an ongoing basis, would leave you with the additional DSR obligation of deleting or exporting any personal data that might be contained in the data that was moved out of Databricks Delta, it too will have benefitted from flowing through Databricks Delta, as it will be in a much more structured format, dramatically simplifying that process as well.
With over 4 billion subscribers, Viacom is focused on delivering amazing viewing experiences to their global audiences. Core to this strategy is ensuring petabytes of streaming content is delivered flawlessly through web, mobile and streaming applications. This is critically important during popular live events like the MTV Video Music Awards.
Streaming this much video can strain delivery systems resulting in long load times, mid-stream freezes and other issues. Not only does this create a poor experience, but can also result in lost ad dollars. To combat this, Viacom set out to build a scalable analytics platform capable of processing terabytes of streaming data for real-time insights on the viewer experience.
After evaluating a number of technologies, Viacom found their solution in Amazon S3 and the Databricks Unified Analytics Platform powered by Apache SparkTM. The rapid scalability of S3 coupled with the ease and processing power of Databricks, enabled Viacom to rapidly deploy and scale Spark clusters and unify their entire analytics stack – from basic SQL to advanced analytics on large scale streaming and historical datasets – with a single platform. Viacom can now proactively monitor video feed quality enabling their teams to reduce the time to first frame by 33%.
Spark + AI Summit will be held in San Francisco on June 4-6, 2018. Check out the full agenda and get your ticket before it sells out! Register today with the discount code 5Reasons and get 15% off.
Convergence of Knowledge
For any Apache Spark enthusiast, these summits are the convergence of Spark knowledge. Used by a growing global community of enterprises, academics, contributors, and advocates, attendees have convened at these summits since 2013 to share knowledge. And this summer attendees will return to San Francisco—to an expanded scope and agenda.
Expansion of Scope
Today, unified analytics is paramount for building big data and Artificial Intelligence (AI) applications. Because AI applications require massive amounts of data to enhance and train machine learning models at scale, so far Spark has been the only engine that combines large-scale data processing with the execution of state-of-the-art machine learning and AI algorithms in a unified manner.
“AI has always been one of the most exciting applications of big data and Apache Spark, so with this change, we are planning to bring in keynotes, talks and tutorials about the latest tools in AI in addition to the great data engineering and data science content we already have” — Matei Zaharia.
For this expanded scope and much more, here are my five reasons as a program chair why you should join us.
1. Keynotes from Distinguished Engineers, Academics and Industry Leaders
To support our expanded scope, we have added five tracks to cover AI and Use Cases, Deep Learning Techniques, Python and Advanced Analytics, Productionizing Machine Learning Models, and Hardware in the Cloud. Combined with all other tracks, all these sessions will provide you with over 180 talks to choose from. And if you miss any sessions, peruse the recorded sessions on summit website later.
3. Apache Spark Training
Update your skills and get the best training from Databricks’ best trainers, who have trained over 3,600 summit attendees. A day dedicated to training, you can choose from four courses and stay abreast with the latest in Spark 2.3 and Deep Learning: Data Science with Apache Spark; Understand and Apply Deep Learning with Keras, TensorFlow, and Apache Spark; Apache Spark Tuning and Best Practices; and Apache Spark Essentials. Depending on your preference, you can choose to register for each class on either AWS or Azure cloud. Plus, we will offer half-day Databricks Developer Certification for Apache Spark prep course after which you can sit for the exam on the same day. Get Databricks Certified!
Apache Spark Meetups are reputed for tech-talks. At summits’ meetups, you learn what other Spark developers from all over are up to, mingle and enjoy the beverages and camaraderie in an informal setting, and ask burning questions.
5. City By The Bay
San Francisco is a city famed for its restaurants, cable cars, hills, Golden Gate Bridge, and vibrant nightlife. Take a breather after days of cerebral sessions, chill out at the Fisherman’s Wharf, visit MOMA, and much more…
We hope to you see you in San Francisco!
With only less than six weeks left, tickets are selling fast. If you haven’t yet, register today with the discount code 5Reasons and get 15% off.
In collaboration with the local chapter of Women in Big Data Meetup and our continuing effort by Databricks diversity team to have more women in the big data space as speakers to share their subject matter expertise, we hosted our second meetup with a diverse and highly-accomplished women in their respective technical fields as speakers at the Bay Area Spark Meetup at Databricks.
For those who missed the meetup, moderated by Maddie Schults and Yvette Ramirez, below are the videos and links to the presentation slides. You can peruse slides and view the videos at your leisure. To those who helped and attended, thank you for your participation and continued community support. And to those who wish to be part of the diversity, join your local Women in Big Data chapter—and be a champion.
Bringing a Jewel from Python World to the JVM with Apache Spark, Arrow, and Spacy
By Holden Karau
Bringing a Jewel as a starter from the Python world to the JVM with Apache Spark, Arrow, and Spacy - YouTube
Just Enough DevOps for Data Scientists (Part II)
By Anya Bida
Just enough DevOps for Data Scientists Part II - YouTube
Our next BASM will be held in June 2018 at Moscone Center in SF as part of the Spark + AI Pre-summit Meetup. If you are not a member, join us today. If you have not registered for Spark + AI Summit in SF, please do so now and use the BASMU code to get 15% discount. We hope to see you there.
As a digital society built around data and devices, we have reached a pivotal juncture where data and Artificial Intelligence must be accessible to everyone. Riding this trend, many homes now contain smart devices such as the Amazon Echo or Google Home.
Yet these devices only offer limited computational power and AI capabilities. To remedy this problem, Databricks is proud to present the Data Brick™, a new all-in-one smart device that delivers the full power of Artificial Intelligence to every home.
The Data Brick is built to deliver exceptional sound. It excels at holding notebooks on a bookshelf or a workspace. It interconnects with all your home smart devices through a unified management console. And its language assistant Bricky is a polyglot, understanding verbal command in both natural and programming languages.
The Data Brick runs Apache Spark™, a powerful technology that seamlessly distributes AI computations across a network of other Data Bricks. The unique form factor of the Data Brick means that multiple Data Bricks can be stacked on top of each other, forming a rack of bricks like servers in a data center, and communicate with each other to execute workloads. However, even a single Data Brick contains multiple cores and up to 1 TB of memory, so most users will find that a few Data Bricks, placed at convenient locations throughout their home, are sufficient for their AI needs.
Voice Communication and Computation
Current smart speakers are limited to pre-programmed queries. In contrast, the Data Brick can support arbitrarily complex computations through Apache Spark. Bricky, its language assistant, supports spoken SQL, Scala, Python, and R. Users can simply speak queries to the Data Brick anywhere, and Bricky will deliver the answers. She will read from all your data sources and generate reports for the busy analysts or CTO.
The Data Brick™: The building block of DataBricks' Unified Analytics Platform - YouTube
The Future: Enabling A Next-Generation Cloud
The Data Brick can perform arbitrary computations because of its unique form factor and networking capability. We plan to release a new version of the DataBricks Unified Analytics Platform on a public cloud of Data Bricks, called the Brick Cloud, which represents the latest advance in modular datacenter design. The Brick Cloud will offer tremendous computing power in a small volume to answer questions faster than ever.
An artist’s vision of the future Brick Cloud™. Credit: Nina Aldin Thune
At Databricks, we strive to make big data simple, and what is simpler than a Data Brick? The future is here and it is concrete. Try DataBricks today.
Have a wonderful April Fool’s Day!
Contributed by Tim Hunter, Nick Lee, Joy Xie, Jules Damji, and Reynold Xin, with special appearances from Alexandra Cong, Haoyi Li, Prakash Chockalingam, and Matei Zaharia.
Click is an open-source tool that lets you quickly and easily run commands against Kubernetes resources, without copy/pasting all the time, and that easily integrates into your existing command line workflows.
At Databricks we use Kubernetes, a lot. We deploy our services (of which there are many) in unique namespaces, across multiple clouds, in multiple regions. Each of those (service, namespace, cloud, region) needs to be specified to target a particular Kubernetes cluster and object. That means that crafting the correct kubectl command line can become a feat in and of itself. A slow, error prone feat.
Because it is difficult to target the right Kubernetes resource with kubectl, it hampers our ability to deploy, debug, and fix problems. This is enough of a pain point that almost all our engineers who regularly interact with Kubernetes have hacked up some form of shell alias to inject parameters into their kubectl commands to alleviate parts of this.
After many tedious revisions of our developers’ personal aliases, we felt like there had to be a better way. Out of this requirement, Click was born.
Click remembers the current Kubernetes “thing” (a “resource” in Kubernetes terms), making it easier for the operator to interact with that resource. This was motivated by a common pattern we noticed using, something like:
kubectl get pods
copy the name of the pod
kubectl logs [pasted pod]
ohh right forgot the container
kubectl logs -c [container] [pasted pod]
Then we want the events from the pod too, and so we have to paste again
Note that these different actions are often applied to the same object (be it a pod, service, deployment, etc). We often want to inspect the logs, then check what version is deployed, or view associated events. This idea of a “current object”, inspired us to start writing Click.
Another major motivation was the recognition that the command line was already excellent tool for most of what we were trying to do. Something like “count how many log lines mention a particular alert” is elegantly expressible via a kubectl -> grep -> wc pipeline. We felt that encouraging developers to move to something like the Kubernetes dashboard (or some other gui tool) would be a major downgrade in our ability to quickly drill down on the plethora of information available through kubectl (or really, through the Kubernetes API).
A caveat: Click isn’t intended to be used in a scripting context. kubectl is already excellent in that space, and we saw little value in trying to replace any of its functionality there. Click therefore doesn’t have a “run one command and exit” mode, but rather always functions as a REPL.
Overview of Click
Note first that Click is organized like a REPL. When you run it, you’re dropped into the “Click shell” where you start running Click commands. So, enough prelude, what does click actually look like? Below is a short clip showing Click in action.
Firstly, Click has comprehensive help built right in. You can just type help for a list and brief description of all the commands you can run. From within a Click shell, you can run any specific command with -h to get a full description of its usage. From experience, once the “REPL” model of interaction is understood, most users discover how to do what they want without too much trouble.
Click commands can be divided into four categories, so each command does one of the following:
Set the current context or namespace (the combination of which I’ll call a “scope”)
Search the current scope for a resource
Select a resource returned from the search
Operate on the selected resource
The ctx command sets the current context, and the ns command the current namespace. Click remembers these settings and reflects them in your prompt.
To search for a resource, there are commands like pods or nodes. These return a list of all the resources that match in the currently set scope. Selecting an object is done by simply specifying its number in the returned list. Again, Click will reflect this selection in its prompt.
Once a resource is selected, we can run more active commands against that resource. For instance, the logs command will fetch logs from the selected pod, or the describe command will return an output similar to kubectl describe.
Click can pass any output to the shell via standard shell operators like |, >, or >>. This enables the above log line counting command to be issued as: logs -c container | grep alertName | wc -c, much as one would in your favorite shell.
The above model allows for quick iteration on a particular Kubernetes resource. You can quickly find the resource you care about, and then issue multiple commands against it. For instance, from the current scope, you could select a pod, get its description, see any recent events related to the pod, pull the logs from the foo container to a file, and then delete the pod as follows:
pods // search for pods in the current context and namespace
2 // select the second pod returned (assuming it’s the one you want)
describe // this will output a description of the pod
events // see recent events
logs -c foo > /tmp/podfoo.log // save logs to specified file
delete // delete the pod (this is ask for confirmation)
In the spirit of pictures being worth more than words, here is a screencast that shows off a few (but by no means all) of clicks features.
You can install click via Rust’s package manager tool cargo by running cargo install click. To get cargo if you don’t have it, check out (rustup)
This sections covers some of the details regarding Click’s implementation. You can safely skip it if you’re mostly just interested in using Click.
Click is implemented in Rust. This was partly because the author just likes writing code in Rust, but also because Click is designed to stay running for a long time and leaking memory would be unacceptable. I also wanted a fast (read instant) start-up time, so people wouldn’t be tempted to reach for kubectl to “just run one quick command”. Finally, I wanted Click to crash as little as possible. To that end, Rust is a pretty good choice.
The allure of unwrap
There’s been an effort made to remove as many references to unwrap from the Click codebase as possible (or to notate why it’s okay when we do use it). This is because unwrap can cause your Rust program to panic and exit in a rather user-unfriendly manner.
Rust without unwrap leads to code that has to explicitly handle most error cases, and in turn leads to a system that doesn’t die unexpectedly and that generally reports errors in a clean way to the user. Click is not perfect in this regard, but it’s quite good, and will continue to get better.
Communicating with Kubernetes
For most operations Click talks directly to the Kubernetes API server using the api. The port-forward and exec commands currently leverage kubectl, as that functionality has not been replicated in Click. These commands may at some point become self contained, but so far there hasn’t been a strong motivation to do so.
Click is in beta, and as such it’s experimental. Over time, with community contributions and feedback, we will continue to improve it and make it more robust. Additionally, at Databricks we use it on a daily basis, and continually fix any issues. Features we plan to add include:
The ability to apply commands to sets of resources all at once. This would make it trivial, for instance, to pull the logs for all pods in a particular deployment all at once.
Auto-complete for lots more commands
Support for all types of Kubernetes resources, including Custom resources
Support for patching and applying to modify your deployed resources
The ability to specify what colors click uses, or to turn of color completely.
Upgrading to the latest version of Hyper (or possibly switching to Reqwest).
We are excited to release this tool to Kubernetes community and hope many of you find it useful. Please file issues at https://github.com/databricks/click/issues and make a PR request for your future feature. We are eager to see Click grow and improve based on community feedback.
The confluence of cloud, data, and AI is driving unprecedented change. The ability to utilize data and turn it into breakthrough insights is foundational to innovation today. Our goal is to empower organizations to unleash the power of data and reimagine possibilities that will improve our world.
To enable this journey, we are excited to announce the general availability of Azure Databricks, a fast, easy, and collaborative Apache® Spark™-based analytics platform optimized for Azure.
Fast, easy, and collaborative
Over the past five years, Apache Spark has emerged as the open source standard for advanced analytics, machine learning, and AI on Big Data. With a massive community of over 1,000 contributors and rapid adoption by enterprises, we see Spark’s popularity continue to rise.
Azure Databricks is designed in collaboration with Databricks whose founders started the Spark research project at UC Berkeley, which later became Apache Spark. Our goal with Azure Databricks is to help customers accelerate innovation and simplify the process of building Big Data & AI solutions by combining the best of Databricks and Azure.
To meet this goal, we developed Azure Databricks with three design principles.
First, enhance user productivity in developing Big Data applications and analytics pipelines. Azure Databricks’ interactive notebooks enable data science teams to collaborate using popular languages such as R, Python, Scala, and SQL and create powerful machine learning models by working on all their data, not just a sample data set. Native integration with Azure services further simplifies the creation of end-to-end solutions. These capabilities have enabled companies such as renewables.AI to boost the productivity of their data science teams by over 50 percent.
“Instead of one data scientist writing AI code and being the only person who understands it, everybody uses Azure Databricks to share code and develop together.”
– Andy Cross, Director, renewables.AI
Microsoft Azure Databricks | renewablesAI - YouTube
Second, enable our customers to scale globally without limits by working on big data with a fully managed, cloud-native service that automatically scales to meet their needs, without high cost or complexity. Azure Databricks not only provides an optimized Spark platform, which is much faster than vanilla Spark, but it also simplifies the process of building batch and streaming data pipelines and deploying machine learning models at scale. This makes the analytics process faster for customers such as E.ON and Lennox International enabling them to accelerate innovation.
“Every day, we analyze nearly a terabyte of wind turbine data to optimize our data models. Before, that took several hours. With Microsoft Azure Databricks, it takes a few minutes. This opens a whole range of possible new applications.”
– Sam Julian, Product Owner, Data Services, E.ON
“At Lennox International, we have 1000’s of devices streaming data back into our IoT environment. With Azure Databricks, we moved from 60% accuracy to 94% accuracy on detecting equipment failures. Using Azure Databricks has opened the flood gates to all kinds of new use cases and innovations. In our previous process, 15 devices, which created 2 million records, took 6 hours to process. With Azure Databricks, we are able to process 25,000 devices – 10 billion records – in under 14 minutes.”
– Sunil Bondalapati, Director of Information Technology, Lennox International
Third, ensure that we provide our customers with the enterprise security and compliance they have come to expect from Azure. Azure Databricks protects customer data with enterprise-grade SLAs, simplified security and identity, and role-based access controls with Azure Active Directory integration. As a result, organizations can safeguard their data without compromising productivity of their users.
Azure is the best place for Big Data & AI
We are excited to add Azure Databricks to the Azure portfolio of data services and have taken great care to integrate it with other Azure services to unlock key customers scenarios.
High-performance connectivity to Azure SQL Data Warehouse, a petabyte scale, and elastic cloud data warehouse allows organizations to build Modern Data Warehouses to load and process any type of data at scale for enterprise reporting and visualization with Power BI. It also enables data science teams working in Azure Databricks notebooks to easily access high-value data from the warehouse to develop models.
Integration with Azure IoT Hub, Azure Event Hubs, and Azure HDInsight Kafka clusters enables enterprises to build scalable streaming solutions for real-time analytics scenarios such as recommendation engines, fraud detection, predictive maintenance, and many others.
We are committed to making Azure the best place for organizations to unlock the insights hidden in their data to accelerate innovation. With Azure Databricks and its native integration with other services, Azure is the one-stop destination to easily unlock powerful new analytics, machine learning, and AI scenarios.
Get started today!
We are excited for you to try Azure Databricks! Get started today and let us know your feedback.
Structured Streaming in Apache Spark 2.0 decoupled micro-batch processing from its high-level APIs for a couple of reasons. First, it made developer’s experience with the APIs simpler: the APIs did not have to account for micro-batches. Second, it allowed developers to treat a stream as an infinite table to which they could issue queries as they would a static table.
In this blog, we are going to illustrate the use of continuous processing mode, its merits, and how developers can use it to write continuous streaming applications with millisecond low-latency requirements. Let’s start with a motivating scenario.
Suppose we want to build a real-time pipeline to flag fraudulent credit card transactions. Ideally, we want to identify and deny a fraudulent transaction as soon as the culprit has swiped his/her credit card. However, we don’t want to delay legitimate transactions as that would annoy customers. This leads to a strict upper bound on the end-to-end processing latency of our pipeline. Given that there are other delays in transit, the pipeline must process each transaction within 10-20 ms.
Let’s try to build this pipeline in Structured Streaming. Assume that we have a user-defined function “isPaymentFlagged” that can identify the fraudulent transactions. To minimize the latency, we’ll use a 0 second processing time trigger indicating that Spark should start each micro batch as fast as it can with no delays. At a high level, the query looks like this.
You can see the complete code by downloading and importing this example notebook to your Databricks workspace (use a Databricks Community Edition). Let’s see what end-to-end latency we get.
The records are taking more than 100 ms to flow through Spark! While this is fine for many streaming pipelines, this is insufficient for this use case. Can our new Continuous Processing mode help us out?
Now we are getting less 1 ms latency — more than two orders of magnitude improvement and well below our target latency! To understand why this latency was so high with micro-batch processing, and how continuous processing helped, we’ll have to dig into the details of the Structured Streaming engine.
Structured Streaming by default uses a micro-batch execution model. This means that the Spark streaming engine periodically checks the streaming source, and runs a batch query on new data that has arrived since the last batch ended. At a high-level, it looks like this.
In this architecture, the driver checkpoints the progress by saving the records offsets to a write-ahead-log, which may be then used to restart the query. Note that the range offsets to be processed in the next micro-batch is saved to the log before the micro-batch has started in order to get deterministic re-executions and end-to-end semantics. As a result, a record that is available at the source may have to wait for the current micro-batch to be completed before its offset is logged and the next micro-batch processes it. At the record level, the timeline looks like this.
This results in latencies of 100s of milliseconds at best, between the time an event is available at the source and when the output is written to the sink.
We originally built Structured Streaming with this micro-batch engine to easily leverage existing batch processing engine in Spark SQL which had already been optimized for performance (see our past blogs on code generation and Project Tungsten). This allowed us to achieve high throughput with latencies as low as 100 ms. Over the past few years, while working with thousands of developers and hundreds of different use cases, we have found that second-scale latencies are sufficient for most practical streaming workloads such as ETL and real-time monitoring. However, some workloads (e.g., the aforementioned fraud detection use case) do benefit from even lower latencies and that motivated us to build the Continuous Processing mode. Let us understand how this works.
In Continuous Processing mode, instead of launching periodic tasks, Spark launches a set of long-running tasks that continuously read, process and write data. At a high level, the setup and the record-level timeline looks like these (contrast them with the above diagrams of micro-batch execution).
Since events are processed and written to sink as soon as they are available in the source, the end-to-end latency is a few milliseconds.
Furthermore, the query progress is checkpointed by an adaptation of the well-known Chandy-Lamport algorithm. Special marker records are injected into the input data stream of every task; we call them “epoch markers” and the gap between them as “epochs.” When a marker is encountered by a task, the task asynchronously reports the last offset processed to the driver. Once the driver receives the offsets from all the tasks writing to the sink, it writes them to the aforementioned write-ahead-log. Since the checkpointing is completely asynchronous, the tasks can continue uninterrupted and provide consistent millisecond-level latencies.
Experimental Release in Apache Spark 2.3.0
In the Apache Spark 2.3.0, the Continuous Processing mode is an experimental feature and a subset of the Structured Streaming sources and DataFrame/Dataset/SQL operations are supported in this mode. Specifically, you can set the optional Continuous Trigger in queries that satisfy the following conditions:
Read from supported sources like Kafka and write to supported sinks like Kafka, memory, console (memory and console are good for debugging).
Has only map-like operations (i.e., selections and projections like select, where, map, flatMap, filter,)
Has any SQL function other than aggregate functions, and current-time-based functions like current_timestamp() and current_date().
With the release of Apache Spark 2.3, developers have a choice of using either streaming mode—continuous or micro-batching—depending on their latency requirements. While the default Structure Streaming mode (micro-batching) does offer acceptable latencies for most real-time streaming applications, for your millisecond-scale latency requirements, you can now elect for the continuous mode.
Early last month, we announced our agenda for Spark + AI Summit 2018, with over 180 selected talks with 11 tracks and training courses. For this summit, we have added four new tracks to expand its scope to include Deep Learning Frameworks, AI, Productionizing Machine Learning, Hardware in the Cloud, and Python and Advanced Analytics.
These new tracks have fostered some unique AI-related submissions that require massive amounts of constantly updated training data to build state-of-the-art models as well as ways of using Apache Spark at scale.
For want of worthy experience—whether picking a restaurant to dine, electing a book to read, or choosing a talk to attend at a conference—often a nudge guides, a nod affirms, and a review confirms. Rebecca Knight suggests in Harvard Business Review “How to Get Most Out of a Conference” by listening to what experts have to say, being strategic with your time, and choosing the right sessions.
As the program chair of the summit, and to help you choose some sessions, a few talks from each of these new tracks caught my attention: All seem prominent in their promise, possibility, and practicality; all seem to suggest how data and AI engender the best of AI applications because of unified analytics. I wish to share them with you: