Follow Databricks Blog on Feedspot

Continue with Google
Continue with Facebook


The value of analytics and machine learning to organizations is well understood. Our recent CIO survey showed that 90% of organizations are investing in analytics, machine learning and AI. But we’ve also noted that the biggest barrier is getting the right data in the right place and in the right format. So we’ve partnered with Informatica to enable organizations to achieve more success by enabling new ways to discover, ingest and prepare data for analytics.

Ingesting Data Directly Into Delta Lake

Getting high volumes of data from hybrid data sources into a data lake in a way that is reliable and high-performant is difficult. Datasets are often dumped in unmanaged data lakes with no thought of a purpose. Data is dumped into data lakes with no consistent format, making it impossible to mix reads and appends. Data can also be corrupted in the process of writing it to a data lake, as writes can fail and leave partial datasets.

Informatica Cloud Data Ingestion (CDI) enables ingestion of data from hundreds of data sources. By integrating CDI with Delta Lake, a smart ingestion can take place with the benefits of Delta Lake. ACID transactions ensure that writes are complete, or are backed out if they fail, leaving no artifacts. Delta Lake schema enforcement ensures that the data types are correct and required columns are present, preventing bad data from causing data corruption. The seamless integration between Informatica CDI and Delta Lake enables data engineers to quickly ingest high volumes of data from multiple hybrid sources into a data lake with high reliability and performance


Every organization is limited in resources to format data for analytics. Ensuring the datasets can be used in ML models requires complex transformations that are time consuming to create. There are not enough highly skilled data engineers available to code advanced ETL transformations for data at scale. Furthermore, ETL code can be difficult to troubleshoot or modify.

The integration of Informatica Big Data Management (BDM) and the Databricks Unified Analytics Platform makes it easier to create high-volume data pipelines for data at scale. The drag and drop interface of BDM lowers the bar for teams to create data transformations by removing the need to write code to create data pipelines. And the easy to maintain and modify pipelines of BDM can leverage the high volume scalability of Databricks by pushing that work down for processing. The result is faster and lower cost development of high-volume data pipelines for machine learning projects. Pipeline creation and deployment is increased 5x, and pipelines are easier to maintain and troubleshoot.


Finding the right datasets for machine learning is difficult. Data scientists waste precious time looking for the right datasets for their models to help solve critical problems. They can’t identify which datasets are complete and properly formatted, and have been properly verified for usage as the correct datasets.

With the integration of Informatica Enterprise Data Catalog (EDC) with the Databricks Unified Analytics Platform, Data Scientists can now find the right data for creating models and performing analytics. Informatica’s CLAIRE engine uses AI and machine learning to automatically discover data and make intelligent recommendations for data scientists. Data scientists can find, validate, and provision their analytic models quickly, significantly reducing the time to value. Databricks can run ML models at unlimited scale to enable high-impact insights. And EDC can now track data in Delta Lake as well, making it part of the catalog of enterprise data.


Tracing the lineage of data processing for analytics has been nearly impossible. Data Engineers and Data Scientists can’t provide any proof of lineage to show where the data came from. And when data is processed for creating models, identifying which version of a dataset, model, or even which analytics frameworks and libraries were used has become so complex it has moved beyond our capacity for manual tracking.

With the integration of Informatica EDC, along with Delta Lake and MLflow running inside of Databricks, Data Scientists can verify lineage of data from the source, track the exact version of data in the Delta Lake, and track and reproduce models, frameworks and libraries used to process the data for analytics. This ability to track Data Science decisions all the way back to the source provides a powerful way for organizations to be able to audit and reproduce results as needed to demonstrate compliance.

We are excited about these integrations and the impact they will have on making organizations successful, by enabling them to automate data pipelines and provide better insights into those pipelines. For more information, register for this webinar https://dbricks.co/INFA19.


Try Databricks for free. Get started today.

The post Databricks and Informatica Accelerate Development and Complete Data Governance for Intelligent Data Pipelines appeared first on Databricks.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Databricks’ commitment to education is at the center of the work we do. Through Instructor-Led Training, Certification, and Self-Paced Training, Databricks Academy provides strong pathways for users to learn Apache Spark™ and Databricks to push their knowledge to the next level.

Our latest offering is a series of short videos introducing the Natural Language Processing technique, Latent Semantic Analysis (LSA). This series explains the conceptual framework of the technique and how the Databricks Runtime for Machine Learning can be used to apply the technique to a body of text documents using Scikit-Learn and Apache Spark.

If you’d like to follow along with the videos on your own computer, simply download the Databricks notebook. If you don’t have a Databricks account yet, get started for free on Databricks Community Edition.

If you’d like to dive deeper into Machine Learning using Databricks, check out our self-paced course Introduction to Data Science and Machine Learning / AWS (also available on Azure) at Databricks Academy.

Introduction to Latent Semantic Analysis

Introduction to Latent Semantic Analysis (1/5) - YouTube

This video introduces the core concepts in Natural Language Processing and the Unsupervised Learning technique, Latent Semantic Analysis (LSA). The purposes and benefits of the technique are discussed. In particular, the video highlights how the technique can aid in gaining an understanding of latent, or hidden, aspects of a body of documents—in addition to reducing the dimensionality of the original dataset.

A Trivial Implementation of LSA using Scikit-Learn

A Trivial Implementation of LSA using Scikit Learn (2/5) - YouTube

This video introduces the steps in a full LSA Pipeline and shows how they can be implemented in Databricks Runtime for Machine Learning using the open-source libraries Scikit-Learn and Pandas.

These steps are:

This video uses a trivial list of strings as the body of documents so that you can compare your own intuition to the results of the LSA. After completing the process, we examine two byproducts of the LSA—the dictionary and the encoding matrix—in order to gain an understanding of how the documents are encoded in topic space.

A Second LSA

A Second LSA (3/5) - YouTube

Here we work through the same steps from the previous video in a second full LSA Pipeline, once more in Databricks Runtime for Machine Learning using the open-source libraries Scikit-Learn and Pandas.

This video uses a slightly more complicated the body of documents: strings of text from two popular children’s books. After completing the process, we examine two byproducts of the LSA—the dictionary and the encoding matrix—in order to gain an understanding of how the documents are encoded in topic space. Finally, we plot the resulting documents in their topic-space encoding using the open source library Matplotlib.

Improving the LSA with a TFIDF

Improving the LSA with a TFIDF (4/5) - YouTube

This video works through a third full LSA Pipeline using Databricks’ Runtime for Machine Learning and the open-source libraries Scikit-Learn and Pandas.

Here we iterate on the previous LSA Pipeline by using an alternate method, Term Frequency-Inverse Document Frequency, to prepare the Document-Term Matrix. After completing the process, the video examines two byproducts of the LSA—the dictionary and the encoding matrix—in order to gain an understanding of how the documents are being encoded in topic space. Finally, the video plots the resulting documents in their topic-space encoding using the open source library Matplotlib and compares the plot to the plot prepared in the previous video.

Latent Semantic Analysis with Apache Spark

Latent Semantic Analysis with Apache Spark (5/5) - YouTube

In this video, we begin looking at a new, larger dataset: the 20 newsgroups dataset. In order to work with this larger dataset, we move the analysis pipeline to Apache Spark using the Scala programming language. This video introduces a new type of NLP-specific preprocessing: lemmatization. We also discusses key differences between performing NLP in Scikit-Learn and Apache Spark.

We hope that you find these videos informative, as well as entertaining! The full video playlist is here. If you’d like to dive deeper into Machine Learning using Databricks, check out our self-paced course Introduction to Data Science and Machine Learning / AWS (also available on Azure) at Databricks Academy,


Try Databricks for free. Get started today.

The post New videos from Databricks Academy: Introduction to Natural Language Processing—Latent Semantic Analysis appeared first on Databricks.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Diego Link is VP of Engineering at Tilting Point

Tilting Point is a new-generation games partner that provides top development studios with expert resources, services, and operational support to optimize high quality live games for success. Through its user acquisition fund and its world-class technology platform, Tilting Point funds and runs performance marketing management and live games operations to help developers achieve profitable scale.

At Tilting Point, we were running daily / hourly batch jobs for reporting on game analytics. We wanted to make our reporting near real-time and make sure that we get insights in 5 to 10 mins. We also wanted to make our in-game live-ops decisions based on real-time player behavior for giving real time data to a bundles and offer system, provide up-to-the-minute alerting on LiveOPs changes that actually might have unforeseen detrimental effects and even alert on service interruptions in game operations. Additionally, we had to store encrypted Personally Identifiable Information (PII) data separately for GDPR purposes.

How data flows and associated challenges

We have a proprietary SDK that developers integrate with to send data from game servers to an ingest server hosted in AWS. This service removes all PII data and then sends the raw data to an Amazon Firehose endpoint. Firehose then dumps the data in JSON format continuously to S3.

To clean up the raw data and make it available quickly for analytics, we considered pushing the continuous data from Firehose to a message bus (e.g. Kafka, Kinesis) and then use Apache Spark’s Structured Streaming to continuously process data and write to Delta Lake tables. While that architecture sounds ideal for low latency requirements of processing data in seconds, we didn’t have such low latency needs for our ingestion pipeline. We wanted to make the data available for analytics in a few minutes, not seconds. Hence we decided to simplify our architecture by eliminating a message bus and instead using S3 as a continuous source for our structured streaming job. But the key challenge in using S3 as a continuous source is identifying files that changed recently.

Listing all files every few minutes has 2 major issues:

  • Higher latency: Listing all files in a directory with a large number of files has high overhead and increases processing time.
  • Higher cost: Listing lot of files every few minutes can quickly add to the S3 cost.
Leveraging Structured Streaming with Blob Store as Source and Delta Lake Tables as Sink

To continuously stream data from cloud blob storage like S3, we use Databricks’ S3-SQS source. The S3-SQS source provides an easy way for us to incrementally stream data from S3 without the need to write any state management code on what files were recently processed. This is how our ingestion pipeline looks:

  • Configure Amazon S3 event notifications to send new file arrival information to SQS via SNS.
  • We use the S3-SQS source to read the new data arriving in S3. The S3-SQS source reads the new file names that arrived in S3 from SQS and uses that information to read the actual file contents in S3. An example code below:
spark.readStream \
  .format("s3-sqs") \
  .option("fileFormat", "json") \
  .option("queueUrl", ...) \
  .schema(...) \
  • Our structured streaming job then cleans up and transforms the data. Based on the game data, the streaming job uses the foreachBatch API of Spark streaming and writes to 30 different Delta Lake tables.
  • The streaming job produces lot of small files. This affects performance of downstream consumers. So, an optimize job runs daily to compact small files in the table and store them as right file sizes so that consumers of the data have good performance while reading the data from Delta Lake tables. We also run a weekly optimize job for a second round of compaction.

Architecture showing continuous data ingest into Delta Lake Tables

The above Delta Lake ingestion architecture helps in the following ways:

  • Incremental loading: The S3-SQS source incrementally loads the new files in S3. This helps quickly process the new files without too much overhead in listing files.
  • No explicit file state management: There is no explicit file state management needed to look for recent files.
  • Lower operational burden: Since we use S3 as a checkpoint between Firehose and structured streaming jobs, the operational burden to stop streams and re-process data is relatively low.
  • Reliable ingestion: Delta Lake uses optimistic concurrency control to offer ACID transactional guarantees. This helps with reliable data ingestion.
  • File compaction: One of the major problems with streaming ingestion is tables ending up with a large number of small files that can affect read performance. Before Delta Lake, we had to setup a different table to write the compacted data. With Delta Lake, thanks to ACID transactions, we can compact the files and rewrite the data back to the same table safely.
  • Snapshot isolation: Delta Lake’s snapshot isolation allows us to expose the ingestion tables to downstream consumers while data is being appended by a streaming job and modified during compaction.
  • Rollbacks: In case of bad writes, Delta Lake’s Time Travel helps us rollback to a previous version of the table.

In this blog, we walked through our use cases and how we do streaming ingestion using Databricks’ S3-SQS source into Delta Lake tables efficiently without too much operational overhead to make good quality data readily available for analytics.


Try Databricks for free. Get started today.

The post How Tilting Point Does Streaming Ingestion into Delta Lake appeared first on Databricks.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Every enterprise today wants to accelerate innovation by building AI into their business. However, most companies struggle with preparing large datasets for analytics, managing the proliferation of ML frameworks, and moving models in development to production.

AWS and Databricks are presenting a series of Dev Day events where we will cover best practices for enterprises to use powerful open source technologies to simplify and scale your ML efforts. We’ll discuss how to leverage Apache Spark™, the de-facto data processing and analytics engine in enterprises today, for data preparation as it unifies data at massive scale across various sources. You’ll also learn how to use ML frameworks (i.e. Tensorflow, XGBoost, Scikit-Learn, etc.) to train models based on different requirements. And finally, you can learn how to use MLflow to track experiment runs between multiple users within a reproducible environment, and manage the deployment of models to production on AWS Sagemaker.

Join us at the half-day workshop near you to learn how unified analytics can bring data science and engineering together to accelerate your ML efforts. This free workshop will give you the opportunity to:

  • Learn how to build highly scalable and reliable pipelines for analytics
  • Get deeper insights into Apache Spark and Databricks, and managing data using Delta Lakes.
  • Train a model against data and learn best practices for working with ML frameworks (i.e. – XGBoost, Scikit-Learn, etc.)
  • Learn about MLflow to track experiments, share projects and deploy models in the cloud with AWS Sagemaker
  • Network and learn from your ML and Apache Spark peers

Join us in these cities:

Austin, TX  | McLean, VA | Dallas, TX | Atlanta, GA | Cambridge, MA
Santa Monica, CA | Chicago, IL | Toronto, ON | New York, NY


Try Databricks for free. Get started today.

The post AWS + Databricks – Developer Day Events appeared first on Databricks.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

This is a community guest blog from Sim Simeonov, the founder & CTO of Swoop and IPM.ai.

Pre-aggregation is a common technique in the high-performance analytics toolbox. For example, 10 billion rows of website visitation data per hour may be reducible to 10 million rows of visit counts, aggregated by the superset of dimensions used in common queries, a 1000x reduction in data processing volume with a corresponding decrease in processing costs and waiting time to see the result of any query. Further improvements could come from computing higher-level aggregates, e.g., by day in the time dimension or by the site as opposed to by URL.

In this blog, we introduce the advanced HyperLogLog functionality of the open-source library spark-alchemy and explore how it addresses data aggregation challenges at scale. But first, let’s explore some of the challenges.

The Challenges of Reaggregation

Pre-aggregation is a powerful analytics technique… as long as the measures being computed are reaggregable. In the dictionary, aggregate has aggregable, so it’s a small stretch to invent reaggregable as having the property that aggregates may be further reaggregated. Counts reaggregate with SUM, minimums with MIN, maximums with MAX, etc. The odd one out is distinct counts, which are not reaggregable. For example, the sum of the distinct count of visitors by site will typically not be equal to the distinct count of visitors across all sites because of double counting: the same visitor may visit multiple sites.

The non-reaggregability of distinct counts has far-reaching implications. The system computing distinct counts must have access to the most granular level of data. Further, queries that return distinct counts have to touch every row of granular data.

When it comes to big data, distinct counts pose an additional challenge: during computation, they require memory proportional to the size of all distinct values being counted. In recent years, big data systems such as Apache Spark and analytics-oriented databases such as Amazon Redshift have introduced functionality for approximate distinct counting, a.k.a., cardinality estimation, using the HyperLogLog (HLL) probabilistic data structure. To use approximate distinct counts in Spark, replace COUNT(DISTINCT x) with approx_count_distinct(x [, rsd]). The optional rsd argument is the maximum estimation error allowed. The default is 5%. HLL performance analysis by Databricks indicates that Spark’s approximate distinct counting may enable aggregations to run 2-8x faster compared to when precise counts are used, as long as the maximum estimation error is 1% or higher. However, if we require a lower estimation error, approximate distinct counts may actually take longer to compute than precise distinct counts.

A 2-8x reduction in query execution time is a solid improvement on its own, but it comes at the cost of an estimation error of 1% or more, which may not be acceptable in some situations. Further, 2-8x reduction gains for distinct counts pale in comparison to the 1000x gains available through pre-aggregation. What can we do about this?

Revisiting HyperLogLog

The answer lies in the guts of the HyperLogLog algorithm. In the partitioned MapReduce pseudocode, the way Spark processes, HLL looks like this:

  1. Map (for each partition)
    • Initialize an HLL data structure, called an HLL sketch
    • Add each input to the sketch
    • Emit the sketch
  2. Reduce
    • Merge all sketches into an “aggregate” sketch
  3. Finalize
    • Compute approximate distinct count from the aggregate sketch

Note that HLL sketches are reaggregable: when they are merged in the reduce operation, the result is an HLL sketch. If we serialize sketches as data, we can persist them in pre-aggregations and compute the approximate distinct counts at a later time, unlocking 1000x gains. This is huge!

There is another, subtler, but no less important, benefit: we are no longer bound by the practical requirement to have estimation errors of 1% or more. When pre-aggregation allows 1000x gains, we can easily build HLL sketches with very, very small estimation errors. It’s rarely a problem for a pre-aggregation job to run 2-5x slower if there are 1000x gains at query time. This is the closest to a free lunch we can get in the big data business: significant cost/performance improvements without a negative trade-off from a business standpoint for most use cases.

Introducing Spark-Alchemy: HLL Native Functions

Since Spark does not provide this functionality, Swoop open-sourced a rich suite of native (high-performance) HLL functions as part of the spark-alchemy library. Take a look at the HLL docs, which have lots of examples. To the best of our knowledge, this is the richest set of big data HyperLogLog processing capabilities, exceeding even BigQuery’s HLL support.

The following diagram demonstrates how spark-alchemy handles initial aggregation (via hll_init_agg), reaggregation (via hll_merge) and presentation (via hll_cardinality).

If you are wondering about the storage cost of HLL sketches, the simple rule of thumb is that a 2x increase in HLL cardinality estimation precision requires a 4x increase in the size of HLL sketches. In most applications, the reduction in the number of rows far outweighs the increase in storage due to the HLL sketches.

error sketch_size_in_bytes
0.005 43702
0.01 10933
0.02 2741
0.03 1377
0.04 693
0.05 353
0.06 353
0.07 181
0.08 181
0.09 181
0.1 96
HyperLogLog Interoperability

The switch from precise to approximate distinct counts and the ability to save HLL sketches as a column of data has eliminated the need to process every row of granular data at final query time, but we are still left with the implicit requirement that the system working with HLL data has to have access to all granular data. The reason is that there is no industry-standard representation for HLL data structure serialization. Most implementations, such as BigQuery’s, use undocumented opaque binary data, which cannot be shared across systems. This interoperability challenge significantly increases the cost and complexity of interactive analytics systems.

A key requirement for interactive analytics systems is very fast query response times. This is not a core design goal for big data systems such as Spark or BigQuery, which is why interactive analytics queries are typically executed by some relational or, in some cases, NoSQL database. Without HLL sketch interoperability at the data level, we’d be back to square one.

To address this issue, when implementing the HLL capabilities in spark-alchemy, we purposefully chose an HLL implementation with a published storage specification and [built-in support for Postgres-compatible databases]((https://github.com/citusdata/postgresql-hll) and even JavaScript. This allows Spark to serve as a universal data pre-processing platform for systems that require fast query turnaround times, such as portals & dashboards. The benefits of this architecture are significant:

  • 99+% of the data is managed via Spark only, with no duplication
  • 99+% of processing happens through Spark, during pre-aggregation
  • Interactive queries run much, much faster and require far fewer resources

In summary, we have shown how the commonly-used technique of pre-aggregation can be efficiently extended to distinct counts using HyperLogLog data structures, which not only unlocks potential 1000x gains in processing speed but also gives us interoperability between Apache Spark, RDBMSs and even JavaScript. It’s hard to believe, but we may have gotten very close to two free lunches in one big data blog post, all because of the power of HLL sketches and Spark’s powerful extensibility.

Advanced HLL processing is just one of the goodies in spark-alchemy. Check out what’s coming and let us know which items on the list are important to you and what else you’d like to see there.

Last but most definitely not least, the data engineering and data science teams at Swoop would like to thank the engineering and support teams at Databricks for partnering with us to redefine what is possible with Apache Spark. You rock!


Try Databricks for free. Get started today.

The post Advanced Analytics with Apache Spark appeared first on Databricks.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Spark + AI Summit 2019, the world’s largest data and machine learning conference for the Apache Spark™ Community, brought nearly 5000 data scientists, engineers, and business leaders to San Francisco’s Moscone Center to find out what’s coming next. Watch the keynote recordings today and learn more about the latest product announcements for Apache Spark, MLflow, and our newest open source addition, Delta Lake!

Product Announcements

The Spark + AI Summit keynotes included several major product announcements. Reynold Xin, Apache Spark PMC member and number-one code committer to Spark, opened the summit by presenting the upcoming work planned for Spark 3.0 later this year, with more than 1000 improvements, features, and bug fixes, ranging from Hydrogen-accelerator aware scheduling, Spark Graph, Spark on Kubernetes, ANSI-SQL Parser, and many more.

Reynold also announced a new release bringing the Pandas DataFrame API to Spark, under a new open-source project called Koalas. Pandas has long been the Python standard to manipulate and analyze data, particularly for small and medium-sized datasets, and the project opens up a more frictionless progression to large data sets on Spark. With compatible API syntax, data scientists trained on Pandas can now use Koalas to transition easily to working on larger, distributed data sets geared for production environments.

Ali Ghodsi, CEO and Co-founder of Databricks, announced the open-source release of Delta Lake, a storage layer that brings increased reliability and quality to data lakes. Previously, data lakes frequently faced garbage-in-garbage-out issues that made data quality too low for data science and machine learning, resulting in large, expensive “data swamps.” This project brings a suite of new features to data lakes, including ACID transactions, schema enforcement, and even data time travel, which help ensure data integrity for downstream analytics and projects. Customers who previously used Databricks Delta gave extremely positive feedback for the core problems that it solved for them, and we’re excited to open-source the project for the larger community. The ecosystem of Apache Spark, MLflow, and now Delta Lake, continues to expand to solve end-to-end data and ML challenges.

Rohan Kumar, CVP Azure Data from Microsoft, made a number of announcements:

  • Azure Machine Learning support for MLflow, with Microsoft joining as an open-source supporter of MLflow
  • .NET support for Apache Spark, to bring more developers into the Spark ecosystem

Matei Zaharia, Databricks Chief Technologist, announced the next new set of components for the open-source MLflow project with MLflow Workflows and MLflow Model Registry. These modules further extend management of the end-to-end machine learning lifecycle with multistep pipelines and model management. Matei also announced the upcoming work around MLflow 1.0, with a stabilized API for long-term usage and additional feature releases. Managed MLflow is also now Generally Available on AWS and Azure.

Keynote Speakers

In addition to the product announcements was an extensive lineup of keynotes from industry luminaries, across a variety of topics. Turing Award winner David Patterson shared insights on the golden age of computer architectures and his vision for more open designs. Netflix VP of Data Science and Analytics Caitlin Smallwood presented on how Netflix uses data throughout the company, from predicting what content Netflix viewers will want to watch to investing in content production itself. Michael I. Jordan shared principles around human-centered AI and how a marketplace strategy to applying algorithms and data could create greater economic value. Timnit Gebru of Google Brain and Black in AI, spoke about biases and error rates found for different gender and racial groups in data and AI systems and the need for improved standards around such systems. Google’s Anitha Vijayakumar shared some of the latest features and developments with TensorFlow 2.0. Jitendra Malik of Facebook AI Research gavean overview into the rapid advances in visual understanding and computer vision with deep-learning techniques.

Jan Neumann and Jim Forsythe from Comcast shared how they architected a scalable data and machine learning platform with Delta Lake, MLflow, and Databricks to support their AI-powered voice remotes, including enriching petabytes of user session data and feeding it into an agile environment for model development and deployment. Check out the keynote recordings to hear perspectives from all the speakers.

Sessions and Trainings

Deeper technical content and tutorials continued outside the keynotes, with more than 120 sessions across over a dozen tracks, with speakers from Yelp, AirBNB, Lyft, Nike, Starbucks, Optum, FINRA, IBM, Verizon, Tencent, and many others. Topics varied widely, including stream processing, quality monitoring, testing, model serving, TensorFlow, SciKit-Learn, Keras, PyTorch, data pipelines, migrations, NLP, data governance, performance optimization, predictive modeling, graph algorithms, and many more. Sold out hands-on trainings dived deep into data engineering, data science, Spark programming, certification, deep learning, and machine learning.

Women in Unified Analytics

Female leaders shared their perspectives across a series of events sponsored by the Databricks Diversity Committee. With speakers from Netflix, Stanford, LinkedIn, Workday, Apple, Google, and more, women brought thought leadership and perspectives to attendees at the Bay Area Spark Meetup, Women’s Breakfast, and Lunch Tech Talk + Panel.

What’s Next

Spark+AI Summit 2019 keynote videos are now available to see the newest product announcements and thought leadership. Follow @Databricks on Twitter or subscribe to our newsletter if you want to be notified whenever new content becomes available. You can also learn Apache Spark today on our free Databricks Community Edition or build a production Spark application with a free Databricks trial. As always, thanks for your support, and we look forward to seeing you in Spark + AI Summit Europe in Amsterdam in October and next year as well!


Try Databricks for free. Get started today.

The post Spark + AI Summit 2019 Product Announcements and Recap. Watch the keynote recordings today! appeared first on Databricks.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 
Databricks Blog by Dillon Bostwick And Brenner Heintz - 2w ago

Managing cloud infrastructure and provisioning resources can be a headache that DevOps engineers are all too familiar with. Even the most capable cloud admins can get bogged down with managing a bewildering number of interconnected cloud resources – including data streams, storage, compute power, and analytics tools.

Take, for example, the following scenario: a customer has completed creating a Databricks workspace, and they want to connect a Databricks cluster to a Redshift cluster in AWS. The diagram below demonstrates the resulting state if all of these steps are completed correctly, as well as how data flows between each resource.

Achieving this state can be a lengthy process, and each configuration step involves significant substeps. For example, configuring the IAM roles to access S3 from Databricks alone requires a 7 step process.

Complex as this setup may be, it is by no means uncommon. Many customers have looked to cloud automation processes to simplify and speed the deployment of cloud resources, but these processes can pose challenges of their own. In general, we’ve found that many of the challenges in cloud automation include:

  • Scalability– Adding new resources to an existing cloud deployment can become exponentially more difficult and cumbersome due to resolving dependencies between cloud resources.
  • Modularity – Many deployment processes are repeatable and inter-dependent (for example, deploying to Redshift also requires a connection to S3 for staging the results).
  • Consistency – Tracking a deployment state may simplify remediation and reduces risk, but it is difficult to maintain and resolve.
  • Lifecycle management – Even if you can audit changes to some cloud resources, it may be unclear what actions are necessary to update an entire end-to-end state (such as the infrastructure state demonstrated in the above diagram).

To address these issues for our customers, Databricks is introducing a solution to automate your cloud infrastructure. Databricks Cloud Automation leverages the power of Terraform, an open source tool for building, changing, and versioning cloud infrastructure safely and efficiently. It offers an intuitive graphical user interface along with pre-built, “batteries included” Terraform modules that make it easier to connect common cloud resources to Databricks.

Keep in mind that many of these setup tasks such as VPC Peering and S3 authentication are specific to AWS. In Azure Databricks, for example, a connection to an Azure SQL Data Warehouse is simply a matter of authenticating with AAD, as the network connectivity is self-managed.

In developing Databricks Cloud Automation, we aim to:

  • Accelerate the deployment process through automation.
  • Democratize the cloud infrastructure deployment process to non-DevOps/cloud specialists.
  • Reduce risk by maintaining a replicable state of your infrastructure.
  • Provide a universal, “cloud-agnostic” solution.

With this new tool, connecting your cloud resources to Databricks is faster and simpler than ever. Let’s take a look at some of the reasons our customers are using Databricks Cloud Automation.

A graphical user interface to democratize Databricks cloud deployments

Databricks democratizes the deployment process by presenting a streamlined user interface that is easy and accessible, so you can be comfortable deploying cloud resources on Databricks without any DevOps experience. This high-level user interface allows us to manage cloud configurations behind the scenes, prompting you only for essential information.

An elegant solution for tracking infrastructure state

Aside from the simple setup, the tool’s most powerful feature is its ability to locally track and maintain the current and prior state of each cloud resource. As many DevOps engineers are aware, cloud resources often form a complex web of dependencies, each resource relying on others to function properly.

For example, a seemingly simple change to a single ACL entry can cause a cascade of errors downstream (and migraines for DevOps engineers). In the past, each of those dependent resources would need to be manually identified and to perform troubleshooting.

Terraform makes life easier by maintaining a “diff” between the desired state of resources and their current state. When it identifies a change, Terraform updates its internal “resource graph” and creates an ordered execution plan, automating the resolution of all cascading changes that would typically be necessary to keep these connections functioning.

A modular framework for your cloud infrastructure

For companies looking to dramatically scale their cloud infrastructure – now or in the future – Databricks Cloud Automation provides a simple interface to connect resources to Databricks using Terraform’s powerful infrastructure management capabilities. While casual users appreciate the graphical user interface and quick setup time, more experienced users appreciate the modular, extensible design principles that the tool embodies – allowing companies to rapidly grow their Databricks infrastructure without the hassle of complicated and easily-broken manual configurations and the risk of developing a monolithic architecture.

Terraform employs a high-level, declarative syntax known as HCL, allowing users to concisely define the end state of the resources that they intend to connect to Databricks. Many users prefer this style of quick, high-level resource declaration, which allows them to connect resources to Databricks right out of the box, while still giving them the flexibility to manually configure resources to their hearts’ content.

When a new resource is added, Terraform updates its internal resource graph, automatically resolves dependencies between resources and creates a new execution plan to seamlessly integrate the new resource into the existing infrastructure. Users can then view a summary of the changes Terraform will make before committing, by calling terraform plan.

Modules that can be shared, versioned and reused

One of the tool’s top features is the ability to create and save custom configurations as modules that can be reused or called by other modules. As a loose analogy, imagine using modules like a software engineer might use a class – to recreate an object, or to create a subclass that inherits properties from a superclass. For example, an S3 bucket configuration can be reused when creating a Redshift cluster, or a user can copy the configuration for a development environment directly to a production environment to ensure that they match.

This emphasis on repeatable, scriptable modules and “infrastructure as code” make it incredibly easy and efficient to scale up your cloud infrastructure. Modules can be shared amongst team members, edited, reviewed and even versioned as code – allowing DevOps engineers to make quick, iterative changes on the fly without fear of breaking their systems. This approach cuts the time-to-launch for new resource deployments by orders of magnitude and fits in nicely with many of the project management methodologies used today.

Connect to any IaaS provider

Terraform is “cloud agnostic” – it can be used to connect to any cloud provider or other systems – so DevOps engineers aren’t locked into a single ecosystem, and can connect resources between different providers easily.

They can also have confidence that they will not have to raze and rebuild their cloud infrastructure from the ground up if they decide to choose a new cloud vendor or connect a new system. In fact, there is a robust online community devoted to publishing robust modules and providers for nearly every cloud resource and use case.

Example: Connecting an S3 bucket to Databricks using the GUI

Let’s take a look at an example of how easy it is to set up and maintain your cloud infrastructure using Terraform with Databricks, by connecting an existing S3 bucket to Databricks using IAM roles (see here for the manual instructions). You will need:

  • A previously created S3 bucket. Make note of the name and region. For a list of region identifiers, click here.
  • AWS access key and secret key – to find or create your credentials, from the AWS console, navigate to IAM → Users → Security Credentials. If the S3 bucket was created by a different user, you’ll need the access key and secret key for their account, too.
  • Name of the IAM role you used to connect Databricks to AWS. You can find that here.

First, using the command line, let’s download and install the Databricks Cloud Automation package, which includes Terraform:

pip install databricks-cloud-automation

To launch the web-based GUI, enter databricks-cloud-manager in the command line, then navigate to the following address in a web browser:

Here you’ll find examples of cloud infrastructure that you can add to Databricks using Terraform. Follow these instructions to get your S3 bucket connected:

  1. Click Select under s3_to_databricks_via_iam.
  2. Enter the credentials for the S3 bucket we’re connecting to. Under aws_region, enter the region that you use to connect to AWS. In our case, since we’re using US West (Oregon), we enter region code us-west-2.
  3. Under databricks_deployment_role, enter the name of the IAM role you used to allow Databricks to access clusters on AWS. In our case, we enter the role name databricks.
  4. Under custom_iam_name_role, enter a brand new name for an IAM role that we’ll create in order to access the S3 bucket.
  5. Under aws_foreign_acct_access_key, aws_foreign_acct_secret_key, and aws_foreign_acct_region, leave these blank if your S3 bucket is under the same AWS account as the account you use to connect to Databricks. If you’ve got access to an S3 bucket owned by a different AWS user, enter those keys here.

  1. Submit the form. You’ll see a summary of the plan, including resources that will be added, changed, or deleted. Review the list of proposed changes thoroughly, and select Apply changes to infrastructure. If you’ve entered valid credentials, you’ll get a page indicating that you successfully applied all changes, along with a list of everything that was changed. (Note that this added new resources, but it also updated some existing resources – for example, it might have added a line to an IAM policy that already existed). Scroll to the bottom of the success page and copy the s3_role_instance_profile written under the Output section, as seen below.

  1. Sign in to your Databricks deployment. Click the Account icon in the upper right, and select Admin Console. Navigate to the IAM Roles tab, paste the role you copied in the previous step, and click Add, as shown below.

Congratulations! You’ve set up your first piece of cloud infrastructure, and managing it is now easier than ever. You can now launch clusters with your new IAM role, and connect S3 to the Databricks file system (DBFS) to access your data using the following command in a Databricks notebook cell:


Terraform is now tracking the state of our newly created infrastructure, and we can view the state by entering terraform show in the command line. If our resource has changed in any way, we can run terraform apply to repair it and any dependent cloud resources.

Connecting an S3 bucket to Databricks using the Command Line

If you prefer to use the command line to execute tasks, you can still get Databricks connected to an S3 bucket using Terraform using the Terraform CLI directly. This will allow you to leverage Terraform’s advanced features that are not easily accessible via the GUI, such as terraform import. To access the modules directly, run the following at a command prompt to install directly from the source:

git clone https://github.com/databrickslabs/databricks-cloud-automation.git
cd databricks-cloud-automation
python setup.py install

Once you have installed from source, simply navigate to the modules folder, and cd into the folder for a module (s3_to_databricks_via_iam, for the sake of our example). From there, run terraform init to initialize Terraform, then run terraform apply to enter your AWS credentials in the prompts that follow.

(Note: You can apply with the -var-file flag to specify input variables in a separate JSON or HCL file)

When you’re all done, copy the Instance Profile ARN provided, and paste it into Databricks via the Admin Console as we did in step 7 above. Voilà – you’ve connected Databricks to S3! Using the IAM role you’ve set up, you’ll be able to read and write data back and forth between Databricks and your S3 bucket seamlessly.

Connecting a Redshift Cluster to Databricks

Let’s revisit the example we proposed at the introduction of this blog – the most complex of our examples so far – to see how much easier the setup can be.

Imagine that you’ve just gotten started with Databricks, and you want to connect your company’s existing AWS Redshift cluster and S3 buckets so that you can get started. Normally, this would involve a time-consuming, procedural approach, requiring writing down various IDs and ARNs, and the following steps:

  1. VPC Peering to the VPC containing the Redshift cluster, including adding new security group rules and route table entries
  2. Create a new IAM role and attach it to the Databricks cluster
  3. Create an S3 bucket with a policy that references the new IAM role
  4. Grant AssumeRole permissions between the Databricks EC2 instance policy and the new role

The diagram below illustrates the complexity of setting up this architecture. Tedious as it may be, this is a real-life example that many of our customers face – and one that we can make significantly easier by using Databricks Cloud Automation.

Notice that in this example, in addition to connecting to a Redshift cluster, we are also connecting to an S3 bucket, just like we’ve done in the last two examples. Here is where the beauty of Terraform’s modular, “infrastructure as code” approach comes into play – since the S3 bucket module has already been built, there’s no need to “reinvent the wheel” and build a new S3 connection from scratch. Instead, we can call upon that already-built module, and add it like an interlocking puzzle piece that fits nicely into our existing resource graph. Likewise, we are also calling a separate “VPC peering” module which can even be refitted to set up VPC peering to resources other than just Redshift.

Just as before, we can use the Databricks Cloud Automation GUI to simplify and expedite this process. After calling databricks-cloud-manager from the command line, we then visit in our browser and select redshift_to_databricks. In addition to the credentials needed to set up the S3 bucket, we will also need:

  • Redshift cluster ID
  • Databricks VPC ID
  • Enterprise workspace ID (leave this blank if you are using a multi-tenant deployment; otherwise, contact Databricks to determine your Workspace ID.)

Once the proper credentials are entered, the Databricks Cloud Manager will configure all of the resource dependencies automatically, and let you know if there are any action steps you need to take. This process, though naturally still requiring you to find credentials for your resources, can be orders of magnitude faster than the alternative.


Whether you’re connecting to a single cloud resource or scaling your company’s infrastructure to meet your users’ growing demands, the Databricks Cloud Manager can dramatically reduce the time it takes to get you up and running. It’s popular among our customers due to its easy to use yet powerful features, including:

  • A graphical user interface for quickly deploying your cloud resources without deep DevOps expertise.
  • A high-level, declarative style that automates dependency and configuration management, allowing you to skip to an end result that simply works.
  • “Infrastructure as code” paradigm, allowing you to reuse and modify modules to quickly scale your deployment without increasing the complexity.
  • “Cloud agnostic” architecture, allowing you to connect to resources from different providers and systems, seamlessly.

Connecting your resources to Databricks is easier than ever, and with the power of the Databricks Unified Analytics Platform, harnessing the power of the cloud to find insights in your data is just a click away. If you’d like more information about this project, contact us, or talk to your Databricks representative.


Try Databricks for free. Get started today.

The post Efficient Databricks Deployment Automation with Terraform appeared first on Databricks.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 
Try this notebook in Databricks

Detecting fraudulent patterns at scale is a challenge, no matter the use case. The massive amounts of data to sift through, the complexity of the constantly evolving techniques, and the very small number of actual examples of fraudulent behavior are comparable to finding a needle in a haystack while not knowing what the needle looks like. In the world of finance, the added concerns with security and the importance of explaining how fraudulent behavior was identified further increases the complexity of the task.

To build these detection patterns, a team of domain experts often comes up with a set of rules that define fraudulent behavior. A typical workflow may include a subject matter expert in the financial fraud detection space putting together a set of requirements for a particular behavior. A data scientist may then take a subsample of the available data and build a model using these requirements and possibly some known fraud cases. To put the pattern in production, a data engineer may convert the resulting model to a set of rules with thresholds, often implemented using SQL.

This approach allows the financial institution to present a clear set of characteristics that led to the identification of fraud that is compliant with the General Data Protection Regulation (GDPR). However, this approach also poses numerous difficulties. The implementation of the detection pattern using a hardcoded set of rules is very brittle. Any changes to the pattern would take a very long time to update. This, in turn, makes it difficult to keep up with and adapt to the shift in fraudulent behaviors that are happening in the current marketplace.

Additionally, the systems in the workflow described above are often siloed, with the domain experts, data scientists, and data engineers all compartmentalized. The data engineer is responsible for maintaining massive amounts of data and translating the work of the domain experts and data scientists into production level code. Due to a lack of common platform, the domain experts and data scientists have to rely on sampled down data that fits on a single machine for analysis. This leads to difficulty in communication and ultimately a lack of collaboration.

In this blog, we will showcase how to convert several such rule-based detection use cases to machine learning use cases on the Databricks platform, unifying the key players in fraud detection: domain experts, data scientists, and data engineers. We will learn how to create a fraud-detection data pipeline and visualize the data leveraging a framework for building modular features from large data sets. We will also learn how to detect fraud using decision trees and Apache Spark MLlib. We will then use MLflow to iterate and refine the model to improve its accuracy.

Solving with ML

There is a certain degree of reluctance with regard to machine learning models in the financial world as they are believed to offer a “black box” solution with no way of justifying the identified fraudulent cases. GDPR requirements, as well as financial regulations, make it seemingly impossible to leverage the power of machine learning. However, several successful use cases have shown that applying machine learning to detect fraud at scale can solve a host of the issues mentioned above.

Training a supervised machine learning model to detect financial fraud is very difficult due to the low number of actual confirmed examples of fraudulent behavior. However, the presence of a known set of rules that identify a particular type of fraud can help create a set of synthetic labels and an initial set of features. The output of the detection pattern that has been developed by the domain experts in the field has likely gone through the appropriate approval process to be put in production. It produces the expected fraudulent behavior flags and may, therefore, be used as a starting point to train a machine learning model. This simultaneously mitigates three concerns:

  1. The lack of training labels,
  2. The decision of what features to use,  and
  3. Having an appropriate benchmark for the model.

Training a machine learning model to recognize the rule-based fraudulent behavior flags offers a direct comparison with the expected output via a confusion matrix. Provided that the results closely match the rule-based detection pattern, this approach helps gain confidence in machine learning based fraud detection with the skeptics. The output of this model is very easy to interpret and may serve as a baseline discussion of the expected false negatives and false positives when compared to the original detection pattern.

Furthermore, the concern with machine learning models being difficult to interpret may be further assuaged if a decision tree model is used as the initial machine learning model. Because the model is being trained to a set of rules, the decision tree is likely to outperform any other machine learning model. The additional benefit is, of course, the utmost transparency of the model, which will essentially show the decision-making process for fraud, but without human intervention and the need to hard code any rules or thresholds. Of course, it must be understood that the future iterations of the model may utilize a different algorithm altogether to achieve maximum accuracy. The transparency of the model is ultimately achieved by understanding the features that went into the algorithm. Having interpretable features will yield interpretable and defensible model results.

The biggest benefit of the machine learning approach is that after the initial modeling effort, future iterations are modular and updating the set of labels, features, or model type is very easy and seamless, reducing the time to production. This is further facilitated on the Databricks Unified Analytics Platform where the domain experts, data scientists, data engineers may work off the same data set at scale and collaborate directly in the notebook environment. So let’s get started!

Ingesting and Exploring the Data

We will use a synthetic dataset for this example. To load the dataset yourself, please download it to your local machine from Kaggle and then import the data via Import Data – Azure and AWS

The PaySim data simulates mobile money transactions based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. The below table shows the information that the data set provides:

Exploring the Data

Creating the DataFrames – Now that we have uploaded the data to Databricks File System (DBFS), we can quickly and easily create DataFrames using Spark SQL

# Create df DataFrame which contains our simulated financial fraud detection dataset
df = spark.sql("select step, type, amount, nameOrig, oldbalanceOrg, newbalanceOrig, nameDest, oldbalanceDest, newbalanceDest from sim_fin_fraud_detection")

Now that we have created the DataFrame, let’s take a look at the schema and the first thousand rows to review the data.

# Review the schema of your data
|-- step: integer (nullable = true)
|-- type: string (nullable = true)
|-- amount: double (nullable = true)
|-- nameOrig: string (nullable = true)
|-- oldbalanceOrg: double (nullable = true)
|-- newbalanceOrig: double (nullable = true)
|-- nameDest: string (nullable = true)
|-- oldbalanceDest: double (nullable = true)
|-- newbalanceDest: double (nullable = true)
Types of Transactions

Let’s visualize the data to understand the types of transactions the data captures and their contribution to the overall transaction volume.

To get an idea of how much money we are talking about, let’s also visualize the data based on the types of transactions and on their contribution to the amount of cash transferred (i.e. sum(amount)).

Rules-based Model

We are not likely to start with a large data set of known fraud cases to train our model. In most practical applications, fraudulent detection patterns are identified by a set of rules established by the domain experts. Here, we create a column called label based on these rules.

# Rules to Identify Known Fraud-based
df = df.withColumn("label", 
                       (df.oldbalanceOrg <= 56900) & (df.type == "TRANSFER") & (df.newbalanceDest <= 105)) | ( (df.oldbalanceOrg > 56900) & (df.newbalanceOrig <= 12)) | ( (df.oldbalanceOrg > 56900) & (df.newbalanceOrig > 12) & (df.amount > 1160000)
                           ), 1

Visualizing Data Flagged by Rules

These rules often flag quite a large number of fraudulent cases. Let’s visualize the number of flagged transactions. We can see that the rules flag about 4% of the cases and 11% of the total dollar amount as fraudulent.

Selecting the Appropriate Machine Learning Models

In many cases, a black box approach to fraud detection cannot be used. First, the domain experts need to be able to understand why a transaction was identified as fraudulent. Then, if action is to be taken, the evidence has to be presented in court. The decision tree is an easily interpretable model and is a great starting point for this use case. Read this blog “The wise old tree” on decision trees to learn more.

Creating the Training Set

To build and validate our ML model, we will do an 80/20 split using .randomSplit. This will set aside a randomly chosen 80% of the data for training and the remaining 20% to validate the results.

# Split our dataset between training and test datasets
(train, test) = df.randomSplit([0.8, 0.2], seed=12345)

Creating the ML Model Pipeline

To prepare the data for the model, we must first convert categorical variables to numeric using .StringIndexer. We then must assemble all of the features we would like for the model to use. We create a pipeline to contain these feature preparation steps in addition to the decision tree model so that we may repeat these steps on different data sets. Note that we fit the pipeline to our training data first and will then use it to transform our test data in a later step.

from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import DecisionTreeClassifier

# Encodes a string column of labels to a column of label indices
indexer = StringIndexer(inputCol = "type", outputCol = "typeIndexed")

# VectorAssembler is a transformer that combines a given list of columns into a single vector column
va = VectorAssembler(inputCols = ["typeIndexed", "amount", "oldbalanceOrg", "newbalanceOrig", "oldbalanceDest", "newbalanceDest", "orgDiff", "destDiff"], outputCol = "features")

# Using the DecisionTree classifier model
dt = DecisionTreeClassifier(labelCol = "label", featuresCol = "features", seed = 54321, maxDepth = 5)

# Create our pipeline stages
pipeline = Pipeline(stages=[indexer, va, dt])

# View the Decision Tree model (prior to CrossValidator)
dt_model = pipeline.fit(train)
Visualizing the Model

Calling display() on the last stage of the pipeline, which is the decision tree model, allows us to view the initial fitted model with the chosen decisions at each node. This helps to understand how the algorithm arrived at the resulting predictions.


Visual representation of the Decision Tree model

Model Tuning

To ensure we have the best fitting tree model, we will cross-validate the model with several parameter variations. Given that our data consists of 96% negative and 4% positive cases, we will use the Precision-Recall (PR) evaluation metric to account for the unbalanced distribution.

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

# Build the grid of different parameters
paramGrid = ParamGridBuilder() \
.addGrid(dt.maxDepth, [5, 10, 15]) \
.addGrid(dt.maxBins, [10, 20, 30]) \

# Build out the cross validation
crossval = CrossValidator(estimator = dt,
                          estimatorParamMaps = paramGrid,
                          evaluator = evaluatorPR,
                          numFolds = 3)  
# Build the CV pipeline
pipelineCV = Pipeline(stages=[indexer, va, crossval])

# Train the model using the pipeline, parameter grid, and preceding BinaryClassificationEvaluator
cvModel_u = pipelineCV.fit(train)
Model Performance

We evaluate the model by comparing the Precision-Recall (PR) and Area under the ROC curve (AUC) metrics for the training and test sets. Both PR and AUC appear to be very high.

# Build the best model (training and test datasets)
train_pred = cvModel_u.transform(train)
test_pred = cvModel_u.transform(test)

# Evaluate the model on training datasets
pr_train = evaluatorPR.evaluate(train_pred)
auc_train = evaluatorAUC.evaluate(train_pred)

# Evaluate the model on test datasets
pr_test = evaluatorPR.evaluate(test_pred)
auc_test = evaluatorAUC.evaluate(test_pred)

# Print out the PR and AUC values
print("PR train:", pr_train)
print("AUC train:", auc_train)
print("PR test:", pr_test)
print("AUC test:", auc_test)

# Output:
# PR train: 0.9537894984523128
# AUC train: 0.998647996459481
# PR test: 0.9539170535377599
# AUC test: 0.9984378183482442

To see how the model misclassified the results, let’s use matplotlib and pandas to visualize our confusion matrix.

Balancing the Classes

We see that the model is identifying 2421 more cases than the original rules identified. This is not as alarming as detecting more potential fraudulent cases could be a good thing. However, there are 58 cases that were not detected by the algorithm but were originally identified. We are going to attempt to improve our prediction further by balancing our classes using undersampling.  That is, we will keep all the fraud cases and then downsample the non-fraud cases to match that number to get a balanced data set. When we visualized our new data set, we see that the yes and no cases are 50/50.

# Reset the DataFrames for no fraud (`dfn`) and fraud (`dfy`)
dfn = train.filter(train.label == 0)
dfy = train.filter(train.label == 1)

# Calculate summary metrics
N = train.count()
y = dfy.count()
p = y/N

# Create a more balanced training dataset
train_b = dfn.sample(False, p, seed = 92285).union(dfy)

# Print out metrics
print("Total count: %s, Fraud cases count: %s, Proportion of fraud cases: %s" % (N, y, p))
print("Balanced training dataset count: %s" % train_b.count())

# Output:
# Total count: 5090394, Fraud cases count: 204865, Proportion of fraud cases: 0.040245411258932016
# Balanced training dataset count: 401898

# Display our more balanced training dataset

Updating the Pipeline

Now let’s update the ML pipeline and create a new cross validator. Because we are using ML pipelines, we only need to update it with the new dataset and we can quickly repeat the same pipeline steps.

# Re-run the same ML pipeline (including parameters grid)
crossval_b = CrossValidator(estimator = dt,
estimatorParamMaps = paramGrid,
evaluator = evaluatorAUC,
numFolds = 3)
pipelineCV_b = Pipeline(stages=[indexer, va, crossval_b])

# Train the model using the pipeline, parameter grid, and BinaryClassificationEvaluator using the `train_b` dataset
cvModel_b = pipelineCV_b.fit(train_b)

# Build the best model (balanced training and full test datasets)
train_pred_b = cvModel_b.transform(train_b)
test_pred_b = cvModel_b.transform(test)

# Evaluate the model on the balanced training datasets
pr_train_b = evaluatorPR.evaluate(train_pred_b)
auc_train_b = evaluatorAUC.evaluate(train_pred_b)

# Evaluate the model on full test datasets
pr_test_b = evaluatorPR.evaluate(test_pred_b)
auc_test_b = evaluatorAUC.evaluate(test_pred_b)

# Print out the PR and AUC values
print("PR train:", pr_train_b)
print("AUC train:", auc_train_b)
print("PR test:", pr_test_b)
print("AUC test:", auc_test_b)

# Output: 
# PR train: 0.999629161563572
# AUC train: 0.9998071389056655
# PR test: 0.9904709171789063
# AUC test: 0.9997903902204509
Review the Results

Now let’s look at the results of our new confusion matrix. The model misidentified only one fraudulent case. Balancing the classes seems to have improved the model.

Model Feedback and Using MLflow

Once a model is chosen for production, we want to continuously collect feedback to ensure that the model is still identifying the behavior of interest. Since we are starting with a rule-based label, we want to supply future models with verified true labels based on human feedback. This stage is crucial for maintaining confidence and trust in the machine learning process. Since analysts are not able to review every single case, we want to ensure we are presenting them with carefully chosen cases to validate the model output. For example, predictions, where the model has low certainty, are good candidates for analysts to review. The addition of this type of feedback will ensure the models will continue to improve and evolve with the changing landscape.

MLflow helps us throughout this cycle as we train different model versions. We can keep track of our experiments, comparing the results of different model configurations and parameters. For example here, we can compare the PR and AUC of the models trained on balanced and unbalanced data sets using the MLflow UI. Data scientists can use MLflow to keep track of the various model metrics and any additional visualizations and artifacts to help make the decision of which model should be deployed in production. The data engineers will then be able to easily retrieve the chosen model along with the library versions used for training as a .jar file to be deployed on new data in production. Thus, the collaboration between the domain experts who review the model results, the data scientists who update the models, and the data engineers who deploy the models in production, will be strengthened throughout this iterative process.


We have reviewed an example of how to use a rule-based fraud detection label and convert it to a machine learning model using Databricks with MLflow. This approach allows us to build a scalable, modular solution that will help us keep up with ever-changing fraudulent behavior patterns. Building a machine learning model to identify fraud allows us to create a feedback loop that allows the model to evolve and identify new potential fraudulent patterns. We have seen how a decision tree model, in particular, is a great starting point to introduce machine learning to a fraud detection program due to its interpretability and excellent accuracy.

A major benefit of using the Databricks platform for this effort is that it allows for data scientists, engineers, and business users to seamlessly work together throughout the process. Preparing the data, building models, sharing the results, and putting the models into production can now happen on the same platform, allowing for unprecedented collaboration. This approach builds trust across the previously siloed teams, leading to an effective and dynamic fraud detection program.

Try this notebook by signing up for a free trial in just a few minutes and get started creating your own models.


Try Databricks for free. Get started today.

The post Detecting Financial Fraud at Scale with Decision Trees and MLflow on  Databricks appeared first on Databricks.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 
Databricks Blog by Ricardo Portilla, Brenner Heintz An.. - 3w ago
Try this notebook in Databricks

This blog is part 1 of our two-part series Using Dynamic Time Warping and MLflow to Detect Sales Trends. To go to part 2, go to Using Dynamic Time Warping and MLflow to Detect Sales Trends.

The phrase “dynamic time warping,” at first read, might evoke images of Marty McFly driving his DeLorean at 88 MPH in the Back to the Future series. Alas, dynamic time warping does not involve time travel; instead, it’s a technique used to dynamically compare time series data when the time indices between comparison data points do not sync up perfectly.

As we’ll explore below, one of the most salient uses of dynamic time warping is in speech recognition – determining whether one phrase matches another, even if it the phrase is spoken faster or slower than its comparison. You can imagine that this comes in handy to identify the “wake words” used to activate your Google Home or Amazon Alexa device – even if your speech is slow because you haven’t yet had your daily cup(s) of coffee.

Dynamic time warping is a useful, powerful technique that can be applied across many different domains. Once you understand the concept of dynamic time warping, it’s easy to see examples of its applications in daily life, and its exciting future applications. Consider the following uses:

  • Financial markets – comparing stock trading data over similar time frames, even if they do not match up perfectly. For example, comparing monthly trading data for February (28 days) and March (31 days).
  • Wearable fitness trackers – more accurately calculating a walker’s speed and the number of steps, even if their speed varied over time.
  • Route calculation – calculating more accurate information about a driver’s ETA, if we know something about their driving habits (for example, they drive quickly on straightaways but take more time than average to make left turns).

Data scientists, data analysts, and anyone working with time series data should become familiar with this technique, given that perfectly aligned time-series comparison data can be as rare to see in the wild as perfectly “tidy” data.

In this blog series, we will explore:

  • The basic principles of dynamic time warping
  • Running dynamic time warping on sample audio data
  • Running dynamic time warping on sample sales data using MLflow
Dynamic Time Warping

The objective of time series comparison methods is to produce a distance metric between two input time series. The similarity or dissimilarity of two-time series is typically calculated by converting the data into vectors and calculating the Euclidean distance between those points in vector space.

Dynamic time warping is a seminal time series comparison technique that has been used for speech and word recognition since the 1970s with sound waves as the source; an often cited paper is Dynamic time warping for isolated word recognition based on ordered graph searching techniques.


This technique can be used not only for pattern matching, but also anomaly detection (e.g. overlap time series between two disjoint time periods to understand if the shape has changed significantly, or to examine outliers). For example, when looking at the red and blue lines in the following graph, note the traditional time series matching (i.e. Euclidean Matching) is extremely restrictive. On the other hand, dynamic time warping allows the two curves to match up evenly even though the X-axes (i.e. time) are not necessarily in sync.  Another way is to think of this is as a robust dissimilarity score where a lower number means the series is more similar.

Source: Wiki Commons: File:Euclidean_vs_DTW.jpg

Two-time series (the base time series and new time series) are considered similar when it is possible to map with function f(x) according to the following rules so as to match the magnitudes using an optimal (warping) path.

Sound pattern matching

Traditionally, dynamic time warping is applied to audio clips to determine the similarity of those clips.  For our example, we will use four different audio clips based on two different quotes from a TV show called The Expanse. There are four audio clips (you can listen to them below but this is not necessary) – three of them (clips 1, 2, and 4) are based on the quote:

“Doors and corners, kid. That’s where they get you.”

and one clip (clip 3) is the quote

“You walk into a room too fast, the room eats you.”

Doors and Corners, Kid.
That’s where they get you. [v1]

Doors and Corners, Kid.
That’s where they get you. [v2]

You walk into a room too fast,
the room eats you.

Doors and Corners, Kid.
That’s where they get you [v3]

Quotes are from The Expanse

Below are visualizations using matplotlib of the four audio clips:

  • Clip 1: This is our base time series based on the quote “Doors and corners, kid. That’s where they get you”.
  • Clip 2: This is a new time series [v2] based on clip 1 where the intonation and speech pattern is extremely exaggerated.
  • Clip 3: This is another time series that’s based on the quote “You walk into a room too fast, the room eats you.” with the same intonation and speed as Clip 1.
  • Clip 4: This is a new time series [v3] based on clip 1 where the intonation and speech pattern is similar to clip 1.

The code to read these audio clips and visualize them using matplotlib can be summarized in the following code snippet.

from scipy.io import wavfile
from matplotlib import pyplot as plt
from matplotlib.pyplot import figure

# Read stored audio files for comparison
fs, data = wavfile.read("/dbfs/folder/clip1.wav")

# Set plot style

# Create subplots
ax = plt.subplot(2, 2, 1)
ax.plot(data1, color='#67A0DA')

# Display created figure

The full code-base can be found in the notebook Dynamic Time Warping Background.

As noted below, when the two clips (in this case, clips 1 and 4) have different intonations (amplitude) and latencies for the same quote.

If we were to follow a traditional Euclidean matching (per the following graph), even if we were to discount the amplitudes, the timings between the original clip (blue) and the new clip (yellow) do not match.

With dynamic time warping, we can shift time to allow for a time series comparison between these two clips.

For our time series comparison, we will use the fastdtw PyPi library; the instructions to install PyPi libraries within your Databricks workspace can be found here: Azure | AWS.  By using fastdtw, we can quickly calculate the distance between the different time series.

from fastdtw import fastdtw

# Distance between clip 1 and clip 2
distance = fastdtw(data_clip1, data_clip2)[0]
print(“The distance between the two clips is %s” % distance)

The full code-base can be found in the notebook Dynamic Time Warping Background.

Base Query Distance
Clip 1 Clip 2 480148446.0
Clip 3 310038909.0
Clip 4 293547478.0

Some quick observations:

  • As noted in the preceding graph, Clips 1 and 4 have the shortest distance as the audio clips have the same words and intonations
  • The distance between Clips 1 and 3 is also quite short (though longer than when compared to Clip 4) even though they have different words, they are using the same intonation and speed.
  • Clips 1 and 2 have the longest distance due to the extremely exaggerated intonation and speed even though they are using the same quote.

As you can see, with dynamic time warping, one can ascertain the similarity of two different time series.


Now that we have discussed dynamic time warping, let’s apply this use case to detect sales trends.


Try Databricks for free. Get started today.

The post Understanding Dynamic Time Warping appeared first on Databricks.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 
Databricks Blog by Ricardo Portilla, Brenner Heintz An.. - 3w ago
Try this notebook series in Databricks

This blog is part 2 of our two-part series Using Dynamic Time Warping and MLflow to Detect Sales Trends

The phrase “dynamic time warping,” at first read, might evoke images of Marty McFly driving his DeLorean at 88 MPH in the Back to the Future series. Alas, dynamic time warping does not involve time travel; instead, it’s a technique used to dynamically compare time series data when the time indices between comparison data points do not sync up perfectly.

As we’ll explore below, one of the most salient uses of dynamic time warping is in speech recognition – determining whether one phrase matches another, even if it the phrase is spoken faster or slower than its comparison. You can imagine that this comes in handy to identify the “wake words” used to activate your Google Home or Amazon Alexa device – even if your speech is slow because you haven’t yet had your daily cup(s) of coffee.

Dynamic time warping is a useful, powerful technique that can be applied across many different domains. Once you understand the concept of dynamic time warping, it’s easy to see examples of its applications in daily life, and its exciting future applications. Consider the following uses:

  • Financial markets – comparing stock trading data over similar time frames, even if they do not match up perfectly. For example, comparing monthly trading data for February (28 days) and March (31 days).
  • Wearable fitness trackers – more accurately calculating a walker’s speed and the number of steps, even if their speed varied over time.
  • Route calculation – calculating more accurate information about a driver’s ETA, if we know something about their driving habits (for example, they drive quickly on straightaways but take more time than average to make left turns).

Data scientists, data analysts, and anyone working with time series data should become familiar with this technique, given that perfectly aligned time-series comparison data can be as rare to see in the wild as perfectly “tidy” data.

In this blog series, we will explore:

  • The basic principles of dynamic time warping
  • Running dynamic time warping on sample audio data
  • Running dynamic time warping on sample sales data using MLflow

For more background on dynamic time warping, refer to the previous post Understanding Dynamic Time Warping.


Imagine that you own a company that creates 3D printed products. Last year, you knew that drone propellers were showing very consistent demand, so you produced and sold those, and the year before you sold phone cases. The new year is arriving very soon, and you’re sitting down with your manufacturing team to figure out what your company should produce for next year. Buying the 3D printers for your warehouse put you deep into debt, so you have to make sure that your printers are running at or near 100% capacity at all times in order to make the payments on them.

Since you’re a wise CEO, you know that your production capacity over the next year will ebb and flow – there will be some weeks when your production capacity is higher than others. For example, your capacity might be higher during the summer (when you hire seasonal workers), and lower during the 3rd week of every month (because of issues with the 3D printer filament supply chain). Take a look at the chart below to see your company’s production capacity estimate:

Your job is to choose a product for which weekly demand meets your production capacity as closely as possible. You’re looking over a catalog of products which includes last year’s sales numbers for each product, and you think this year’s sales will be similar.

If you choose a product with weekly demand that exceeds your production capacity, then you’ll have to cancel customer orders, which isn’t good for business. On the other hand, if you choose a product without enough weekly demand, you won’t be able to keep your printers running at full capacity and may fail to make the debt payments.

Dynamic time warping comes into play here because sometimes supply and demand for the product you choose will be slightly out of sync. There will be some weeks when you simply don’t have enough capacity to meet all of your demand, but as long as you’re very close and you can make up for it by producing more products in the week or two before or after, your customers won’t mind. If we limited ourselves to comparing the sales data with our production capacity using Euclidean Matching, we might choose a product that didn’t account for this, and leave money on the table. Instead, we’ll use dynamic time warping to choose the product that’s right for your company this year.

Load the product sales data set

We will use the weekly sales transaction data set found in the UCI Dataset Repository to perform our sales-based time series analysis. (Source Attribution: James Tan, jamestansc ‘@’ suss.edu.sg, Singapore University of Social Sciences)

import pandas as pd

# Use Pandas to read this data
sales_pdf = pd.read_csv(sales_dbfspath, header='infer')

# Review data

Each product is represented by a row, and each week in the year is represented by a column. Values represent the number of units of each product sold per week. There are 811 products in the data set.

Calculate distance to optimal time series by product code
# Calculate distance via dynamic time warping between product code and optimal time series
import numpy as np
import _ucrdtw

def get_keyed_values(s):
    return(s[0], s[1:])

def compute_distance(row):
    return(row[0], _ucrdtw.ucrdtw(list(row[1][0:52]), list(optimal_pattern), 0.05, True)[1])

ts_values = pd.DataFrame(np.apply_along_axis(get_keyed_values, 1, sales_pdf.values))
distances = pd.DataFrame(np.apply_along_axis(compute_distance, 1, ts_values.values))
distances.columns = ['pcode', 'dtw_dist']

Using the calculated dynamic time warping ‘distances’ column, we can view the distribution of DTW distances in a histogram.

From there, we can identify the product codes closest to the optimal sales trend (i.e., those that have the smallest calculated DTW distance). Since we’re using Databricks, we can easily make this selection using a SQL query. Let’s display those that are closest.

-- Top 10 product codes closest to the optimal sales trend
select pcode, cast(dtw_dist as float) as dtw_dist from distances order by cast(dtw_dist as float) limit 10

After running this query, along with the corresponding query for the product codes that are furthest from the optimal sales trend, we were able to identify the 2 products that are closest and furthest from the trend. Let’s plot both of those products and see how they differ.

As you can see, Product #675 (shown in the orange triangles) represents the best match to the optimal sales trend, although the absolute weekly sales are lower than we’d like (we’ll remedy that later). This result makes sense since we’d expect the product with the closest DTW distance to have peaks and valleys that somewhat mirror the metric we’re comparing it to. (Of course, the exact time index for the product would vary on a week-by-week basis due to dynamic time warping). Conversely, Product #716 (shown in the green stars) is the product with the worst match, showing almost no variability.

Finding the optimal product: Small DTW distance and similar absolute sales numbers

Now that we’ve developed a list of products that are closest to our factory’s projected output (our “optimal sales trend”), we can filter them down to those that have small DTW distances as well as similar absolute sales numbers. One good candidate would be Product #202, which has a DTW distance of 6.86 versus the population median distance of 7.89 and tracks our optimal trend very closely.

# Review P202 weekly sales  
y_p202 = sales_pdf[sales_pdf['Product_Code'] == 'P202'].values[0][1:53]

Using MLflow to track best and worst products, along with artifacts

MLflow is an open source platform for managing the machine learning lifecycle, including experimentation, reproducibility, and deployment. Databricks notebooks offer a fully integrated MLflow environment, allowing you to create experiments, log parameters and metrics, and save results. For more information about getting started with MLflow, take a look at the excellent documentation.

MLflow’s design is centered around the ability to log all of the inputs and outputs of each experiment we do in a systematic, reproducible way. On every pass through the data, known as a “Run,” we’re able to log our experiment’s:

  • Parameters – the inputs to our model.
  • Metrics – the output of our model, or measures of our model’s success.
  • Artifacts – any files created by our model – for example, PNG plots or CSV data output.
  • Models – the model itself, which we can later reload and use to serve predictions.

In our case, we can use it to run the dynamic time warping algorithm several times over our data while changing the “stretch factor,” the maximum amount of warp that can be applied to our time series data. To initiate an MLflow experiment, and allow for easy logging using mlflow.log_param(), mlflow.log_metric(),  mlflow.log_artifact(), and mlflow.log_model(), we wrap our main function using:

with mlflow.start_run() as run:

as shown in the abbreviated code below.

import mlflow

def run_DTW(ts_stretch_factor):
    # calculate DTW distance and Z-score for each product
    with mlflow.start_run() as run:
        # Log Model using Custom Flavor
        dtw_model = {'stretch_factor' : float(ts_stretch_factor), 'pattern' : optimal_pattern}       
        mlflow_custom_flavor.log_model(dtw_model, artifact_path="model")

        # Log our stretch factor parameter to MLflow
        mlflow.log_param("stretch_factor", ts_stretch_factor)

        # Log the median DTW distance for this run
        mlflow.log_metric("Median Distance", distance_median)

        # Log artifacts - CSV file and PNG plot - to MLflow
        mlflow.log_artifact('zscore_outliers_' + str(ts_stretch_factor) + '.csv')

    return run.info

stretch_factors_to_test = [0.0, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5]
for n in stretch_factors_to_test:

With each run through the data, we’ve created a log of the “stretch factor” parameter being used, and a log of products we classified as being outliers based upon the Z-score of the DTW distance metric. We were even able to save an artifact (file) of a histogram of the DTW distances. These experimental runs are saved locally on Databricks and remain accessible in the future if you decide to view the results of your experiment at a later date.

Now that MLflow has saved the logs of each experiment, we can go back through and examine the results. From your Databricks notebook, select the  icon in the upper right-hand corner to view and compare the results of each of our runs.

Not surprisingly, as we increase our “stretch factor,” our distance metric decreases. Intuitively, this makes sense: as we give the algorithm more flexibility to warp the time indices forward or backward, it will find a closer fit for the data. In essence, we’ve traded some bias for variance.

Logging Models in MLflow

MLflow has the ability to not only log experiment parameters, metrics, and artifacts (like plots or CSV files), but also to log machine learning models. An MLflow Model is simply a folder that is structured to conform to a consistent API, ensuring compatibility with other MLflow tools and features. This interoperability is very powerful, allowing any Python model to be rapidly deployed to many different types of production environments.

MLflow comes pre-loaded with a number of common model “flavors” for many of the most popular machine learning libraries, including scikit-learn, Spark MLlib, PyTorch, TensorFlow, and others. These model flavors make it trivial to log and reload models after they are initially constructed, as demonstrated in this blog post. For example, when using MLflow with scikit-learn, logging a model is as easy as running the following code from within an experiment:

mlflow.sklearn.log_model(model=sk_model, artifact_path="sk_model_path")

MLflow also offers a “Python function” flavor, which allows you to save any model from a third-party library (such as XGBoost, or spaCy), or even a simple Python function itself, as an MLflow model. Models created using the Python function flavor live within the same ecosystem and are able to interact with other MLflow tools through the Inference API. Although it’s impossible to plan for every use case, the Python function model flavor was designed to be as universal and flexible as possible. It allows for custom processing and logic evaluation, which can come in handy for ETL applications. Even as more “official” Model flavors come online, the generic Python function flavor will still serve as an important “catch all,” providing a bridge between Python code of any kind and MLflow’s robust tracking toolkit.

Logging a Model using the Python function flavor is a straightforward process. Any model or function can be saved as a Model, with one requirement: it must take in a pandas Dataframe as input, and return a DataFrame or NumPy array. Once that requirement is met, saving your function as an MLflow Model involves defining a Python class that inherits from PythonModel, and overriding the .predict() method with your custom function, as described here.

Loading a logged model from one of our runs

Now that we’ve run through our data with several different stretch factors, the natural next step is to examine our results and look for a model that did particularly well according to the metrics that we’ve logged. MLflow makes it easy to then reload a logged model, and use it to make predictions on new data, using the following instructions:

  1. Click on the link for the run you’d like to load our model from.
  2. Copy the ‘Run ID’.
  3. Make note of the name of the folder the model is stored in. In our case, it’s simply named “model.”
  4. Enter the model folder name and Run ID as shown below:
import custom_flavor as mlflow_custom_flavor

loaded_model = mlflow_custom_flavor.load_model(artifact_path='model', run_id='e26961b25c4d4402a9a5a7a679fc8052')

To show that our model is working as intended, we can now load the model and use it to measure DTW distances on two new products that we’ve created within the variable new_sales_units:

# use the model to evaluate new products found in ‘new_sales_units’
output = loaded_model.predict(new_sales_units)
Next steps

As you can see, our MLflow Model is predicting new and unseen values with ease. And since it conforms to the Inference API, we can deploy our model on any serving platform (such as Microsoft Azure ML, or Amazon Sagemaker), deploy it as a local REST API endpoint, or create a user-defined function (UDF) that can easily be used with Spark-SQL.    In closing, we demonstrated how we can use dynamic time warping to predict sales trends using the Databricks Unified Analytics Platform.  Try out the Using Dynamic Time Warping and MLflow to Predict Sales Trends notebook with Databricks Runtime for Machine Learning today.


Try Databricks for free. Get started today.

The post Using Dynamic Time Warping and MLflow to Detect Sales Trends appeared first on Databricks.

Read Full Article

Read for later

Articles marked as Favorite are saved for later viewing.
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Separate tags by commas
To access this feature, please upgrade your account.
Start your free month
Free Preview