If you've ever wanted to play around with big data sets in a Spark cluster from R with the sparklyr package, but haven't gotten started because setting up a Spark cluster is too hard, well ... rest easy. You can get up and running in about 5 minutes using the guide SparklyR on Azure with AZTK, and you don't even have to install anything yourself. I'll summarize the steps below, but basically you'll run a command-line utility to launch a cluster in Azure with everything you need already installed, and then connect to RStudio Server using your browser to analyze data with sparklyr.
Step 1: Install the Azure Distributed Data Engineering Toolkit (aztk). For this, you'll need a Unix command line with Python 3 installed. I'm on Windows, so I used a bash shell from the Windows Subsystem for Linux and it worked great. (I just had to use pip3 instead of pip to install, since the default there is Python 2.) The same process should work with other Linux distros or from a Mac terminal.
Step 3: Back at the command line, set up authentication in the secrets.yaml file. You'll be using the Azure portal to retrieve the necessary keys, and you'll need to create an Azure Batch account if you don't have one already. (Batch is the HPC cluster and job-management service in Azure.) You can find step-by-step details in the aztk documentation.
Step 4: Configure your cluster defaults in the cluster.yaml file. Here you can define the default VM instance size used for the cluster nodes; for example vm_size: standard_a2 gives you basic 2-core nodes. (You can override this in the command line, but it's convenient to set it here.) You'll also need to specify a dockerfile here that will be used to set up the node images, and for use with sparklyr you'll need to specify one that includes R and the version of Spark you want. I used:
This provides an image with Spark 2.2.0, R 3.4.1, and a suite of R packages pre-installed, including sparklyr and tidyverse. (You could provide your own dockerfile here, if you need other things installed on the nodes.)
Step 5: Privision a Spark cluster. This is the easy bit: just use the command line tool like this:
In this case, it will launch a cluster of 4 nodes, each with 2 cores (pre the vm_size option configured above.) Each node will be pre-installed with R and (Warning: the default quotas for Azure Batch are laughably low: for me it was 24 cores total at first. You can get your limit raised fairly easily, but it can take a day to get approval.) Provisioning a cluster takes about 5 minutes; while your waiting you can check on the progress by clicking on the cluster name in the "Pools" section of your Azure Batch account within the Azure Portal.
Once it's ready, you'll also need to provide a password for the head node unless you set up ssh keys in the secrets.yaml file.
Step 6: Connect to the head node of the Spark cluster. Normally you'd need to find the IP address first, but aztk makes it easy with its ssh command:
aztk spark cluster ssh --id mysparklyr4
(You'll need to provide a password here, if you set one up in Step 5.) This gives you a shell on the head node, but more importantly it maps the ports for Spark and RStudio server, so that you can connect to them using http://localhost URLs in the next step. Don't exit from this shell until you're done with the next steps, or the port mappings will be cancelled.
Step 7: Connect to RStudio Server
Open a browser window on your desktop, and browse to http://localhost:8787. This will open up RStudio Server in your browser. (The default login is rstudio/rstudio.) To be clear, RStudio Server is running on the head node of your cluster in Azure Batch, not on your local machine: the port mapping from the previous step is redirecting your local port 8787 to the remote cluster.
From here, you can use RStudio as you normally would. In particular, the sparklyr package is already installed, so you can connect to the Spark cluster directly and use RStudio Server's built-in features for working with Spark.
One of the nice things about using RStudio Server is that you can shut down your browser or even your machine, and RStudio Server will preserve its state so that you can pick up exactly where you left off next time you log in. (Just use aztk spark cluster ssh to reapply the port mappings first, if necessary.)
Step 8: When you're finished, shut down your cluster using the aztk spark cluster delete command. (While you can delete the nodes from the Pools view in the Azure portal, the command does some additional cleanup for you.) You'll be charged for each node in the cluster at the usual VM rates for as long as the cluster is provisioned. (One cost-saving option is to use low-priority VMs for the nodes, for savings of up to 90% compared to the usual rates.)
That's it! Once you get used to it, it's all quick and easy -- the longest part is waiting for the cluster to spin up in Step 5. This is just a summary, but the full details see the guide SparklyR on Azure with AZTK.
The SparklyR package from RStudio provides a high-level interface to Spark from R. This means you can create R objects that point to data frames stored in the Spark cluster and apply some familiar R paradigms (like dplyr) to the data, all the while leveraging Spark's distributed architecture without having to worry about memory limitations in R. You can also access the distributed machine-learning algorithms included in Spark directly from R functions.
To interface with Azure Data Lake, you'll use U-SQL, a SQL-like language extensible using C#. The R Extensions for U-SQL allow you to reference an R script from a U-SQL statement, and pass data from Data Lake into the R Script. There's a 500Mb limit for the data passed to R, but the basic idea is that you perform the main data munging tasks in U-SQL, and then pass the prepared data to R for analysis. With this data you can use any function from base R or any R package. (Several common R packages are provided in the environment, or you can upload and install other packages directly, or use the checkpoint package to install everything you need.) The R engine used is R 3.2.2.
The tutorial walks you through the following stages:
I’m delighted to announce the release of version 1.0.0 of the dplyrXdf package. dplyrXdf began as a simple (relatively speaking) backend to dplyr for Microsoft Machine Learning Server/Microsoft R Server’s Xdf file format, but has now become a broader suite of tools to ease working with Xdf files.
This update to dplyrXdf brings the following new features:
Support for Spark and Hadoop clusters, including integration with the sparklyr package to process Hive tables in Spark
Integration with dplyr to process SQL Server tables in-database
Simplified handling of parallel processing for grouped data
Several utility functions for Xdf and file management
Workarounds for various glitches and unexpected behaviour in MRS and dplyr
Spark, Hadoop and HDFS
New in version 1.0.0 of dplyrXdf is support for Xdf files and datasets stored in HDFS in a Hadoop or Spark cluster. Most verbs and pipelines behave the same way, whether the computations are taking place in your R session itself, or in-cluster (except that they should be much more scalable in the latter case). Similarly, dplyrXdf can handle both the scenarios where your R session is taking place on the cluster edge node, or on a remote client.
For example, here is some sample code where we extract a table from Hive, then create a pipeline to process it in the cluster:
If you are logged into the edge node, dplyrXdf also has the ability to call sparklyr to process Hive tables in Spark. This can be more efficient than converting the data to Xdf format, since less I/O is involved. To run the above pipeline with sparklyr, we simply omit the step of creating an Xdf file:
One of the key strengths of dplyr is its ability to interoperate with SQL databases. Given a database table as input, dplyr can translate the verbs in a pipeline into a SQL query which is then execute in the database. For large tables, this can often be much more efficient than importing the data and running them locally. dplyrXdf can take advantage of this with an MRS data source that is a table in a SQL database, including (but not limited to) Microsoft SQL Server: rather than importing the data to Xdf, the data source is converted to a dplyr tbl and passed to the database for processing.
# copy the flights dataset to SQL Server
flightsSql <- RxSqlServerData("flights", connectionString=connStr)
flightsHd <- copy_to(flightsSql, nycflights13::flights)
# this is run inside SQL Server by dplyr
flightsQry <- flightsSql %>%
filter(month > 6) %>%
#> # Source: lazy query [?? x 2]
#> # Database: Microsoft SQL Server
#> # 13.00.4202[dbo@DESKTOP-TBHQGUH/sqlDemoLocal]
#> carrier avg_delay
#> <chr> <dbl>
#> 1 "9E" 5.37
#> 2 AA - 0.743
#> 3 AS -16.9
#> 4 B6 8.53
#> 5 DL 1.55
#> # ... with more rows
Even without a Hadoop or Spark cluster, dplyrXdf makes it easy to parallelise the handling of groups. To do this, it takes advantage of Microsoft R Server's distributed compute contexts: for example, if you set the compute context to "localpar", grouped transformations will be done in parallel on a local cluster of R processes. The cluster will be shut down automatically when the transformation is complete.
More broadly, you can create a custom backend and tell dplyrXdf to use it by setting the compute context to "dopar". This allows you a great deal of flexibility and scalability, for example by creating a cluster of multiple machines (as opposed to multiple cores on a single machine). Even if you do not have the physical machines, packages like AzureDSVM and doAzureParallel allow you to deploy clusters of VMs in the cloud, and then shut them down again. For more information, see the “Parallel processing of grouped data” section of the Using dplyrXdf vignette.
Data and file management
New in dplyrXdf 1.0.0 is a suite of functions to simplify managing Xdf files and data sources:
HDFS file management: upload and download files with hdfs_file_upload and hdfs_file_download; copy/move/delete files with hdfs_file_copy, hdfs_file_move, hdfs_file_remove; list files with hdfs_dir; and more
Xdf data management: upload and download datasets with copy_to, collect and compute; import/convert to Xdf with as_xdf; copy/move/delete Xdf data sources with copy_xdf, move_xdf and delete_xdf; and more
Other utilities: run a block of code in the local compute context with local_exec; convert an Xdf file to a data frame with as.data.frame; extract columns from an Xdf file with methods for [, [[ and pull
Obtaining dplyr and dplyrXdf
dplyrXdf 1.0.0 is available from GitHub. It requires Microsoft R Server 8.0 or higher, and dplyr 0.7 or higher. Note that dplyr 0.7 will not be in the MRAN snapshot that is your default repo, unless you are using the recently-released MRS 9.2; you can install it, and its dependencies, from CRAN. If you want to use the SQL Server and sparklyr integration facility, you should install the odbc, dbplyr and sparklyr packages as well.
Works with Spark and Hadoop clusters and files in HDFS
Several utility functions to ease working with files and datasets
Many bugfixes and workarounds for issues with the underlying RevoScaleR functions
This (pre-)release of dplyrXdf requires Microsoft R Server or Client version 8.0 or higher, and dplyr 0.7 or higher. If you’re using R Server, dplyr 0.7 won’t be in the MRAN snapshot that is your default repo, but you can get it from CRAN:
This completely changes the way in which dplyr handles standard evaluation. Previously, if you wanted to program with dplyr pipelines, you had to use special versions of the verbs ending with "_": mutate_, select_, and so on. You then provided inputs to these verbs via formulas or strings, in a way that was almost but not quite entirely unlike normal dplyr usage. For example, if you wanted to programmatically carry out a transformation on a given column in a data frame, you did the following:
This is prone to errors, since it requires creating a string and then parsing it. Worse, it's also insecure, as you can't always guarantee that the input string won't be malicious.
The tidyeval framework replaces all of that. In dplyr 0.7, you call the same functions for both interactive use and programming. The equivalent of the above in the new framework would be:
# the rlang package implements the tidyeval framework used by dplyr
x_sym <- sym(x)
transmute(mtcars, mpg2=2 * (!!x_sym))
Here, the !! symbol is a special operator that means to get the column name from the variable to its right. The verbs in dplyr 0.7 understand the special rules for working with quoted symbols introduced in the new framework. The same code also works in dplyrXdf 0.10:
# use the new as_xdf function to import to an Xdf file
mtx <- as_xdf(mtcars)
transmute(mtx, mpg2=2 * (!!x_sym)) %>% as.data.frame
New features in dplyrXdf
Copy, move and delete Xdf files
The following functions let you manipulate Xdf files as files:
copy_xdf and move_xdf copy and move an Xdf file, optionally renaming it as well.
rename_xdf does a strict rename, ie without changing the file’s location.
delete_xdf deletes the Xdf file.
HDFS file transfers
The following functions let you transfer files and datasets to and from HDFS, for working with a Spark or Hadoop cluster:
copy_to uploads a dataset (a data frame or data source object) from the native filesystem to HDFS, saving it as an Xdf file.
collect and compute do the reverse, downloading an Xdf file from HDFS.
hdfs_upload and hdfs_download transfer arbitrary files and directories to and from HDFS.
Uploading and downloading works (or should work) both from the edge node and from a remote client. The interface is the same in both cases: no need to remember when to use rxHadoopCopyFromLocal and rxHadoopCopyFromClient. The hdfs_* functions mostly wrap the rxHadoop* functions, but also add extra functionality in some cases (eg vectorised copy/move, test for directory existence, etc).
HDFS file management
The following functions are for file management in HDFS, and mirror similar functions in base R for working with the native filesystem:
hdfs_dir lists files in a HDFS directory, like dir() for the native filesystem.
hdfs_dir_exists and hdfs_file_exists test for existence of a directory or file, like dir.exists() and file.exists().
hdfs_file_copy, hdfs_file_move and hdfs_file_remove copy, move and delete files in a vectorised fashion, like file.copy(), file.rename() and unlink().
hdfs_dir_create and hdfs_dir_remove make and delete directories, like dir.create() and unlink(recursive=TRUE).
in_hdfs returns whether a data source is in HDFS or not.
As far as possible, the functions avoid reading the data via rxDataStep and so should be more efficient. The only times when rxDataStep is necessary are when importing from a non-Xdf data source, and converting between standard and composite Xdfs.
as_xdf imports a dataset or data source into an Xdf file, optionally as composite. as_standard_xdf and as_composite_xdf are shortcuts for creating standard and composite Xdfs respectively.
is_xdf and is_composite_xdf return whether a data source is a (composite) Xdf.
local_exec runs an expression in the local compute context: useful for when you want to work with local Xdf files while connected to a remote cluster.
dplyrXdf 0.10 is tentatively scheduled for a final release at the same time as the next version of Microsoft R Server, or shortly afterwards. In the meantime, please download this and give it a try; if you run into any bugs, or if you have any feedback, you can or log an issue at the Github repo.
H2O.ai is an open-source AI platform that provides a number of machine-learning algorithms that run on the Spark distributed computing framework. Azure HDInsight is Microsoft's fully-managed Apache Hadoop platform in the cloud, which makes it easy to spin up and manage Azure clusters of any size. It's also easy to to run H2O on HDInsight: H2O AI Platform is available as an application on HDInsight, which pre-installs everything you need as the cluster is created.
You can also drive H2O from R, but the R packages don't come auto-installed on HDInsight. To make this easy, the Azure HDInsight team has provided a couple of scripts that will install the necessary components on the cluster for you. These include RStudio (to provide an R IDE on the cluster) and the rsparkling package. With these components installed, from R you can:
Query data in Spark using the dplyr interface, and add new columns to existing data sets.
Convert data for training, validation, and testing to "H2O Frames" in preparation for modeling.
Apply any of the machine learning models provided by Sparkling Water to your data, using the distributed computing capabilities provided by the HDInsight platform.
For details on how to install the R components on HDInsight, follow the link below.
The sparklyr package (by RStudio) provides a high-level interface between R and Apache Spark. Among many other things, it allows you to filter and aggregate data in Spark using the dplyr syntax. In Microsoft R Server 9.1, you can now connect to a a Spark session using the sparklyr package as the interface, allowing you to combine the data-preparation capabilities of sparklyr and the data-analysis capabilities of Microsoft R Server in the same environment.
In a presentation by at the Spark Summit (embedded below, and you can find the slides here), Ali Zaidi shows how to connect to a Spark session from Microsoft R Server, and use the sparklyr package to extract a data set. He then shows how to build predictive models on this data (specifically, a deep Neural Network and a Boosted Trees classifier). He also shows how to build general ensemble models, cross-validate hyper-parameters in parallel, and even gives a preview of forthcoming streaming analysis capabilities.
Extending the R API for Spark with sparklyr and Microsoft R Server - Ali Zaidi - YouTube
Any easy way to try out these capabilities is with Azure HDInsight 3.6, which provides a managed Spark 2.1 instance with Microsoft R Server 9.1.