Optimizing partitioning for Apache Spark database loads via JDBC for performance
Jozef's Rblog
by
3y ago
Introduction Apache Spark is a popular open-source analytics engine for big data processing and thanks to the sparklyr and SparkR packages, the power of Spark is also available to R users. A very common task in working with Spark apart from using HDFS-based data storage is also interfacing with traditional RDMBS systems such as Oracle, MS SQL Server, and others. There is a lot of performance that can be gained by efficiently partitioning data for these types of data loads. In this post, we will explore the partitioning options that are available for Spark’s JDBC reading capabilities and inves ..read more
Visit website
A guide to retrieval and processing of data from relational database systems using Apache Spark and JDBC with R and sparklyr
Jozef's Rblog
by
4y ago
Introduction The {sparklyr} package lets us connect and use Apache Spark for high-performance, highly parallelized, and distributed computations. We can also use Spark’s capabilities to improve and streamline our data processing pipelines, as Spark supports reading and writing from many popular sources such as Parquet, Orc, etc. and most database systems via JDBC drivers. In this post, we will explore using R to perform data loads to Spark and optionally R from relational database management systems such as MySQL, Oracle, and MS SQL Server and show how such processes can be simplified. We wil ..read more
Visit website
A review of my experience with the Big Data Analysis with Scala and Spark course
Jozef's Rblog
by
4y ago
Introduction Apache Spark is an open-source distributed cluster-computing framework implemented in Scala that first came out in 2014 and has since then become popular for many computing applications including machine learning thanks to among other aspects its user-friendly APIs. The popularity also gave rise to many online courses of varied quality. In this post, I share my personal experience with completing the Big Data Analysis with Scala and Spark course on Coursera in May 2020, briefly walk through the content and write about the course assignments. I wrote down each of the paragraphs as ..read more
Visit website
Exploring and plotting positional ice hockey data on goals, penalties and more from R with the {nhlapi} package
Jozef's Rblog
by
4y ago
Introduction The National Hockey League (NHL) is considered to be the premier professional ice hockey league in the world, founded 102 years ago in 1917. Like many other sports, the data about teams, players, games, and more are a great resource to dive in and analyze using modern software tools. Thanks to the open NHL API, the data is accessible to everyone and the {nhlapi} R package aims to make that data readily available for analysis to R users. In this post, we will use the {nhlapi} R package to explore the positional data on in-game events, which will provide us with information on the ..read more
Visit website
A review of my experience with the Functional Programming Principles in Scala course
Jozef's Rblog
by
4y ago
Introduction Functional programming is a programming paradigm where programs are constructed by applying and composing functions and it quite popular in the data science application because of some of its useful properties that can help for example with scaling computations. One well-known resource to get into functional programming is the Functional Programming Principles in Scala course by École Polytechnique Fédérale de Lausanne. In this post, I share my personal experience with completing the Functional programming in Scala course on Coursera in May 2020, briefly walk through the content ..read more
Visit website
Automating R package checks across platforms with GitHub Actions and Docker in a portable way
Jozef's Rblog
by
4y ago
Introduction Automating the execution, testing and deployment of R code is a powerful tool to ensure the reproducibility, quality and overall robustness of the code that we are building. A relatively recent feature in GitHub - GitHub actions - allows us to do just that without using additional tools such as Travis or Jenkins for our repositories stored on GitHub. In this post, we will examine using GitHub actions and Docker to test our R packages across platforms in a portable way and show how this setup works for the CRAN package languageserversetup. Contents Many different tools, many di ..read more
Visit website
Setting up R with Visual Studio Code quickly and easily with the languageserversetup package
Jozef's Rblog
by
4y ago
Introduction Over the past years, R has been gaining popularity, bringing to life new tools to with ith it. Thanks to the amazing work by contributors implementing the Language Server Protocol for R and writing Visual Studio Code Extensions for R, the most popular development environment amongst developers across the world now has very strong support for R as well. In this post, we will look at the languageserversetup package that aims to make the setup of the R Language Server robust and easy to use by installing it into a separate, independent library and adjusting R startup in a way that ..read more
Visit website
R is turning 20 years old next Saturday. Here is how much bigger, stronger and faster it got over the years
Jozef's Rblog
by
4y ago
Introduction It is almost the 29th of February 2020! A day that is very interesting for R, because it marks 20 years from the release of R v1.0.0, the first official public release of the R programming language. In this post, we will look back on the 20 years of R with a bit of history and 3 interesting perspectives - how much faster did R get over the years, how many R packages were being released since 2000 and how did the number of package downloads grow. Contents The first release of R, 29th February 2000 Further down in history, to 1977 Faster - How performant is R today versus 20 ye ..read more
Visit website
Releasing and open-sourcing the Using Spark from R for performance with arbitrary code series
Jozef's Rblog
by
4y ago
Introduction Over the past months, we published and refined a series of posts on Using Spark from R for performance with arbitrary code. Since the posts have grown in size and scope the blogposts were no longer the best medium to share the content in the way most useful to the readers, we decided to compile a publication instead and open-source it for all readers to use freely. In this post, we present Using Spark from R for performance, an open-source online publication that will serve as a medium to communicate the current and future installments of the series comprehensively, including in ..read more
Visit website
4 great free tools that can make your R work more efficient, reproducible and robust
Jozef's Rblog
by
4y ago
Introduction It is Christmas time again! And just like last year, what better time than this to write about the great tools that are available to all interested in working with R. This post is meant as a praise to a few selected tools and packages that helped me to be more efficient and productive with R in 2019. In this post, we will praise free tools that can help your work become more efficient, reproducible and productive, namely the data.table package, the Rocker project for R-based Docker images, the base package parallel, and the R-Hub service for package checking. Contents data.ta ..read more
Visit website

Follow Jozef's Rblog on FeedSpot

Continue with Google
Continue with Apple
OR