SparkR

Introduction

SparkR is very similar to PySpark. Like PySpark, you can run a SparkR shell directly in a terminal using sparkR on the command line. This will open a SparkR environment configured to work with an existing Spark cluster.

This section should be prefaced with the warning that the last time the author programmed in R the Space Shuttle was still actively flying missions. Your best bet is probably just to read the guide here.

Maps and Reduces

That is all that really happens in Spark, maps, reduces and also maybe shuffles. SparkR dataframes appear to closely match R dataframes where possible. One can use as.DataFrame or createDataFrame to create a Spark DataFrame which has all the fancy features associated with Spark.

In contrast to PySpark where you had pretty good control of mapping and reducing SparkR bakes a lot more of it into its operations. Explicit partitioning appears to be supported but is very poorly documented.

Maps

Spark.lapply is probably going to be the most familiar operation. It will apply a function to each element of a list by distributing the workload out to workers. It returns the results back to the master machine so the results need to be able to fit in that machines memory or the program will fail. The syntax is Spark.lapply(list, func).

Spark.dapply will work in partitions and keep data partitioned. This is likely the function you will want to use if you want to transform your data but not do any sort of aggregation. You can get the results back with Spark.dapplyCollect. At that point the results need to fit on a single machine in memory.

Spark.gapply will apply a function to a group. This can work similarly to reducing by key where the column provided to do the grouping is used at the key. Like dapply there is a Spark.gapplyCollect function which will collect the results on the master node. After a reduction the results need to fit in memory. So long as there are not a huge number of groups, this is much more feasible than collecting the results after a dapply which works on partitions. It is possible to not aggregate but that is better suited for dapply because it will work on partitions, not on groups so it should be more efficient.

Reduces

Reduces in Spark don’t have the same fold options that existed in Python. Instead you would use a function like agg or rollup. The former will aggregate across the entire dataframe the latter will create groups as specified and do the aggregation across those groups.

Spark SQL with R

Like PySpark you can use SQL on R dataframes. Were I to work with a lot of R and Spark this might be my preferred path as the R documentation is somewhat lacking. Just like with Python we need to register views using createOrReplaceTempView(df, view) which takes the dataframe passed to it and the name of the view to register a view with the Spark context. At that point we can use the sql function to run SQL queries against the view which will return result sets as dataframes.

Review

SparkR exists. While not an expert in R, it seems like it is somewhat less robust than the implementations in Scala, Java or Python. That said, it would certainly be valuable to the programmer who’s workflow was in R and who needed to distribute work better.