Apache Spark SQL has a Thrift JDBC/ODBC server mode which implements the HiveServer2 in Hive 0.13. Please see the document of Apache Spark: Distributed SQL Engine - Running the Thrift JDBC/ODBC server for details.
dplyrSparkSQL is an experimental project to build a Spark SQL backend for dplyr.
- Download the prebuild binary manually from https://spark.apache.org/downloads.html.
- Checkout
https://github.com/wush978/dplyrSparkSQL"todplyrSparkSQL. - Extract the
.tgzfile and copy the jars in<spark-home>/libtodplyrSparkSQL/inst/drv - Install the package from
dplyrSparkSQL.
library(devtools)
install_github("bridgewell/dplyrSparkSQL")If you install in this way, the dplyrSparkSQL will trying to download the spark binaries automatically to
retrieve the driver.
src <- src_spark_sql(host = "localhost", port = "10000", user = Sys.info()["nodename"])Please change the host, port and user accordingly.
The following command create the table people from JSON.
db_create_table(src, "people", stored_as = "JSON", temporary = TRUE,
location = sprintf("file://%s", system.file(file.path("examples", "people.json"), package = "dplyrSparkSQL")))Note that the src here connects to a spark in local mode, so it could access
the file in local file system. If we connect to a real spark cluster, then the
location should be a directory or a file on HDFS.
The following command create a table users from Parquet.
db_create_table(src, "users", stored_as = "PARQUET", temporary = TRUE,
location = sprintf("file://%s", system.file("examples/users.parquet", package = "dplyrSparkSQL")))The dplyr obtains the tbl object via
people <- tbl(src, "people")
users <- tbl(src, "users")We could apply the verbs of dplyr on people and users
people
nrow(people)
filter(people, age < 20)
select(users, name, favorite_color)
mutate(users, test_column = 1)
mutate(users, test_column = 1) %>% collect