PySpark - Using Spark with Python
This is a post to summarize some basic operations for PySpark.
First, we introduce Apache Zeppelin, this is a web-based notebook that acts as a Spark shell.
A sample is like this:
%livy.pyspark
print('Hello Zeppelin')
from pyspark import HiveContext
# Show databases
spark.sql("show databases").collect()
spark.sql("use dw")
# Show tables
spark.sql("show tables").collect()
# Read Hive table as a DataFrame object
hc_rdd = HiveContext(sc).sql('from table_name select * where dt = get_date(-1) limit 100')
hc_order.show()
# Read a column from a DataFrame
target_id_col = hc_raw_beacon.select("target_id")
target_id_col.show()
Note that due to storage mechanism, operations on RDD is not like that for other Python objects. So far as I know, training phase is usually completed offline in Python but not online using Spark. Maybe I can have better idea on this when I learn more.