This is a post to summarize some basic operations for PySpark.
First, we introduce Apache Zeppelin, this is a web-based notebook that acts as a Spark shell.
A sample is like this:
%livy.pyspark print('Hello Zeppelin') from pyspark import HiveContext # Show databases spark.sql("show databases").collect() spark.sql("use dw") # Show tables spark.sql("show tables").collect() # Read Hive table as a DataFrame object hc_rdd = HiveContext(sc).sql('from table_name select * where dt = get_date(-1) limit 100') hc_order.show() # Read a column from a DataFrame target_id_col = hc_raw_beacon.select("target_id") target_id_col.show()
Note that due to storage mechanism, operations on RDD is not like that for other Python objects. So far as I know, training phase is usually completed offline in Python but not online using Spark. Maybe I can have better idea on this when I learn more.