Spark SQL编程案例2_电商用户评论数据集分析
本案例的数据集包含电子产品类别的大约169万条Amazon评论。我们可以直接读取JSON数据集来创建Spark SQLDataFrame。
【例】从JSON文件中读取一组订单记录进行分析。
实现代码如下:
def main(args: Array[String]): Unit = {
import org.apache.spark.sql.SparkSession
// 创建SparkSession的实例
val spark = SparkSession.builder()
.master("local[*]")
.appName("Spark Basic Example")
.getOrCreate()
val sc = spark.sparkContext
// 数据集路径
val filePath = "src/main/resources/amazon/Electronics_5.json"
// 加载数据文件,创建DataFrame
val reviewsDF = spark.read.json(filePath).cache()
// 简单查看
reviewsDF.printSchema()
// reviewsDF.show(5)
// 创建临时视图
reviewsDF.createOrReplaceTempView("reviewsTable")
// 执行SQL查询,找出给出综合评分(overall)大于3的评论
val sqlStr =
"""
SELECT asin, overall, reviewTime, reviewerID, reviewerName
FROM reviewsTable
WHERE overall >= 3
"""
val selectedDF = spark.sql(sqlStr)
selectedDF.show(5)
import spark.implicits._
// 提取数组元素
val selectedJSONArrayElementDF = reviewsDF
.select($"asin", $"overall", $"helpful")
.where($"helpful".getItem(0) < 3)
selectedJSONArrayElementDF.show(5)
}
执行以上代码,输出结果如下:
root |-- asin: string (nullable = true) |-- helpful: array (nullable = true) | |-- element: long (containsNull = true) |-- overall: double (nullable = true) |-- reviewText: string (nullable = true) |-- reviewTime: string (nullable = true) |-- reviewerID: string (nullable = true) |-- reviewerName: string (nullable = true) |-- summary: string (nullable = true) |-- unixReviewTime: long (nullable = true) +----------+-------+-----------+--------------+--------------------+ | asin|overall| reviewTime| reviewerID| reviewerName| +----------+-------+-----------+--------------+--------------------+ |0528881469| 5.0| 06 2, 2013| AO94DHGC771SJ| amazdnu| |0528881469| 3.0| 09 9, 2010|A3N7T0DY83Y4IG| C. A. Freeman| |0594451647| 5.0| 01 3, 2014|A2JXAZZI9PHK9Z|Billy G. Noland "...| |0594451647| 5.0| 05 4, 2014| AAZ084UMH8VZ2|D. L. Brown "A Kn...| |0594451647| 4.0|07 11, 2014| AEZ3CR6BKIROJ| Mark Dietter| +----------+-------+-----------+--------------+--------------------+ only showing top 5 rows +----------+-------+-------+ | asin|overall|helpful| +----------+-------+-------+ |0528881469| 5.0| [0, 0]| |0528881469| 1.0| [0, 0]| |0594451647| 2.0| [0, 0]| |0594451647| 5.0| [0, 0]| |0594451647| 4.0| [0, 0]| +----------+-------+-------+ only showing top 5 rows