Spark SQL编程案例2_电商用户评论数据集分析
本案例的数据集包含电子产品类别的大约169万条Amazon评论。我们可以直接读取JSON数据集来创建Spark SQLDataFrame。
【例】从JSON文件中读取一组订单记录进行分析。
实现代码如下:
def main(args: Array[String]): Unit = {
    import org.apache.spark.sql.SparkSession
    // 创建SparkSession的实例
    val spark = SparkSession.builder()
      .master("local[*]")
      .appName("Spark Basic Example")
      .getOrCreate()
    val sc = spark.sparkContext
    // 数据集路径
    val filePath = "src/main/resources/amazon/Electronics_5.json"
    // 加载数据文件,创建DataFrame
    val reviewsDF = spark.read.json(filePath).cache()
    // 简单查看
    reviewsDF.printSchema()
    // reviewsDF.show(5)
    // 创建临时视图
    reviewsDF.createOrReplaceTempView("reviewsTable")
    // 执行SQL查询,找出给出综合评分(overall)大于3的评论
    val sqlStr =
      """
        SELECT asin, overall, reviewTime, reviewerID, reviewerName
        FROM reviewsTable
        WHERE overall >= 3
      """
    val selectedDF = spark.sql(sqlStr)
    selectedDF.show(5)
    import spark.implicits._
    // 提取数组元素
    val selectedJSONArrayElementDF = reviewsDF
      .select($"asin", $"overall", $"helpful")
      .where($"helpful".getItem(0) < 3)
    selectedJSONArrayElementDF.show(5)
  }
执行以上代码,输出结果如下:
root |-- asin: string (nullable = true) |-- helpful: array (nullable = true) | |-- element: long (containsNull = true) |-- overall: double (nullable = true) |-- reviewText: string (nullable = true) |-- reviewTime: string (nullable = true) |-- reviewerID: string (nullable = true) |-- reviewerName: string (nullable = true) |-- summary: string (nullable = true) |-- unixReviewTime: long (nullable = true) +----------+-------+-----------+--------------+--------------------+ | asin|overall| reviewTime| reviewerID| reviewerName| +----------+-------+-----------+--------------+--------------------+ |0528881469| 5.0| 06 2, 2013| AO94DHGC771SJ| amazdnu| |0528881469| 3.0| 09 9, 2010|A3N7T0DY83Y4IG| C. A. Freeman| |0594451647| 5.0| 01 3, 2014|A2JXAZZI9PHK9Z|Billy G. Noland "...| |0594451647| 5.0| 05 4, 2014| AAZ084UMH8VZ2|D. L. Brown "A Kn...| |0594451647| 4.0|07 11, 2014| AEZ3CR6BKIROJ| Mark Dietter| +----------+-------+-----------+--------------+--------------------+ only showing top 5 rows +----------+-------+-------+ | asin|overall|helpful| +----------+-------+-------+ |0528881469| 5.0| [0, 0]| |0528881469| 1.0| [0, 0]| |0594451647| 2.0| [0, 0]| |0594451647| 5.0| [0, 0]| |0594451647| 4.0| [0, 0]| +----------+-------+-------+ only showing top 5 rows