PySpark RDD编程案例_Top N问题
本节我们应用前面所学到的知识,实现几个常见的算法场景。
【示例】给出一个员工信息名单,找出收入最高的前10名员工(Top N问题)。
样本数据 employees.csv内容如下:
ename,title,department,Full or Part-Time,Salary or Hourly,Typical Hours,Annual Salary,Hourly Rate 张三,paramedic i/c,fire,f,salary,,91080.00, 李四,lieutenant,fire,f,salary,,114846.00, 王老五,sergeant,police,f,salary,,104628.00, 赵六,police officer,police,f,salary,,96060.00, 钱七,clerk iii,police,f,salary,,53076.00, 周扒皮,firefighter,fire,f,salary,,87006.00, 吴用,law clerk,law,f,hourly,35,,14.51
实现代码如下。
from pyspark.sql import SparkSession
# 构建SparkSession和SparkContext实例
spark = SparkSession.builder \
.master("spark://xueai8:7077") \
.appName("pyspark demo") \
.getOrCreate()
sc = spark.sparkContext
# 构造RDD
inputPath = "file:///home/hduser/data/spark/employees.csv"
rdd = sc.textFile(inputPath)
# 排序函数
def sortFun(arr):
if len(arr[6]) > 0:
return float(arr[6])
else:
return 0.0
# 计算过程
sortedData = rdd \
.filter(lambda line: not line.startswith("ename")) \
.map(lambda line: line.split(",")) \
.sortBy(sortFun, False)
# 取前3个
top = sortedData.take(3)
for row in top:
print(row)
执行以上代码,得到如下的结果:
['李四', 'lieutenant', 'fire', 'f', 'salary', '', '114846.00', ''] ['王老五', 'sergeant', 'police', 'f', 'salary', '', '104628.00', ''] ['赵六', 'police officer', 'police', 'f', 'salary', '', '96060.00', '']