发布日期:2022-11-13 VIP内容

初始设置

设置表名、基本路径和数据生成器,以生成示例所需要的记录,代码如下:

// spark-shell
import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._

val tableName = "hudi_trips_cow"
val basePath = "hdfs://xueai8:8020/hudi/hudi_trips_cow"
val dataGen = new DataGenerator

// 测试生成的json数据集
convertToStringList(dataGen.generateInserts(2)).foreach(println)

执行以上代码,可以看到Hudi的数据生成器生成2条JSON数据,数据格式如下:

{"ts": 1647196090688, "uuid": "421b6078-2d2c-4e23-a6f5-b64713bdf81d", "rider": "rider-284", "driver": "driver-284", "begin_lat": 0.7340133901254792, "begin_lon": 0.5142184937933181, "end_lat": 0.7814655558162802, "end_lon": 0.6592596683641996, "fare": 49.527694252432056, "partitionpath": "asia/india/chennai"}
{"ts": 1646976254550, "uuid": "30d0b36f-ca4b-43ad-abb0-71de287ae259", "rider": "rider-284", "driver": "driver-284", "begin_lat": 0.1593867607188556, "begin_lon": 0.010872312870502165, "end_lat": 0.9808530350038475, "end_lon": 0.7963756520507014, "fare": 29.47661370147079, "partitionpath": "americas/united_states/san_francisco"}

DataGenerator生成的示例数据包含的字段类型如下:

{"name": "ts","type": "long"},
{"name": "uuid", "type": "string"},
{"name": "rider", "type": "string"},
{"name": "driver", "type": "string"},
{"name": "begin_lat", "type": "double"},
{"name": "begin_lon", "type": "double"},
{"name": "end_lat", "type": "double"},
{"name": "end_lon", "type": "double"},
{"name": "fare", "type": "double"}

Hudi支持通过Spark对Hudi数据集的数据进行插入、更新和删除。