整理一下SparkSql DataSet Api的使用方式与心得cuiyaonan2000@163.com
参考资料:
- Dataset (Spark 3.2.1 JavaDoc)Dataset (Spark 3.2.1 JavaDoc)
- Getting Started - Spark 3.2.1 Documentation -----官网的用例
- 【SparkSQL笔记】SparkSQL的Dataset操作大全(二)_sdut菜鸟的博客-CSDN博客
在使用sparksql的时候,应用会返还给我们DataSet 来处理查询出来的数据.故此需要了解DataSet的操作方法,以应对业务的各种无理要求cuiyaonan2000@163.com
操作大致如下所示,同时可以看到还可以创建DataFrame,但是返回的是DataSet,说明我们直接研究DataSet的Api方向是正确的.----基于Hive.
总的来说跟我们使用JavaStream的方法,或者像是Flink的算子一样.没有必要一次性全都了解,可以根据需要主动去找就行了,cuiyaonan2000@163.com
show 返回值 方法头 说明void
show()
Displays the top 20 rows of Dataset in a tabular form.
返回默认20条记录void
show(boolean truncate)
Displays the top 20 rows of Dataset in a tabular form.
如果一个字段的字符数量超过20,则后面的就不显示void
show(int numRows)
Displays the Dataset in a tabular form.
设置显示的记录数void
show(int numRows, boolean truncate)
Displays the Dataset in a tabular form.
设置显示的记录数,如果超过20个字符是否需要截取void
show(int numRows, int truncate)
Displays the Dataset in a tabular form.
设置显示的记录数,每个字段最多能显示多少个字符void
show(int numRows, int truncate, boolean vertical)
Displays the Dataset in a tabular form.
collectAsList将DataSet 以List的方式返回. 因为DataSet是分布式的数据集合.转换成List就是将数据进行收集到一台服务器上,所以次方法的使用要注意cuiyaonan2000@163com
describe显示指定的列,无果没有指定列这返回所有列,
Dataset
describe(scala.collection.Seq cols)
Computes basic statistics for numeric and string columns, including count, mean, stddev, min, and max.
Dataset
describe(String... cols)
Computes basic statistics for numeric and string columns, including count, mean, stddev, min, and max.
该方法比较有意思,就是返回指定字段的统计信息:统计信息包含count,mean,stddev,min,max
统计信息的含义如下所示:
- Count:记录条数
- Mean:平均值
- Stddev:样本标准差
- Min:最小值
- Max:最大值
因为是少量Row的返回,所以住下返回的内容.
- first:获取第一行记录
- head:获取第一行记录
- head(int n)获取前n行记录,返回的是Array
- take(int n):获取前n行记录,返回的是Array
- takeAsList(int n):获取前n行数据,并以List的形式展现
java 用例如下所示:
// first
Row first = studentDataset.first();
// head
Row head = studentDataset.head();
// head(2)
Row[] heads = studentDataset.head(2);
// take(2)
Row[] take = studentDataset.take(2);
// takeAsList(2)
List rows = studentDataset.takeAsList(2);
DataSet转换成Sql的操作方式
这些方法的名字跟Sql的作用大致一样.注意根据需求使用即可cuiyaonan2000@163.com
where 返回值 方法体注释Dataset
where(Column condition)
Filters rows using the given condition.
使用Spark的方法来拼接条件Dataset
where(String conditionExpr)
Filters rows using the given SQL expression.
类似传入Sql这种写法
示例:
// where(String conditionExpr)
Dataset wheredataset = studentDataset.where("age > 13 and sex = '男'");
wheredataset.show();
// where(Column condition)
Dataset whereDataset2 = studentDataset.where(studentDataset.col("age").gt(13).and(studentDataset.col("sex").equalTo("男")));
whereDataset2.show();
filter
类似于where ,同时多提供了2种实现.总共四种实现----scala的就pass了
Dataset
filter(Column condition)
Filters rows using the given condition.
Dataset
filter(FilterFunction func)
(Java-specific) Returns a new Dataset that only contains elements where func
returns true
.
Dataset
filter(scala.Function1 func)
(Scala-specific) Returns a new Dataset that only contains elements where func
returns true
.
Dataset
filter(String conditionExpr)
Filters rows using the given SQL expression.
示例:
// filter(String conditionExpr)
studentDataset.filter("age > 13 and sex = '男'").show();
// filter(Column condition)
studentDataset.filter(studentDataset.col("age").gt(13).and(studentDataset.col("sex").equalTo("男"))).show();
// filter(new FilterFunction(){...})
studentDataset.filter(new FilterFunction() {
@Override
public boolean call(Row value) throws Exception {
return (long)value.getAs("age") > 13 && value.getAs("sex").equals("男");
}
}).show();
select
选择指定的列,官网提供了多种方法,同时可以针对指定的列增加一些操作.
返回值方法体Dataset
select(Column... cols)
Selects a set of column based expressions.
Dataset
select(scala.collection.Seq cols)
Selects a set of column based expressions.
Dataset
select(String col, scala.collection.Seq cols)
Selects a set of columns.
Dataset
select(String col, String... cols)
Selects a set of columns.
Dataset
select(TypedColumn c1)
Returns a new Dataset by computing the given Column expression for each element.
Dataset
select(TypedColumn c1, TypedColumn c2)
Returns a new Dataset by computing the given Column expressions for each element.
Dataset
select(TypedColumn c1, TypedColumn c2, TypedColumn c3)
Returns a new Dataset by computing the given Column expressions for each element.
Dataset
select(TypedColumn c1, TypedColumn c2, TypedColumn c3, TypedColumn c4)
Returns a new Dataset by computing the given Column expressions for each element.
Dataset
select(TypedColumn c1, TypedColumn c2, TypedColumn c3, TypedColumn c4, TypedColumn c5)
Returns a new Dataset by computing the given Column expressions for each element.
示例:
// select(String col, String... cols)
studentDataset.select("name","age","sex").show(2);
// select(String col, scala.collection.Seq cols)
ArrayStack arraySeq = new ArrayStack();
arraySeq.push("sex");
arraySeq.push("age");
studentDataset.select("name",arraySeq).show(2);
// select(Column... cols) colum可以对字段做一些处理,例如给年龄+5 并取别名age+5
studentDataset.select(studentDataset.apply("name"),studentDataset.col("age").plus(5).as("age+5"),studentDataset.col("sex")).show(2);
// select(scala.collection.Seq cols)
ArrayStack arraySeqClumn = new ArrayStack();
arraySeqClumn.push(studentDataset.col("sex"));
arraySeqClumn.push(studentDataset.col("age"));
arraySeqClumn.push(studentDataset.col("name"));
studentDataset.select(arraySeqClumn).show(2);
selectExp
可以对指定字段进行特殊处理
Dataset
selectExpr(scala.collection.Seq exprs)
Selects a set of SQL expressions.
Dataset
selectExpr(String... exprs)
Selects a set of SQL expressions.
实例:
studentDataset.selectExpr("name","age","age+1 as otherAge","round(age)","sex as性别").show(2);
drop
这个跟select 作用相反,是去掉某一列.
Dataset
drop(Column col)
Returns a new Dataset with a column dropped.
Dataset
drop(scala.collection.Seq colNames)
Returns a new Dataset with columns dropped.
Dataset
drop(String... colNames)
Returns a new Dataset with columns dropped.
Dataset
drop(String colName)
Returns a new Dataset with a column dropped.
示例:
studentDataset.drop("age").show(2);
limit
limit方法获取指定Dataset的前n行记录,得到一个新的Dataset对象。
和take与head不同的是,limit方法不是Action操作,因为take,head均获得的均为Array(数组),而limit返回的是一个新的转化生成的Dataset对象
Dataset
limit(int n)
Returns a new Dataset by taking the first n
rows.
order by 和 sort 都是按照指定字段排序,默认为升序。并且使用方法相同,支持多字段排序。
studentDataset.sort(studentDataset.col("age").desc(),studentDataset.apply("name").desc()).show(5);
studentDataset.orderBy(studentDataset.col("age").desc(),studentDataset.apply("name").desc()).show(5);