SparkSql DataSet Api

序言

整理一下SparkSql DataSet Api的使用方式与心得cuiyaonan2000@163.com

参考资料:

Dataset (Spark 3.2.1 JavaDoc)Dataset (Spark 3.2.1 JavaDoc)
Getting Started - Spark 3.2.1 Documentation -----官网的用例
【SparkSQL笔记】SparkSQL的Dataset操作大全（二）_sdut菜鸟的博客-CSDN博客

背景

在使用sparksql的时候,应用会返还给我们DataSet 来处理查询出来的数据.故此需要了解DataSet的操作方法,以应对业务的各种无理要求cuiyaonan2000@163.com

操作大致如下所示,同时可以看到还可以创建DataFrame,但是返回的是DataSet,说明我们直接研究DataSet的Api方向是正确的.----基于Hive.

Shall We

总的来说跟我们使用JavaStream的方法,或者像是Flink的算子一样.没有必要一次性全都了解,可以根据需要主动去找就行了,cuiyaonan2000@163.com

show 返回值方法头说明voidshow()

Displays the top 20 rows of Dataset in a tabular form.

返回默认20条记录voidshow(boolean truncate)

Displays the top 20 rows of Dataset in a tabular form.

如果一个字段的字符数量超过20,则后面的就不显示voidshow(int numRows)

Displays the Dataset in a tabular form.

设置显示的记录数voidshow(int numRows, boolean truncate)

Displays the Dataset in a tabular form.

设置显示的记录数,如果超过20个字符是否需要截取voidshow(int numRows, int truncate)

Displays the Dataset in a tabular form.

设置显示的记录数,每个字段最多能显示多少个字符voidshow(int numRows, int truncate, boolean vertical)

Displays the Dataset in a tabular form.

collectAsList

将DataSet 以List的方式返回. 因为DataSet是分布式的数据集合.转换成List就是将数据进行收集到一台服务器上,所以次方法的使用要注意cuiyaonan2000@163com

describe

显示指定的列,无果没有指定列这返回所有列,

Datasetdescribe(scala.collection.Seq cols)

Computes basic statistics for numeric and string columns, including count, mean, stddev, min, and max.

Datasetdescribe(String... cols)

Computes basic statistics for numeric and string columns, including count, mean, stddev, min, and max.

该方法比较有意思,就是返回指定字段的统计信息:统计信息包含count,mean,stddev,min,max

统计信息的含义如下所示:

Count：记录条数
Mean：平均值
Stddev：样本标准差
Min：最小值
Max：最大值

获取指定行数的Row

因为是少量Row的返回,所以住下返回的内容.

first：获取第一行记录
head:获取第一行记录
head（int n）获取前n行记录，返回的是Array
take(int n)：获取前n行记录，返回的是Array
takeAsList(int n)：获取前n行数据，并以List的形式展现

java 用例如下所示:

// first
Row first = studentDataset.first();
// head
Row head = studentDataset.head();
// head(2)
Row[] heads = studentDataset.head(2);
// take(2)
Row[] take = studentDataset.take(2);
// takeAsList(2)
List rows = studentDataset.takeAsList(2);

DataSet转换成Sql的操作方式

这些方法的名字跟Sql的作用大致一样.注意根据需求使用即可cuiyaonan2000@163.com

where 返回值方法体注释Datasetwhere(Column condition)

Filters rows using the given condition.

使用Spark的方法来拼接条件Datasetwhere(String conditionExpr)

Filters rows using the given SQL expression.

类似传入Sql这种写法

示例:

// where(String conditionExpr)
Dataset wheredataset = studentDataset.where("age > 13 and sex = '男'");
wheredataset.show();

// where(Column condition)
Dataset whereDataset2 = studentDataset.where(studentDataset.col("age").gt(13).and(studentDataset.col("sex").equalTo("男")));
whereDataset2.show();

filter

类似于where ,同时多提供了2种实现.总共四种实现----scala的就pass了

Datasetfilter(Column condition)

Filters rows using the given condition.

Datasetfilter(FilterFunction func)

(Java-specific) Returns a new Dataset that only contains elements where func returns true.

Datasetfilter(scala.Function1 func)

(Scala-specific) Returns a new Dataset that only contains elements where func returns true.

Datasetfilter(String conditionExpr)

Filters rows using the given SQL expression.

示例:

// filter(String conditionExpr)
studentDataset.filter("age > 13 and sex = '男'").show();


// filter(Column condition)
studentDataset.filter(studentDataset.col("age").gt(13).and(studentDataset.col("sex").equalTo("男"))).show();


// filter(new FilterFunction(){...})
studentDataset.filter(new FilterFunction() {
    @Override
    public boolean call(Row value) throws Exception {
        return (long)value.getAs("age") > 13 && value.getAs("sex").equals("男");
    }
}).show();

select

选择指定的列,官网提供了多种方法,同时可以针对指定的列增加一些操作.

返回值方法体Datasetselect(Column... cols)

Selects a set of column based expressions.

Datasetselect(scala.collection.Seq cols)

Selects a set of column based expressions.

Datasetselect(String col, scala.collection.Seq cols)

Selects a set of columns.

Datasetselect(String col, String... cols)

Selects a set of columns.

Datasetselect(TypedColumn c1)

Returns a new Dataset by computing the given Column expression for each element.

Datasetselect(TypedColumn c1, TypedColumn c2)

Returns a new Dataset by computing the given Column expressions for each element.

Datasetselect(TypedColumn c1, TypedColumn c2, TypedColumn c3)

Returns a new Dataset by computing the given Column expressions for each element.

Datasetselect(TypedColumn c1, TypedColumn c2, TypedColumn c3, TypedColumn c4)

Returns a new Dataset by computing the given Column expressions for each element.

Datasetselect(TypedColumn c1, TypedColumn c2, TypedColumn c3, TypedColumn c4, TypedColumn c5)

Returns a new Dataset by computing the given Column expressions for each element.

示例:

// select(String col, String... cols)
studentDataset.select("name","age","sex").show(2);


// select(String col, scala.collection.Seq cols)
ArrayStack arraySeq = new ArrayStack();
arraySeq.push("sex");
arraySeq.push("age");
studentDataset.select("name",arraySeq).show(2);

// select(Column... cols) colum可以对字段做一些处理，例如给年龄+5 并取别名age+5
studentDataset.select(studentDataset.apply("name"),studentDataset.col("age").plus(5).as("age+5"),studentDataset.col("sex")).show(2);


// select(scala.collection.Seq cols)
ArrayStack arraySeqClumn = new ArrayStack();
arraySeqClumn.push(studentDataset.col("sex"));
arraySeqClumn.push(studentDataset.col("age"));
arraySeqClumn.push(studentDataset.col("name"));
studentDataset.select(arraySeqClumn).show(2);

selectExp

可以对指定字段进行特殊处理

DatasetselectExpr(scala.collection.Seq exprs)

Selects a set of SQL expressions.

DatasetselectExpr(String... exprs)

Selects a set of SQL expressions.

实例:

studentDataset.selectExpr("name","age","age+1 as otherAge","round(age)","sex as性别").show(2);

drop

这个跟select 作用相反,是去掉某一列.

Datasetdrop(Column col)

Returns a new Dataset with a column dropped.

Datasetdrop(scala.collection.Seq colNames)

Returns a new Dataset with columns dropped.

Datasetdrop(String... colNames)

Returns a new Dataset with columns dropped.

Datasetdrop(String colName)

Returns a new Dataset with a column dropped.

示例:

studentDataset.drop("age").show(2);

limit

limit方法获取指定Dataset的前n行记录，得到一个新的Dataset对象。

和take与head不同的是，limit方法不是Action操作，因为take，head均获得的均为Array（数组），而limit返回的是一个新的转化生成的Dataset对象

Datasetlimit(int n)

Returns a new Dataset by taking the first n rows.

order by 和 sort

order by 和 sort 都是按照指定字段排序，默认为升序。并且使用方法相同，支持多字段排序。

studentDataset.sort(studentDataset.col("age").desc(),studentDataset.apply("name").desc()).show(5);
studentDataset.orderBy(studentDataset.col("age").desc(),studentDataset.apply("name").desc()).show(5);

SparkSql DataSet Api

[ 申请 ]友情链接：