使用流收集数据

2019-07-23 约 4250 字预计阅读 9 分钟次阅读

假设有一个事务列表，你希望根据货币对它们进行分组。在Java 8之前，即使是这样一个简单的例子也很难实现，如下所示：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


Map<Currency, List<Transaction>> transactionsByCurrencies = new HashMap<>();
for (Transaction transaction : transactions) {
    Currency currency = transaction.getCurrency();
	List<Transaction> transactionsForCurrency = transactionsByCurrencies.get(currency);
	if (transactionsForCurrency == null) {
		transactionsForCurrency = new ArrayList<>();
		transactionsByCurrencies.put(currency, transactionsForCurrency);
	}
	transactionsForCurrency.add(transaction);
}

在Java 8之后，仅用一条语句就可以获得完全相同的结果：

1

Map<Currency, List<Transaction>> transactionsByCurrencies = transactions.stream().collect(groupingBy(Transaction::getCurrency));

Collector接口提供了三个主要功能：

将流中的元素汇总为一个值
将元素进行分组
将元素进行分区

归纳总结

Collector接口（collect方法的参数）通常用于需要将流的元素重新组织到集合中的情况。但更一般的情况是，Collector接口可以用于将流中的所有元素组合成一个结果时。

1

long howManyDishes = menu.stream().collect(Collectors.counting());

也可以写成

1

long howManyDishes = menu.stream().count();

查找最大和最小值

你可以使用Collectors.maxBy 和Collectors.minBy 计算流中的最大值和最小值。它们的参数是Comparator

1
2


Comparator<Dish> dishCaloriesComparator = Comparator.comparingInt(Dish::getCalories);
Optional<Dish> mostCalorieDish = menu.stream().collect(maxBy(dishCaloriesComparator));

总结

Collectors类提供了用于求和的特定工厂方法Collectors.summingInt，其参数是ToIntFunction，并返回一个Collector。

1

int totalCalories = menu.stream().collect(summingInt(Dish::getCalories));

Collectors.summingLong 和Collectors.summingDouble 方法类似用于计算long和double。Collectors.averagingInt ，averagingLong 和averagingDouble 用于计算一组值的平均值。

1

double avgCalories = menu.stream().collect(averagingInt(Dish::getCalories));

有时候你想要获取更多统计信息，可以使用Collectors.summarizingInt。这个Collector将所有信息收集到IntSummaryStatistics类里面。

1
2


IntSummaryStatistics menuStatistics = menu.stream().collect(summarizingInt(Dish::getCalories));
// IntSummaryStatistics{count=9, sum=4300, min=120, average=477.777778, max=800}

同样相应地有summarizingLong和summarizingDouble以及对应的类LongSummaryStatistics和DoubleSummaryStatistics

连接字符串

Collectors.joining将流中所有对象连接成一个字符串，调用每个对象的toString方法。

1

String shortMenu = menu.stream().map(Dish::getName).collect(joining());

Collectors.joining的重载版本接受一个字符串用于间隔元素：

1

String shortMenu = menu.stream().map(Dish::getName).collect(joining(","));

通用归纳总结

到目前为止，我们讨论的所有Collector实际上都是Collectors.reducing方法的简便版。

1
2


int totalCalories = menu.stream().collect(reducing(0, Dish::getCalories, (i, j) -> i + j));
Optional<Dish> mostCalorieDish = menu.stream().collect(reducing((d1, d2) -> d1.getCalories() > d2.getCalories() ? d1 : d2));

比如前面用到的counting定义如下：

1
2
3


public static <T> Collector<T, ?, Long> counting() {
    return reducing(0L, e -> 1L, Long::sum);
}

达到同样的效果，可以有多种方法，比如：

1
2


int totalCalories = menu.stream().map(Dish::getCalories).reduce(Integer::sum).get();
int totalCalories = menu.stream().mapToInt(Dish::getCalories).sum();

分组

数据库的一个常见操作是根据一个或多个属性将数据分组。Java 8中使用groupingBy达到相同的效果：

1

Map<Dish.Type, List<Dish>> dishesByType = menu.stream().collect(groupingBy(Dish::getType));

但是并不总是可以将方法引用用作分类函数，因为你可能希望使用比简单属性访问器更复杂的方法进行分类。

1
2
3
4
5
6
7


public enum CaloricLevel {DIET, NORMAL, FAT}
Map<CaloricLevel, List<Dish>> dishesByCaloricLevel = menu.stream().collect(
    groupingBy(dish -> {
        if (dish.getCalories() <= 400) return CaloricLevel.DIET;
        else if (dish.getCalories() <= 700) return CaloricLevel.NORMAL;
        else return CaloricLevel.FAT;
    }));

操作分组元素

假设你要按类型分组菜肴并在每个分组过滤热量大于500的菜肴，你可能会这样做：

1

Map<Dish.Type, List<Dish>> caloricDishesByType = menu.stream().filter(dish -> dish.getCalories() > 500).collect(groupingBy(Dish::getType));

打印caloricDishesByType得到：

1

{OTHER=[french fries, pizza], MEAT=[pork, beef]}

可以看到这个方法有一个问题，FISH分组不见了。为了解决这个问题，Collectors.groupingBy 提供了一个重载的方法接受另一个Collector参数。如下所示：

1

Map<Dish.Type, List<Dish>> caloricDishesByType = menu.stream().collect(groupingBy(Dish::getType, filtering(dish -> dish.getCalories() > 500, toList())));

filtering 接受一个谓词过滤每个分组的元素，另外一个Collector用于重新分组过滤后的元素。

1

{OTHER=[french fries, pizza], MEAT=[pork, beef], FISH=[]}

另一种常见的操作分组元素的方法是通过映射函数转换它们。Collectors.mapping接受一个Function和另一个Collector用于收集分组中被映射的元素：

1

Map<Dish.Type, List<String>> dishNamesByType = menu.stream().collect(groupingBy(Dish::getType, mapping(Dish::getName, toList())));

groupingBy还可以和flatMapping结合使用。假设我们有如下数据：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


Map<String, List<String>> dishTags = new HashMap<>();
dishTags.put("pork", asList("greasy", "salty"));
dishTags.put("beef", asList("salty", "roasted"));
dishTags.put("chicken", asList("fried", "crisp"));
dishTags.put("french fries", asList("greasy", "fried"));
dishTags.put("rice", asList("light", "natural"));
dishTags.put("season fruit", asList("fresh", "natural"));
dishTags.put("pizza", asList("tasty", "salty"));
dishTags.put("prawns", asList("tasty", "roasted"));
dishTags.put("salmon", asList("delicious", "fresh"));

假如我们想要获取各类菜肴的标签，如下所示：

1

Map<Dish.Type, Set<String>> dishNamesByType = menu.stream().collect(groupingBy(Dish::getType, flatMapping(dish -> dishTags.get(dish.getName()).stream(), toSet())));

多级分组

带两个参数的Collectors.groupingBy可以用来执行两级分组，如下所示：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


Map<Dish.Type, Map<CaloricLevel, List<Dish>>> dishesByTypeCaloricLevel =
    menu.stream().collect(
        groupingBy(Dish::getType,
            groupingBy(dish -> {
                if (dish.getCalories() <= 400) return CaloricLevel.DIET;
                else if (dish.getCalories() <= 700) return CaloricLevel.NORMAL;
                else return CaloricLevel.FAT;
            })
        )
    );

多级分组操作可以扩展到任意层级，一个N级分组结果是一个N级Map。

在子分组中收集数据

groupingBy的第二个参数可以是任意类型的Collector。比如counting用于计算每个分组的数量：

1

Map<Dish.Type, Long> typesCount = menu.stream().collect(groupingBy(Dish::getType, counting()));

而且单个参数的groupingBy(f)实际上等价于groupingBy(f, toList())。再举一个例子，获取各类菜肴热量最高的菜：

1

Map<Dish.Type, Optional<Dish>> mostCaloricByType = menu.stream().collect(groupingBy(Dish::getType, maxBy(comparingInt(Dish::getCalories))));

将收集器结果调整为不同的类型

要使收集器返回的结果调整为不同的类型，可以使用Collectors.collectingAndThen factory`返回的Collector，如下面所示：

1
2


Map<Dish.Type, Dish> mostCaloricByType = menu.stream().collect(groupingBy(Dish::getType,
	collectingAndThen(maxBy(comparingInt(Dish::getCalories)), Optional::get)));

与groupingBy一起使用的其他收集器示例

更一般地，传递给groupingBy方法第二个参数的收集器将用于对流中分类为同一组的所有元素执行进一步的归纳操作。比如：

1

Map<Dish.Type, Integer> totalCaloriesByType = menu.stream().collect(groupingBy(Dish::getType, summingInt(Dish::getCalories)));

另外一个收集器是由mapping方法生成的。此方法接受两个参数：一个函数转换流中的元素，另一个收集器收集转换后的对象。如下：

1
2
3
4
5
6
7


Map<Dish.Type, Set<CaloricLevel>> caloricLevelsByType =
    menu.stream().collect(
        groupingBy(Dish::getType, mapping(dish -> {
                if (dish.getCalories() <= 400) return CaloricLevel.DIET;
                else if (dish.getCalories() <= 700) return CaloricLevel.NORMAL;
                else return CaloricLevel.FAT;
            }, toSet())));

在上例中，我们不知道toSet会返回什么类型的Set。如果你想要返回指定的集合，可以是用toCollection：

1
2
3
4
5
6
7


Map<Dish.Type, Set<CaloricLevel>> caloricLevelsByType =
    menu.stream().collect(
        groupingBy(Dish::getType, mapping(dish -> {
                if (dish.getCalories() <= 400) return CaloricLevel.DIET;
                else if (dish.getCalories() <= 700) return CaloricLevel.NORMAL;
                else return CaloricLevel.FAT;
            }, toCollection(HashSet::new))));

分区

分区是分组的一种特殊情况：使用一个称为分区函数的谓词作为分类函数。分区函数返回布尔值这一事实意味着生成的分组映射将使用布尔值作为键类型，因此最多可以有两个不同的组——一个为true，一个为false。

1

Map<Boolean, List<Dish>> partitionedMenu = menu.stream().collect(partitioningBy(Dish::isVegetarian));

Advantages of partitioning

和分组一样，partitioningBy方法也有一个重载版本，可以向其传递第二个收集器，如下所示：

1
2
3
4
5
6
7
8
9


Map<Boolean, Map<Dish.Type, List<Dish>>> vegetarianDishesByType =
    menu.stream().collect(
        partitioningBy(Dish::isVegetarian, groupingBy(Dish::getType)));

Map<Boolean, Dish> mostCaloricPartitionedByVegetarian =
    menu.stream().collect(
        partitioningBy(Dish::isVegetarian,
            collectingAndThen(maxBy(comparingInt(Dish::getCalories)),
                Optional::get)));

将数字划分为素数和非素数

假设你想编写一个方法，接受一个整数n作为参数，并将前n个自然数划分为素数和非素数。但首先，你需要一个谓词判断一个整数是否是质数：

1
2
3


public boolean isPrime(int candidate) {
    return IntStream.range(2, candidate).noneMatch(i -> candidate % i == 0);
}

下面是一个小小的优化：

1
2
3
4


public boolean isPrime(int candidate) {
    int candidateRoot = (int) Math.sqrt((double) candidate);
    return IntStream.rangeClosed(2, candidateRoot).noneMatch(i -> candidate % i == 0);
}

现在就可以划分前n个整数是否是质数了：

1
2
3


public Map<Boolean, List<Integer>> partitionPrimes(int n) {
    return IntStream.rangeClosed(2, n).boxed().collect(partitioningBy(candidate -> isPrime(candidate)));
}

下表是Collectors类中主要的静态工厂方法：

工厂方法	返回类型	用途
toList	List<T>	将流中的元素放入List
toSet	Set<T>	将流中的元素放入Set，去除重复元素
toCollection	Collection<T>	将流中的元素放入指定的集合
counting	Long	统计流中元素个数
summingInt	Integer	对流中元素的整形属性求和
averagingInt	Double	计算流中元素的整形属性的平均值
summarizingInt	IntSummaryStatistics	收集流中元素的整形属性的统计数据
joining	String	连接流中元素调用toString方法产生的字符串
maxBy	Optional<T>	根据指定的Comparator返回流中最大的元素或Optional.empty()
minBy	Optional<T>	根据指定的Comparator返回流中最小的元素或Optional.empty()
reducing	归纳操作产生的类型	使用BinaryOperator依次将流中元素合并为一个值
collectingAndThen	转换函数返回的类型	包装另一个收集器并对其结果应用转换函数
groupingBy	Map<K, List<T>>	基于流中元素的属性作为键值分组元素
partitioningBy	Map<Boolean, List<T>>	基于谓词对流中的元素分区

Collector接口

Collector接口声明的方法如下：

1
2
3
4
5
6
7


public interface Collector<T, A, R> {
    Supplier<A> supplier();
    BiConsumer<A, T> accumulator();
    Function<A, R> finisher();
    BinaryOperator<A> combiner();
    Set<Characteristics> characteristics();
}

其中：

T是要被收集的流中元素的类型
A是累加器的类型，在收集过程中，收集部分结果的对象。
R是结果类型

比如你可以实现一个类ToListCollector<T> 将Stream<T>中所有的元素收集到List<T>中

1

public class ToListCollector<T> implements Collector<T, List<T>, List<T>>

理解Collector接口声明的方法

supplier方法

supplier方法必须返回一个空的累加器，比如：

1
2
3


public Supplier<List<T>> supplier() {
    return () -> new ArrayList<T>();
}

也可以使用构造器引用：

1
2
3


public Supplier<List<T>> supplier() {
	return ArrayList::new;
}

accumulator方法

accumulator 方法返回一个执行归纳操作的函数。这个函数带2个参数，一个是累加器，一个是流中第n个元素，并返回void，因为累加器被就地修改。比如：

1
2
3


public BiConsumer<List<T>, T> accumulator() {
	return (list, item) -> list.add(item);
}

也可以使用方法引用：

1
2
3


public BiConsumer<List<T>, T> accumulator() {
	return List::add;
}

finisher方法

finisher方法返回一个在累加过程最后调用的函数，用于将累加器对象转换为最终结果，比如：

1
2
3


public Function<List<T>, List<T>> finisher() {
	return Function.identity();
}

下图为collect方法处理过程

combiner方法

combiner方法返回一个方法用于并行处理时将流中不同的子部分元素合并起来，比如：

1
2
3
4
5
6


public BinaryOperator<List<T>> combiner() {
    return (list1, list2) -> {
        list1.addAll(list2);
        return list1;
    }
}

characteristics方法

characteristics方法返回Characteristics集合，定义了收集器的行为。特别是提示是否可以并行处理以及哪些优化可行。Characteristics是一个枚举：

UNORDERED：归纳的结果不受遍历和累加流中元素的顺序的影响
CONCURRENT：累加方法可以被多个线程并发调用，收集器可以在流上执行并行归纳。
IDENTITY_FINISH：finisher方法返回的结果是累加器本身，这意味着累加器A到结果R的类型转换不需要检查。

collect方法重载版本

对于IDENTITY_FINISH类型的收集操作，可以不需要实现Collector接口就能自定义收集操作。collect方法有一个重载版本接受3个参数——supplier，accumulator，combiner：

1

List<Dish> dishes = menuStream.collect(ArrayList::new, List::add, List::addAll);

开发自己的收集器

只除质数

判断一个数是否是质数，只需将这个数除以比它小的质数即可，不用每一个数都去测试

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


public static boolean isPrime(List<Integer> primes, int candidate) {
    return primes.stream().noneMatch(i -> candidate % i == 0);
}

public static boolean isPrime(List<Integer> primes, int candidate){
    int candidateRoot = (int) Math.sqrt((double) candidate);
    return primes.stream()
        .takeWhile(i -> i <= candidateRoot)
        .noneMatch(i -> candidate % i == 0);
}

定义收集器类

1

public class PrimeNumbersCollector implements Collector<Integer, Map<Boolean, List<Integer>>, Map<Boolean, List<Integer>>>

实现归纳过程

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


public Supplier<Map<Boolean, List<Integer>>> supplier() {
    return () -> new HashMap<Boolean, List<Integer>>() {{
        put(true, new ArrayList<Integer>());
        put(false, new ArrayList<Integer>());
    }};
}

public BiConsumer<Map<Boolean, List<Integer>>, Integer> accumulator() {
    return (Map<Boolean, List<Integer>> acc, Integer candidate) -> {
        acc.get(isPrime(acc.get(true), candidate))
            .add(candidate);
    };
}

尽可能并行收集

1
2
3
4
5
6
7


public BinaryOperator<Map<Boolean, List<Integer>>> combiner() {
    return (Map<Boolean, List<Integer>> map1, Map<Boolean, List<Integer>> map2) -> {
        map1.get(true).addAll(map2.get(true));
        map1.get(false).addAll(map2.get(false));
        return map1;
    };
}

最后两个方法

1
2
3
4
5
6
7


public Function<Map<Boolean, List<Integer>>, Map<Boolean, List<Integer>>> finisher() {
    return Function.identity();
}

public Set<Characteristics> characteristics() {
    return Collections.unmodifiableSet(EnumSet.of(IDENTITY_FINISH));
}

现在可以使用这个自定义的收集器分区质数：

1
2
3


public Map<Boolean, List<Integer>> partitionPrimesWithCustomCollector(int n) {
    return IntStream.rangeClosed(2, n).boxed().collect(new PrimeNumbersCollector());
}

也可以使用collect接受3个参数的重载版本，但是这样可读性变差了，而且代码也不能复用了。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


public Map<Boolean, List<Integer>> partitionPrimesWithCustomCollector (int n) {
    IntStream.rangeClosed(2, n).boxed().collect(
        () -> new HashMap<Boolean, List<Integer>>() {{
            put(true, new ArrayList<Integer>());
            put(false, new ArrayList<Integer>());
        }},
        (acc, candidate) -> {
            acc.get(isPrime(acc.get(true), candidate))
                .add(candidate);
        },
        (map1, map2) -> {
            map1.get(true).addAll(map2.get(true));
            map1.get(false).addAll(map2.get(false));
        });
}

目录