统计词频¶

难度：中等
链接：https://leetcode-cn.com/problems/word-frequency

描述¶

写一个 bash 脚本以统计一个文本文件 words.txt 中每个单词出现的频率。

为了简单起见，你可以假设：

words.txt只包括小写字母和 ' ' 。
每个单词只由小写字母组成。
单词间由一个或多个空格字符分隔。

示例

输入：words.txt

the day is sunny the the
the sunny is is

输出:（以词频降序排列）

the 4
is 3
sunny 2
day 1

说明

不要担心词频相同的单词的排序问题，每个单词出现的频率都是唯一的。
你可以使用一行 Unix pipes 实现吗？

题解¶

分析¶

解题思路如下：

首先将所有单词独立出来
然后对所有单词进行排序，并生成单词集合（每个单词出现一次）
遍历单词集合，对每个单词计数，并且将单词和对应的计数值（如the 4）保存到一个变量中，每个单词直接使用\n分隔
将得到的变量安装第二列逆序排序后输出

实现¶

对于第一步，将文件中的单词独立出来，可以使用sed将文件中的空格都替换成换行；
通过sort和uniq生成单词集合，完成第二步；
for循环遍历集合，grep -Ewo结合wc -l得到每个单词的出现次数，将单词和计数信息合并到一个变量
使用sort --key=2 -reverse -n对变量按计数值逆序排序，打印输出

# Read from the file words.txt and output the word frequency list to stdout.
# 获取单词集合
uniq_words=$(sed -E 's/\ +/\n/g' words.txt |sort |uniq)

res=""

# 计算每个单词的出现次数
for word in $uniq_words
do
    count=$(grep -Ewo $word words.txt |wc -l)
    res="${res}${word} $count\n"
done

# 按照单词频次逆序排序
echo -en "$res" |sort --key=2 --reverse -n

sort 长参数

sort --reverse -n 可以合并为 sort -rn
sort 某些长参数在 LeetCode 是不包含的，使用后会报错

优化¶

以上解法显然不是最优的，那么可以考虑优化，思考下获取单词集合的方法。

使用cat和xargs, 将每个单词都当做一个参数看待，可以用以下方式得到集合。

uniq_words=$(cat words.txt |xargs -n 1| uniq)

这是其一，其二是打印输出时的排序，如果不使用sort的--key参数该怎么实现呢，可以在合并单词和计数值时先用计数值在前的方式存储，然后直接sort -rn排序，最后通过awk调换输出顺序。

优化后的程序如下：

# Read from the file words.txt and output the word frequency list to stdout.
# 获取单词集合
uniq_words=$(cat words.txt |xargs -n 1| uniq)

res=""

# 计算每个单词的出现次数
for word in $uniq_words
do
    count=$(grep -Ewo $word words.txt |wc -l)
    res="${res}${count} $word\n"
done

# 按照单词频次逆序排序
echo -en "$res" |sort -rn |awk '{print $2" "$1}'

再优化¶

最后再想想还能怎么优化，按照题目的说明信息可知，肯定有方法可以一行搞定，那怎么实现呢？看了别人的讨论发现，我忽略了uniq的强有力参数-c，没错，uniq -c本身就可以计数，并且默认输出时将计数值放在字符串之前，以tab分隔，如4 the.

OK, 有了这个就好办了，根本无需循环遍历，直接一步到位，完美。

# Read from the file words.txt and output the word frequency list to stdout.
cat words.txt |xargs -n 1 |sort |uniq -c |sort -rn |awk '{print $2" "$1}'

但是实践证明，这种方式并不是最高效的，但确实是最简洁的。

小结¶

xargs 有时候真的有妙用，可以把单词变成参数；还有uniq -c，意想不到的巧妙；最后加上awk的格式化输出，果然团结就是力量~