文章目录

前言
一、正则表达式概述
1. 基础正则表达式
二、文本处理工具
2.sort内容排序工具
找出访问量最高的10个ip
总结

前言

正则表达式通常被用来检索、替换那些符合某个模式(规则)的文本。正则表达式(Regular Expression)是一种文本模式，包括普通字符（例如，a 到 z 之间的字母）和特殊字符（称为”元字符”）。正则表达式使用单个字符串来描述、匹配一系列匹配某个句法规则的字符串。

一、正则表达式概述

①正则表达式（别称）：正规表达式、常规表达式
②使用字符串来描述、匹配一系列符合某个规则的字符串，是一种匹配字符串的方法，通过一些特殊符号，实现快速查找、删除、替换某个特定字符串

③正则表达式组成
普通字符
包括大小写字母、数字、标点符号及一些其他符号
元字符
指那些在正则表达式中具有特殊意义的专用字符，可以用来规定其前导字符（即位于元字符前面的字符）在目标对象中的出现模式。

正则表达式一般用于脚本编程与文本编辑器中。
Linux 系统中常见的文件处理工具中 grep 与 sed 支持基础正则表达式，而 egrep 与 awk 支持扩展正则表达式。

文本处理器	基础正则表达式	扩展正则表达式
vi编辑器	支持
grep	支持
egrep	支持	支持
sed	支持
awk	支持	支持

1. 基础正则表达式

元字符	作用
^	匹配输入字符串的开始位置。除非在方括号表达式中使用，表示不包含该字符集合。要匹配“^” 字符本身，请使用“^”
$	匹配输入字符串的结尾位置。如果设置了 RegExp 对象的 Multiline 属性，则“ $KaTeX parse error: Undefined control sequence: \n at position 6: ”也匹配‘\̲n̲’或‘\r’。要匹配“$ ”字符本身，请使用“$”
.	匹配除“\r\n”之外的任何单个字符
\反斜杠	反斜杠，又叫转义字符，去除其后紧跟的元字符或通配符的特殊意义
*	匹配前面的子表达式零次或多次。要匹配“”字符，请使用”\“
[]	字符集合。匹配所包含的任意一个字符。例如，“[abc]”可以匹配“plain”中的“a”
[^]	赋值字符集合。匹配未包含的一个任意字符。例如，“[^abc]”可以匹配“plain”中任何一个字母
[n1-n2]	字符范围。匹配指定范围内的任意一个字符。例如，“[a-z]”可以匹配“a”到“z”范围内的任意一个小写字母字符。注意：只有连字符（-）在字符组内部，并且出现在两个字符之间时，才能表示字符的范围；如果出现在字符组的开头，则只能表示连字符本身
{n}	n 是一个非负整数，匹配确定的 n 次。例如，“o{2}”不能匹配“Bob”中的“o”，但是能匹配“food”中的“oo”
{n,}	n 是一个非负整数，至少匹配 n 次。例如，“o{2,}”不能匹配“Bob”中的“o”，但能匹配“foooood”中的所有 o。“o{1,}”等价于“o+”。“o{0,}”则等价于“o*”
{n,m}	m 和 n 均为非负整数，其中 n<=m，最少匹配 n 次且最多匹配 m 次

示例

1.查找特定字符

创建一个txt文件用来做示例

[root@yzq mnt]#cat test.txt 
he was short and fat.
He was wearing a blue polo shirt with black pants.
The home of Football on BBC Sport online.
the tongue is boneless but it breaks bones.12! 
google is the best tools for search keyword.
The year ahead will test our political establishment to the limit.
PI=3.141592653589793238462643383249901429
a wood cross!
Actions speak louder than words


#woood #
#woooooood #
#AxyzxyzxyzxyzC
I bet this place is really spooky late at night! 
Misfortunes never come alone/single.

I shouldn't have lett so tast.

grep -n ‘the’ test.txt
从 test.txt 文件中查找出特定字符“the”所在位置，其中“-n”表示显示行号

[root@yzq mnt]#grep -n 'the' test.txt 
4:the tongue is boneless but it breaks bones.12! 
5:google is the best tools for search keyword.
6:The year ahead will test our political establishment to the limit.

“-i”表示不区分大小写。

[root@yzq mnt]#grep -ni 'the' test.txt 
3:The home of Football on BBC Sport online.
4:the tongue is boneless but it breaks bones.12! 
5:google is the best tools for search keyword.
6:The year ahead will test our political establishment to the limit.

若反向选择，如查找不包含“the”字符的行，则需要通过 grep 命令的“-v”选项实现，并配合“-n”一起使用显示行号。

[root@yzq mnt]#grep -niv 'the' test.txt 
1:he was short and fat.
2:He was wearing a blue polo shirt with black pants.
7:PI=3.141592653589793238462643383249901429
8:a wood cross!
9:Actions speak louder than words
10:
11:
12:#woood #
13:#woooooood #
14:#AxyzxyzxyzxyzC
15:I bet this place is really spooky late at night! 
16:Misfortunes never come alone/single.
17:
18:I shouldn't have lett so tast.

2.利用中括号“[]”来查找集合字符

其中“[]”中无论有几个字符，都仅代表一个字符，也就是说“[io]”表示匹配“i”或者“o”。

[root@yzq mnt]#grep -n 'sh[io]rt' test.txt 
1:he was short and fat.
2:He was wearing a blue polo shirt with black pants.

3.查找以#开头的行

[root@yzq mnt]#grep -n '^#' test.txt 
12:#woood #
13:#woooooood #
14:#AxyzxyzxyzxyzC

4.查找以！结尾的行

[root@yzq mnt]#grep -n '!$' test.txt 
8:a wood cross!

5.查找以w开头d结尾中间有两个字符的行

[root@yzq mnt]#grep -n 'w..d' test.txt 
5:google is the best tools for search keyword.
8:a wood cross!
9:Actions speak louder than words

6.查找包含.的行

之前说过.有其它意义，所以如果要找到.这种特殊意义的字符需要用转义\

[root@yzq mnt]#grep -n '\.' test.txt 
1:he was short and fat.
2:He was wearing a blue polo shirt with black pants.
3:The home of Football on BBC Sport online.
4:the tongue is boneless but it breaks bones.12! 
5:google is the best tools for search keyword.
6:The year ahead will test our political establishment to the limit.
7:PI=3.141592653589793238462643383249901429
16:Misfortunes never come alone/single.
18:I shouldn't have lett so tast.

7.匹配字符的集合

【】可以匹配【】中包含的任意字符，一个或多个,“[a-z]”可以匹配“a”到“z”范围内的任意一个小写字母字符。

[root@yzq mnt]#grep -n '[abc]' test.txt 
1:he was short and fat.
2:He was wearing a blue polo shirt with black pants.
3:The home of Football on BBC Sport online.
4:the tongue is boneless but it breaks bones.12! 
5:google is the best tools for search keyword.
6:The year ahead will test our political establishment to the limit.
8:a wood cross!
9:Actions speak louder than words
15:I bet this place is really spooky late at night! 
16:Misfortunes never come alone/single.
18:I shouldn't have lett so tast.

8.不希望“oo”前面存在小写字母，可以使用“grep -n‘【^a-z】oo’test.txt”命令实现，“a-z”表示小写字母，大写字母则通过“A-Z”表示

[root@yzq mnt]#grep -n '[^a-z]oo' test.txt 
3:The home of Football on BBC Sport online.

查询以小写字母开头的行可以通过“¹”规则来过滤，查询大写字母开头的行则使用 “^{[A-Z]”规则，若查询不以字母开头的行则使用“}[^a-zA-Z]”规则。
“^”符号在元字符集合“[]”符号内外的作用是不一样的，在“[]”符号内表示反向选择，在“[]” 符号外则代表定位行首。

[root@yzq mnt]#grep -n '^[^a-zA-Z]' test.txt 
12:#woood #
13:#woooooood #
14:#AxyzxyzxyzxyzC

9.{}匹配次数

注意的是{}有其他含义，需要使用转义符\

[root@yzq mnt]#grep -n 'o\{1,4\}' test.txt 
1:he was short and fat.
2:He was wearing a blue polo shirt with black pants.
3:The home of Football on BBC Sport online.
4:the tongue is boneless but it breaks bones.12! 
5:google is the best tools for search keyword.
6:The year ahead will test our political establishment to the limit.
8:a wood cross!
9:Actions speak louder than words
12:#woood #
13:#woooooood #
15:I bet this place is really spooky late at night! 
16:Misfortunes never come alone/single.
18:I shouldn't have lett so tast.

2.扩展正则表达式

元字符	作用
+	作用：重复一个或者一个以上的前一个字符示例：执行“egrep -n ‘wo+d’ test.txt”命令，即可查询”wood” “woood”
?	作用：零个或者一个的前一个字符示例：执行“egrep -n ‘bes?t’ test.txt”命令，即可查询“bet”“best”这两个字符串
\|	作用：使用或者（or）的方式找出多个字符示例：执行“egrep -n ‘of\|is\|on’ test.txt”命令即可查询”of”或者”if”或者”on”字符串
()	作用：查找“组”字符串示例：“egrep -n ‘t(a\|e)st’ test.txt”。“tast”与“test”因为这两个单词的“t”与“st”是重复的，所以将“a”与“e” 列于“()”符号当中，并以“\|”分隔，即可查询”tast”或者”test”字符串
()+	作用：辨别多个重复的组示例：“egrep -n ‘A(xyz)+C’ test.txt”。该命令是查询开头的”A”结尾是”C”，中间有一个以上的”xyz”字符串的意思

1.匹配至少一个0的行

[root@yzq mnt]#egrep '0+' test.txt 
PI=3.141592653589793238462643383249901429

2.匹配至少两个o的行

[root@yzq mnt]#egrep 'ooo?' test.txt 
The home of Football on BBC Sport online.
google is the best tools for search keyword.
a wood cross!
#woood #
#woooooood #
I bet this place is really spooky late at night!

3.匹配至少一个xyz的行

[root@yzq mnt]#egrep '(xyz)+' test.txt 
#AxyzxyzxyzxyzC

二、文本处理工具

1.cut列截取工具

cut 命令从文件每一行剪切字节，字符和字段并将这些字节，字符和字段写至标准输出。

如果不指定file参数，cut命令将读取标准输入。必须指定 -b,-c 或者 -f 标志之一。
字符和字节的区别
字节（byte）：是计量单位，表示数据量多少，是计算机信息技术用于计量存储容量的一种计量单位，通常情况下1字节等于8位
字符（character）：计算机中使用的字母、数字、字和符号
一般在英文状态下，一个字母或字符占用一个字节，一个汉字占用两个字节。

b	按字节截取
c	按字符截取，常用于中文
d	指定分割符截取，默认为空格和制表符
f	通常和-d一起，选择列

示例

1.截取系统中前十个用户名

-d：以：为分割
-f1 取第一列

[root@yzq mnt]#cat /etc/passwd | head | cut -d: -f1
root
bin
daemon
adm
lp
sync
shutdown
halt
mail
operator

2.截取文件中的每列第一个字节

[root@yzq mnt]#cat /etc/passwd | head | cut -b1
r
b
d
a
l
s
s
h
m
o

2.sort内容排序工具

1、sort排序

sort是一个一行为单位对文件进行排序的工具，也可以根据不同的数据类型来排序。例如：数据和字符的排序就是不一样的

格式
sort [选项] 参数

##参数如下：

	-b 忽略每行前面开始出的空格字符。
		#无法读取zn_UTF-8汉字
	-k：指定排序区域，在那个区间排序
		#可以读取zn_UTF-8汉字
	-n 按照数字进行排序，默认是以文字形式排序
	 
	-u 等同于 uniq，表示相同的数据仅显示一行，注意：如果行尾有空格去重就不成功

	-o<输出文件> 将排序后的结果存入指定的文件。
	
	-r 反向排序，默认是升序，-r就是降序
	
	-t<分隔字符> 默认使用[Tab]键或空格分隔

示例

1.不加任何选项默认按第一列升序，字母的话就是从a到z由上而下显示

-f1-3 取1-3列

[root@yzq ~]#cat /etc/passwd | head | cut -d: -f1-3 | sort
adm:x:3
bin:x:1
daemon:x:2
halt:x:7
lp:x:4
mail:x:8
operator:x:11
root:x:0
shutdown:x:6
sync:x:5

2 以冒号为分隔符，以数字大小对第二列排序（降序）

-nr 按数字降序排列
-t：以：为分隔符
-k2 取第二列

[root@yzq ~]#cat /etc/passwd | head | cut -d: -f1-3 | sort -nr -t: -k2
sync:x:5
shutdown:x:6
root:x:0
operator:x:11
mail:x:8
lp:x:4
halt:x:7
daemon:x:2
bin:x:1
adm:x:3

3去重

[root@yzq mnt]#cat txt
1
1
1
1
2
2
2
1
1
1
2
2
3
3
2
2
2
[root@yzq mnt]#cat txt | sort -u
1
2
3

3.uniq去重工具

uniq主要是用于去除连续的重复行
注意，是连续的行，所以通常和sort命令结合使用先排序使之变成连续的行再执行去重操作，否则不连续的重复行他不能去重。

格式
uniq [选项] 参数
##常用参数如下：
	
	-c：在每列旁边显示该行重复出现的次数。
	
	-d：仅显示重复出现的行列。
	
	-u：仅显示出一次的行列

	-f：忽略比较指定的栏位。
	 
	-s：忽略比较指定的字符。

示例

1.去除连续的重复行

[root@yzq mnt]#cat txt
1
1
1
1
2
2
2
1
1
1
2
2
3
3
2
2
2
[root@yzq mnt]#cat txt | uniq
1
2
1
2
3
2

2.统计重复行的次数，不连续的重复行他不算做重复行

前面一列为出现的次数

[root@yzq mnt]#uniq -c txt
      4 1
      3 2
      3 1
      2 2
      2 3
      3 2

3.结合sort使用，去重(相当于 sort -u)

[root@yzq mnt]#sort txt | uniq
1
2
3
[root@yzq mnt]#sort txt | uniq -c
      7 1
      8 2
      2 3
[root@yzq mnt]#sort  txt | uniq -c | sort -r
      8 2
      7 1
      2 3

4. uniq -u 只显示出现1次的行

不连续重复也算只出现一次

[root@yzq mnt]#uniq -u txt
1
4
[root@yzq mnt]#cat txt 
1
4
1
1
1
2
2
2
1
1
1
2
2
3
3
2
2
2

4.tr字符去重工具

格式
tr [选项]... SET1 SET2
	#从标准中替换、缩减和（或）删除字符，并将结果写到标准输出
	
##常用参数如下：

	-c：反选设定字符。也就是符合 SET1 的部份不做处理，不符合的剩余部份才进行转换
	 
	-d：删除指令字符
	 
	-s：缩减连续重复的字符成指定的单个字符

示例

1.替换a为A

用单引号’’

[root@yzq mnt]#cat txt
AAA
aaa
ABB
abb
[root@yzq mnt]#cat txt | tr 'a' 'A'
AAA
AAA
ABB
Abb

当替换的数值不足时，一直替换最后一个，替换也是对应替换，lis分别对应123，可以看到zhangsan中s替换为3

[root@yzq mnt]#cat 1.txt 
zhangsan
lisi
wangwu
[root@yzq mnt]#cat 1.txt | tr 'lisi' '123'
zhang3an
1333
wangwu

2.删除所有a

[root@yzq mnt]#cat txt
AAA
aaa
ABB
abb
[root@yzq mnt]#cat txt | tr -d 'a'
AAA

ABB
bb

3.对a去重

[root@yzq mnt]#cat txt 
AAA
aaa
ABB
abb
[root@yzq mnt]#cat txt | tr -s 'a'
AAA
a
ABB
abb

找出访问量最高的10个ip

[root@yzq mnt]#cat nginx.access.log-2021013 | cut -d " " -f1 | sort -n | uniq -c | sort -nr | head
   5498 122.51.38.20
   2161 117.157.173.214
    953 211.159.177.120
    219 58.87.87.99
    100 222.218.17.189
    100 218.201.62.71
    100 122.139.5.237
    100 120.195.144.116
    100 118.121.41.14
    100 1.177.191.161

总结

sort：表示排序
默认以字母排序，数字排序需要加选项（-n），反向排序需要加选项（-r）。
还可以选择按照哪一列进行排序，需要先定义分割符（-t），然后根据分割符去选取对应的列（-k），最后进行排序。
可以将排序后的内容输入到其它文件，使用选项（-o）可以指定需要注入的文件名。
还可以将进行去重（-u），可以是不连续的行，进行去重。
uniq：表示去重
主要注意的是它必须是连续的行，不然无法去重
可以根据选项，选择显示不重复的行（-u）
还可以选择显示重复的行（-d）
还可以统计连续重复的数量（-c）

tr：表示修改字符
可以修改对应的字符，按照字符对应一一修改，如果有重复的字符，它会按照最后一个字符对应的修改字符进行替换，如果对应的字符不够时，它会将修改的最后一个字符进替换。
还可以进行删除（-d），删除文本中所对应的字符
还可以进行去重（-s），但是去重的字符必须是连续在一起的两个字符（会保留其中一个），不然无法去重成功。

cut ：表示截取列
可以按照字符（-c）按照字节（-b）或者根据分割符（-d）来选取要截取的列（-f）