regexpsyntax of regular expression patterns |
Miscellaneous Information |
sh
.
The newline character at the end of each input line is never explicitly matched
by any regular expression or part thereof.
expr
,
ex
,
vi
, and
ed
take basic regular expressions; all other MKS commands accept extended regular
expressions. grep
and
sed
accept basic regular expressions,
but can accept extended regular expressions if the -E
option
is used.
Regular expressions may be made up of normal characters and/or special
characters, sometimes called metacharacters. Basic and extended
regular expressions differ only in the metacharacters they can contain.
The basic regular expression metacharacters are:
The extended regular expression metacharacters are:^ $ . * \( \) [ \{ \} \
In addition,| ^ $ . * + ? ( ) [ { } \
vi
,
ex
, and
egrep
(grep
-E
) also
accept these two metacharacters:
These have the following meanings:\< \>
.
A dot character matches any single character of the input line.
^
The ^
character does not match any character but represents
the beginning of the input line. For example, ^A
is a regular
expression matching the letter A
at the beginning of a line.
The ^
character is only special at the beginning of a regular
expression, or after a (
or |
.
$
This does not match any character but represents the end of the input
line. For example, A$
is a regular expression matching the
letter A
at the end of a line. The $
character
is only special at the end of a a regular expression, or before a
)
or |
.
[
bracket-expression]
A bracket expression enclosed in square brackets is a regular expression that matches a single character, or collating element.
If the initial character is a circumflex ^
, then this
bracket expression is complemented. It shall match any character or
collating-element except for the expressions specified in the bracket
expression.
If the first character after any potential circumflex is either a
dash (-
), or a closing square bracket (]
),
then that character shall match exactly that character; that is a
literal dash or closing square bracket.
Collating sequences may be specified by enclosing their name inside
square bracket period. For example, [.ch.]
matches the multi-character collating sequence ch
(if the current language supports that collating sequence).
Any single character is itself. It is an error to give a collating
sequence that isn't part of the current locale.
Equivalence classes may be specified by enclosing a character or
collating sequence inside square bracket equals. For example,
[=a=]
matches any character in the same equivalence class
as a
. This normally expands to all the variants of
a
in the current locale: for example, a
,
\(a:
, \(a`
, ... On some locales it might
include both the uppercase and lowercase of a given character. In the
POSIX locale, this always expands to only the character given.
Within a character class expression (one made with square brackets), the following constructs may be used to represent sets of characters. These constructs are used for internationalization and handle the different collating sequences as required by POSIX.
[:alpha:]
[:lower:]
[:upper:]
[:digit:]
[:alnum:]
[:space:]
[:graph:]
[:print:]
[:punct:]
[:cntrl:]
you need to enclose the expression within another set of square brackets, as in:[:alpha:]
/[[:alpha:]]/
Character ranges are specified by a dash (-
) between
two characters, or collating sequences. This indicates all character
or collating sequences which collate between two characters or
collating sequences. The range does not refer to the native character
set. For example, in the POSIX locale, [a-z]
means all
lowercase letters, even if they don't agree with the binary machine
ordering. However, since many other locales do not collate in this
manner, ranges should not be used in Strictly Conforming POSIX.2
applications. A collating sequence may explicitly be an endpoint of a
range; for example, [[.ch.]-[.ll.]]
is valid; however
equivalence classes or character classes may not:
[[=a=]-z]
is illegal.
\
This character is used to turn off the special meaning of
metacharacters. For example, \.
only matches a dot character.
Note that \\
matches a literal \
character.
Also note the special case of `\
d' described
below.
\
dFor d representing any single decimal digit (from 1 to 9), this
pattern is equivalent to the string matching the dth expression
enclosed within the ()
characters (or \(\)
for some commands) found at an earlier point in the regular expression.
Parenthesized expressions are numbered by counting (
characters from the left.
s
command in
Ex, or the sub
function of
awk
), to stand for constructs
matched by parts of the regular expression.
For example, in the following Ex command
thes/\(.*\):\(.*\)/\2:\1/
\1
stands for everything matched by the first
\(.*\)
and the \2
stands for everything matched
by the second. The result of the command is to swap everything before the
:
with everything after.*
A regular expression regexp followed by *
matches a
string of zero
or more strings that would match
regexp. For example, A*
matches A
,
AA
, AAA
, and so on. It also matches the null
string (zero occurrences of A
).
+
A regular expression regexp followed by +
matches a
string of one or more strings that would match regexp.
?
A regular expression regexp followed by ?
matches a
string of zero or one occurrences of strings that would match
regexp.
{
n}
\{
n\}
In this expression (and the ones to follow), char is a regular
expression that stands for a single character (for example, a literal
character or a period (.
)). Such a regular expression
followed by a number in brace brackets stands for that number of
repetitions of a character. For example, X\{3\}
stands for
XXX
. In basic regular expressions, in order to reduce the
number of special characters, {
and }
must be
escaped by the \
character to make them special, as shown in
the second form (and the ones to follow).
{
min,}
\{
min,\}
When a number, min, followed by a comma appears in braces
following a single-character regular expression, it stands for at least
min repetitions of a character. For example, X\{3,\}
stands for at least three repetitions of X
.
{
min,max}
\{
min,max\}
When a single-character regular expression is followed by a pair of
numbers in braces, it stands for at least min repetitions and no
more than max repetitions of a character. For example,
X\{3,7\}
stands for three to seven repetitions of
X
.
|
regexp2This expression matches either regular expression regexp1 or regexp2.
(
regexp\(
regexp\)
This lets you group parts of regular expressions. Except where
overridden by parentheses, concatenation has the highest precedence. In
basic regular expressions, in order to reduce the number of special
characters, (
and )
must be escaped by the
\
character to make them special, as shown in the second
form.
\<
This matches the beginning of an identifier, defined as the boundary between non-alphanumerics and alphanumerics (including underscore). This matches no characters, only the context.
\>
This construct is analogous to the \<
notation except
that it matches the end of an identifier.
The regular expressions accepted by Ex and Vi are similar to basic regular expressions, except that the^ $ . * \( \) [ \ \< \>
\{
and \}
characters are
not special, the [: :] character class expressions are not available, and the
\<
and \>
metacharacters can be used.
Notation | awk | ed | egrep | expr | gres | pg | sed | vi |
---|---|---|---|---|---|---|---|---|
. | • | • | • | • | • | • | • | • |
^ | • | • | • | • | • | • | • | |
$ | • | • | • | • | • | • | • | • |
[...] | • | • | • | • | • | • | • | • |
[::] | • | • | • | • | • | • | • | |
re* | • | • | • | • | • | • | • | • |
re+ | • | • | • | • | ||||
re? | • | • | • | • | ||||
re|re | • | • | • | • | ||||
\d | • | • | • | • | • | • | • | • |
(...) | • | • | • | • | ||||
\(...\) | • | • | • | • | ||||
\< | • | • | ||||||
\> | • | • | ||||||
\{ \} | • | • | • | • | • | |||
{ } | • | • |
abc
matches any line of text containing the three letters
abc
in that order.
a.c
matches any string beginning with the letter a
, followed by
any character, followed by the letter c
.
^.$
matches any line containing exactly one character (the newline is not counted).
a(b*|c*)d
matches any string beginning with a letter a
, followed by
either zero or more of the letter b
, or zero or more of the
letter c
, followed by the letter d
.
.* [a-z]+ .*
matches any line containing a word, consisting of lowercase alphabetic characters, delimited by at least one space on each side.
(morty).*\1
morty.*morty
These expressions both match lines containing at least
two occurrences of the string morty
.
[[:space:][:alnum:]]
Matches any character that is either a white space character or alphanumeric.