Searching and Replacing in Strings (GNU Octave (version 8.4.0))

Previous: Searching in Strings, Up: String Operations [Contents][Index]

5.3.5 Searching and Replacing in Strings

: newstr = strrep (str, ptn, rep) ¶

: newstr = strrep (cellstr, ptn, rep) ¶

: newstr = strrep (…, "overlaps", val) ¶

Replace all occurrences of the pattern ptn in the string str with the string rep and return the result.

The optional argument "overlaps" determines whether the pattern can match at every position in str (true), or only for unique occurrences of the complete pattern (false). The default is true.

s may also be a cell array of strings, in which case the replacement is done for each element and a cell array is returned.

Example:

strrep ("This is a test string", "is", "&%$")
    ⇒  "Th&%$ &%$ a test string"

See also: regexprep, strfind.

: newstr = erase (str, ptn) ¶

Delete all occurrences of ptn within str.

str and ptn can be ordinary strings, cell array of strings, or character arrays.

Examples

## string, single pattern
erase ("Hello World!", " World")
    ⇒ "Hello!"

## cellstr, single pattern
erase ({"Hello", "World!"}, "World")
    ⇒ {"Hello", "!"}

## string, multiple patterns
erase ("The Octave interpreter is fabulous", ...
       {"interpreter ", "The "})
    ⇒ "Octave is fabulous"

## cellstr, multiple patterns
erase ({"The ", "Octave interpreter ", "is fabulous"}, ...
       {"interpreter ", "The "})
    ⇒ {"", "Octave ", "is fabulous"}

Programming Note: erase deletes the first instance of a pattern in a string when there are overlapping occurrences. For example:

erase ("abababa", "aba")
    ⇒ "b"

For processing overlaps, see strrep.

See also: strrep, regexprep.

: [s, e, te, m, t, nm, sp] = regexp (str, pat) ¶

: […] = regexp (str, pat, "opt1", …) ¶

Regular expression string matching.

Search for pat in UTF-8 encoded str and return the positions and substrings of any matches, or empty values if there are none.

The matched pattern pat can include any of the standard regex operators, including:

.

Match any character

* + ? {}

Repetition operators, representing

*: Match zero or more times
+: Match one or more times
?: Match zero or one times
{n}: Match exactly n times
{n,}: Match n or more times
{m,n}: Match between m and n times

[…] [^…]

List operators. The pattern will match any character listed between "[" and "]". If the first character is "^" then the pattern is inverted and any character except those listed between brackets will match.

Escape sequences defined below can also be used inside list operators. For example, a template for a floating point number might be [-+.\d]+.

() (?:)

Grouping operator. The first form, parentheses only, also creates a token.

|

Alternation operator. Match one of a choice of regular expressions. The alternatives must be delimited by the grouping operator () above.

^ $

Anchoring operators. Requires pattern to occur at the start (^) or end ($) of the string.

In addition, the following escaped characters have special meaning.

\d: Match any digit
\D: Match any non-digit
\s: Match any whitespace character
\S: Match any non-whitespace character
\w: Match any word character
\W: Match any non-word character
\<: Match the beginning of a word
\>: Match the end of a word
\B: Match within a word

Implementation Note: For compatibility with MATLAB, escape sequences in pat (e.g., "\n" => newline) are expanded even when pat has been defined with single quotes. To disable expansion use a second backslash before the escape sequence (e.g., "\\n") or use the regexptranslate function.

The outputs of regexp default to the order given below

s: The start indices of each matching substring
e: The end indices of each matching substring
te: The extents of each matched token surrounded by (…) in pat
m: A cell array of the text of each match
t: A cell array of the text of each token matched
nm: A structure containing the text of each matched named token, with the name being used as the fieldname. A named token is denoted by (?<name>…).
sp: A cell array of the text not returned by match, i.e., what remains if you split the string based on pat.

Particular output arguments, or the order of the output arguments, can be selected by additional opt arguments. These are strings and the correspondence between the output arguments and the optional argument are

	`'start'`	`s`
	`'end'`	`e`
	`'tokenExtents'`	`te`
	`'match'`	`m`
	`'tokens'`	`t`
	`'names'`	`nm`
	`'split'`	`sp`

Additional arguments are summarized below.

‘once’

Return only the first occurrence of the pattern.

‘matchcase’

Make the matching case sensitive. (default)

Alternatively, use (?-i) in the pattern.

‘ignorecase’

Ignore case when matching the pattern to the string.

Alternatively, use (?i) in the pattern.

‘stringanchors’

Match the anchor characters at the beginning and end of the string. (default)

Alternatively, use (?-m) in the pattern.

‘lineanchors’

Match the anchor characters at the beginning and end of the line.

Alternatively, use (?m) in the pattern.

‘dotall’

The pattern . matches all characters including the newline character. (default)

Alternatively, use (?s) in the pattern.

‘dotexceptnewline’

The pattern . matches all characters except the newline character.

Alternatively, use (?-s) in the pattern.

‘literalspacing’

All characters in the pattern, including whitespace, are significant and are used in pattern matching. (default)

Alternatively, use (?-x) in the pattern.

‘freespacing’

The pattern may include arbitrary whitespace and also comments beginning with the character ‘#’.

Alternatively, use (?x) in the pattern.

‘noemptymatch’

Zero-length matches are not returned. (default)

‘emptymatch’

Return zero-length matches.

regexp ('a', 'b*', 'emptymatch') returns [1 2] because there are zero or more 'b' characters at positions 1 and end-of-string.

Stack Limitation Note: Pattern searches are done with a recursive function which can overflow the program stack when there are a high number of matches. For example,

regexp (repmat ('a', 1, 1e5), '(a)+')

may lead to a segfault. As an alternative, consider constructing pattern searches that reduce the number of matches (e.g., by creatively using set complement), and then further processing the return variables (now reduced in size) with successive regexp searches.

See also: regexpi, strfind, regexprep.

: [s, e, te, m, t, nm, sp] = regexpi (str, pat) ¶

: […] = regexpi (str, pat, "opt1", …) ¶

Case insensitive regular expression string matching.

Search for pat in UTF-8 encoded str and return the positions and substrings of any matches, or empty values if there are none. See regexp, for details on the syntax of the search pattern.

See also: regexp.

: outstr = regexprep (string, pat, repstr) ¶

: outstr = regexprep (string, pat, repstr, "opt1", …) ¶

Replace occurrences of pattern pat in string with repstr.

The pattern is a regular expression as documented for regexp. See regexp.

All strings must be UTF-8 encoded.

The replacement string may contain $i, which substitutes for the ith set of parentheses in the match string. For example,

regexprep ("Bill Dunn", '(\w+) (\w+)', '$2, $1')

returns "Dunn, Bill"

Options in addition to those of regexp are

‘once’: Replace only the first occurrence of pat in the result.
‘warnings’: This option is present for compatibility but is ignored.

See also: regexp, regexpi, strrep.

: str = regexptranslate (op, s) ¶

Translate a string for use in a regular expression.

This may include either wildcard replacement or special character escaping.

The behavior is controlled by op which can take the following values

"wildcard"

The wildcard characters ., *, and ? are replaced with wildcards that are appropriate for a regular expression. For example:

regexptranslate ("wildcard", "*.m")
     ⇒ '.*\.m'

"escape"

The characters $.?[], that have special meaning for regular expressions are escaped so that they are treated literally. For example:

regexptranslate ("escape", "12.5")
     ⇒ '12\.5'

See also: regexp, regexpi, regexprep.