You get a bonus - 1 coin for daily activity. Now you have 1 coin

Regular expressions in php and other languages. Regular expressions: search with inversion

Lecture



It took me to find all the words in the sentence, except for certain words. For example, in the sentence " find everything except the words computing and matching.

Test text: 'Provides wildcard characters in a file system.'

Regular expression: '/ \\ b (?! (?: computing | matching) \\ b) w + \\ b /'

Result: a list of all words except computing and matching.

Where,

\\ b - word boundary

w - character set [A-Za-z0-9_]

?: - non-preserving brackets, i.e. everything inside the grouping brackets will not be included in the result

?! - search with inversion, i.e. what is inside will be excluded from the coincidence

Generally a combination ?! called negative forward testing and refers to one of four types of positional checks:

Type of Regular
expression
Successful
if the subexpression is ...
Positive retrospective verification (? <= ..) Can match the left
Negative retrospective check (? <! - ..) Can not match on the left
Positive Advance Check (? = ..) Can match right
Negative forward check (?! ..) Can't match right

To check the work, you can use:

  • RegExr is an online regular expression checker tool.
  • rejex.heroku.com is another online regexp debugger.
  • strfriend.com is a regular expression visualizer.
  • pythex.appspot.com

Extension for FireFox Regular Expressions Tester

Useful Material - RegexAdvice Forums - regular expressions forum.

While reading the bestler about regular expressions - J. Friedl - Regular expressions learned two interesting things:

  • according to the POSIX specification, there are two dialects of the regularizers BRE and ERE
  • two basic technologies on the basis of which the regular expression mechanism is built: NKA and DKA

BRE (basic regular expressions) - basic regular expressions

ERE (extened regular expressions) - extended regular expressions

NCA (nondeterministic finite automaton) - the mechanism is controlled by a regular expression

DKA (deterministic finite automaton) - the mechanism is controlled by text

Support for dialects of different metacharacters

Metacharacters Bre Ere
Point, ^, $, [..], [^ ..]   Regular expressions in php and other languages. Regular expressions: search with inversion   Regular expressions in php and other languages. Regular expressions: search with inversion
Arbitrary number * *
Quantifiers + and? +?
Interval quantifier {min, max} {min, max}
Grouping (..) (..)
Applying Quantifiers to Brackets   Regular expressions in php and other languages. Regular expressions: search with inversion   Regular expressions in php and other languages. Regular expressions: search with inversion
Backlinks 1..9
Selection design   Regular expressions in php and other languages. Regular expressions: search with inversion

Short table of comparison of DFA and NKA

options DKA NCA
backlinks support   Regular expressions in php and other languages. Regular expressions: search with inversion   Regular expressions in php and other languages. Regular expressions: search with inversion
storing text in parentheses   Regular expressions in php and other languages. Regular expressions: search with inversion   Regular expressions in php and other languages. Regular expressions: search with inversion
quick match search   Regular expressions in php and other languages. Regular expressions: search with inversion   Regular expressions in php and other languages. Regular expressions: search with inversion
fast compilation   Regular expressions in php and other languages. Regular expressions: search with inversion   Regular expressions in php and other languages. Regular expressions: search with inversion
lower memory costs   Regular expressions in php and other languages. Regular expressions: search with inversion   Regular expressions in php and other languages. Regular expressions: search with inversion

Editor's note. There are many dialects of the regular expression language. The expressions in this article use Perl syntax, including functions that are not available in other dialects. A similar dialect outside Perl is known as "Perl-compatible regular expressions" (PCRE, Perl-Compatible Regular Expressions).

In most text editors, you can search for a word, its parts, and be case sensitive. Most often, these possibilities are sufficient, since relatively simple texts have to be processed - the human brain immediately evaluates the obtained coincidence and makes decisions.

Sooner or later there comes a time when human resources are not enough to process all the incoming information, or you need to enter fully automated data processing, for example, to publish news from various sources on your website. This is where automated systems come into play, in which regular expressions are widely used. Their use allows dozens of times to reduce the amount of code required for word processing. When they first met, regular expressions elicited a variety of reactions, but the common thing in them is that the person is repelled by the difficulty of understanding the meaning of these “statements”. This article is intended to explain how to read an expression and understand its meaning.

Terms

  • operator - symbols used in regular expressions; they control the search, but do not participate in the coincidence.
  • match — one or more characters and regular expression operators that match the text being checked.
  • A quantifier is a regular expression operator that determines the number of instances of a recurring match.
    • greedy (another definition - “maximum”) - captures all available matches, after which it can return them if it is required by operators to the right. By default, all quantifiers are greedy.
    • not greedy (another definition - “minimal”) - captures a match only if there is no match to the right.
    • exciting - greedy, never returns captured matches.
  • sequential match is a sequence of identical matches.

Operators dictionary

  • characters
    • <character> —the printing character: letters, numbers, punctuation marks, etc.
    • . - arbitrary character
    • The substring is a literal (one with no operators ) character set.
  • match management (quantifiers)
    • * - zero or more consecutive matches
    • + - one or more consecutive matches
    • ? - zero or one match
    • *? - not greedy zero or more consecutive matches
    • +? - no greedy one or more consecutive matches
    • * + - exciting zero or more consecutive matches
    • ++ - exciting one or more consecutive matches
    • {n1, n2} - from n1 to n2 consecutive matches
    • {n} - exactly n matches
  • sequencing - only the entire sequence can match
    • ( - let's start capturing matches in a sequence and store it in memory (retaining brackets)
    • (?: - let's start capturing matches in a sequence, but do not save it (non-saving brackets)
    • ) - let's finish capturing matches.
    • | - alternative sequence in the absence of a match with the sequence on the left
  • character sets - matches any of the characters
    • [ - open the character set
    • ] - close the character set
    • - - we will specify a range of characters
    • ^ - the set contains all characters except those listed
  • positional check
    • (? = - let's start checking for a match on the right
    • (?! - let's start checking for the absence of a match on the right
    • (? <= - start checking for a match on the left
    • (? <! - let's start checking for the absence of a match on the left
    • ) - finish the test
  • references to previous matches
    • \ 0 .. \ 9 - the sequence number of the sequence in saving brackets

This is the translation of regular expression constructs into “human” language. In the future, when analyzing the examples, I will give them a detailed description.

Reading and writing regular expressions

Speaking out loud and writing the rules is the main problem that arises when mastering the technique of parsing text using regular expressions. In the absence of experience, it is difficult to formulate a verbal notation of rules. However, this is the most efficient way to write regular expressions.

The main rule when creating regular expressions is to write them in expanded form on a sheet of paper or on the screen. It is easiest to determine how the text will be processed correctly. Another rule is to formulate an expression from general to specific. At its observance the time of writing the expression and the number of errors is significantly reduced.

We formulate the conditions for the successful development of a regular expression

  1. The expression must have a “literary” form.
  2. Verbal description should be logical.
  3. Drawing up the expression should go:
    1. from easy to hard.
    2. From general to specific.

Despite the complexity of such a record, it increases the speed of development and debugging of the rules for the analysis of texts and the effectiveness of their application.

Expression Example

Task

Highlight in HTML markup the contents of certain blocks with established attributes:

  • paragraph (<p> tag) with CSS content class
  • list item (<li> tag) with CSS content class
 <p> Paragraph 1 </ p>
 <p class = "content"> Paragraph 2 </ p>
 <ul>
    <li> Element 1 </ li>
    <li class = "content"> Element 2 </ li>
 </ ul> 
 <p class = "content"> Paragraph 2 </ p>
 <li class = "content"> Element 2 </ li> 

Solution Sequence

  1. Create a regular expression to highlight all tags.
  2. We add it to select only paired tags.
  3. Add it to select only the specified tags.
  4. We write an expression to search for pairs' attribute_name = "value".
  5. Let's add the main expression to highlight the specified tags with certain attributes.
  6. Add search modifiers.

Usually they begin to simultaneously produce all 6 steps, which causes serious problems when debugging. I propose to act gradually. Step by step.

I will give a solution right away:

 # <(p | li) \ s + [^>] *? class \ s * = \ s * (['"]) content \ 2 [^>] *> ((?: (?! </ \ 1> ).) *) </ \ 1> #is 

Agree, it looks cleaner "Chinese letters." However, following the description, you will see that everything is not so difficult.

So, let's begin:

Step 1. Select all tags.

Let's write down the rules for parsing in Russian:

  1. Find the substring '<'
  2. Start capturing characters in a sequence.
    1. Grab one or more letters of the alphabet
  3. Finish capturing matches
  4. Grab 0 or more characters that do not match the '>' character set.
  5. Grab the substring '>'
  6. Start capturing characters in a sequence.
    1. Grab 0 or more characters that do not match the '>' character set.
  7. Finish capturing matches

Now that the task has been accurately described, you can start writing it as a regular expression:

  1. <
  2. (
    1. \ w +
  3. )
  4. [^>] *
  5. >
  6. (
    1. [^ <] *
  7. )

We got the following expression:

 <(\ w +) [^>] *> ([^ <] *) 

It has 2 drawbacks:

  1. captures all tags, not just pairs.
  2. does not properly handle nested tags.

Step 2. Selecting paired tags

We write the parsing rules in a formal language:

  1. Find the substring '<'
  2. Start capturing characters in a sequence.
    1. Grab one or more letters of the alphabet
  3. Finish capturing matches
  4. Grab 0 or more characters that do not match the '>' character set.
  5. Grab the substring '>'
  6. Start capturing characters in a sequence.
    1. Start capturing characters in a non-persistent sequence.
      1. Let's start checking for the lack of a successful match on the right of the sequence of
        1. '</'
        2. match found in steps 2-3 (reference to sequence 1)
        3. '>'
      2. Complete the check.
      3. Grab any character
    2. Finish capturing matches.
    3. Capture 0 or more times.
  7. Finish capturing matches.
  8. Grab the substring '</'
  9. Capture the match found in steps 2-3 (reference to sequence 1)
  10. Grab the substring '>'

Now that the task has been accurately described, you can start writing it as a regular expression:

  1. <
  2. (
    1. \ w +
  3. )
  4. [^>] *
  5. >
  6. (
    1. (?:
      1. (?!
        1. </
        2. \one
        3. >
      2. )
      3. .
    2. )
    3. *
  7. )
  8. </
  9. \one
  10. >

So, we got the following expression:

 <(\ w +) [^>] *> ((?: (?! </ \ 1>).) *)) </ \ 1> 

It captures any paired tags along with the content.

Step 3. Highlight the required tags.

Using the regular expression obtained in the previous step, we can select several types of tags from the text at once, using the construction of “alternative sequence with no match to the left”. In the description we use the term "alternative sequence".

Add a selection from the text of the entire contents of the paragraphs and list items:

  1. Find the substring '<'
  2. Start capturing characters in a sequence.
    1. substring 'p'
    2. Add an alternate sequence
    3. substring 'li'
  3. Finish capturing matches
  4. Let's make a check for a successful match on the right of the character set '\ s>'
  5. Grab 0 or more characters that do not match the '>' character set.
  6. Grab the substring '>'
  7. Start capturing characters in a sequence.
    1. Start capturing characters in a non-persistent sequence.
      1. Let's start checking for the lack of a successful match on the right of the sequence of
        1. '</'
        2. match found in steps 2-3 (reference to sequence 1)
        3. '>'
      2. Finish the check
      3. Grab any character
    2. Finish capturing matches
    3. Grab a sequence 0 or more times.
  8. Finish capturing matches
  9. Grab the substring '</'
  10. Capture the match found in steps 2-3 (reference to sequence 1)
  11. Grab the substring '>'

Items 7 and 8 were added so that the expression does not capture the tags, the beginning of which coincides with the selected tags. For example, so that when searching for the <p> tag, the <param> tags are not captured.

We translate it into regular expression operators:

  1. <
  2. (
    1. p
    2. |
    3. li
  3. )
  4. (? = [\ s>])
  5. [^>] *
  6. >
  7. (
    1. (?:
      1. (?!
        1. </
        2. \one
        3. >
      2. )
      3. .
    2. )
    3. *
  8. )
  9. </
  10. \one
  11. >

New regular expression:

 <(p | li) (? = [\ s>]) [^> \ w] *> ((?: (?! </ \ 1>).) *)) </ \ 1> 

Now only p and li tags and all their contents will be highlighted in the text.

Step 4. Isolation of the pairs name_name = "value"

Expression requirements

  1. there must be a space to the right of the attribute name
  2. To the left of the value must be a space or a closing tag bracket
  3. value must be enclosed in single or double quotes
  4. there may be spaces between the equal sign, attribute name and its value

We describe the problem in a formal language:

  1. Find 1 or more \ s characters
  2. Grab 1 or more \ w characters
  3. Grab 0 or more \ s characters
  4. Grab the substring '='
  5. Grab 0 or more \ s characters
  6. Start capturing characters in a sequence.
    1. Grab a substring from the set '' '
  7. Finish capturing matches.
  8. Grab 0 or more characters not in the character set found in steps 6-7 (reference to sequence 1)
  9. Grab the character found in steps 6-7 (reference to sequence 1)

We translate into regular expression operators:

  1. \ s +
  2. \ w +
  3. \ s *
  4. =
  5. \ s *
  6. (
    1. ['"]
  7. )
  8. [^ \ 1] *
  9. \one

The following regular expression is obtained:

 \ s + \ w + \ s * = \ s * (['"]) [^ \ 1] * \ 1 

Modify it so that the expression matches only the name of the 'class' attribute and its value 'content':

 \ s + class \ s * = \ s * (['"]) content \ 1 

Step 5. Add to the main expression a check on certain attributes

We describe the problem in a formal language:

  1. Find the substring '<'
  2. Start capturing characters in a sequence.
    1. Grab the 'p' Substring
    2. Add an alternate sequence
    3. Grab the 'li' substring
  3. Finish capturing matches
  4. Grab 1 or more \ s characters
  5. Capture the minimum 0 or more characters that do not match the '>' character set.
  6. Add a regular expression from step 4: class \ s * = \ s * (['"]]) content \ 1
  7. Grab 0 or more characters that do not match the '>' character set.
  8. Grab the substring '>'
  9. Start capturing characters in a sequence.
    1. Start capturing characters in a non-persistent sequence.
      1. Let's start checking for the lack of a successful match on the right of the sequence of
        1. '</'
        2. the match found in steps 2-3 (reference to sequence 1)
        3. '>'
      2. Finish the check
      3. Grab any character
    2. Finish capturing matches
    3. Grab a sequence of 0 or more times.
  10. Finish capturing matches
  11. Grab the substring '</'
  12. Capture the match found in steps 2-3 (reference to sequence 1)
  13. Grab the substring '>'

We translate it into regular expression operators:

  1. <
  2. (
    1. p
    2. |
    3. li
  3. )
  4. \ s +
  5. [^>] *?
  6. class \ s * = \ s * (['"]) content \ 2
  7. [^>] *
  8. >
  9. (
    1. (?:
      1. (?!
        1. </
        2. \one
        3. >
      2. )
      3. .
    2. )
    3. *
  10. )
  11. </
  12. \one
  13. >

Resulting expression:

 <(p | li) \ s + [^>] *? class \ s * = \ s * (['"]) content \ 2 [^>] *> ((?: (?! </ \ 1>) .) *) </ \ 1> 

Step 6. Adding Modifiers and Limiters

Limiters

In almost all languages ​​where regular expressions are supported, it is possible to choose expression delimiters. The most common ones are: /// and # # . In principle, you can use almost any pair of characters, if this is supported by the interpreter. When choosing delimiters, it is better to proceed from the fact that the characters are present in the regular expression. It is better to choose those that are not in the expression. Otherwise, you will have to escape these characters, which will make the expression more confusing. In our case, standard / / are not suitable, as they are inside a regular expression. Therefore, I suggest using # # limiters.

Modifiers

I advise you to look at information on all search modifiers in special reference books, for example, in the PHP and Perl documentation. Here we use i - search is case insensitive and s - the mode of coincidence of the character “.” With newlines.

Final expression

 # <(p | li) \ s + [^>] *? class \ s * = \ s * (['"]) content \ 2 [^>] *> ((?: (?! </ \ 1> ).) *) </ \ 1> #is 

As you can see, this problem is solved quite simply. When writing an article, I chose it because the forums often ask the question "how to choose the content of a particular tag" and "how to parse the HTML markup." The decision is in front of you.

useful links

  • Regular-expressions.info - a large amount of information on regas, their implementation in different programming languages.
  • Search & Replace for Far is a great plugin for working with regs in the spotlight.
  • RegexBuddy is a powerful program for visualizing regular expressions with the support of a huge number of dialects.
avatar
11.5.2020 1:7

Определения специальных символов для регулярных выражений
Цитировать следующий метасимвол
^ Соответствие началу строки
. Соответствует любому символу (кроме новой строки)
$ Соответствует концу строки (или перед новой строкой в ​​конце)
| альтернативность
() Группировка
Класс персонажа
Соответствие 0 или более раз
+ Соответствие 1 или более раз
? Совпадение 1 или 0 раз
n соответствует ровно n раз
n, Соответствовать не менее n раз
n, m Совпадение не менее n, но не более m раз
Больше специальных персонажей
t символ табуляции (HT, TAB)
n перевод строки (LF, NL)
r возврат строки (CR)
f подача формы (FF)
a будильник (звонок) (БЕЛ)
e escape(думаю, troff) (ESC)
033 восьмеричный символ (вспомним PDP-11)
x1B шестнадцатеричный символ
c контрольный символ
l строчный следующий символ (думаю, vi)
u следующий символ в верхнем регистре (думаю, vi)
L строчными до E (думаю, vi)
U прописными до E (думаю, vi)
E конец модификации (думаю, vi)
Q Цитировать (отключить) метасимволы шаблона до E

Еще больше специальных символов
w Соответствует символу «слово (буквенно-цифровой и «_ )
W Соответствует несловесному символу
s Соответствует пробелу
S Соответствует непробельному символу
d Совпадение с цифрой
D Соответствует нецифровому символу
b Соответствует границе слова
B Сопоставить не- (граница слова)
A совпадать только в начале строки
Z Совпадение только в конце строки или до новой строки в конце
z Совпадение только в конце строки
G Соответствует только там, где предыдущий m//g остановлен (работает только с/g)


Comments


To leave a comment
If you have any suggestion, idea, thanks or comment, feel free to write. We really value feedback and are glad to hear your opinion.
To reply

Running server side scripts using PHP as an example (LAMP)

Terms: Running server side scripts using PHP as an example (LAMP)