Lecture
It took me to find all the words in the sentence, except for certain words. For example, in the sentence " find everything except the words computing and matching.
Test text: 'Provides wildcard characters in a file system.'
Regular expression: '/ \\ b (?! (?: computing | matching) \\ b) w + \\ b /'
Result: a list of all words except computing and matching.
Where,
\\ b - word boundary
w - character set [A-Za-z0-9_]
?: - non-preserving brackets, i.e. everything inside the grouping brackets will not be included in the result
?! - search with inversion, i.e. what is inside will be excluded from the coincidence
Generally a combination ?! called negative forward testing and refers to one of four types of positional checks:
Type of | Regular expression | Successful if the subexpression is ... |
---|---|---|
Positive retrospective verification | (? <= ..) | Can match the left |
Negative retrospective check | (? <! - ..) | Can not match on the left |
Positive Advance Check | (? = ..) | Can match right |
Negative forward check | (?! ..) | Can't match right |
To check the work, you can use:
Extension for FireFox Regular Expressions Tester
Useful Material - RegexAdvice Forums - regular expressions forum.
While reading the bestler about regular expressions - J. Friedl - Regular expressions learned two interesting things:
BRE (basic regular expressions) - basic regular expressions
ERE (extened regular expressions) - extended regular expressions
NCA (nondeterministic finite automaton) - the mechanism is controlled by a regular expression
DKA (deterministic finite automaton) - the mechanism is controlled by text
Support for dialects of different metacharacters
Metacharacters | Bre | Ere |
---|---|---|
Point, ^, $, [..], [^ ..] | ||
Arbitrary number | * | * |
Quantifiers + and? | +? | |
Interval quantifier | {min, max} | {min, max} |
Grouping | (..) | (..) |
Applying Quantifiers to Brackets | ||
Backlinks | 1..9 | |
Selection design |
Short table of comparison of DFA and NKA
options | DKA | NCA |
---|---|---|
backlinks support | ||
storing text in parentheses | ||
quick match search | ||
fast compilation | ||
lower memory costs |
Editor's note. There are many dialects of the regular expression language. The expressions in this article use Perl syntax, including functions that are not available in other dialects. A similar dialect outside Perl is known as "Perl-compatible regular expressions" (PCRE, Perl-Compatible Regular Expressions).
In most text editors, you can search for a word, its parts, and be case sensitive. Most often, these possibilities are sufficient, since relatively simple texts have to be processed - the human brain immediately evaluates the obtained coincidence and makes decisions.
Sooner or later there comes a time when human resources are not enough to process all the incoming information, or you need to enter fully automated data processing, for example, to publish news from various sources on your website. This is where automated systems come into play, in which regular expressions are widely used. Their use allows dozens of times to reduce the amount of code required for word processing. When they first met, regular expressions elicited a variety of reactions, but the common thing in them is that the person is repelled by the difficulty of understanding the meaning of these “statements”. This article is intended to explain how to read an expression and understand its meaning.
Terms
Operators dictionary
This is the translation of regular expression constructs into “human” language. In the future, when analyzing the examples, I will give them a detailed description.
Reading and writing regular expressions
Speaking out loud and writing the rules is the main problem that arises when mastering the technique of parsing text using regular expressions. In the absence of experience, it is difficult to formulate a verbal notation of rules. However, this is the most efficient way to write regular expressions.
The main rule when creating regular expressions is to write them in expanded form on a sheet of paper or on the screen. It is easiest to determine how the text will be processed correctly. Another rule is to formulate an expression from general to specific. At its observance the time of writing the expression and the number of errors is significantly reduced.
We formulate the conditions for the successful development of a regular expression
Despite the complexity of such a record, it increases the speed of development and debugging of the rules for the analysis of texts and the effectiveness of their application.
Expression Example
Task
Highlight in HTML markup the contents of certain blocks with established attributes:
<p> Paragraph 1 </ p> <p class = "content"> Paragraph 2 </ p> <ul> <li> Element 1 </ li> <li class = "content"> Element 2 </ li> </ ul>
<p class = "content"> Paragraph 2 </ p> <li class = "content"> Element 2 </ li>
Solution Sequence
Usually they begin to simultaneously produce all 6 steps, which causes serious problems when debugging. I propose to act gradually. Step by step.
I will give a solution right away:
# <(p | li) \ s + [^>] *? class \ s * = \ s * (['"]) content \ 2 [^>] *> ((?: (?! </ \ 1> ).) *) </ \ 1> #is
Agree, it looks cleaner "Chinese letters." However, following the description, you will see that everything is not so difficult.
So, let's begin:
Step 1. Select all tags.
Let's write down the rules for parsing in Russian:
Now that the task has been accurately described, you can start writing it as a regular expression:
We got the following expression:
<(\ w +) [^>] *> ([^ <] *)
It has 2 drawbacks:
Step 2. Selecting paired tags
We write the parsing rules in a formal language:
Now that the task has been accurately described, you can start writing it as a regular expression:
So, we got the following expression:
<(\ w +) [^>] *> ((?: (?! </ \ 1>).) *)) </ \ 1>
It captures any paired tags along with the content.
Step 3. Highlight the required tags.
Using the regular expression obtained in the previous step, we can select several types of tags from the text at once, using the construction of “alternative sequence with no match to the left”. In the description we use the term "alternative sequence".
Add a selection from the text of the entire contents of the paragraphs and list items:
Items 7 and 8 were added so that the expression does not capture the tags, the beginning of which coincides with the selected tags. For example, so that when searching for the <p> tag, the <param> tags are not captured.
We translate it into regular expression operators:
New regular expression:
<(p | li) (? = [\ s>]) [^> \ w] *> ((?: (?! </ \ 1>).) *)) </ \ 1>
Now only p and li tags and all their contents will be highlighted in the text.
Step 4. Isolation of the pairs name_name = "value"
Expression requirements
We describe the problem in a formal language:
We translate into regular expression operators:
The following regular expression is obtained:
\ s + \ w + \ s * = \ s * (['"]) [^ \ 1] * \ 1
Modify it so that the expression matches only the name of the 'class' attribute and its value 'content':
\ s + class \ s * = \ s * (['"]) content \ 1
Step 5. Add to the main expression a check on certain attributes
We describe the problem in a formal language:
We translate it into regular expression operators:
Resulting expression:
<(p | li) \ s + [^>] *? class \ s * = \ s * (['"]) content \ 2 [^>] *> ((?: (?! </ \ 1>) .) *) </ \ 1>
Step 6. Adding Modifiers and Limiters
Limiters
In almost all languages where regular expressions are supported, it is possible to choose expression delimiters. The most common ones are: /// and # # . In principle, you can use almost any pair of characters, if this is supported by the interpreter. When choosing delimiters, it is better to proceed from the fact that the characters are present in the regular expression. It is better to choose those that are not in the expression. Otherwise, you will have to escape these characters, which will make the expression more confusing. In our case, standard / / are not suitable, as they are inside a regular expression. Therefore, I suggest using # # limiters.
Modifiers
I advise you to look at information on all search modifiers in special reference books, for example, in the PHP and Perl documentation. Here we use i - search is case insensitive and s - the mode of coincidence of the character “.” With newlines.
Final expression
# <(p | li) \ s + [^>] *? class \ s * = \ s * (['"]) content \ 2 [^>] *> ((?: (?! </ \ 1> ).) *) </ \ 1> #is
As you can see, this problem is solved quite simply. When writing an article, I chose it because the forums often ask the question "how to choose the content of a particular tag" and "how to parse the HTML markup." The decision is in front of you.
useful links
Определения специальных символов для регулярных выражений
Цитировать следующий метасимвол
^ Соответствие началу строки
. Соответствует любому символу (кроме новой строки)
$ Соответствует концу строки (или перед новой строкой в конце)
| альтернативность
() Группировка
Класс персонажа
Соответствие 0 или более раз
+ Соответствие 1 или более раз
? Совпадение 1 или 0 раз
n соответствует ровно n раз
n, Соответствовать не менее n раз
n, m Совпадение не менее n, но не более m раз
Больше специальных персонажей
t символ табуляции (HT, TAB)
n перевод строки (LF, NL)
r возврат строки (CR)
f подача формы (FF)
a будильник (звонок) (БЕЛ)
e escape(думаю, troff) (ESC)
033 восьмеричный символ (вспомним PDP-11)
x1B шестнадцатеричный символ
c контрольный символ
l строчный следующий символ (думаю, vi)
u следующий символ в верхнем регистре (думаю, vi)
L строчными до E (думаю, vi)
U прописными до E (думаю, vi)
E конец модификации (думаю, vi)
Q Цитировать (отключить) метасимволы шаблона до E
Еще больше специальных символов
w Соответствует символу «слово (буквенно-цифровой и «_ )
W Соответствует несловесному символу
s Соответствует пробелу
S Соответствует непробельному символу
d Совпадение с цифрой
D Соответствует нецифровому символу
b Соответствует границе слова
B Сопоставить не- (граница слова)
A совпадать только в начале строки
Z Совпадение только в конце строки или до новой строки в конце
z Совпадение только в конце строки
G Соответствует только там, где предыдущий m//g остановлен (работает только с/g)
Comments
To leave a comment
Running server side scripts using PHP as an example (LAMP)
Terms: Running server side scripts using PHP as an example (LAMP)