Skip to content

Latest commit

 

History

History
389 lines (288 loc) · 14.6 KB

09_re.md

File metadata and controls

389 lines (288 loc) · 14.6 KB

Regular Expressions

At the end of the previous section on Emacs, you learned how to tags-query-replace to search and replace across many files at once. These searches consist of regular expressions, a mini pattern-matching language distinct from and more powerful than shell wildcards. For those who have read prior sections on Unix, Emacs's regexps are implemented a bit differently than in the grep and sed commands. For the benefit those Windows users who have only read the section on Emacs, this section reviews some of the same terminology. Emacs regexps also vary slightly from those in major programming languages, but here is where you'll likely get the most use out of them when routinely editing files. This section shows you how to apply regular expressions to a bunch of useful editing scenarios. By the end, you will be able to build a regexp that matches passive-voice constructions.

  1. For this unit, specify all the markdown source files for global search and replace, then open any file:

    $ etags *.md ; emacs TAGS
    

    Run the same tags-query-replace command you learned about in the previous unit. You'll be doing that repeatedly for all these searches. You can make the changes or not, or decide not to save the changes after you make them.

  2. Search for DELETE and change to __DELETE__. (In markdown, the double underscores format the text as bold.)

  3. Search for TAB and change to __TAB__. These are simple searches for literal text.

    • literal
  4. Suppose you have a lot of different text in ALL CAPS that you want to format as bold. One way to do it is to search for the following typical regular expression:

    \b\([A-Z][A-Z]+\)\b
    

    Then change it to:

    __\1__
    
  5. Pick apart how the regexp works:

    • The \b at the start and end marks a word boundary. The boundary itself doesn't match any text, and also referred to as zero-width or as a positioner. This example prevents matches within words, such as XMLHttpRequest or JSONPath.

    • The \( and \) captures the text, so that you can use \1 to insert it back in the replacement. This is also often called a group or a trap.

    • The [A-Z] defines a custom character class, in this case a single character that's a range of possible matches: any uppercase letter.

    • The + is a quantifier that means one or more of whatever comes before, in this case another [A-Z] character class.

    In practice, the two character classes in a row followed by the + means two or more uppercase characters, which ignores common matches such as A and I in text.

    • character class
    • quantifier
    • capture, group, trap
    • boundary, positioner, zero-width
  6. That search casts a wide net, and may yield far too many ALL CAPS matches to be useful. If you have a narrower set of things you want to change at once, you can search for something like this:

    \b\(TAB\|DELETE\|RETURN\|SHIFT\)\b
    

    Within the \( ... \) group, the \| serve as alternators, so that each sequence of text may match separately.

    Note that parentheses and bar characters match literally by themselves, but get this special grouping behavior when preceded by a backslash. This is the opposite of how it works for most other metacharacters in Emacs. For example, [ by itself opens a custom character class, while \[ matches the literal [ character.

    • alternators
    • sequence
  7. The [A-Z] above is a custom character class, but often you use built-in classes, the most important of which is the dot (.) metacharacter, which stands for any character within a line, anything except for a linefeed character. (A metacharacter is one that has a special meaning within a regular expression, the opposite of a literal.)

    Suppose you have some quoted text such as "this" and "that" that should convert to inline monospace formatting. Search for this regexp:

    "\(.+\)"
    

    Replace with:

    `\1`
    
    • custom character class
    • metacharacter
  8. There's a problem with this search. What if "this" and "that" both appear on the same line, as they actually did above? By default, the + quantifier after the . matches as many characters as possible, so "this" and "that" match rather than "this" and "that" separately. This is called a greedy match. Adding a ? after the quantifier makes it a lazy match:

    "\(.+?\)"
    

    That translates to: one or more non-linefeed characters, up to the very first double-quote character the match encounters. Lazy matches need to be followed by something that is certain to match, or nothing will.

    • greedy match
    • lazy match
  9. What if the text to match falls on separate lines?

    Blah blah blah blah blah blah blah blah blah blah blah "this
    text" blah blah blah blah blah blah blah blah blah blah "that
    text" blah blah.
    

    The dot (.) matches any character except the one that separates lines, so you need to use a different approach to match the text. Find out if there's some character that doesn't appear in your text, and define a character class that matches anything that's not that character. For example, tab characters are typically excluded from text. Within the search, define a custom class ([]), add a caret character to reverse it ([^]), followed by the tab character itself. Follow the class with the lazy quantifier. Together, it looks like this:

    "\([^	]+?\)"
    
  10. What if there's punctuation within the quotes, but you want to place it outside the font formatting specified in the replacement? Maybe you're searching on "this," "that," and "the other." Add another group to the previous search to specify punctuation characters such as commas and periods:

    "\([^	]+?\)\([,.]?\)"
    

    What does this do? The [,.] specifies the punctuation, as a custom class of discrete characters rather than a range like you saw before. Note that within the custom class, the dot . character is treated as a literal. The ? is quantifier that means zero or one, so that the match is optional. The lazy quantifier in the first group matches only up until the optional match in the second group, so the regexp matches cases with or without punctuation such as "this" and "that." This is the replacement, where the contents of the second group \2 may consist of empty text when no punctuation is present:

    `\1`\2
    
  11. You can cast a wider net to match any other punctuation that may appear within the quotes. Note that like the dot, the question mark is treated literally within the character class:

    "\([^	]+?\)\([,.!?:;]?\)"
    
  12. There's a less ugly way to do it. Use the \s. built-in class, which stands for any punctuation character, broadly defined to include many characters that don't often appear in prose. This is the alternate regexp:

    "\([^	]+?\)\(\s.?\)"
    
  13. There's even more built-in syntax available for Posix classes, which follow a named [:class:] format:

    "\([^	]+?\)\([:punct:]?\)"
    

    Don't mistake these for custom classes. You can actually specify them within classes, like the following that matches any non-whitespace, non-punctuation character:

    [^[:space:][:punct:]]
    
  14. There's another quantifier available, *, which stands for zero or more. Suppose these three words appear in text: prince, princes, and princess.

    Searching for princes? (zero or one) matches prince or princes.

    Searching for princes+ (one or more) matches princes or princess.

    Searching for princes* (zero or more) matches all three.

  15. This regexp shows a couple of other built-in classes, along with a useful technique you can use to find repeated words:

    \b\(\w+\)\s-+\1\b
    

    The \w stands for any alphanumeric word character, and the \s- means any whitespace character. This regexp captures any word that's followed by whitespace, followed by the same word again, all bounded with \b positioners to keep from matching text such as an angle.

    You typically use \1 in replacements, but you can also specify it in searches to refer to text you've already trapped.

    You can also use the \sw code to refer to word characters. You can include any built-in code that starts with \s within custom classes. For example, [^\s.\s-] matches characters that are neither punctuation nor whitespace.

  16. Here are a few other zero-width positioners that come in handy. The ^ character stands for start of line. Replacing this with a dash - allows you to change all the numbered lists here to bullet lists:

    ^[0-9]+\.
    

    Note the backslash before the dot character. You're matching the number followed by a literal dot (such as 1.), so you have to escape the metacharacter to interpret it literally. Otherwise it would match any character. You'd have to treat other metacharacters literally within searches, such as \[ for an open square brace. You'd have to escape a literal backslash \\ in both searches and replacements.

    • escape
  17. The $ character stands for end of line. When editing code, it's often considered bad form to allow extra spaces at the end of lines. Searching for a space character followed by +$ allows you to remove them all.

  18. These positioning characters get disabled when they don't appear contextually at the start or end of the match. That means $\w+ matches certain $variable names without any fuss.

  19. The \` code stands for the start of the file (or editing buffer). Search for this and replace with nothing to remove extra space at the start of the file:

    \`\s-+
    
  20. The \' code stands for the end of the file. What if you want to make sure there's a carriage return at the end of the file, which is also considered good form when editing code? Search for this:

    \(.\)\'
    

    But you can't type a carriage return in the replacement without it being interpreted as responding to Emacs's minibuffer prompt mode. The following two steps offer two different ways to do it, first the more involved way, then the simpler way.

  21. You can copy (kill) the text, and paste (yank) it into the replacment. Go to the end of any line (CTRL-e), kill the subsequent linefeed, (CTRL-k), and yank it back into the text (CTRL-y). Then later when specifying replacement text, type CTRL-y to yank it again, this time into the minibuffer.

  22. The simpler way to insert a literal linefeed character anywhere within Emacs is to type the C-q C-j sequence. Searching for a sequence of three of those followed by a + plus sign, then replacing with a sequence of two, makes sure there's only one blank line in a row.

  23. The next example shows how to search for many passive-voice constructions. While useful, it's sufficiently complex (or at least long) that if you use it you're likely to save it in a file and copy and paste it when searching:

    \b\(were\|was\|is\|are\|be\|ha[sd]\s-+been\)\(\s-+\w+\)?\s-+\w+e[dn]\b
    

    It consists of two grouped sequences followed by an ungrouped sequence. The first group specifies the word were, was, is, are, be, or the phrases has been or had been.

    The second group specifies an optional word preceded by whitespace. Note the ? after the group. Earlier you saw this quantifier follow a character class to indicate an optional punctuation character, but here it means the entire grouped sequence of characters is optional.

    The final sequence specifies whitespace, then any word ending either ed or en.

    With the optional second group, the regexp matches both is recommended and is strongly recommended.

  24. What if you want to match the entire passive-voice regular expression above and mark it with text such as FIX=> and <=FIX? Simply surround the entire expression with \( and \) and invoke \1 in the replacement. A group may contain nested groups.

  25. One final feature, then you're done. Suppose you want to lowercase some HTML anchor links:

    #camelCase
    #mixOfUpperAndLower
    

    Search for \(#\w+\) and replace with:

    \,(downcase \1)
    

    The \, marks what follows in the parentheses as a Lisp expression to evaluate in the replacement text. You can do the same to \,(capitalize \1) the match. The following search and replace applies title case throughout markdown headings:

    ^\(#+ *\)(.+)
    \1\,(capitalize \2)
    

    You can also \,(upcase \2) to convert text to ALL CAPS.

Reference

Position adjacent expression:

    ^              ~ Start of line
    $              ~ End of line
    \b             ~ Word boundary
    \B             ~ Not a word boundary
    \<             ~ Start of word
    \>             ~ End of word
    \`             ~ Start of buffer
    \'             ~ End of buffer

Quantify previous expression:

    ?              ~ Zero or one
    *              ~ Zero or more
    *?             ~ Zero or more, lazy
    +              ~ One or more
    +?             ~ One or more, lazy
    \{N\}          ~ Exactly N matches

Group sequences:

    \(  \)         ~ Capture text
    \1, \2 ... \9  ~ Retrieve captured text
    \|             ~ Mark alternative within group
    \(?: \)        ~ Group uncaptured text

Custom character classes:

    [aeiou]        ~ Match custom character set
    [^aeiou]       ~ Match characters not in set
    [a-z]          ~ Match custom character range
    [^a-z]         ~ Match characters not in range
    [-'_a-zA-Z0-9] ~ Match both set and ranges

Built-in character classes:

    .              ~ Match any character except newline
    \w             ~ Match word element
    \W             ~ Match non-word element
    \sw            ~ Match word element: alternate, works within custom []
    \s-            ~ Match whitespace
    \s.            ~ Match punctuation
    \s_            ~ Match symbol

Named Posix classes:

    [:digit:]
    [:alpha:]
    [:alnum:]
    [:upper:]
    [:lower:]
    [:space:]
    [:print:]
    [:punct:]

Escape metacharacters

    \. \[ \^ \$ \\, etc.