Regular Expressions
Regular expressions (regex, or regexp) are a powerful text/string searching tool, that the internet often describes as “wildcards on steriods”. They provide a framework for flexible string searching, pattern-matching, and replacement.
Regular expressions are most useful in the following situations:
- Reorganizing contents of a text file
- Parsing a text file (with Python’s
re
module for regular expressions!) - Obtaining information from the command line about files or file contents
Importantly, regular expressions often differ somewhat between programming languages (e.g. Python, Bash, Perl, etc.), but the basic elements are usually the same. Here are some useful websites for practicing and testing regular expressions (be sure your language matches the website!):
- General: https://regex101.com/
- Python-specific: http://pythex.org/
This post discuss Perl-style regular expressions. If you find that your regular expression isn’t working in a certain program, it’s probably because the program doesn’t accept this flavor of regex (instead, it probably uses POSIX style regular expressions, which simply has a different syntax).
What are regular expressions?
In their simplest form, regular expressions are simply strings for searching text - any ctl+F
search you’ve ever done could be a regular expression. Where regular expressions get exciting is their use of wildcards and pattern-matching capabilities. Today’s cheatsheet provides a basic set of regular expression wildcards and other useful nuggets, specifically for Perl-style regular expressions.
Example: Renaming files
One common task is to rename lots of files, in a similar way. For example, assume you have a directory full of files like this:
These file names all follow a common pattern: specimen<numbers>_CAPLETTERS_othernumbers.txt
. We Let’s say we want to rename all of these files to have a new name with this pattern: specimen<numbers>_othernumbers.txt
. Essentially, we want to remove the middle chunk of capital letters from all file names. This is easily accomplished with regular expressions!
The simplest way to do this task is interfacing with a text editor, like TextWrangler for Mac or Kate for Linux, which can handle regular expressions. There are several command line tools, notably sed
and awk
, which can also be used, but we won’t cover them here.
The basic strategy for this task is to:
- Pipe all files names to a file for editing:
ls *.txt > rename.txt
. - Open the editing file (
rename.txt
) and perform a regex search/replace to obtainmv
commands for renaming - Execute that file (
sh rename.txt
) to rename files
One example of a regex search/replace for this system is the following:
- Search:
(specimen\d+)(_\w+_)(\d+.txt)
- Replace:
mv \1\2\3 \1_\2
These commands result in this example replacement:
- Original:
specimen1_RMFIPBW_39640.txt
- New:
mv specimen1_RMFIPBW_39640.txt specimen1_39640.txt
Remember to be extra flexible, here. For example, there are a different amount of digits after the word “specimen” in each file name, so we shouldn’t enforce, say, 2 digits only. We should capture any amount of digits that might be there (hence the \d+
, instead of \d\d
for 2 digits only, for example).
Python re
module
Python comes with a module called re
(regular expressions) for working with, you guessed it, regular expressions. This module is extremely useful for parsing and/or searching within text files. Extensive documentation for this module can be found here.
In today’s files, you will find the directory hyphy_output/
, which contains files for parsing and the two parsing scripts parse_hyphy_output.py
and parse_hyphy_output_extended.py
. These script parses these (poorly-formatted) files for particular piece(s) of information (with the latter script parsing more complex amounts of info). See within for details!