Regular Expressions


Regular expressions (regex, or regexp) are a powerful text/string searching tool, that the internet often describes as “wildcards on steriods”. They provide a framework for flexible string searching, pattern-matching, and replacement.

Regular expressions are most useful in the following situations:

Importantly, regular expressions often differ somewhat between programming languages (e.g. Python, Bash, Perl, etc.), but the basic elements are usually the same. Here are some useful websites for practicing and testing regular expressions (be sure your language matches the website!):

This post discuss Perl-style regular expressions. If you find that your regular expression isn’t working in a certain program, it’s probably because the program doesn’t accept this flavor of regex (instead, it probably uses POSIX style regular expressions, which simply has a different syntax).

What are regular expressions?

In their simplest form, regular expressions are simply strings for searching text - any ctl+F search you’ve ever done could be a regular expression. Where regular expressions get exciting is their use of wildcards and pattern-matching capabilities. Today’s cheatsheet provides a basic set of regular expression wildcards and other useful nuggets, specifically for Perl-style regular expressions.

Example: Renaming files

One common task is to rename lots of files, in a similar way. For example, assume you have a directory full of files like this:

Stephanies-MacBook-Pro:files sjspielman$ ls
specimen10_GHAUFKM_04363.txt specimen33_YADXQVE_02397.txt
specimen11_NPIOLVY_86041.txt specimen34_JDCUSVN_02478.txt
specimen12_OKDJUNG_78425.txt specimen35_DNTRAFS_65102.txt
specimen13_HBXIRNT_39720.txt specimen36_JHINLWM_17330.txt
specimen14_ZNGVOMY_63214.txt specimen37_WITHCZJ_62140.txt
specimen15_MDKCJQR_10489.txt specimen38_KFXRPHQ_33157.txt
specimen16_JTQVOIZ_38621.txt specimen39_ASBCJHX_30563.txt
specimen17_ACGJPLB_02384.txt specimen3_HINOJDW_64209.txt
specimen18_GZXBJAL_63579.txt specimen40_SFROJIC_09235.txt
specimen19_EVLWRBI_58372.txt specimen41_XQISFKH_18073.txt
specimen1_RMFIPBW_39640.txt  specimen42_DRGNXYK_39143.txt

These file names all follow a common pattern: specimen<numbers>_CAPLETTERS_othernumbers.txt. We Let’s say we want to rename all of these files to have a new name with this pattern: specimen<numbers>_othernumbers.txt. Essentially, we want to remove the middle chunk of capital letters from all file names. This is easily accomplished with regular expressions!

The simplest way to do this task is interfacing with a text editor, like TextWrangler for Mac or Kate for Linux, which can handle regular expressions. There are several command line tools, notably sed and awk, which can also be used, but we won’t cover them here.

The basic strategy for this task is to:

  1. Pipe all files names to a file for editing: ls *.txt > rename.txt.
  2. Open the editing file (rename.txt) and perform a regex search/replace to obtain mv commands for renaming
  3. Execute that file (sh rename.txt) to rename files

One example of a regex search/replace for this system is the following:

These commands result in this example replacement:

Remember to be extra flexible, here. For example, there are a different amount of digits after the word “specimen” in each file name, so we shouldn’t enforce, say, 2 digits only. We should capture any amount of digits that might be there (hence the \d+, instead of \d\d for 2 digits only, for example).

Python re module

Python comes with a module called re (regular expressions) for working with, you guessed it, regular expressions. This module is extremely useful for parsing and/or searching within text files. Extensive documentation for this module can be found here.

In today’s files, you will find the directory hyphy_output/, which contains files for parsing and the two parsing scripts parse_hyphy_output.py and parse_hyphy_output_extended.py. These script parses these (poorly-formatted) files for particular piece(s) of information (with the latter script parsing more complex amounts of info). See within for details!

[top]