A presentation at Future of Web Apps London in in London, UK by Drew McLellan
Regular Expressions
For the fearful.
Hello!
flickr.com/photos/85520404@N03/9535499657
Created by b mijnlie ff from the Noun Project
Created by Christy Presler from the Noun Project
Created by Yi Chen from the Noun Project
Humans are great at matching patterns.
RegExp are great at matching patterns.
RegExp Humans
Donec in euismod mi. Ut a ullamcorper eros, id ultricies odio. In ullamcorper lobortis fi nibus. Nunc molestie, ex id ultrices lobortis, ante elit Finding mauris consequat lacus, at scelerisque leo nisl vitae leo. cursus lacus eu erat euismod tincidunt. Etiam ultrices elementum nulla, eu ornare elit eleifend a. Mauris lacinia velit non maximus ultrices. Praesent in condimentum metus. Curabitur hendrerit eget text id egestas. Nam et sodales dui. Suspendisse potenti. Mauris sed suscipit dui. Suspendisse ultricies felis non lacus maximus rutrum. Duis vel ante et neque ornare sagittis eu a nisi. Curabitur ultrices aliquet magna ut venenatis. Duis nec rhoncus that , sed pulvinar dui. Nunc pellentesque tortor sem, convallis eleifend nibh pharetra eu. Nulla congue, nisi vitae consectetur sollicitudin, felis nisl malesuada tortor, ut semper sem tellus ut dui. Donec eget augue quis justo vestibulum sodales sit amet eget tortor. Donec viverra risus turpis, sit amet congue dolor vel matches . Pellentesque sollicitudin purus a ligula tristique, et posuere justo faucibus. Pellentesque vehicula id nisl sit amet mollis. Integer tempor eros id varius aliquam. Phasellus vel est ullamcorper, dignissim nulla et, iaculis ex. Maecenas a dictum orci, eu sagittis felis. Vestibulum scelerisque diam elit, vitae placerat ipsum congue nec. Nulla blandit magna vel velit feugiat, eget maximus tortor feugiat. In vel metus ex. Ut molestie enim vel dolor elementum, at patterns turpis volutpat. Sed pulvinar dignissim eros et interdum. Quisque scelerisque diam et facilisis consequat. Etiam gravida sodales ornare. Donec tristique sem vitae ipsum gravida, in fi nibus sem vulputate. Sed in ex at dolor euismod commodo sed nec augue. Maecenas sed dictum turpis, nec bibendum neque. Pellentesque dapibus mi vitae elit porttitor elementum. Vestibulum porttitor porta nunc, et laoreet eros fi nibus ac. Suspendisse potenti. Nunc a gravida nisi. Morbi et massa magna. Cras ligula erat, congue sit amet dignissim a, porttitor vel felis.
Regular Expressions Server rewrite rules. Form validation. Text editor search & replace. Application code.
Flavours POSIX basic & extended. Perl and Perl-compatible (PCRE). Most common implementations are Perl-like (PHP, JavaScript and HTML5, mod_rewrite, nginx)
In this exciting episode Basic syntax. Matching. Repeating. Grouping. Replacing.
But fi rst… A regular expression tester is a great way to try things out. There’s an excellent online tester at: regex101.com
RegExp Basics
Basics / regex goes here /
/ regex goes here / modifiers / [A-Z]\w[A-Z] / i Delimiters are usually slashes by default. Some engines allow you to use other delimiters. Modi fi ers include things like case sensitivity.
Basics / this/that / Delimiters and other special characters need to be escaped with backslashes.
Basics / \w\s\d /
Matching
Words \w (lowercase W) / \w / H ello, world, 1234. Matches an alphanumeric character, including
underscore.
Global modi fi er The ‘g’ global modi fi er returns all matches. Doesn’t stop at the fi rst match.
Words \w (lowercase W) / \w / g Hello , world , 1234 . Matches an alphanumeric character, including
underscore.
Digits \d / \d / Hello, world, 1 234.
/ \d / g Hello, world, 1234 . Matches single digits 0-9.
Spaces \s / \s / Hello,
world, 1234.
/ \s / g Hello,
world,
1234 . Matches single whitespace character. Includes spaces, tabs, new lines.
Character classes These are all shorthand character classes . Character classes match one character, but o ff er a set of acceptable possibilities for the match. The tokens we’ve looked at a shorthand for more complex character classes.
Words \w [ A-Za-z0-9_ ] Character classes match one character only. They can use ranges like A-Z. They are denoted by [square brackets].
Digits \d [ 0-9 ] Character classes match one character only. They can use ranges like A-Z. They are denoted by [square brackets].
Spaces \s [
\f
] Character classes match one character only. They can use ranges like A-Z. They are denoted by [square brackets]. !!!
Carriage return
New line
Ta b \f Form feed
Custom classes [ ol3 ] / [ ol3 ] /g He llo , w o r l d, 12 3 4 . [ a-z0-9- ] / [ a-z0-9- ] /g / 2009 / nice-title
Negative classes [^ ol3 ] / [^ ol3 ] /g He llo , w o r l d, 12 3 4. Use a caret to indicate the class should match none of the given characters. [^ a-z0-9- ] / [^ a-z0-9- ] /g / 2009 / nice-title
Dot A dot (period) matches any character other than a line break. It’s often over-used. Try to use something more speci fi c if possible.
Dot / . /g Hello, world, 1234. Matches any character other than a line break.
!false Developer joke time.
So where does this get us?
Matching Hello world (1980-02-21). / \d\d\d\d-\d\d-\d\d / Hello world ( 1980-02-21 ). So that’s something, right?
Repetition
Repetition Matching single characters gets old fast. There are four main operators or ‘quanti fi ers’ for specifying repetition.
Repetition ? Match zero or once. + Match once or more. * Match zero or more. {x} Match x times. {x,y} Match between x and y times.
Repetition / \d\d\d\d-\d\d-\d\d / / \d {4} -\d {2} -\d {2} / / [ a-z0-9- ]+ /g / 2009 / nice-title
Greediness Repetition quanti fi ers are ‘greedy’ by default. They’ll try to match as many times as possible, within their scope. Sometimes that’s not quite what we want, and we can change this behaviour to make them ‘lazy’.
Greediness / < .+
/ This <em>is</em> some HTML. EXPECTED: This <em> is</em> some HTML. ACTUAL: This <em>is</em> some HTML. Repetition quanti fi ers try to match as many times as they’re allowed to.
Greediness / < .+?
/ This <em> is</em> some HTML. Quanti fi ers can be made ‘lazy’ with a question mark.
Anchors
Anchors Anchors don’t match characters, but the position within the string. There are three main anchors in common use.
Anchors ^ The beginning of the string. $ The end of the string. \b A word boundary.
Anchors / ^ Hello /g Hello , Hello / Hello $ /g Hello, Hello Anchors fi nd matches based on position.
Anchors / cat /g cat con cat enation / \b cat \b /g cat concat enation Word boundaries are useful for avoiding accidental sub- matches.
[‘hip’, ‘hip’] Developer joke time.
Grouping
Grouping Parts of a pattern can be grouped together with (parenthesis). This enables repetition to be applied on the group, and enables us to control how the result is ‘captured’.
? /
? )+ / [ ‘ abc123- ’, ‘ def456- ’, ‘ ghi789 ’ ] Round brackets enable us to create groups that can then be repeated.
? )+ / Groups are captured by default. If you don’t need the group to be captured, make it non- capturing.
Grouping / \w+ @ \w+ . \w+ / drew@allinthehead.com / ( \w+ ) @ ( \w+ . \w+ ) / [ ‘ drew ’, ‘ allinthehead.com ’ ] Capturing groups is very useful! !!!
Grouping / (?<user> \w+ ) @ (?<domain> \w+ . \w+ ) / [ user: ‘ drew ’, domain: ‘ allinthehead.com ’ ] Some engines o ff er named groups.
Replacing
Replacing If you’ve used capturing groups in your pattern, you can re-insert any of those matched values back into your replacement. This is done with ‘back references’. Back references use the index number of the captured group.
Replacing with back references
<?php $str = 'drew@allinthehead.com' ; $pattern = '/(\w+)@(\w+\.\w+)/' ; $replacement = ' $1 is now fred@ $2 ' ; $result = preg_replace ( $pattern , $replacement , $str ); echo $result ; > drew is now fred@allinthehead.com PHP uses the preg (Perl Regular Expression) functions to perform matches and replacements.Replacing with back references var
'drew@allinthehead.com' ; var
/(\w+)@(\w+.\w+)/ ; var
' $1 is now fred@ $2 ’ ;
var
str . replace ( pattern , replacement ); console.log ( result );
drew is now fred@allinthehead.com JavaScript uses the replace()
method of a string object.
Putting it to use
HTML5 input validation <input name="sku" type="text" pattern="[A-Z]{3}[0-9]{8-10}"
HTML5 adds the pattern attribute on form fi elds. They’re parsed using the browser’s JavaScript engine.
Apache mod rewrite RewriteEngine On RewriteRule
^news/([1-2]{1}[0-9]{3})/([a-z0-9-]+)/?
/news.php?year=$1&slug=$2 URL rewriting in Apache uses PCRE.
Your application code
<?php $str = 'Look at this https:// www.youtube.com/watch?v=loab4A_SqoQ and this https://www.youtube.com/watch? v=I-19GRsBW-Y' ; $pattern = '/(\w+:\/\/[^\s"]+)/' ; $replacement = '<a href="$1">$1</a>' ; echo preg_replace ( $pattern , $replacement , $str ); > Look at this <a href="https:// www.youtube.com/watch? v=loab4A_SqoQ">https://www.youtube.com/ watch?v=loab4A_SqoQ</a> and this <a href="https://www.youtube.com/watch? v=I-19GRsBW-Y">https://www.youtube.com/ watch?v=I-19GRsBW-Y </a> Don’t copy this example - it’s simpli fi ed and insecure.Further reading
Further reading Teach Yourself Regular Expressions in 10 minutes, by Ben Forta. (Not actually in 10 minutes.) Mastering Regular Expressions, by Je ff rey E. F. Friedl.
Further learning regex101.com
Thanks! @drewm speakerdeck.com/drewm/getting-to-grips-with-regular-expressions
Regular Expressions are everywhere, from configuring routing in your web app, to form validation, to server configuration. They’re a tool every developer should master, yet most of us just copy and paste solutions from StackOverflow. If you feel like you ever really understood regular expressions, this is the introduction you need. Next time you need to import content from a legacy platform, or deal with messy data from a badly designed API you’ll have all the tools you need.
Here’s what was said about this presentation on social media.
Hip hip array! @drewm #FOWA pic.twitter.com/8cXokUDl03
— Shannon Burns (@karishannon) October 6, 2015
#FOWA learning regular expressions with @drewm pic.twitter.com/3EWNS7WSnO
— Rachel Andrew (@rachelandrew) October 6, 2015
Brilliant talk on Regular expressions from @drewm. Enjoyed that ['hip', 'hip']! #fowa
— Kiran Mistry (@iamthemistry) October 6, 2015
Really good introduction to Regular Expressions by @drewm. I’ve always just… “got by” with RegEx. Some great terminology and tips #FOWA
— Gary Garside (@garygarside) October 6, 2015
!false "It's funny cause it's true." @drewm #FOWA
— Shannon Burns (@karishannon) October 6, 2015