SPELLBINDERS:

Amanda Swinsten
UNDERSTANDING GREED BEHAVIOUR IN REGULAR EXPRESSIONS

Regular Expressions (those snippets of code that can be used to search for a certain pattern in within strings) have probably always been a bit neglected in javaScript programming (or programming without adjectives) mostly due to the difficulty to understand what "greed" is.
December 2001
{ @ }



Regular Expressions and their most troublesome characteristic: greed.
A comprehensive analysis of what greed is and on whether it is a justified or helping feature in Regular Expressions

  1. Boring Foreword
    Nothing is most annoying than a solution that complicates the problem, aside from a solution that solves the problem it arises.
    Nadia Auermann

    Normally you set out with such a remark to assess, waving warning and sagacious fingers, how you should never be, the way you should never program or script, or to scold your by-standers on dysfunctional solutions to programming riddles.
    But I won't do that, I won't take you by hand to platitude hill. For when you're at programming, warning too long it betrays a basic lack of understanding of what programming implies. For programming has a roaming genius of its own, which like an Aladdin's lamp holy smoke can at times sieze the scene, and reminds us that dealing with complexity and complex solutions is the hallmark of programming, and furnishing a simple problem with a cumbersome solution may not be the consequence of a dire mistake, but the sound of the trumpet which heralds programming quint essences. Programmers just can't happen to be plain guys, that is.
    Functionality, she needs to be demystified. Pragmatism calls upon us for polished codes. But programming springs out from that heritage common to mankind which has been to defy the fiercest among the elements: «to defy material injury from external annoyance»; complexity radiates fascination. At times you just get caught by its charm. So never blame the programmer who hands over to you a complex solution, for the very simple reason this doesn't make of him/her less of a programmer of he/she who passes to you a simpler one. The same guy who today gives to you a simple solution may get lost tomorrow, the same guy who gives to you a complex solution which is unfit to the nature of the problem today, is the same guy who could save your day tomorrow. Programmers are just swindlers.

  2. What regular expressions are


    If you're reading this,
    matisse
    I suppose you already know something about regular expressions: they are those snippets of code, put in between a leading and a trailing forward slashes, that define patterns in the possible sequence of words and letters in a string such that they can check [if used in methods such as replace(regExpression, replacerText) or search(regExpression)] whether an instance of the given pattern is actually present in a given string. Dumb example: /hallo/ would check for the word hallo.
    In javaScript you have a method which is called match() which gets as an argument (in brackets) a regular expression and as its object a string
    Instance:
    var myString="hallo world and hallo folks"
    myString.match(/hallo/g) /*g is a flag for GLOBAL. Search ALL string, that is. Must be after the last slash*/

    The match method would return an array of instances matching with the given regular expression (and returns keyword null if no match occurred): in our case it would return a two entries array.
    You may wonder, why it returns an array: I already know the word is "hallo" I do not need to have a collection of words matching my regular expression, it could have returned just a number flagging the amount of matches (which, by the way, is the returned by the match method array length).
    You are approaching the greed problem. In fact regular expressions have not been introduced just to spot flat words, but also anonymous patterns whose structure we know but whose specific contents we ignore: and in such cases an array of all the different possible outcome texts matching it would be very useful.
    For instance, the escaped letter d (that is: \d a d with a backward slash before it) means a one digit number: so an expression like /\d/g would mean search(/.../) for single digits (\d) on all (g: global) the string object; and therefore an array which gives to you all the possible matching instances found in the string is no longer a quirk (for numbers may well differ from each other, such as 3 5 1 0; or may be the same, such as 4 4 4: but how could you guess if a method like match would not return a convenient array which you may scan later?).
    Now I bet this might be a feature you could start appreciating very very much. Or may be not, but would you blame the language here for providing you with the most complex possibility?

  3. The notorious quantifiers

    In the very same fashion regular expressions have a syntax for matching whatever single digit number (\d), they also have syntax for loads of different stuff which may be addressed anonymously enough. We won't review all of them here since we're focused on understanding a problem and not on providing a syntax manual (there are enough out there). But we do need to stress the quantifiers.
    Quantifiers are those segments of regular expressions (let me use RE from now on) whose task is to define the amount of the required instance you're looking for. In fact you may be looking not just for one digit, but for a specific amount of digits although you do not care what digits they are (say three digits? well then 345 would be a match as well as 409): you just care they make up for a number whose length is the length (amount) you have in mind (imagine a credit card number: searching for a pattern like \d\d\d\d \s \d\d\d\d \s \d\d\d\d \s \d\d\d\d might do, although not a nice way to express it).
    A prospect on quantifiers may look like this, considering all of them go after the pattern (first the pattern, then the amount of it you're looking for. Convincing, isn't it?)
    Those below are the quantifiers, as you see they cover every possible desire.
    So if you want to check only for those numbers which are groups of three digits, you may do:
    /\d{3}/g

Leaann Tweeden
{X} only X instances
{X,} at least X instances, no upper limit
{X,Y} at least X instances, at most Y
? shortcut: 0 or 1 at most
* shortcut: 0 or whatever amount
+ shortcut: at least 1, no upper limit

You might probably want a device to test around with some instances:
(please if you're running Netscape read a warning [click here] on Netscape)


No forward slashes needed
global? (adds a g modifier) » « case INsensitive?(adds i)


No confirm boxes »


  1. Only in certains specific cases you have to worry. Keep them in mind.

    The first thing you must keep in mind when you want to understand greed is this: you will face greed problems (we're to see soon what they are) only and uniquely when quantifiers are called in: what I want you to understand is that it is pointless to get concerned about greed as long as a quantifier is not put in action, while on the other side is mandatory to get concerned about greed as soon as a quantifier is put in action. But you need a second requirements to start thinking of greed: you must get worried of greed only when:
    • Quantifiers appear on the scene
    • They set a range
    What is a range? A range is a quantifier which not only determines how many of the previous instances should be searched for, but also what should be AFTER that amount, so a range defines a bigger lot to search for, a certain amount of a certain data type followed by some other data type (such as: 4 digits followed by one letter. I understand it may appear complicated, but it's not a language's fault: it is that searched for items may be complicated!); so a range is whatever quantifier which is not only preceded by an instance, but also followed by other stuff, a range is a data type plus an amount (quantifier) plus another data type. That's the only case when you must get worried about greed:
    • a data type
    • a quantifier
    • another data type after it

  2. Previous condition examples
    Let's imagine an html code like this:
    <a href='file.html'><img src='pic.jpg'></a>
    Let's imagine you want to search for the tags in a html file which contains that string (an oversimplification, I agree): you cannot do:
    /<>/g
    for that would mean searching for an empty tag: you're forced to use quantifiers to say: start with a < sign, then whatever chars in between, and check for a closing >.
    Since a dot (.) means in RE whatever type of char, you might write:
    /<.*>/g
  3. You must consider that if you'd use \w which means a letter whatever or \d a digit whatever or \s a space, even if you'd group them in a class (which is a group surrounded by square brackets, such as [\w\s\d], and which means: whatever of the type included here) and then put a * after them to mean: search for a < followed by whatever a \w or a \s or a \d for whatever AMOUNT *, until a > is found: /<[\w\s\d]*>/g, you'd easily get a null result for there could be in between some quote or other character in the searched string which doesn't belong either to the letter (word: \w) type or to the digit (number: \d) type or to the white-space type (\s) taken into account in your regular expression, and that would be enough to make your search fail: in fact searching for a leading < sign followed by letters or digits or spaces and eventually followed by a > symbol, means exactly what it means: that if I find a starting < sign followed by letters digits spaces and also other characters which are not letters digits or spaces (even one # or $ sign would compromise your search, since it's neither a digit nor a word nor a white space!), means that the whole string which is parsed won't match even if it finishes with a > sign. When programming almost a match means no match at all.

    Thus the dot symbol is what is used to mean whatever type of character, also symbols. If you use the dot sign, you could certainly get the idea to write:
    /<[\w\s\d.]*>/g
    to mean a leading < followed by letter space digit or whatever else char, for whatever amount, followed by a > sign, parsing on the global input string; but this would be a bit overshooting even before than wrong: since a dot already means whatever type of sign, it already includes letter chars, digit type chars, space type instances, or whatever sign, so it is much easier to write:
    /<.*>/g
    And with this we back to the closing we were at at point 5.


    BUT, to your extreme puzzlement, this code, apparently correct, won't work.
    Why?
    Since we were mistaken already in the steps in point 6 which illustrated the logic sequence which led us to the final snippet!
    Behold.

  4. How greed operates

    I am forced to provide you with another example: just be attentive: a new string:
    "john82 and mary4"
    Do not disregard it as trivial, as it is going to show to you an authentic programming threat.
    Imagine you want to search such string, say still with the match() method for our convenience, for whatever amount of letters followed by one number: a nifty regular expression might be:
    /\w+\d/g
    which means whatever letter (\w) at least one or more times (+) followed by one digit (single \s).
    Trigger on this string the match() method with the given regular expression and it would return to you a two entry array:
    • john82
    • mary4

    Are you happy? Don't be: the program did not obey your intentions at all!
    In fact if mary4 perfectly matches your criteria (some letters plus one digit), john82 doesn't match at all: some letters plus... 2 digits! If you deem this irrelevant, wait to parse with that for the correctness of a credit card number in a 1 billion dollars transaction...


  5. You were right: greed is a problem we could have done without
    Greed has no justifications in the shape it currently has

    So what happened there, and in what circumstances?
    The fact is this: although you may perfectly know what you're looking for, you have no idea what environment you might find. You may search for 4 letters plus a digit, but you may encounter 4 letters plus 2 digits. Why the regular expression (RE) doesn't return the right pattern?
    You may know that if you instruct either a subroutine or a built in method or even a single row statement of a program to do something, it won't return (yield) a result, neither a temporary one, until a minimal amount of input is not parsed. You can know what the eventual result of a programming expression will be, but you can not know what the result will be in the meanwhile until it has come to a full stop. An informatic expression or program won't relinquish to your screen a result until its task is not fully performed (no matter how fast your PC is).
    So here is what happens when you get the wrong result in the RE above:
    1. The search starts
    2. finds a letter: j.
      RE asks for an amount of letters plus a digit, so doesn't match yet.
    3. finds a letter, o.
      RE asks for an amount of letters plus a digit, so doesn't match yet.
    4. Finds a letter, h.
      RE asks for an amount of letters plus a digit, so doesn't match yet.
    5. Finds a 4th letter, n.
      RE asks for an amount of letters plus a digit, so doesn't match yet.
    6. Finds a number, 8.
      RE asks for an amount of letters plus a digit, so this is a match, and john8 is stored.
    7. Finds a second digit, 2.
      RE asks for an amount of letters plus a digit, the RE wonders whether since you had not specified you meant amount of letters and nothing else in between, if you might like this outcome as well, and stores john82

    Now: wait. Is this crazy and dysfunctional? Yes, it is. Very much so, in my opinion: the RE attempts to behave with generosity and grabs all it can (greed).
    In my probably conceited opinion it means not to obey the programmer; for if in some circumstances almost a match means no match, and therefore I am required to craft my RE as carefully and as skilfully as I can, then I should not have been supposed to provide the system with details on what I do not want in other circumstances; for if in the former circumstances the system should stuck with diligence to what I do want and to that only, why in the latter circumstances it just doesn't keep doing it and presumes what I do do not want?

    If I forget to say that a char like a $ sign should be taken into consideration, the RE program rightly doesn't assume I wanted it to be taken into consideration (like we were discussing at point 6, remember?), and doing so it is strict and correct. Programming languages should not wander if they are to be reliable.
    So I wonder why the hack, if RE make the right assumption in one case, when I set a quantifier and a second data type after it, they wrongly assume that if I do not specify I do not want the second data type to be taken into account but like an end, it takes it into account like a possible goes in between! If I say some letters plus a digit, grabbing some letters plus 2 digits is arbitrary, it is a... uhm: a bug!
    What makes you lose your patience is not the problem, it is the obstinacy with which they keep saying: no, it is correct, that's very much logic.


    We can make a round double trip around the world here to save this day, but the truth is that what the inventor of RE made is already so good that it makes no difference if he/they eventually let it default to such deeply irrational a thing like greed: but we cannot say, in order to acknowledge the fantastic utility of RE (like some do indeed) that greed is... logical! It is the opposite of logic: it's a backdoor!
    For, behold at the standard explanations given to you on the manuals to justify to your eyes greed logics, and see how deceptive it is:

    They always introduce to you the dot case pattern: our RE might be changed into:
    /.+\d/g
    Which means: whatever data type (dot) one or plus times, followed by a digit. The interpreter would still show its greed, but guess what: in such case (DOT case!) greed would make sense, which is immediately wielded to carefully cloak the instance case I brought where it makes no sense at all, and proclaim a presumptive attained rational basis for greed:


    1. The search starts
    2. finds a letter: j.
      RE asks for an amount of dot chars, that is whatever type plus a digit, so (missing the closing digit) doesn't match yet.
    3. finds a letter, o.
      RE asks for an amount of dot chars, that is whatever type plus a digit, so doesn't match yet.
    4. Finds a letter, h.
      RE asks for an amount of dot chars, that is whatever type plus a digit, so doesn't match yet.
    5. Finds a 4th letter, n.
      RE asks for an amount of dot chars, that is whatever type plus a digit, so doesn't match yet.
    6. Finds a number, 8.
      RE asks for an amount of dot chars, that is whatever type plus a digit, so this is a match, and john8 is stored.
    7. Finds a second digit, 2.
      RE asks for dot type chars, that is: whatever type plus a digit, therefore your RE, having found a new digit, safely assumes the previous one (number 8) now belongs to the whatever type category, and the agency of the closing digit should now be overtaken by this latest digit (number 4): thus john8, now that another number has been found, is no longer deemed like an instance of the some chars plus a digit criteria, but like a some chars whatever they are (dot) instance, whose subsequent matching digit is now number 4! So it... «rightly» picks john84.

    In the latest case (dot case) they attempt to prove to you that greed behaviour is just a logical behavior in the inner workings of a program. But they chose the most cause-friendly example to prove it, for if we use another example, then greed shows its full lack of logics, even more: it proves itself like a serious flaw that (if no countermeasures would have been available: and we're talking countermeasures) was on the brim to jeopardize the correct parsing of every match, and after all the fact everybody spent months of life working around it with the so called denied type classes (I won't go deeper in it now) should have raised more than a suspicious eyebrow before those statements like : «You must understand the inner workings of the interpreter in order to understand greed»: which always sounded to me like those expressions «socialism with an human face» to which somebody replied: « if it is so unnatural to give a human face to something, I do not want to have anything to share with it»; analogously, if an interpreter inner workings are so sophisticated that I have to elate myself in heavens to understand why it does what it does, I'd prefer to switch interpreter.

  6. Solutions to force the Regular Expressions interpreter to do what it should

    Before I proceed, let me tell immediatly the solution.
    We've seen the conditions:
    1. some data (obviously: a quantifier without some data first would make no sense: a minimum amount of X and a max amount of Y of... nothing, still means nothing!)
    2. a quantifier
    3. other data after the quantifier
    The solution is: force the interpreter to grab only the first instance of the given RE pattern it finds, which is achieved putting a question mark sing after the quantifier. I understand it may appear confusing, but this time it is not. A question mark is already a quantifier in itself (see table above: it meant: 0 or 1 at most), but it is a convention that a question mark after a quantifier (I said after!) would mean: limit your search to the shorter instance. So there can be no confusion: a single question mark after some data would be a quantifier, but a second question mark after a question mark as a quantifier, can only be a scope limiter.
    So also in our old example at point 5:
    <a href='file.html'><img src='pic.jpg'></a>
    the correct RE to extract tags would be:
    /<.+?>/g
    So here is your solution, rule of thumb I suggest to you: whenever you insert a quantifier and after it you'd put a range closing data, stuck a damned question mark soon after the quantifier!

    WARNING: if you're testing on our form above, be aware that Netscape's engine may not recognize the question mark syntax; thus you'd resort to negated character classes: I do not want to make this already long essay unduly longer, so I won't discuss negated char classes here; enough to say they were the convoluted way we had in order to work around greed before the question mark patch, so to call it, was made available: that was, a class (namely a data type between square brackets []) which was negated ( the ^ symbol, soon after the first square bracket: [^] ) with after the ^ sign the very same symbol that follows after quantifier: you happy, aren't you? That would have meant:
    /<.[^>]+>/g
    Ya see, the symbology [^] overtakes the question mark, and in between those brackets goes a mere repetition (better: anticipation) of what will be soon after the quantifier (for we said no quantifier, no need to shield from greed)... Would have meant: search for everything in between a < and a closing > tag sign, except a sign equal to the closing one (in our case a >, so excluding it would mean: stop as soon as you find some > ), and this search (the search for such a pattern, that is) must go on for a + quantifier amount: thus when the interpeter found right one >, stopped and stored what was just before (I said before) it as a possible valid pattern: then started going on again parsing for the next char and a possible >, and guess what: since it went back one slot to store the previous possible matter, now that it re-starts going on, it found a > soon (the same as before, yes: but the interpreter went one step backward to store what detected first this > sign, since we instructed it that we want whatever after a < sign but not a > sign: so when it met one, it went backward of one instance to clutch and store what was before). So now that it restarts, it finds immediately a > which would be exactly the closing range: what a combination! so there was its match out of greed...

    I love these things, especially when they party. Don't you?


  7. More on greed and its strange mix of logics and absence of logics

    I want to go deeper in this, although our discussion under a practical point of view might have been completed with point above.
    RE interpreter was bound to find itself facing this problem: given a research range which goes from, say, A to B, what instance of B you'd take into account, the first one you meet or the last one?

    It is correct, in my opinion, to let the default choice being the latter: given a range from A to B, the closing B is not the first B we meet, but the last in case there are multiple instances of B.

    In fact, should be regarded pretty convincing, and not arbitrary, that a program instructed to find a final data B, without further instructions would yield the outmost instance of B it can detect: after all, that's logic twice: first because the closing range, without further specifications, should be the last available instance of the case, and second because if we're talking about a closing/final data, that's one more reason to grab the.. final instance.

    Under these assumptions greed might appear logic.
    There are two points that reverse our judgement:

    1. Could have the interpreter being set differently, that is: to grab the first instance of B?
      Of course. Would have this made our lives easier? Yes. Should therefore we deem the opted choice illogic? No. Could have we made a different choice? Of course. Would have it been more logic? Not at all. Last means last: therefore when a language grabs the last B, it works correctly although it has the same faults the ancient Rome saying went: «summum ius summa iniura», grand justice grand injustice: a program so literal may appear excessive. But in my opinion it must be so.
      What is not convincing in it is this: RE already could get avail of the so called modifiers like the g letter to mean global scope; so since you could already supply your RE interpreter with a global scoping instruction, I wonder why we could not give to our researches such scope by modifiers and had to find this capability bestowed twice by allowing the interpreter to default to a greed behaviour which exhibits once again a global scope (before returning a closing range match).

      Of course, it could be argued that when preparing the original interpreter, modifiers were not available and were introduced as a second or third feature at a stage where the core of the interpreter was already based upon a greed conception; in other words we could argue that maybe first there was the global scoping by greed and later a global scoping by modifiers.

      But if it were so, we should wonder why in an environment which already defaulted to a global scoping was felt the necessity to introduce a modifier which gave to it exactly what it already got: a global scoping, instead of a more logically expected modifier restricting such sweep.
      No, greed really appears like an unnecessary complication.
       

    2. Moreover and finally: it is logical that we grab the last instance of a closing B object. But it is not logic at all that one time we do, another time we don't. For if I have an expression like the one we brought:
      /\w+\d/g
      and given a string like "john82" you grab the whole string, this is not equal to having grabbed the last digit of the requested digit type, but it equals to having grabbed arbitrary data, for the RE is clear: whatever amount of letters (\w) followed by a digit: if you grab "john82", would mean whatever amount of letters plus a non letter (a digit) followed by a digit...
      Not at all what I wrote, which could not be plainer: if I have to beseech an interpreter to understand right what is already clear, we could go on years attempting to pesuade us that the interpeter we made was right and we're now dead wrong. On the other hand, if I set a RE without a \d symbol, it won't match those strings containing digits. One time it does, another it does not, and we have to bear it and grin.

  8. Last considerations on a strange outcome

    Wherever you opt into something, you declare more than you say. Your style is a traitor.
    Whenever you make a choice, you reveal more than you meant to tell, with a proportion growing with the degree of the equivalence of the alternatives that were facing you. I find thus that we can find here either a philosophical or clinic aspect. It is undoubtedly tremendously overwhelming to perceive the metaphor behind the scene. A world of affluence and abundant staples, enough to make you happy, whereas the abundance produces a short circuiting in those who should enjoy it: instead of being regarded as a blessing, this abundance and generosity produces problems; not in itself but under the spotlight of the choice we made, which is crafted in order to make us believe we were right: to put in place a necessity to protect yourself by an omnivore tendency which, would have not we felt such a necessity, would have not raised, and thus compel us into a policy where generosity is a fault and brings about catastrophes, and strictness a virtue which justifies itself with the necessities it itself conjured.

    Whatever logics may have been behind it, and whatever constraints may have forced us to keep greed with us throughout years to date, I still encounter strictly and clearly very personal difficulties in appreciating greed principles whenever I consider that neither me nor other programmers have ever been resorting to such a "feature" one single time in order to exploit its advantages, but conversely we have always been intent in carefully working around its obvious blessings.
    Difficulties, and a deep seated annoyance which I'd be glad in the first place to be able to get rid of, for I can't take off my mind that in such way the enforced formula wasn't of making of steel necessity a virtue, but of drab vice a necessity.