Not that I deem this essay so precious, but hey: some drops of life have been brewed in, so:
INFO ON COPYRIGHT INFRINGEMENTS[you can quote freely only if you provide a clear acknowledgement/link]
MERCURY DWELLING IN TWELFTH
|
|
Brief foreword before the technicalities
|
This file will not deal with the so called greed in Regular Expressions.
If you are searching for a spellbinder on that specific topic, or for one more file devoted to the Regular Expression topic, you can read my Understanding Greed in Regular Expressions |
The sunman cometh, and an astounding clarity brought about from nowhere is to shine, shine crazy, all above you.
«Still, things might have happened differently, had not the two dead men come out from under the stones and hushed the hot words in his throat. They led him quite gently from the cache, which he forgot to close. That consummation was reached; that something they had whispered to him in his dreams was about to happen. They guided him gently, very gently, to the woodpile, where they put the axe in his hands. Then they helped him shove open the cabin door, and he felt sure they shut it after him, or at least he heard it slam and the latch fall sharply into place. And he knew they were waiting just without, waiting for him to do his task.» [Jack London, In A Far Country]
The storehouse of the unknown powers is going to unleash its dogs once again: for the one thing is when you believe you have understood something, and the other completely different thing is the fact you have understood it: the former is characteristically marked by the following sentence: "it is such simple a thing!". Oh, indeed?
So let it be: the revelation is here, shined crazy above me and I'm gonna share it with you: for who can be silent after he heard the whisper of the supreme?
Ya got it: I understood what all those strange, absurd properties and methods present in JavaScript and meant to handle Regular Expressions are all about. And in a short moment you too.
The vision begins. Be inspired - by the angels.
AN OVERVIEW FOR THE LESS FAMILIAR WITH REGULAR EXPRESSIONS
|
|
What RegExp are in a short, perhaps boring to the most experienced, but none the less obviously necessary summary before tackling the serious stuff
|
I already said in the headers (which are not merely cosmetic) what Regular Expressions are, and I want to suppose you already have a clue on what they're all about if you landed here after some intentional pursuit.
Anyway, and for good measure, I am arguably supposed too to provide less experienced javaScripters with a slightly irksome to a few but to my eyes indispensable summary: I'm not gonna leave rookies entirely out of the bigger party.
Regular Expressions are those (with somewhat awkward if not ugly a look, we ought admit) snippets of code that are meant to organize a query (a search) on a string in order to see whether the given input string carries one (or more) matches with a given pattern: and it's all in this word: pattern.
In fact Regular Expressions
(from now on at times shortened into RE) are exactly a standardized way agreed upon to craft a pattern request, so that a string search would be able to return all the words (if any) that match with the pattern: therefore a RE would let you, for instance, search for all the words in a string which are composed of 5 and only 5 letters, or of words followed by numbers, or for symbols followed by nothing or only by specific elements, in completely independent a way of the specific type of verbose incarnations such patterns may materialize themselves inside the string.
Needless to stress the remarkable utility of so great a flexibility when searching items in a string knowing the structure they must have but not what exactly the may spell: like in a phone number, you know it is composed by digits but you don't need to see a specific one to recognize the pattern.
A likewise RE pattern on the example would thus be something like: find in this string all those items which are arguably phone numbers (3 digits plus a dash plus, say, from 5 to 12 digits?), but find them provided the fact that, alas and obviously enough, I've no clue about which specific phone number out of the 500 active millions may have been included.
Talk the difference between: find me all the Dodge cars in the parking lot, versus: find me a Dodge which is red, whose driver is my bud Jim and whose plate recites "wahoo!".
RE are concerned with the former, more generalized and less lucky strike dependant duty.
We should probably distinguish between what I call explicit and implicit forms of a RE, whereas actually the implicit form is just a form that has to be turned into the explicit one to be used as an actual RE, therefore reducing the whole of this volatile distinction between Regular Expressions to one single actual instance/type: the explicit one.
A RE starts and ends with two forward slashes, which are by mere convention considered the boundaries of its starting and ending point: for simplicity I will just use the following in our examples:
/\w{5}/
Notice the slashes. Note the statement it is not in between apex. Note the statement never includes apexes meant to surround the elements in within. In other words: it is not a string: in fact it is... a regular expression
The expression within the two forward slashes stands for the hypersimplified following pattern: whatever sequence of letters (the expression \w whereas the w stands for word type and the backward slash flags it must be meant as such and not as... an actual "w" like in words such as: "www"!).
Additionally, it demands to search for such letters as long as they are 5 and only 5 in a row ( the expression {5}; tip: peruse my Understanding Greed in Regular Expressions to learn more about these elements called quantifiers if they're totally new to you).
So, I call this Regular Expression representation, the representation including the forward slashes, the explicit representation.
What I called the implicit representation is what we could call a "wannabe" Regular Expression that is featured without the forward slashes and in between quotes: you got it, a STRING. Our example in its implicit form would thus look like:
"\w{5}"
And since there are no forward slashes, you can not -apparently, but will not be so for long- add modifiers such as g and i.
EXPLICIT |
IMPLICIT |
/\w{5}/ |
"\w{5}" |
TURNING IMPLICIT INTO EXPLICIT
|
|
We start with a bit more serious stuff that are hopefully going to add some spice to your understanding of javaScript methods and properties that manage Regular Expressions
|
The only purpose of a Regular Expression is to be used: therefore a RE is just a shape that you mould in order to let it be used by third things (such as functions). Out of such environment, namely without a third element between the Regular Expression itself and the string to search upon, Regular Expressions in themselves are nearly nothing.
And the only way for a Regular Expression to be available for usage by a third element, is that it does is in its explicit form.
Why? because in its stringed (what I called implicit) version it is not a Regular Expression: it is exactly still and again another mere instance of a text string.
Thence the utility, to begin with, of a built in method whose purpose is to turn a string into a regular expression. Such built in method's name is: compile("wannabe RE here").
Here are the 2 elements to keep in mind in order to use it competently:
COMPILE uncovered
|
THE USE
|
As I already stressed there must always be 3 things on our stage: a Regular Expression, a String, and a Method which uses the regular expression.
The method is compile, the String we have it (the so called wanna be regular expression in its implicit form), we clearly lack the third protagonist: a Regular Expression.
You may wonder we cannot have it, for our task is precisely to get a RE out of a String, and you'd be somewhat right: none the less JavaScript methods must all be belongings of some object in order to be possible to invoke/run them.
Whether you're aware of it or not, even the functions you define by yourself -call them methods or subroutines, we're not going to quibble now- do not autarkically (statically) run by and on themselves, but are automatically appended as belongings of the window object.
Thus a call such as myFunc() tantamount to window.myFunc() in an abridged form.
Being window the topmost object of a hierarchy (JavaScript's), it is the only one that can be safely omitted for it is the only one too that can be safely assumed as always present as well!
[you always, invariably have a window object if you're reading this, correct?]
The compile method makes no exception, and since it deals with Regular Expressions, when deciding which object such method must belong to well he who engineered javaScript found logical and more consistent not to assign it to the window object, but to a Regular Expression object.
It is just a policy. It could have been assigned to the document object: but would have that made much sense after all?
There we go, if it has been assigned to the regular expression object, this just means in JavaScript there exists a Regular Expression object like the much more familiar window or document objects.
And such new object is:
RegExp
Yeah, just named like that: and no matter if it is the first time you see it and you're puzzled: rest assured the first time I read of it I was perplexed as well.
Therefore in order to use compile you must first create some foo foo RegExp object from which you can then invoke the compile method legitimately; which you do by the following syntax, whereas the word foo is just a placeholder for whatever other variable name you prefer (and all the rest are indispensable keywords):
var foo = new RegExp()
That positively creates a new instance of a RegExp object named foo.
You can now compile our "\w{5}" wannabe a RE string, by issuing the following command:
foo.compile( "\\w{5}" ) [why two slashes? see a bit further on]
Note that this generates and assigns to foo the nature of no longer an anonymous RE, but of a specific RE: our string turned into:
/\w{5}/
and foo is the variable name that now represent such specific RE.
This is what compile does: transforms a string into a specific RegExp object, drawn by that string.
So, on the whole, the compile method works with Regular Expressions and takes as an argument the string to convert (consider compile a synonymous to convert!). To run compile, both the "wannabe" String and a Regular Expression object (the latter meant to run the method from, the former to run the method unto), do have to exist.
And if you originally had only a string, you have to device/scramble out a RegExp object on the fly in the fore said fashion, to make the compilation of a String into a Regular Expression a possible process.
Since strings cannot include modifiers, the compile method allows for a second argument which must be a string (in between quotes, namely) where you can add either an i or a g or both (be safe: no spaces in between): the method will take care of the rest:
foo.compile( "\\w{5}" , "gi")
|
ONE BIG ISSUE
|
There is one complex issue with the compile method: knowing it means to arrange your string meant to be transformed into a regular expression, forging it competently as the future valid regular expression it yearns to become by being compiled.
The issue is this: if your string (I repeat: this affects STRINGS namely text in between quotes: therefore does not affect bareword regular expressions such as: /hallo/) includes backward slashes (by the way include no leading and trailing forward slashes: the compile method is meant to take care of adding them on its own, otherwise what compiler would it be??), well such backward slashes sometimes will be... removed!
Why?
Because in nearly all of the existing programming languages (do not blame JavaScript, that is) a convention goes and imposes that backward slashes are to be considered as the so called escape characters; an escape char has the following purpose: it flags that the next element after it has not to be considered like what it is namely a mere letter, but as a letter which purportedly has a machine meaning for the running programming language: for instance if in a string I write "\n", what javaScript in our case would write if such a string is sent to a page and printed, is not a letter n preceded by a slash, but a New line (say a carriage return like the html tag <BR> which Breaks a Line).
To provide you with another example, in a language like Perl an element like " \a" would generate a... beep sound (so called bell or Audio) out of the computer! What the hack is such a thing for, you may wonder; well, a beep command nested into a string, might warn you that a long text processing has reached its end... and solicit your attention in case you left the computer alone!
Got it?
Well, do you know what happens if the letter after a back slash has attached no special meaning for the programming language? You might guess an error. No, for then a script might be beset by errors: it just ignores them. See what the engine which interprets the given programming language does:
- It sees a back slash: it consequently understands it is implied that what follows must not be considered a trivial letter but a command.
- Sees the letter after it, and cumulates the back slash plus the letter into one thing: the command.
- Searches in its codes the command linked to the item produced above and:
- Finds it, therefore executes it, and then discards the couple back slash+letter namely drops it and shows it not in the returned text, for its only purpose was to trigger the command or to show some special formatting.
- If it does not find it, it just discards (removes) the back slash and prints the letter.
In other words what happens is that whichever the case, a back slash plus a letter results into not showing something of their couple in case the string gets printed on an output device or passed to another processing thread!!
Of course, the situation is not so hopeless as it may appear: it is not desperately impossible to force a script to show\consider a back slash just as such.
The way to achieve this is, by convention, to pre pone to the back slash another back slash: this combination \\ forces the interpreter to print one backlash: escaping one escaper sign, compels by convention the machine to print one of them. And the reason of this convention is that we had to be provided with a way out as well, without adding a new symbol to escape the escaper, which as such might have been liable to call for another escaper sign, and so on endlessly: so the escaper can also escape itself, if it gets duplicated; by convention the escaping of an escaper returns one escaper as a mere literal in a string.
Got it? Gulp it anyway: it is a convention.
Anyway keep in mind this rule of thumb: in a language that accepts escaping:
- if you escape an element which unescaped would have NO meaning for the language, it means that now it must be considered as having a special meaning. [from "w" to \w]
- if you escape an element which unescaped would have meaning for the language, it means that now it must be considered as having no more such special meaning. [from \ to \\]
Since in our Regular Expression case, and in most of the Regular Expression cases actually, we do not want our backward slashes being stripped off at all when turning a string into a Regular Expression because such backward slashes are a significant, intentional, and critically functional part of the way our future RE has to be, the trick to force the compile method to keep into place a backward slash where there is one, in order to pass it to the regular expression, is to be sure there are... two consecutive backward slashes:
"\\w{5}"
One is to survive!
Have you guessed what is properly in action here? Two languages both relying on backward slashes as a way to mean escaping signs and as a way to flag letters as meant to be parsed not as such but as commands (our Regular Expression \w as whatever letter of a Word, and not as an actual "w" as in "www", remember?) meet, clash, and overlap: thus you have to overcome the first language (javaScript) and prevent it from removing the backslashes in the assumption they were commands meant for "him", in order to make sure the correct escapes are passed to the second language (the Regular Expression!) the were actually meant for!
If you don't, and javaScript would therefore stumble into escaped letters that in javascript are not commands (such as \w), it would strip off such alleged escapers downright: but they were vital and quite valid indicators of commands for the Regular Expression!
Got it? It is an interference issue, a misunderstanding not between humans but between... languages.
And remember, do not get confused, we're talking not of already valid Regular Expression objects, but of STRINGS that still have to undergo a transformation into a Regular Expression Object: whenever facing a string, javaScript cannot guess on its own and magically that this time an escape sign in this specific string is not meant to be escaped anymore, unless, precisely and as we saw, you double it. That's the way. Out of this way, no other way to let javaScript guess this time, on this string, it has not to escape...
I want you now to keep in mind one critical thing: web pages FORM fields always return a string data type when read by a script. This firstly means you need not to include your inputs typed in form textfields in between quotes: they are already considered strings anyway and by default.
But secondly and most importantly, what happens next with web forms is what follows.
The so far discussed behaviour by the compile method and its related replacement of backward slashes, occur only if your strings are strings manually included in your scripts (whereas there can well be: in your codes there can be numbers, keywords, strings...!), but would not occur if your strings are drawn from form fields. Before I explain to you why, test it.
In the form below you can see the two behaviours on all types of strings:
- One backward slash, drawn from a FORM
- Two backward slash, drawn from a FORM
- One backward slash, handwritten in the code (do trust me...) and identical to point 1
- Two backward slash, handwritten in the code (do trust me...) and identical to point 2
Hint: the right versions are identical: there cannot be but one right version and many wrong ones. In fact as Ilya Prigogine writes quoting Lagrange: «You cannot have but one Isaac Newton: there isn't but one world to discover.»
The reason for this apparently crazy behaviours is the following: when drawing what we called the wannabe regular expression from a form, javaScript never presumes that inputs from a form that include a backward slash should be handled, as far as those slashes are concerned, like actual escape signs: if JavaScript would have assumed so, the consequences would have been dire some: in fact an user who sends, for instance, data from a web form in order to purchase an item and includes backslashes to separate for instance security alphanumeric codes, might have well sent through an invalid code, for escaped letters that perchance where commands are... escaped, namely stripped off along with the slash, and made thus invisible!
Or an user including by mistake an \a might have caused your server... beep!
Or an user writing in a form a web address using backslashes instead of forward slashes (quite legitimate: browsers consider both ways as valid and identical addresses), might have very likely sent through the form a wrong address.
So, Forms Do Not Escape. Take it like a ditto, like Norman Mailer's «Tough Guys Don't Dance».
But when an escape character (namely a backward slash) is handwritten in the code by a coder (a... human javascript programmer, say!), then javaScript goes back to what it correctly and normally does: it safely considers backward slashes as traditional escapers: and thus in order to force it to consider them as normal backslashes (and pass them as such to the Regular Expression without tampering with them namely without assuming they were meant for itself), it calls for you... escaping the escape sign with an escape sign...: \\
That flags by convention that one escape sign \ must be preserved, as we saw previously. This way the escaped sign navigates its way unharmed through the jaws of the scripting language, and lands into the realm of the Regular Expression compiler with its correct syntax preserved.
Treasure this: I never found this explanation in any book (can you guess how many javascript books I read?), I never found this explanation in any newsgroup, I never found this explanation when asking for help, I conversely found only a few lines on such articulated an issue; and this file is absolutely for free, and it took me years to understand these things (not in an intensive meaning of "year", of course: in a year I had to cope with compile about 10 times, no more. None the less, whichever way, it took time). This is the way I understood these crazy behaviours, and my bet is now that I understood them right.
|
Missing a better place to add this small information, be warned that Netscape and Internet Explorer return a different data type for a regular expression: Explorer considers regular expressions as Objects, whereas oddly enough Netscape considers regular expressions as instances of the Function data type!
MATCH & SEARCH
|
|
Two important built in methods using Regular Expressions, and a script to add a missing one!
|
Here we investigate all the javascript built in methods (about 5 on the whole, depending on how we consider the method exec) that in JavaScript are meant to handle relations between Strings and Regular Expressions.
To explain them we use as a regular expression our usual:
/\w{5}/gi
already in the form of a RegExp object, assuming that if any compiling process was necessary, it is already done. Note it has the two trailing modifiers to be global scoping and case INsensitive.
As a string we will use:
"Hallo folks, I said Hallo"
which conveniently contains 3 words of 5 letters: two times Hallo and once Folks. The reason of this choice is to be soon clear to you.
Now the methods:
- match:
Its form is: STRING.match(RegExp)
namely in our case:
"Hallo folks, I said Hallo".match(/\w{5}/gi)
Do not be fooled by the fact I included a so to name it physical string before it: it may hold a variable that holds as a value a string as well!
It returns either keyword null if no match was found, or an Array indexed from zero onward, whose each entry is the text of the searched string. So:
- search:
Its form is: STRING.search(RegExp)
namely in our case:
"Hallo folks, I said Hallo".match(/\w{5}/gi)
It returns a number, either -1 if no match is found, or a positive number from zero onward representing the offset position in the string where the first occurency of the first match was found. In our example would return zero, for the first Hallo is at the very beginning. Try to put some word of less or more than 5 letters before it to see a different index number returned.
So:
- You got it: we have a missing feature here: what if I want to return all the indexes of all the matches, even the ones that may be double (like Hallo in our example)? Strangely enough, no one thought of producing such a function despite its obvious utility and absence.
United Scripters is here precisely for this. I made the function that does it.
It is called imatch, i stands for Indexes in a php naming style this time, and the function wants two arguments: the first a String, the second a RegExp (if by chance you pass a String also as its second argument, it -would you guess?- compiles it!): standard invocation on our example:
imatch("Hallo folks, I said Hallo" , /\w{5}/gi)
There is a third possible arguments named flags: only if you pass a string as second argument instead than a RegExp object expression you may want to pass such third argument: it must be a string and include either a g or a i or both gi in case you want to add them as the modifiers argument to the compile method run to generate the RegExp out of the string. If you don't pass it and the second argument is still a String, no modifiers will be applied (and therefore the default compilation would not set the generated RegExp to global)!
Example:
imatch("Hallo folks, I said Hallo" , "\\w{5}" , "gi")
Eventually, the fourth and last argument named doubleReturn, if passed and set as, for instance, number 1, would slightly modify the output, returning this time an array of two entries: the first entry is an array on its own holding once again the indexes of the matches, whereas the second entry is still again another array but this time holding the corresponding found matches.
The arguments are obviously separated by a comma.
The function returns either keyword null if no match was found at all, or an array whose each entry is the search equivalent index, but this time for each match: thus if you run in parallel first the built in match method and then my function, you will generate two identical arrays the one returned by match carrying the matches and the other one returned by my imatch, at the corresponding array indexes, the... positions!
Be warned that in the textfields below I set the scripts inside this file to be run always as global and case insensitive by default (if you use imatch with a string meant to be compiled, you'd have to pass the third argument named flag as "gi" to actually achieve this: but here, in this file, to simplify I set the option in our test form as always implicitly present and set to global and case insensitive)
Also, I slightly rearranged our RegExp (taken from a form, therefore a string: the imatch function can work out whether it has to compile it or not); I rearranged it in this fashion:
\b\w{5}\b
Why?
I added two pattern definitions: two \b which means word boundaries.
Therefore our example regular expression now by and large means: search for something which is not a letter, followed by 5 letters, followed by something which is not a letter.
If it weren't so and instead of number 5 you would have inserted number 4 (with the intention, for instance, to produce a match on the word "said" which is present in our string "Hallo folks, I said Hallo"), well you would have got not only 3 items, but as many as 4: in fact if you don't instruct a regular expression to match only whole words, it can match even substrings in the words, and therefore return not only "said" but even "Hall" "folk" and "Hall" again: in fact, all of them are strings of 4 letters, correct?
Morale: you have to specify something more, if you want to get only those words that are composed of a x amount of letters and are also whole isolated words.
no alert? »
TEST & REPLACE
|
|
Two important built in methods using Regular Expressions, and a script to add a missing one!
|
Two more methods.
- test
It has the following shape:
aRegExp.test(STRING)
It merely checks whether one instance of a given pattern is present and returns a boolean true of false.
So consider these methods for searching a match as basically differing in this: the returned type:
- match returns array (or null)
- search returns number
- test returns boolean (note also that unlike the previous two methods, it gets a STRING as an argument and a RegExp as the object it is invoked from)
- replace
It has the following shape:
String.replace(RegExp, replaceString)
It just replaces the occurrences of the pattern RegExp passed as argument with the string passed as second argument (in between quotes, for it must be a String), if it finds any of them on the String. That simple (and useful!).
Instance:
aString.replace(/\b\w{5}\b/gi , "")
which replaces all 5 letters long words with nothing: an empty string in this particular case; it is a standard procedure to remove words; of course, you could have inserted in between quotes a word.
No, actually it doesn't let you specify the second argument as another Regular Expression: in fact a pattern as such is anonymous, so whereas you can find specific incarnations of a pattern to search for (and replace), you cannot replace it with a ghost (a pattern) but with something more concrete than just a pattern specification.
Anyway it is not entirely true that replace does not let you specify as its second argument something similar to a regular expressions: it lets you in fact include a highly specific type of string which appears like this (when I say like this, I mean exactly like this, comrade):
"$1 $2"
Now, what is that?
It is called backreferencing: if in a regular expression you include round brackets, they have a special meaning (and if you do not mean a round bracket as carrying the special meaning Regular Expressions attach to round brackets I?m going to explain to you, well: escape it pre-pending to it a \!): such meaning is: keep in memory the incarnation of the instance of the pattern segment included in between these parenthesis, if you found a match.
So consider a RegExp like:
/\(\w{4})\d{2}/
meaning (regardless of how much sense it may have or have not: it is just a trivial example) get those patterns which are 4 letters followed by 2 numbers (d=digits); the round brackets would allow for the storage of the, let's call it so, subquery \w{4} (the parenthesized element indeed!) as if it were independent of the main query and should be appointed with a memory of it; storing, but storing where, you may be wondering?
Well, we saw that every RegExp is an object (although I always sponsored against this lexicon "object", preferring to it structured data as I already stated on my essay on pointers): as such it can memorize properties: the mystery in this can last only as long as you ignore the names of these properties, by calling which you could read the values.
We're going to see it better, although you're to see it in action in replace as well: parenthesis do let the RegExp object store and memorize elements on what it does, and one of these elements are the so called captured parenthesized elements whose names are as funny a thing like $1, $2, $3...
and so on (funny if you never coded in Perl or Php, obviously: for in Perl and Php both all the names of variables start exactly with a dollar sign, believe it or not).
Thus if you memorized as many as two of such captured elements, the replace method allows you by the syntax
aString.replace(/(\w{5})(\d{2})/gi , "$1 $2")
replace the first captured element with the second captured element thus performing an actual swapping! In our latest example, it would find matches like Helen67 and would turn them into 67Helen!
Enjoy.
- Perhaps you got it again: we miss another logical dovetail to these methods: a method that allows you to split a String accordingly to a RegExp pattern.
You guessed: here it is:
It is called isplit and gets arguments as follows: input String, regexp, flags.
This last one (flags) only if regexp is a string itself: flags defaults to ignite a global search (global split) if it is not passed at all.
If it is passed as a Number, it would add no modifiers (an arguably senseless feature for if you want to split a string most of the time you want to split it after all the recurrences of the found matches, and not just after the first one enocuntered - which would be exactly what happens if you pass the regexp argument as a String and the flags as a Number).
The returned type: it is either null if invalid arguments were provided, or an Array: if it is of one entry only [index 0], it is probably the very same string meaning it was found no match to split it with (the reason I return it in the shape of an array instead of just returning the string itself, is that by doing so I keep consistency with the output type -an array once again as you're to see- and thus I arguably spare you the time to scramble different functions to handle different outputs).
If the function finds matches, it can remove them all (every split process, since it splits by a given element -in our case a... pattern!- excludes and drops off the splitting element/s: it is a split, not a join: a join includes the joiner whereas a split excludes the splitter) and would return an Array whose each entry [indexes from zero onward as usual for arrays] is the fragmented rests of the string (this may well include apparently empty items, for they may just be white spaces in the input string once removed the matches, or even those typical Regular Expression phantasmal start of line and end of line invisible "entities", showing up in the appearences of empty strings, much alike vampires become showing up like festering bats ya see...).
Test it:
no alert? »
EXEC
|
|
The last built in method: the mysterious exec
|
Here we are at the last of the 5 built in methods.
So, you've noticed that you can do a variety of things with the built in methods, and none the less still there were there a few remarkable exceptions (2 at least as we saw) whose gap we just filled in with the cute isplit and imatch.
Well, what if there are others not detected yet?
Let me tell you you haven't been the only one to have this doubt. In fact here comes along exec, as a spider.
Such built in method is meant to fill in those instances where the type of fruitions of Regular Expressions management provided by replace match search test didn't account for.
exec cannot guess what next you may want to do or get from a Regular Expression versus a String, but it attempts to lend a helpful hand by presenting itself like a way to pass over to you a wealth of information on a Regular Expression when it gets executed on a String: it is up to you then to pick from this information.
And in this lays the only objective the method stands for: it does nothing on the String it is executed upon: it just returns
a set of data on the relations found between the Regular Expression and the String, and puts them at your disposal.
Its syntax is:
aRegExp.exec(String)
In which shape is exec to pass to you this data? being a variety of things, its shape is the shape of an Object, namely it returns something that can be later questioned by addressing either the names of some of its properties or even of some numerical indexes (the two things can both coexist inside an Object).
They are:
- index: is a number, and is the first numerical index of the first match for the given RegExp pattern found inside String.
- [0]: is a string and is the matching text whose index is equivalent to index (see above)
- input: it is the entire String text of the original string. You wonder: I already knew that. Well, the idea is the one described by Danny Goodman in his exceptional JavaScript Bible (no comparison on earth as far as JavaScript is concerned, period):
«The value of having [properties you already know stored] in the object, is that their contents are safely stowed in the [returned] object, w[hereas] the RegExp object and its properties [and the input String itself] may be modified soon by another call [process] to a regular expression method.»
It is something I myself do often when I ideate my own functions: I try to return as much as possible, to be sure that either you pick only one data, or you can arguably find in the returned stuff some meat indeed.
- [1-x] namely array indexes from 1 onward such as [1] or [10]: each of them stores one parenthesized captured item (remember?): therefore I deem that exec is mainly useful exactly to store a relevant amount of captured sub-patterns!
- An example:
var foo=/\b(\w{2})\b/.exec("Do you think I am sexy?");
foo.input;
foo.index;
foo[0];
foo[1];
foo[NUMBER]
//if any. Indexes exceeding the captured sets, return undefined.
no alert? »
READ:
THE PROPERTIES
|
|
The last chapter is devoted to the properties: in fact you haven't just methods, but attributes as well into Regular Expressions
|
All right, I'm gonna make you understand it all in one second: there are some properties in a Regular Expression object (regardless of the fact it is born such or you've generated it out of a string by compile) which can:
- Be read before and after the given Regular Expression has been run on a String (before it gets actually used, that is)
- Read only after it has been engaged on some String, and as such these latter type of "used" Regular Expressions reflect only the relationships each of them entertained with the last string it has been employed upon, regardless of which method has called the given Regular Expression in: either a built in one or one crafted by yourself.
Remember what I have said to you in the beginning? Regular Expressions are meant to be used.
In fact as soon as they are used, their last usage stores in the Regular Expression object new information on the latest usage it underwent.
- Let me stress it is critical you don't blur your mental focus out of the fact that a Regular Expression object is not an creature of the Aether, but an actual variable to which you have assigned the role or Regular Expression either by the syntax foo=new RegExp(), or igniting it directly as, say, foo=/stuff/gi
Namely, this elusive "Regular Expression Object" is precisely, in this example, the variable you named foo! That is the object. The only competitor to such regular expression as embodied in the variable name (in our case foo) would be the keyword RegExp which is case sensitive and is the so called constructor object.
- Last but not least, it may be useful to stress that the properties that are available to be used on a regular expression before it is run on a String by whatever method employing such regular expressions, can be read also after the regular expressions has been used.
It is not true the vice versa: until a regular expressions doesn't get employed and engaged on the field at least once, you cannot read prior to some usage the properties defined as to be read only after at least one usage.
Yeah, makes sense doesn't it? In fact when either you understand it or they explain it to you clearly, it does!
PROPERTIES
|
BEFORE and AFTER USE (foo)
|
- source: foo.source
Is the String version of the RegExp itself, stripped off the leading and trailing forward slashes (and modifiers are excluded from this stringed representation as well).
- global: foo.global
Is a boolean either true or false flagging whether the RegExp object is set to perform a GLOBAL performance as soon as used on a String. It is read only a property.
- ignoreCase: foo.ignoreCase
Is a boolean either true or false flagging whether the RegExp object is set to perform a case INsensitive performance as soon as used on a String. It is read only a property.
- multiline: foo.multiline
Is a boolean either true or false.
I owe the following explanation of this property to mr. Danny Goodman, the author of the accalimed JavaScript Bible, in my opinion possibily the very best of the javaScript books available out there. None the less up to the 4th edition included, this explanation about the meaning and purposes of the multiline feature is not available on the book.
When you include in a regular expressions modifiers like the previous two (gloabl, caseinsensitive) you could also include, actually, a third type of modifier (either alone or toghether with both or each of the above): this modifier is: m.
Example:
/hallo/igm or:
/hallo/m or whichever combination.
The reason this property is not widely publicized stems from the fact its parsing is somewhat flawed on some MacIntosh versions of the main browsers, and works not, so it seems, on Netscape 4 (would you guess?).
So, what this multiline flag does?
In regular expressions the caret simbol ^ means beginning of input.
So a regular expression like:
/^hallo/
would match "hallo" only if an instance is found at the beginning of the input, but would not match something like "I said hallo" because "hallo", inside such input, is not at the beginning of the input (which the caret symbol requires).
Now, a string like "I said hallo" can appear in two fashions: flushing in one single line, or broken up in more lines, correct? like:
I
said
hallo
There we go, if you want your regular expression to consider as beginning of input not only the beginning of the input as a whole but also each single newline, you add the flag m to your regular expression.
Consequently if on the multilined string:
I
said
hallo
A regular expression like:
/^hallo/
would report failure (no hallo at the very beginning of the input), a regular expression like:
/^hallo/m
would return a successful match, for the m modifier flagged and instructed the regular expression to consider as beginnings not just the very leftward edge of your whole input, but each single newline of it!
Got it?
- I know, you're wondering how can I set these properties before a RegExp is run: in other words, what if I have a RegExp set to global but then for some reason I wanna set it no more to global, and then perhaps reset it to global again? Who knows what the next day may bring.
Of course, you can rewrite the RegExp. Or use this function, named setModifiers: it affects the original (!) RegExp passed as its first argument, changing its global or ignore case status (peek the code, it is quite intuitive):
An important note:
- lastIndex: foo.lastIndex
This property is set to zero at the creation of the Regular Expression, but must be mostly read after the RegExp is engaged on a String; in fact if engaged this property updates itself, carrying the last position (index) of the found match within the string: to see more specifically what it is here properly meant by last index, see the table below.
So, if our match in the string
"hey you folks how are you" is by chance you, the index returned is as follows: it grabs the last instance of you found (if the regular expression was set to global, it is the last you) and returns the number of the position after the utmost last match it can find on the match: that is: last match starting position plus it counts in the length of the matched word and additionally it exceeds it of one position: last indeed...
So in the example below if a RegExp gets used by some method like match searching in a global way for the string "You", well the returned number by lastIndex would be... 25 !
If the search wouldn't have been global, would have stopped after the first match returning... 7!
The form highlights the last matches (index 6 if no global, index 24 if global search) to show the property stores the index soon after it!
| 0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 |
24 |
| H |
E |
Y |
|
Y |
O |
U |
|
F |
O |
L |
K |
S |
|
H |
O |
W |
|
A |
R |
E |
|
Y |
O |
U |
|
|
|
Let's still imagine we're running global searches on our previous instance String:
"hey you folks how are you"
and that we're searching for the last "you" by a trivial RegExp like: /you/
The most relevant thing you have to keep in mind is the following (forgive the wording) absurdity: although you can usefully read these properties only after some RegExp has been run on some String, well you cannot call upon and read these properties from some variable name you initialized as a new RegExp (in other words, you can not call: foo.input)
but conversely you have to read them through the RegExp Object itself, as if you were calling in the gods: exactly like the following, no placeholders:
RegExp.input
That positively lets you read the properties of the LAST regular expression that has undergone some method - so, obviously if several different Regular Expressions named foo1, foo2, foo3 or so exist and all of them have underwent some use on some string, the input in the example would refer to the last specific RegExp embodiment called upon to perform a duty.
- input: RegExp.input
Technically for the last run Regular Expression in the family of Regular Expressions alive in the document, this property does returns the input String it has been run upon. This holds true.
Additionally, let me quote what Danny Goodman says on it (I consider his JavaScript Bible the ultimate reference on these issues): I have not tested all these statements, but it is some small work you can do yourself if you're really interested.
Goodman says about this property that if a RegExp is called upon from an html object like:
- LINK
- SELECT form field
- TEXT form field
- TEXTAREA form field
namely from an event handler such as onChange, then such RegExp.input property stores the following:«If a text, textarea, select or link objects contains an event handler that invokes a function containing a Regular Expression, (...) the input property becomes the content of the [html] object [whose event handler invoked the function containing the RegExp]; for the select object it is the text (not the value) of the selected option; for a link it is the text of the link, [for the other objects it is the content of the field]»
I have tested with a link and a textarea this behaviour, and this time it did not seem to hold entirely true: RegExp.input returns an empty string although both the testing textarea and the link had text in them and an onClick event handler calling a function that runs a globally defined RegExp on a trivial task ignited inside the function just to be sure the RegExp has done something as the documentations (I'm not referring exclusively to Danny Goodman here) want.
The explanation I could think of, is that these behaviours hold true only if a match is produced: as a matter of fact, if a match is found, then these objects are returned as predicted: none the less, the apparent assumption in the documentations is that the text of a textarea should be returned anyway namely not only in case of a successful match but just as a consequence of having searched for a match and thus having activated a RegExp: in fact, the text objectively present within a textarea should exist anyway and thus should be liable to be returned. So rely on them in the perspective of how to handle them from events on forms, only in case they produce successful matches.
- lastMatch: RegExp.lastMatch
The last found match, in our case cannot be but "you", but if we would have used a more generic regular expression searching not for a highly specific string but for a pattern, this property lastMatch would just store the last specific string whose characteristics matched with the requested pattern.
If no macth has been found when the last regular expression in the document has run, this property returns an empty string.
- leftContext: RegExp.leftContext
All the String that is on the left, namely before the last found match in the input String for the last run Regular Expression in the family of Regular Expressions alive in the document:
"hey you folks how are "
If no macth has been found when the last regular expression in the document has run, this property returns an empty string.
- rightContext: RegExp.rightContext
Al the String that is on the right, namely after the last found match in the input String for the last run Regular Expression in the family of Regular Expressions alive in the document:
""
yeah an empty set in our example, for after the last found you the input String finishes.
If no macth has been found when the last regular expression in the document has run, this property returns an empty string.
- $1 ... $9: RegExp.$2
As you may remember, a list of the parenthesized and therefore captured elements: unlike the method named exec which can store as many fragments as they happen to get captured, this property has a capacity limit (due the heavens know to what) that stops at 9, starting from 1.
I believe it is implied the first 9, not the last 9 if there are (unlikely indeed) several dozens of parenthesized captures.
If no macth has been found when the last regular expression in the document has run, these properties return an empty string.
- lastParen: RegExp.lastParen
Returns the last parenthesized (namely Paren stands for parenthesis, not for parent. I stress this for in JavaScript some methods meant, say, for XML do have a property named parent, like: parentNode) element captured (if there are for instance two sets of capturing parenthesis, it would return the equivalent of $2, in simpler words).
If no macth has been found when the last regular expression in the document has run, this property returns an empty string.
|
|