THE WEIRD THINGS INSIDE THE THING
|
|
What this script was written for, and why therefore it may carry/relinquish strange things
|
Nothing more than a specific compiler works as smoothly as a compiler.
Therefore this script might not. For a compiler is, among other things, basically this: a program that parses the syntax correctness for a given programming language; therefore it's usually highly specific, and the more it is specific the less the challenges to prepare it.
The purpose I was after with this subroutines, was to provide you with a javaScript utility fit to parse (for instance) textarea pasted or user produced and included scripting syntaxes, and consequently check their correctness for educational purposes. By correctness it is meant, being the compiler generic, that no nesting errors are present in the defined tuples(=couples) of syntax elements, and that no isolated elements are present where you'd expect a defined tuple (=couple).
If you want a highly javascript syntax specific compiler wirtten in Javascript I may suggest the specific and excellent The JavaScript Verifier by Douglas Crockford.
We could envision a website where you may want to introduce a few forms to let your users exercise with some challenging syntax, and therefore the first requirement would be the correctness of the basics of such syntax.
Since I didn't know beforehand what language you might be wanting doing this onto, there we go: I had to device a compiler as general as possible, capable of allowing in:
- Scripting and programming syntaxes, mainly based on checking the correctness of the nestings of parenthesis.
- HTML tags.
- A third unknown entity: starting items followed by corresponding closing items as defined by you.
All this came with a price: I had to nest into the script some syntaxes to craft regular expressions in a way which might be multipurpose and that at times include expressions like:
\s*\s*
which means: some white space or not followed by some white space or not: a repetitive tautology which, although doesn't affect at all the functionality for both the repetitions are optional, none the less may be considered queer; so if you are scanning the code and wondering about these things, please consider that my challenge was not to make a compiler for a specific language, but an educational tool that might be tailored to compile unknown stuff.
Last but not least, consider what an hell it is when you have to deal with Regular Expressions that must affect an unknown type of incoming text (unknown opening items followed by unknown sets of closed items): in fact some of these items may carry within themselves characters doomed to interfere with regular expressions' keywords, and if such a thing occurs, I cannot grant to you that the inner workings would always be able, for instance, to guess that an escape char as to be escaped as well. I can only tell you that it surely works for standard programming syntaxes and for html, and grants some type of compatibility for unknown third types, although at times unavoidable compromises have been necessarily set up inside the script (instance: a forcibly imposed rewriting of the regular expressions for accomodating highly regular-expressions-interferring items like scripting comments syntaxes are: /* or */, which inside regular
expressions, given the , would unavoidably amount to unexpected quantifiers; and no, escaping them was not enough for I had to let in that \s*\s* stuff as well. Try yourself to tackle this riddle, and you are eventually to see what tautologies it traps you into).
So this is definitely one of those cases where the availability of my Test Form (a feature other sites dealing with free scripts do not offer at all - to date) can be of great avail to intensively check the behaviour of this subroutine on unknown/unexpected entities.
HOW IT WORKS
|
|
Understanding how it works by understanding how to pass to it its arguments
|
Here is how you have to pass the arguments to this script (the Test Form is further on): by understanding how to craft them, you'd understand the inner workings of this subroutine.
Arguments are:
| ARGUMENTS OVERVIEW |
| input |
It is the input text which has to be parsed for nesting errors or mistakes or missing closing entries or isolated ending entries. |
| priorityType |
Either 0 or 1.
- iI zero, it means that the matching open/closed couples must just be even as far as their overall count is concerned.
- If 1, it means that the matching open/closed couples must be strictly parsed: that is, if something closes, it must be the last opened instance.
So something like: {[] ()} would be ok. But:
{ [ ] }
would be not, for as you see one set of round brackets closes before the set of square brackets closes, and the latter should have been closed earlier (or, conversely, the round brackets should have been closed later on).
This type of precedence is traditionally called LIFO, acronym for Last In First Out: a very deceptive way to say a much more comprehensive thing than just " out": Last object In is the First to be Dealt With next (that is: for instance, the last open parenthesis must be the first to be closed next by its matching closing companion), which has its conceptual adverse complement in FIFO: First In First Out, another deceptive way to mean: The First Object which gets in, is also the First to be Dealt With next, and which is of no interest in our case.
Typical LIFO objects are called piles or stacks.
Typical FIFO objects are called queues.
Therefore passing this argument named priorityType as one meas: apply LIFO.
Passing it as zero means: apply nothing actually, just make sure the overall closing/opening couples match and yield an even balance.
|
| openingChars |
If neither this and the next argument ar passed, the starting default set of opening elements is the following:
{ [ ( /* <!--
If you want to pass your own elements (such as a set of html tags or, in case you want to verify that every tag in an input is closed, the only element you pass as openingChars can be: <).
This argument must be either:
- Array whose each entry is one of the starting items
- String, therefore in between quotes: elements
If you pass as a string a list of multiple elements, the function by default will produce an Array out of it by splitting it into an array of all the chars included in the string; conversely if you pass a String of elements each separated by a specific separator (say a typical comma) and you next pass as the splitter argument the String element which separates them, the function will split and produce an Array out of the openingChars string by dividing its element after the splitter.
If you are to pass html tags, beware: we have right here the main setback I was hinting at earlier: the function will do the following things:
- Each element which is not a word element (such as symbols) would be considered as if allowing optional white spaces, if any, before or after it. So:
<font
would match even all the instances like:
< font
or
< font
regardless of how many white spaces there may be between non word symbols (by word regular expressions cumulate both letters and digits if the digits are nested or appended or pre pended to a set of letters. A questionable implementation which is not my fault and that, actually and by the way, adds greediness to Regular Expressions' greediness. For more on this you may want to peek at: Understanding Greed In Regular Expressions).
-
Conversely, spaces are not to be allowed in within alphabetical or alpha/numerical words (which as such are not symbols), unless you include for some reasons white spaces in them by yourself.
Also, aAfter the last found letter, the script reqires (adds) a word boundary, namely after the last letter you pass there must be say whatever except a symbol or a letter: a word boundary, that is.
So if you want to pass openingChars as a list of tags, do not include in them the eventual > symbol (such an apparently irrational request has a sound basis indeed: a tag must not necessarily end with the tag name, but may well include a whole set of properties before closing: if you were to include in a tag meant to be an instance of opening chars the eventual >, the script would parse only those tags which... include no properties at all: an extremely unlikely thing, you see.
In other word, you can see that the general Compiler has been written for general purposes but obviously keeping an eye on the most likely tasks).
Instance of a String separated by commas to inspect a list of specified tags:
<a
<body <table <tr <td
In such case, remember to pass priorityType as 1, and most significantly the argument named splitter (see further on) as: ""
|
| closingChars |
It is the Array (or String, separated by the same separator as the opening chars) of elements that are the closing ones for the given input Chars.
Do not pass them randomly! They must be in the very same listed order of the opening chars!!
The same policies as above are applied as far as spaces are concerned, but no policy on word boundaries is applied. So if they are html tags, this time they have to include bot the starting < and the closing one >. Instance:
</a>
</body> </table> </tr> </td>
|
| splitter |
If the opening and closing groups are passed as strings and the separator of the sting is a certain type of char, this latest separator must be passed as the splitter argument.
Known Issue: if you pass openingChars or closingChars as a string and they are meant to hold only one element (a possible chance, indeed) like, for instance, say <a and </a> keep in mind that if you pass splitter argument, those chars are to be split into a set of subsequent elements, like say []
In order to avoid this, pass a splitter argument as well in case you're using strings to pass the openingChars or closingChars arguments. In the case outlined just above, pass some fake splitter argument (even a comma would do: ",") so that the script would attempt to split the arguments, and finding no splitter in the arguments to split them with, it would group the whole </a> (in our example) as item (which is what you likely want). |
| dontAllowSpaces |
See openingChars: if passed as 1, doesn't allow white spaces in between symbols and letters. |
| noWordBound |
See openingChars: if passed as 1, doesn't demand any longer the word boundary after the opening letters. |
As far as the returned values are concerned, the script can return:
- A Number: it is 1. So if it just returns a number, it means it found no mismatches whatsoever.
- If some mismatch has been intercepted, the script invariably returns an array of 3 entries that provide all the data. More specifically:
- if you assign your function to a foo variable:
var foo=generalCompiler( various arguments )
Then foo would be, if mismatches have been detected, an array of three entries:
foo[0]
foo[1]
foo[2]
Such entries carry the following data:
- foo[]
Those instances for which a closing item has been found but no starting item.
This entry is a literally indexed array itself, whose each literal index is one of the chars that you have passed as closingChars - if you didn't, you remember that default closing ones are }]) */ -->
Therefore you can call in one of these item report as:
foo[0]["}"]
Note the quotes surrounding the } to indicate that we're pinpointing an entry whose index is a literal and not a traditional number. If you would have passed among your closingChars something like: </a>, then:
foo[0]["</a>"]
Be sure not to include spaces or any difference from what you've used.
Each of these object, again, is an Array, numerically indexed: each entry of the array holds the numerical index of the input text where the reported mismatch was located.
You may consider this complex only as long as you don't realize that's actually the smoothest way to deliver the highest amount of immediately searchable data.
So if you want to know
how many instances of a closed item without opening ones (if any) have been found, grab the entry [] of foo
then loop it by a for-in loop (remember, literal Arrays can only be scoured by a for-in loop, not by traditional loops: for more on this: looping hash (associative) arrays); each entry if has a length, means has indexes of found mismatches for the relative literally indexed instance:
for(var x in foo[]){
if(foo[0][x].length){
alert("Mismatches indexes found for "++" are: "
+foo[0][x].toString());
}
}
- foo[]
Those instances for which an opening item has been found but never a closing ones.
This entry is a literally indexed array itself, whose each literal index is one of the chars that you have passed as openingChars - if you didn't, you remember that default closing ones are {[( /* <!--
Therefore you can call in one of these item report as:
foo[1]["["]
Note the quotes surrounding the [ to indicate that we're pinpointing an entry whose index is a literal and not a traditional number. If you would have passed among your closingChars something like: <a, then:
foo[1]["<a"]
Be sure not to include spaces or any difference from what you've used.
Each of these object, again, is an Array, numerically indexed: each entry of the array holds the numerical index of the input text where the reported mismatch was located.
You may consider this complex only as long as you don't realize that's actually the smoothest way to deliver the highest amount of immediately searchable data.
So if you want to know
how many instances of a open item without closing ones (if any) have been found, grab the entry [] of foo
then loop it by a for-in loop (remember, literal Arrays can only be scoured by a for-in loop, not by traditional loops: for more on this: looping hash (associative) arrays); each entry if has a length, means has indexes of found mismatches for the relative literally indexed instance:
for(var x in foo[]){
if(foo[1][x].length){
alert("Mismatches indexes found for "++" are: "
+foo[1][x].toString());
}
}
- foo[]
If priorityType is set as 1, this entry carries those instances for which a nesting error has been found.
This entry is a literally indexed array itself, whose each literal index is one of the chars that you have passed as closingChars - if you didn't, you remember that default closing ones are }]) */ -->
Therefore you can call in one of these item report as:
foo[2]["}"]
Note the quotes surrounding the } to indicate that we're pinpointing an entry whose index is a literal and not a traditional number. If you would have passed among your closingChars something like: </a>, then:
foo[2]["</a>"]
Be sure not to include spaces or any difference from what you've used.
Each of these object, again, is an Array, numerically indexed: each entry of the array holds the numerical index of the input text where the reported mismatch was located.
You may consider this complex only as long as you don't realize that's actually the smoothest way to deliver the highest amount of immediately searchable data.
So if you want to know
how many instances of an item have been closing in a wrongly nested manner (if any), grab the entry [] of foo
then loop it by a for-in loop (remember, literal Arrays can only be scoured by a for-in loop, not by traditional loops: for more on this: looping hash (associative) arrays); each entry if has a length, means has indexes of found mismatches for the relative literally indexed instance:
for(var x in
foo[]){
if(foo[2][x].length){
alert("Mismatches indexes found for "++" are: "
+foo[2][x].toString());
}
}
That's all as far as the returned values are concerned. The only thing I could add is that the possibly returned index(es) of the mismatches are the index(es) of the location(s) of the first char of each mismatching text. Also, consider that white spaces and even new lines ( carriage returns, that is) are considered by digital languages as chars (yes!), so a carriage return is taken into account while counting positions for the index of an item first offset. Locations counting starts with zero, not with 1.
Also it may be useful to stress again that if you include the argument openingChars, then the argument closingChars must include the closing pairs in the same exact order as you passed/listed the opening ones: being this a general purpose scripting compiler (or syntax checker), I could not assume that, say, the closing partner of a starting anchor tag would be a closing anchor tag and therefore producing it without your approval, for the less the assumptions the more the generality.
THE CODE & THE TEST FORM
|
|
Put the script to work on your own inputs
|
Your test form is below: it is a good tool to get acquainted with this script.
|