WORD STATS AND DUPLICATE REPLACER
|
|
A function to grab statistics about words recurrences and to detect and replace duplicates
|
The first function featured here is called wordStats. It is a typical case of a sought for function, apparently simple, and that yet can cause significant headaches when trying to implement it. It caused to me infinitely more difficulties than at first I expected.
The purpose of the function is actually to detect consecutive duplicated words, like,say:
was was
is is
and the alike, and firstly report at what char index they are located, and then remove them if requested (by passing an argument to the function, as we'll see).
Since the index of the duplicated instances are to be reported, it is important that you understand that the reported index will be that of the first occurrence of the duplicated word.
So for instance:
I said said that
would report as index of the duplicate number 2: indexes are counted as usual starting with zero included, and in this case 2 represents the index where the first char of the first instance of, in our example, the repeated word "say" is found, namely s.
A Jean-Michel Basquiat painting
|
The (unfortunate indeed I'd say) case of multiple duplicates such as, say:
I was was was
would be reported as one index : that of the beginning of the whole trailer of repeated words, in the example the index of the first w. The amount of interposed white spaces do not interfere with the detection.
Duplicates separated by punctuations are not considered duplicates by default but you can instruct the function to consider them as such.
That is things like:
How are You? You look good.
The substring You? You by default is not considered a duplicate; in the same manner by default it won't be considered a duplicate something like Well, well where a punctuation is interposed.
We are on a moot threshold when we consider potential duplicates where there is a punctuation in between: a mistake, or a new sentence?
Making a decision could jeopardize fully valid data from the input string. So I set this possibility as optional - you can set it in the arguments you pass to the function as you'll see shortly.
You may want to consider that since the function returns the indexes of any word instance and the indexes of any duplicate instance too (let's imagine the word "hallo" has duplicates), for all the indexes of "hallo" reported as duplicated you might perform a comparison also with the indexes of that same word that are not reported as duplicates, because the function issues reports on both (that is, the differences and/or intersections between the positions of "hallo" as reported in the duplicates section and as reported in the words section).
Of course, there are also occasions when a duplicated set of words is intentional and does not represent a mistake, like in the case of:
that that
This latter cases will be reported correctly, but it would be up to you decide whether to replace them or not. How could I guess it? Computers are good a counting not at assigning gestalt meaning.
By and large, removing a duplicate like "that that" does not alterate or compromise the meaning of a sentence in the least as far as english is concerned, but of course it all depends on the language you're dealing with and its grammatical rules, which a function can not guess.
So if you instruct the function to remove duplicates, it will just remove them all namely included things like "that that", though as you will see I have arranged it so that you can pass a few exceptions.
This setting means the function can be a bit long, but let me state again my scripting policy here: when I feature a function, I provide you with as many capabilities as I can think of, for it is easier to remove them at a second time if you really think you don't need or don't want them, rather than to recast the whole functions in order to add them.
I describe now the arguments that you must pass to the function, which will make you understand better its whole range of features. It's rather complete.
Arguments passed in this order:
- input: your input String you want to check for the duplicates.
If it would have been PHP, maybe it would have been a file line grabbed by fgets().
- removeFoundDuplicates: defaults to zero.
By default the function will report the index of the duplicates. If you want also to remove them, pass this argument as number 1.
- skipPunctuations: if passed as number 1, will instruct the function to consider as duplicates also those repeated words separated by whitespaces where one punctuation (a dot, a comma, a mark) is interposed, or which are separated just by one punctuations also if no whitespaces are present.
Let me stress I said one punctuation.
By the way, the function will remove also duplicated punctuations, but with this function it cannot be predicted which will be removed first whether the duplicated words or the duplicated punctuations.
- exceptionsArray: an Array of Strings, each representing an exception namely a word which, if found as duplicated, must not be removed (though it would be reported among the index of the duplicates).
Such words can be whatever case, they will be all transformed into their lowercase versions, thus they will set exceptions for the string they carry in whatever case such string will be found within the input text.
The exceptions affect only the removal (meaning by this that exceptions will not be removed also if duplicated), not the duplicate index reporting which will go on reporting (not removing) also these instances in the lot of the indexes flagged as duplicates: I decided so for I argue that, if you don't want to remove but you only want to return the indexes (which is achieved passing as zero the argument removeFoundDuplicates, or not passing it at all), this may mean you may want to verify personally the duplicates at a second stage or at a second tier of the application, and thus I didn't want to prejudice your personal evaluation of every potential duplicates.
The array will be converted to an associative one if it has not been passed already as such. If you pass it as associative already, it will be your responsibility making sure all the keys are lowercase and all the values held do not amount to false (like zero, null and the alike). If you don't know what this all means, just ignore this detail.
A Edward Hopper painting
|
The function returns an array of three entries:
- output[0]: it is the input itself; if removal was requested, this will be the input but with the duplicates removed as per request.
- output[1]: this is an array itself. It is an associative array namely can be scanned only through a for-in loop.
Each key is a word (or as said a punctuation) present in the input string, in its lowercase version.
The held value is an array again, representing the set of indexes where such a word or punctuation was located, therefore an expression like:
output[1]["hallo"].length
will report how many instances of that word are in the input string.
It required adding only one line of code to achieve this, so including this statistical feature was natural.
- output[2]: this is an array itself. It is an associative array namely can be scanned only through a for-in loop.
Each key is a word (or as said a punctuation) that showed a duplicate, stored in its lowercase version.
The value each of these entries held is an array itself, each item a number representing a numerical index of the position in the original input string of the duplicate(s).
So, for instance:
output[2]["hallo"].length
would say how many duplicates for the word "hallo" are there.
To get an index of a duplicate:
output[2]["hallo"][0 or a Number within length-1]
that would yield a number on its own, which would be one index for one of the duplicates found for the word "hallo" (if any, of course).
TECHNICAL NOTES -you can skip if uninterested-
The function will replace all duplicated white spaces with a single white space per each set.
The function initializes an internal private method, though the function is not a class; it will do that by an unusual statement like:
window["wordStats"]["_imatch"]=function(){/*blah blah*/}Such private method initialized in such unorthodox a manner, will be correctly recognized by the whole window scope as belonging to the function's scope only, as tests by the operator typeof might demonstrate.
You can verify how immensely hostile this function has been considering the following issue: the regular expression to verify double words:
(word_here\\b\\s*){2,}
Focus on the expression \\b\\s*: if you omit the \\s* (which means an optional whitespace after the words - we'll see soon why that can't be but optional), the regular expression would check only and exclusively adjacent and fully adjoining copies without whitespaces in between such as:
hallohallo
This is unacceptable, and thus I had to add the \\s portion .
Yet, with that the regular expression will not consider copies two fully adjacent repeated words - which I thought after all could be ok, for there exist words like "gogo". But it has been a difficult choice. So keep in mind with this function things like "foofoofoo" will not be considered copies.
If conversely you omit the \\b (word boundary) portion, the regular expression would check only matches where a space is after each repeated word (I said each), with the consequence that if rather than a whitespace after a duplicate there appears a punctuation, that would not match: that is, omitting \\b would not consider as matches the following:
hallo hallo.
because the second hallo ends with a dot rather than with the expected whitespace!
At the same time, this implies that things like
hello, hello
or
hello.hello
will not be considered duplicates, though you can force the function to consider them as such by passing the argument named skipPunctuations.
You can imagine what a headache to provide this function for free to you. I had to struggle among nearly equivalent options, and eventually I was bound to make a decision.
Here follows your function codex and then the Test Form so that you can get acquainted with the function:
THE UNIVERSAL ALPHABETICAL SORTER
|
|
Sort an array even after exotic alphabets not built in in your operative system
|
Whether you are aware of it or not, one of the main issues when sorting, namely ordering alphabetically, an array is this: that you can sort it only after the set of characters that your operative system recognizes as native.
Therefore, if you are not german and you have strange chars like say ß, your computer will be most likely unable to guess where exactly a word including that char must be inserted, for instance is that a b perhaps, or does that stand for something similar to an s?
Things go even worse with russian alphabets, for instance, unless you are russian or you have a russian operative system version.
This script solves the problem in a simple manner: given an input array, you can specify a string of alphabetical chars as long as these are disposed in the order you consider their alphabetical one (with, by the way, the possible extreme consequence that you could even pass an alphabet ordered after an entirely arbitrary order of your own whim).
These chars must include both the uppercase and lowercase versions of each char -when possible- and must include also numbers.
Let me repeat: the order in which the chars appear in the string must be the same order you would rank them alphabetically.
The function has built in a default such string, which is this:
-_.,:;!?0123456789AaBbCcDd&EeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz
Though you may not notice it, it starts with a whitespace. Also a white space must be included.
That string gives an idea of how your alphabet should look like: whatever exotic additional char could be included as long as you know exactly where to position it.
Then you can sort just whatever array after whatever alphabet.
The function takes in the following arguments:
- input: must be an array, for you can sort only arrays in programming languages!
- alphabet: an ordered alphabetical string (also an associative array whose each key is a char and whose value is the number representing its ordinal position is quite fine) in case you want to override the default one.
- alphabethSplitter: defaults to an empty string, and is the char after which the passed alphabet will be split, for as hinted the alphabet is passed as a string for your convenience but actually it will become an associative array.
- fromCharCode: an advanced option. If passed as number 1, it means that the alphabet is not a list of chars but of char codes (and arguably the splitter will be a white space): the corresponding char will thus be drawn out of that char code.
Note: the function auto creates a window scope variable named alphasortAlphabeth: be sure it does not interfere with other globally defined variable names (though unlikely).
Creating this window scope variable is somewhat necessary because the function takes avail of the built in quicksort support; and when sorting and passing an argument to a sorting procedure as I do here, the only scope visibile from within such procedure is not that internal of the alphasorter but that broader one of the window Object: the order of the alphabet is stored there in the window scope after the shape of an associative array.
Here is your function codex and a Test Form with a few exotic chars so that you can have some fun too. The name of the function is alphasorter
THE NUMBER FORMATTER
|
|
Format a Number
|
This is a simple function that yet many want. It correctly formats a number of an arbitrary length, taking into consideration the possibility it is signed (that is, has a prefix which is the plus or minus sign), that it may have or have not a floating decimal part, or that both things may occur (floating and signed).
In fewer words, you can add the commas to the thousands - correctly, and with a somewhat elegant stack unshift procedure. As a bonus, you can choose whether to add the commas to the possible decimal part too, which is normally not required.
The name of the function is numberFormat and takes in these arguments:
- number (also a string of only numbers is ok).
- separator: defaults to a comma.
- formatFloatingPortionToo: self explicative argument name, I guess.
The function returns false if what you pass includes chars other than a sign and digits and one dot for the floating part.
Here is the code and a small test form.
|