This article is an excerpt from the recently released book, Programming with Mathematica: An Introduction by Paul Wellin © 2013 [1]. The book, which follows on the well-known An Introduction to Mathematica Programming, provides an example-driven primer on the foundations of the Mathematica programming language.
Strings are used across many disciplines to represent filenames, data, and other objects: linguists working with text data study representation, classification, and patterns involved in audio and text usage; biologists dealing with genomic data as strings are interested in sequence structure and assembly and perform extensive statistical analysis of their data; programmers operate on string data for such tasks as text search, file manipulation, and text processing. Strings are so ubiquitous that almost every modern programming language has a string datatype and dozens of functions for operating on and with strings.
Introduction
In Mathematica, strings are represented by any concatenation of characters enclosed in double quotes.
Strings are also used to represent file names that you import and export.
Strings are used as arguments, option values, and as the output to many functions.
In this chapter we will introduce the tools available for working with strings in Mathematica. We will begin with a look at the structure and syntax of strings, then move on to a discussion of the many high-level functions that are optimized for string manipulation. String patterns follow on the discussion of patterns in Chapter 4 and we will introduce an alternative syntax (regular expressions) that provides a very compact mechanism for working with strings. The chapter closes with several applied examples drawn from computer science (checksums) as well as bioinformatics (working with DNA sequences) and also word games (anagrams, blanagrams).
9.1. Structure and Syntax
Strings are expressions consisting of a number of characters enclosed in quotes. The characters can be anything you can type from your keyboard, including uppercase and lowercase letters, numbers, punctuation marks, and spaces. For example, here is the standard set of printable Ascii characters.
Other character sets are available as well. For example, here are the lowercase Greek letters. These are typically entered from one of Mathematica‘s many built-in character palettes, or using a keyboard shortcut such as –a– for .
When Mathematica displays a string in output, it appears without the quotes. This is the default behavior of the formatting rules for OutputForm.
Use InputForm or FullForm to display these quotes in output.
Various predicates test whether a string consists entirely of letters, or uppercase and lowercase letters.
Use === (SameQ) to test for equality of strings.
Several functions are available for working with the structure of strings.
StringLength also works with lists of strings. In other words, it has the Listable attribute.
Character Codes
One way to work with strings is to convert them to a list of character codes and then operate on the codes using mathematical functions. Each character in a computer’s character set is assigned a number, called its character code. By general agreement, almost all computers use the same character codes, called the Ascii code. In this code, the uppercase letters A, B, …, Z are assigned the numbers 65, 66, …, 90 while the lowercase letters a, b, …, z have the numbers 97, 98, …, 122 (note that the number of an uppercase letter is 32 less than its lowercase version). The numbers 0, 1, …, 9 are coded as 48, 49, …, 57 while the punctuation marks period, comma, and exclamation point have the codes 46, 44, and 33, respectively. The space character is represented by the code 32. Table 9.1 shows the characters and their codes.
Here are the printable Ascii characters.
converts any string character char to its Ascii code.
You can also get a list of the characters in a range if you know how they are ordered by their character codes.
Characters from other languages can also be used, for example, Greek and Japanese.
Unicode charts for many languages are available online (for example, www.unicode.org/charts). With these charts you can find the hexadecimal code for characters in many different languages. For Gujarati, the first character in its code table has hex value 0A90. Here we convert from base 16 and then display the character.
Using the character code representation of characters, the following series of computations changes a word from lowercase to uppercase.
Or, simply use a built-in function that is designed specifically for this task.
Sorting Lists of Characters
As a practical example of the use of character codes, we will extend the simple sorting function from Chapter 4 to work with lists of string characters. Although written to operate on numbers, this rule can be overloaded to work on characters by making only a few small changes. Here is the original rule from Section 4.3.
The first change is to check that the patterns a and b have head String instead of testing for numbers with the predicate NumericQ. Second, instead of the numerical comparison b < a, we need to compare their character codes.
Here is a list of characters.
Here is the sort.
Section 9.5 explores the use of character codes to create hash tables, or checksums.
Ordered Words
When studying word or language structure, a common task is to find all words within a corpus that meet some criteria you are interested in. In this brief example, we will use character codes to search for words whose letters are “in order” when read from the first letter to the last. We will create a Boolean function OrderedWordQ that returns True or False depending upon whether its argument is in alphabetic order. So would return True but would return False. Then we will use this predicate to find all words in a dictionary that are ordered in this sense.
Start by getting a list of all words in the dictionary using DictionaryLookup.
Alternatively, you can use the data in WordData, which contains phrases in addition to words. You could use any similar resource for your list of words.
First, consider the character code of a string.
Then we only need to know if this list of codes is in order.
Here is a predicate that returns True if its argument is ordered in this alphabetic sense.
Now we will find all the words in the dictionary file that comes with Mathematica that are ordered in this way; we will use Select to return those words that pass the test. Finally, we randomly sample 40 of them.
Almost correct! In the English character code set, capitals appear before lowercase letters. So, although our words are ordered in the sense of character codes, they are not ordered in the commonly-used sense.
One approach to resolving this issue is to only work with words of the same case. We could either convert words of the form uppercase/lowercase to lowercase/lowercase or we could select only words from the dictionary that match a pattern that codes for this. We will wait until the discussion of string patterns in Section 9.3 to correct this issue.
Exercises
- Convert the first character in a string (which you may assume to be a lowercase letter) to uppercase.
- Given a string of digits of arbitrary length, convert it to its integer value. (Hint: you may find that the Dot function is helpful.)
- Create a function that takes a string as its argument and returns a list of the unique characters in that string. For example, should return .
9.2. Operations on Strings
Strings are expressions and, like other expressions (such as numbers and lists), there are built-in functions available to operate on them. Many of these functions are very similar to those for operating on lists. In this section we will first look at some of these basic functions for operating on strings and then use them on some nontrivial examples: analyzing a large piece of text, encoding strings, creating index variables, and finally, a word game for creating anagrams.
Basic String Operations
StringTake, which has a similar syntax to Take, is used to extract parts of a string. The second argument specifies the positions of the characters to extract. So, for example, this takes the first twelve characters in this string.
And this takes the last twelve characters from the string.
A list of the individual characters is returned by Characters.
StringJoin concatenates strings.
The shorthand notation for StringJoin is .
The following functions mirror those for list operations.
Some functions are quite specific to strings and do not have analogs with lists. For example, conversion to uppercase and lowercase.
This trims substrings from a string using alternative patterns (discussed further in Section 9.3). So if either "https://" or "/" is found, they will be trimmed.
Strings vs. Lists
For some computations, you might be tempted to convert a string to a list of characters and then operate on the list using some list manipulation functions. For example, this first constructs a list of the individual characters and then uses Count to get the number of occurrences of the letter B in the list of characters from the text of Charles Darwin’s On the Origin of Species.
Since the string functions in Mathematica are optimized for working on strings directly you will often find that they are much faster than the more general list manipulation functions.
This speedup results from the fact that the string pattern matching algorithms are operating only on a well-defined finite alphabet and string expressions are essentially flat structures, whereas the algorithms for more general expression matching are designed to operate on arbitrary expressions with potentially much more complicated structures.
Converting to lists and using list manipulation functions will often be more cumbersome than working with the string functions directly. For example, counting the occurrences of a word within a chunk of text by first converting to a list of characters would be quite indirect and computationally more taxing than simply using StringCount directly.
In fact, sometimes you will even find it more efficient to convert a numerical problem to one involving strings, do the work with string manipulation functions, and then convert back to numbers as in the subsequence example in Section 9.5.
Encoding Text
In this example, we will develop functions for coding and decoding strings of text. The particular coding that we will use is quite simplistic compared with contemporary commercial-grade ciphers, but it will give us a chance to see how to combine string manipulation, the use of functional programming constructs, and rule-based programming all in a very practical example that should be accessible to anyone.
The problem in encryption is to develop an algorithm that can be used to encode a string of text and then a dual algorithm that can be used to decode the encrypted message. Typically, the input string is referred to as the plaintext and the encoded output as the ciphertext.
To start, we will limit ourselves to the 26 lowercase letters of the alphabet.
One of the simplest encryption schemes is attributed to Julius Caesar who is said to have used this cipher to encode communications with his generals. The scheme is simply to shift each letter of the alphabet some fixed number of places to the left and is commonly referred to as a substitution cipher. Using Thread, we can set up rules that implement this shift, here just shifting one place to the left.
The decoding rules are simply to reverse the encoding rules.
To code a string, we will decompose the string into individual characters, apply the code rules, and then join up the resulting characters in a “word.”
Here is the function to accomplish this.
Similarly, here is the decoding function.
Let us try it out on a phrase.
In this example, we have shifted one position for each letter to encode (and decode). It is thought that Caesar (or his cryptographers) used a shift of length three to encode his military messages. In the exercises, you are asked to implement a different shift length in the encoding and decoding functions.
Even with longer shifts, the Caesar cipher is terribly insecure and highly prone to cracking since there are only 26 possible shifts with this simple cipher. A slightly more secure cipher involves permuting the letters of the alphabet.
Using Thread, we create a rule for each letter paired up with the corresponding letter from the permutation p.
Again, the decoding rules are obtained by simply reversing the above rules.
Although these substitution ciphers are not terribly difficult to crack, they should give you some good practice in working with strings and the various Mathematica programming constructs. Modern commercial-grade ciphers such as public-key ciphers are often based on the difficulty of factoring large integers. For a basic introduction to the history of ciphers, see Sinkov (1966) [2]. A more thorough treatment can be found in Paar and Pelzl (2010) [3].
Indexed Symbols
When developing algorithms that operate on large structures (for example, large systems of equations), it is often helpful to be able to create a set of unique symbols with which to work. As an example of operations on strings, we will use some of the functions discussed in this section to develop a little utility function that creates unique symbols. Although there is a built-in function, Unique, that does this, it has some limitations for this particular task.
One potential limitation of Unique is that it uses the first unused symbol of a particular form. It does this to avoid overwriting existing symbols.
However, if you want to explicitly create a list of indexed symbols with a set of specific indices, it is useful to create a different function. First, note that a string can be converted to a symbol using ToExpression or by wrapping the string in Symbol.
StringJoin is used to concatenate strings. So, let us concatenate the variable with the index, first with one number and then with a range of numbers.
We put all the pieces of code together.
Let us create an additional rule for this function that takes a range specification as its second argument.
Note that we have not been too careful about argument checking.
In the exercises you are asked to correct this.
Anagrams
Anagrams are words that have the same set of letters but in a different order. Good Scrabble players are adept at anagram creation. Anagrams can be created by taking a word, extracting and permuting its characters, and then finding which permutations are real words.
Start by getting the characters in a word.
Permute the characters.
Concatenate the characters in each list.
Now, which of these “words” are really words? One way to check is to select those that are in the dictionary. Those elements in words that are not in the dictionary will return when run against DictionaryLookup, so we omit those using ≠.
Putting all the pieces together, we have the function Anagrams.
Other than extracting the characters of a word and joining the permuted list of characters, the operations here are essentially those on lists (of strings) and pattern matching. Exercise 2 in Section 9.5 discusses a more direct approach to this problem, one that avoids the creation of permutations of the characters in the word.
Exercises
- Create a function that returns a value of True if its argument str is a palindrome, that is, if the string str is the same forward and backward. For example, refer is a palindrome.
- Create a function that takes a string str, and returns a string with the characters rotated to the left places. For example:
- In creating the function MakeVarList in this section, we were not careful about the arguments that might be passed. Correct this problem using pattern matching on the arguments to this function to insure that the indices are positive integers only.
- Create a function that pads the end of a string with whitespace characters. Then create a second rule that pads the string out to length . If the input string has length greater than , issue a warning message. Finally, mirroring the argument structure for the built-in PadLeft, create a third rule that pads with whitespaces at the front and whitespaces at the end of the string.
- Modify the Caesar cipher so that it encodes by shifting five places to the right. Include the space character in the alphabet.
- A mixed-alphabet cipher is created by first writing a keyword followed by the remaining letters of the alphabet and then using this as the substitution (or cipher) text. For example, if the keyword is django, the cipher text alphabet would be:
- Modify the alphabet permutation cipher so that instead of being based on single letters, it is instead based on adjacent pairs of letters. The single letter cipher will have permutations; the adjacent pairs cipher will have permutations.
9.3. String Patterns
Most of the string operations we have looked at up until this point have involved literal strings. For example, in string replacement, we have specified both the explicit string that we are operating on as well as the replacement string.
But the real power of programming with strings comes with the use of patterns to represent different classes of strings. A string pattern is a string expression that contains symbolic patterns. Much of the pattern matching discussed in the previous chapters extends to strings in a very powerful manner. For example, this uses patterns to change the first letter in a string to uppercase.
Or, use a conditional pattern to check if a word begins with an uppercase character.
To get started, you might find it helpful to think of strings as a sequence of characters and use the same general principles on these expressions as you do with lists.
For example, the expression matches the pattern because it is a list that starts with a sequence of one or more elements, it contains an element repeated once, and then ends with a sequence of one or more elements.
If we now use a string instead of a list and StringMatchQ instead of MatchQ, we get a similar result using the shorthand notation ~~ for StringExpression.
is shorthand notation for , which, for the purpose of pattern matching, represents a sequence of strings.
StringExpression is quite similar to StringJoin (both can be used to concatenate strings) except that with StringExpression, you can concatenate nonstrings.
The next example also shows the similarity between the general expression pattern matching that we explored earlier in Chapter 4 and string patterns. Using Cases, the following returns all those expressions that match the pattern _Symbol, that is, pick out all symbols from the list.
With strings we use StringCases whose second argument is a pattern that represents a class of characters to match. StringCases returns those substrings that match a given pattern. Many named patterns are available for various purposes. For example, LetterCharacter matches a single letter.
Match single digits with DigitCharacter and one or more digits with NumberString.
To see the generality and power of working with string patterns, suppose we were looking for a nucleotide sequence in a gene consisting of a repetition of A followed by any character, followed by T. Using a gene from the human genome, the following string pattern neatly does the job.
Here are the starting and ending positions of these substrings. StringPosition takes the same syntax as StringCases, analogous to Position and Cases.
And if you wanted to return those characters that follow all occurrences of the string "GTC", you can name the pattern and use a rule to return it.
In this example, the pattern is . This pattern is named pat and it consists of the string GTC which is then followed by any character. That character is named x so that we can refer to it in the replacement expression on the right-hand side of the rule. The replacement expression is the pattern pat concatenated with the character named x.
As another example of the use of string patterns, suppose you were interested in scraping phone numbers off of a web page; you need to construct a pattern that matches the form of the phone numbers you are looking for. In this case we use the form which matches the form of North American phone numbers. NumberString comes in handy as it picks up strings of numbers of any length. Otherwise you would have to use which matches repeating digits.
Finding Subsequences with Strings
In this section we will explore a related problem to the one in Section 4.3, where we searched for subsequences within a sequence of numbers. Here we will transform the problem from working with lists of digits to one where we work with strings.
Using pattern matching it is not too difficult to construct the pattern of interest. For example, suppose we were looking for the substring are within a larger string. Using the special named string pattern WordBoundary which matches the beginning or end of a word, we concatenate (StringJoin) the patterns we need. See Table 9.3 in the next section for a listing of other named patterns.
To start, we will prototype with a short sequence of digits of , converted to a string.
Check that the output is in fact a string.
For our purposes here, we are only interested in the digits following the decimal point. We can extract them by splitting the string of digits on the decimal point and then taking the second part of that expression. This will generalize for numbers with an arbitrary number of digits before the decimal point.
The subsequence 3238 occurs starting 15 positions to the right of the decimal point.
Collecting the code fragments, we turn this into a function.
Let us try it out on a more challenging example: finding occurrences of the sequence 314159 in the decimal expansion of to ten million places.
Comparing with the function that takes lists of digits developed in Section 4.3, our string implementation is about twice as fast.
Alternatives
We have already seen general patterns with alternatives discussed in Chapter 4. Here we will use alternatives with string patterns. The idea is quite similar. For example, a common task in genome analysis is determining the GC content or ratios of the nucleobases guanine (G) and cytosine (C) to all four bases in a given fragment of genetic material.
You could count the occurrences of G and the occurrences of C and add them together.
But it is much easier to use alternatives to indicate that you want to count all occurrences of either G or C. The syntax for using alternative string patterns is identical to that for general expressions that we introduced in Section 4.1.
We will return to the computation of GC content in Section 9.5.
As a slightly more involved example, suppose you are interested in tallying the lengths of words in a corpus. You might start by using StringSplit to split the large string into a list of words.
Looking at the result, you will see that some elements of this list include various types of punctuation. For example, StringSplit, with default delimiters, missed certain hyphenated words and some punctuation.
There are 149863 elements in this split list.
Fortunately, StringSplit takes a second argument that specifies the delimiters to match. The pattern is given as a set of alternatives followed by the repeated operator to catch one or more repetitions of any of these delimiters. Searching through the text will help to come up with this list of alternatives.
Notice that this list contains many more elements than the initial approach given above.
Finally, here is a histogram showing the distribution of word lengths in the text, On the Origin of Species.
Let us compare this with a different text: A Portrait of the Artist as a Young Man, by James Joyce [4] (available online at Project Gutenberg [5]). We are postprocessing here by removing metadata at the beginning and at the end of the file.
An alternative syntax uses a list of delimiters as given by Characters. The repeated pattern, .., helps to catch such constructions as “--“, “::” and double-spaces.
In the next section, on regular expressions, we will see that there are more compact ways of accomplishing some of these tasks.
Exercises
- At the end of Section 9.1 we created a predicate OrderedWordQ to find all words in a dictionary whose letters are in alphabetic order. This predicate used character codes and returned incorrect results for words that started with a capital letter. Correct this error by only selecting words from the dictionary that start with a lowercase letter. Consider using a conditional string pattern involving the built-in function LowerCaseQ.
- Given a list of words, some of which start with uppercase characters, convert them all to words in which the first character is lowercase. You can use the words in the dictionary as a good sample set.
- Create a function that finds all palindromic words of length in the dictionary. For example, kayak is a five-letter palindrome.
- Find the number of unique words in a body of text such as Alice in Wonderland.
After splitting the text into words, convert all uppercase characters to lowercase so that you count words such as hare and Hare as the same word.
Such computations are important in information retrieval systems, for example, in building term-document incidence matrices used to compare the occurrence of certain terms across a set of documents (Manning, Raghavan, and Schütze 2008 [6]).
9.4. Regular Expressions
In addition to the use of string patterns discussed up to this point, you can also specify string patterns using what are known as regular expressions. Regular expressions in Mathematica follow a syntax very close to that of the Perl programming language. This syntax is quite compact and powerful but it comes at the cost of readability – regular expressions tend to be quite cryptic to humans. As a result, we will only cover a few examples of their use here and refer the interested reader to the Mathematica documentation on string patterns (Working with String Patterns, WMDC [7]).
You should think of regular expressions as an alternative syntax for string pattens. To indicate that you are using a regular expression, wrap the expression in RegularExpression. For example, the regular expression . is a wildcard character. It matches any single character except a newline. To use it as a string pattern, write .
The string "abc" does not match the pattern because it does not consist of a single character.
You can also match a set or range of characters. For example, this matches any of the characters a through z.
Certain constructs give patterns with repeating elements. For example, "c*" is a pattern matched by a string with character c repeated zero or more times; "c+" stands in for the character c repeated one or more times.
You can also match on concatenated characters using the syntax .
Several constructs are available for classes of characters. The named classes in the last two entries of Table 9.2 include alpha, ascii, blank, digit, space, word, and several more.
The regular expression matches any expression beginning with the character a followed by any sequence of characters.
The regular expression represents any digit 0 through 9.
The regular expression matches any expression beginning with an a, followed by any character repeated one or more times, followed by a digit.
Let us try something more ambitious. This finds all words in text that are of length 16 to 18.
The regular expression matches any word boundary (typically whitespace, period, comma, etc.) and matches any word of length 16 to 18.
Various shortcuts exist for some commonly used patterns (Table 9.3).
Conveniently, you can mix regular expressions and other string patterns in various ways. This accomplishes the same thing as the previous computation, but using WordBoundary, instead of the regular expression .
Sometimes you will need to refer to the pattern by name in order to perform some operation on it. This is similar to the situation with regular named patterns. For example, given a list of words, some of which are uppercase/lowercase, this uses string patterns to transform the list to all lowercase words, naming the pattern that is matched by the first character after a word boundary, a.
So how do we name a pattern with regular expressions so that we can refer to it on the right-hand side of a rule? The syntax using regular expressions is to wrap the pattern in parentheses and then refer to it using "$n" , where is the th occurrence of such patterns. For example, is a named pattern that is matched by an expression consisting of a word boundary followed by a word character. The subexpression matching is referenced by "$1" on the right-hand side of the rule.
To change the second character after the word boundary to uppercase, use "$2" to refer to the expression that matches the second .
A particularly useful construct in many situations is the lookahead/lookbehind construct. is used when the following text must match patt and is used when the preceding text must match patt. For example, this finds all those words in some example text that follow "Raven, ".
There are many more constructs available for doing quite sophisticated things with regular expressions. We will explore some of these in the examples and exercises below as well as in the applications in Section 9.5. For a more detailed discussion, see the tutorials Regular Expressions (WMDC [8]) and Working with String Patterns (WMDC [7]).
Word Stemming
Many aspects of linguistic analysis include a study of the words used in a piece of text or in a speech. For example, you might be interested in comparing the complexity of articles written in two different newspapers. The length and frequency of certain words might be a useful measure for such an analysis. Patterns in usage of certain word combinations can be used to identify authenticity or the identity of an author of a work.
There are some basic issues that arise again and again in such analyses. For example, what should be done with contractions such as shouldn’t? What about sets of words such as run, runs, ran, running. Are they considered distinct? One approach in language processing is to strip suffixes and reduce alternate forms to some stem. This process, known as word stemming, is extensively used in many online search systems to try to distill user’s queries to some basic form that can be processed and operated on. It is a bit tricky, as natural languages are notorious for exceptions to almost any rule. For example, although the word entertainment can sensibly be stemmed to entertain, the stem of comment is certainly not com. In other words, a rule that dropped the suffix ment is too broad and returns nonwords in many cases. In most word stemming algorithms, there are numerous rules for the many cases that need to be examined; and there are many special cases. In this section, we will create a set of rules for word stemming to show how these rules are described and how the string pattern constructs in Mathematica provide a good set of tools to implement these concepts. A full-fledged stemming application would include hundreds of rules for each language, so we will only give a small set here to indicate the general process.
Words Ending in …xes
The first set of stemming rules we will create involves a relatively small set of words in the English language – those ending in xes, such as boxes or complexes. The rule is to strip off the es.
To prototype, we collect all the words in the dictionary that end in xes. We will also restrict ourselves to words that are all lowercase. Quiet is used here to suppress the error messages that arise when StringTake operates on words of length less than three. Alternatively, you could put an extra clause () in the conjunction.
Here is the replacement rule. The regular expression "(\\w)(x)es" will be matched by any word character followed by xes. It is replaced by that word character followed only by x. On the right-hand side of the rule, $1 refers to the first pattern on the left, ; and $2 refers to the second pattern on the left, .
This is pretty good; it appears only four stemmed words are not in the dictionary; although max might be considered an abbreviation, postfix is certainly a word! Nonetheless, these sorts of exceptions are common and will need to be dealt with separately.
Plural Nouns Ending …mming
A word such as programming has a stem of program; so the rule for words ending in …mming could be: drop the ming. Start by gathering all the words in the dictionary that end with …mming.
Recall the regular expression represents a word character repeated some number of times; represents a word boundary; the $1 refers to the first expression , that is, the characters up to the mming. These characters will be joined with a single m.
Again, this is quite good although the word lemming has been stemmed to the nonword lem, something that will need to be dealt with as a special case. The way to do that is to order the rules so that the special cases are caught first.
Words Ending in …otes
Numerous rules are needed for turning plural words into their singular stems. To see this, consider a naive rule that simply drops the s for any such words.
This is clearly too general a rule. In fact, several different rules are needed for words that end in s, depending upon the preceding characters. Here, we will only deal with words that end in …otes. First gather the words in the dictionary that match this pattern.
Here is the replacement rule.
Stemming litotes gives the nonword litote. This can again be resolved by adding some specific rules for these not uncommon situations.
Plural to Singular
Let us try to deal with the general problem of stemming plural forms to singular. This is a more difficult scenario to deal with as there are many rules and even more exceptions. We will begin by showing how the order of the replacement rules matters in the stemming process.
You might imagine the rules given in Table 9.4 being used to stem plurals (these are not complete, but they will get us started). In fact, these are step 1a of the commonly-used Porter’s algorithm for word stemming in the English language.
The order in which such rules are used is important. You do not want the last rule being used before any of the others. As we saw with the previous set of rules, Mathematica will apply rules in the order in which they are given, assuming that they have roughly the same level of specificity. Note also that some of these rules are designed to leave certain words unchanged. For example neither pass nor abacus are plural and they should not be stemmed.
Here then is a rough attempt at stemming plural words. First we gather the words from the dictionary that end in s and display a random sample of them.
This process of word stemming requires a lot of trial and error and the creation of many rules for the exceptions. Another approach, called lemmatization, does a more careful and thorough job by working with vocabularies and performing morphological analysis of the words to better understand how to reduce them to a root. For more information, see Manning, Raghavan, and Schütze (2008) [6].
Exercises
- Rewrite the genomic example in Section 9.3 to use regular expressions instead of string patterns to find all occurrences of the sequence . Here is the example using general string patterns.
- Rewrite the web page example in Section 9.3 to use regular expressions to find all phone numbers on the page, that is, expressions of the form . Modify accordingly for other web pages and phone numbers formatted for other regions.
- Create a function that takes its argument word and returns the word with the first letter uppercase and the rest of the letters lowercase.
- Use a regular expression to find all words given by DictionaryLookup that consist only of the letters a, e, i, o, u, and y in any order with any number of repetitions of the letters.
- The basic rules for pluralizing words in the English language are roughly, as follows: if a noun ends in ch, s, sh, j, x, or z, it is made plural by adding es to the end. If the noun ends in y and is preceded by a consonant, replace the y with ies. If the word ends in ium, replace with ia (Chicago Manual of Style 2010 [9]). Of course, there are many more rules and even more exceptions, but you can implement a basic set of rules to convert singular words to plural based on these rules and then try them out on the following list of words.
- A common task in transcribing audio is cleaning up text, removing certain phrases such as um, er, and so on, and other tags that are used to make a note of some sort. For example, the following transcription of a lecture from the University of Warwick, Centre for Applied Linguistics (BASE Corpus [10]), contains quite a few fragments that should be removed, including newline characters, parenthetical remarks, and nonwords. Use StringReplace with the appropriate rules to “clean” this text and then apply your code to a larger corpus.
- Find the distribution of sentence lengths for any given piece of text. contains several well-known books and documents that you can use. You will need to think about and identify sentence delimiters carefully. Take care to deal properly with words such as Mr., Dr., and so on that might incorrectly trigger a sentence-ending detector.
- In web searches and certain problems in natural language processing (NLP), it is often useful to filter out certain words prior to performing the search or processing of the text to help with the performance of the algorithms. Words such as the, and, is, and so on are commonly referred to as stop words for this purpose. Lists of stop words are almost always created manually based on the constraints of a particular application. We will assume you can import a list of stop words as they are commonly available across the internet. For our purposes here, we will use one such list that comes with the materials for this book.
- Modify the previous exercise so that the user can supply a list of punctuation in addition to the list of stop words to be used to filter the text.
9.5. Examples and Applications
This section puts together many of the concepts and techniques developed earlier in the chapter to solve several nontrivial applied problems. The first example creates a function to generate random strings, mirroring the syntax of the built-in random number functions. People who work with large strings, such as those in genomic research, often partition their strings into small blocks and then perform some analysis on those substrings. We develop functions for partitioning strings as well as several examples for analyzing sequences of genetic code. An additional example covers checksums, which are used to verify stored and transmitted data. Finally, a word game is included in which we create blanagrams, a variant of anagrams.
Random Strings
A blasphemous sect suggested …that all men should juggle letters and symbols until they constructed, by an improbable gift of chance, these canonical books.
— Jorge L. Borges, The Library of Babel [11]
Those who work with genomic data often need to test their algorithms on strings. While it may be sensible to test against real data – for example, using genes on the human genome – random data might be more appropriate to quickly test and measure the efficiency of an algorithm. Although Mathematica has a variety of functions for creating random numbers, random variates, and so on, it does not have a function to create random sequences of strings. In this section we will create one.
To start, we will choose the characters A, C, T, and G – representing the nucleotide, or DNA, bases – as our alphabet, that is, the letters in the random strings we will create.
The key observation is that we want to choose one character at random from this list. Since we need to repeat this times, we need to randomly choose with replacement. That is the purpose of RandomChoice.
This expression is a list of strings.
Finally, we concatenate the strings.
So a first attempt at putting these pieces together would look like this. Note the use of a default value of 1 for the optional argument n (see Section 6.1 for a discussion of default values for arguments to functions).
The default value of the second argument gives one choice.
We can make the arguments a bit more general using structured patterns. The first argument in this next version must be a list consisting of a sequence of one or more strings.
Here is a ten-character password generator.
It is not hard to extend this function to create random strings of a given length. We essentially pass that argument structure to RandomChoice.
The exercises at the end of this section include a problem that asks you to add an option that provides a mechanism to weight the individual characters in the random string.
Partitioning Strings
Some string analysis requires strings to be broken up into blocks of a certain size and then computations are performed on those blocks. Although there is no built-in function for partitioning strings, we can easily create one, taking advantage of the syntax and speed of the built-in Partition function.
The Partition function requires a list as its first argument. To start, we will give it a list of the characters in a prototype string, a gene on the human genome.
Now, partition this list of characters into lists of length 4 with offset 1.
Because the number of characters in str is not a multiple of 4, this use of Partition has padded the last sublist with the first two characters from the original string; in other words, this has treated the list cyclically; not quite what we want here.
A slightly different syntax for Partition gives an uneven subset at the end. We will need to use this form so as not to lose or introduce any spurious information.
Finally, convert each sublist into a contiguous string.
This puts everything together in a function.
This partitions the string into nonoverlapping blocks of length 12.
This function operates on large strings fairly fast. Here we partition a random string of length ten million into nonoverlapping blocks of length ten.
Adler Checksum
Checksums, or hashes, are commonly used to check the integrity of data when that data are either stored or transmitted. A checksum might be created, for example, when some data are stored on a disk. To check the integrity of that data, the checksum can be recomputed and if it differs from the stored value, there is a very high probability that the data was tampered with. Hash functions are used to create hash tables which are used for record lookup in large arrays of data. As an example of the use of character codes, we will implement a basic checksum algorithm, the Adler checksum.
Mathematica has a built-in function, Hash, that can be used to create hash codes, or checksums.
If the string is changed, its checksum changes accordingly.
We will implement a basic hash code, known as the Adler-32 checksum algorithm. Given a string consisting of concatenated characters , we form two 16-bit sums and as follows:
where is the character code for the character . The number 65521 is chosen as it is the largest prime smaller than . Choosing primes for this task seems to reduce the probability that an interchange of two bytes will not be detected. Finally, the Adler checksum is given by
Let us take Mathematica as our test word. We start by getting the Ascii character codes for each character.
The number above is given by the cumulative sums of the character codes, with 1 prepended to that list. (This step could also be done using FoldList.)
The number is given by the cumulative sums from this last list, omitting the 1 at the beginning as it is already part of the cumulative sums.
We can check our result against the algorithm implemented in the Hash function.
Finally, this puts these steps together in a reusable function. Our prototype worked with small numbers and so the need to work mod 65521 was not necessary. For general inputs, the arithmetic will be done using this modulus.
As an aside, here is its hash code in hexadecimal.
And here is a lengthier example.
Search for Substrings
As we have seen in this chapter, string patterns provide a powerful and compact mechanism for operating on text data. In this example, we will create a function that searches the dictionary for words containing a specified substring.
If our test substring is cite, here is how we would find all words that end in cite. Note the triple blank pattern to match any sequence of zero or more characters.
Here are all words that begin with cite.
And this gives all words that have cite somewhere in them, at the beginning, middle, or end.
Using the double blank gives words that have cite in them but not beginning or ending with cite.
Let us put these pieces together in a reusable function FindWordsContaining. We will include one option, WordPosition that identifies where in the word the substring is expected to occur.
Depending upon the value of the option WordPosition, Which directs which expression will be evaluated.
Using the default value for WordPosition, this finds all words in the dictionary that start with the string cite.
And this finds all words that have cite anywhere in the word.
Finally, here is a dynamic interface that includes a text field in which you can enter an input string; tabs are used to specify in which part of the word you expect the string to occur.
For more on the creation of these sorts of dynamic interfaces, see Chapter 11.
DNA Sequence Analysis
DNA molecules are composed of sequences of the nitrogenous bases guanine, cytosine, thymine, and adenine. Guanine and cytosine bond with three hydrogen bonds and thymine and adenine bond with two. Research has indicated that high GC content (guanine and cytosine) DNA is more stable than that with lower GC. The exact reasons for this are not completely understood and determining the GC content of various DNA materials is an active area of biomolecular research. GC content is often described as a percentage of the guanine and cytosine nucleotides compared to the entire nucleotide content (Cristianini and Hahn 2007 [12]). In this section we will create a function to compute the ratio of GC in any given DNA sequence or fragment.
We will start by importing a FASTA file consisting of human mitochondrial DNA, displaying some information about the contents of this file.
We use StringCount to count the number of occurrences of G or C in this sequence.
And here is the number of occurrences of A or T.
The GC percentage is given by the following ratio.
Here then is an auxiliary function we will use in what follows.
Note that gcRatio expects a string as an argument, but this fails with hsMito, imported from an external source.
It fails because Import returns a list consisting of a string, not a raw string. We can remedy this by writing a rule to deal with this argument structure and then call the first rule.
Typically, researchers are interested in studying the GC ratio on particular fragments of DNA and comparing it with similar fragments on another molecule. One common way of doing this is to compute the GC ratio for blocks of nucleotides of some given length. We will use the function StringPartition, developed earlier to partition the sequence into blocks of a given size. We will work with a small random sequence to prototype.
Here are the GC ratios for each of the blocks given by lis.
Finally, it is helpful to be able to identify each block by its starting position. So we first create a list of the starting positions for each block and then transpose that with the ratios.
Here are all the pieces in one function, GCRatio.
And again, a second rule in case the string is wrapped in a list.
Let us try it out first on our prototype sequence.
And then on the human mitochondrial DNA with block size 1000.
Various types of analysis can then be performed on these blocks. For example, using Select, this quickly finds regions of high GC content.
Here is a quick visualization of the GC content across the blocks.
Numerous comparative studies have been done looking at the GC content for different organisms. One much-studied organism is Thermoplasma volcanium, a bacterium-like organism that exists in very high-acid and high-temperature environments. To accommodate the extreme conditions, organisms in such environments often have high GC content which has a higher thermal stability. The following sequence is in the public domain and was obtained courtesy of the National Center for Biotechnology Information (NCBI Nucleotide Database [13]).
Here is the GC ratio for the entire sequence.
And here are the ratios for block sizes of 100000.
Displaying DNA Sequences
DNA sequences are typically long strings of nucleotides that are difficult to visualize simply by looking at the string of characters. Various visualization tools have been used to work with these sequences and in this section we will look at a common way of viewing them in a formatted table.
As before, we will prototype with a short random string consisting of nucleotide characters G, C, A, and T.
Using StringPartition developed earlier in this chapter, we split the string into blocks of a desired size.
We have 13 blocks here, but for readability purposes, we will put five blocks on each line of output. We use the blank string " " to pad out any string shorter than the blocksize, in this case 10.
The following code gives the starting positions for each line once we have set the block length and row length.
We prepend the starting position of each row at the head of the row. Recall, the second argument to Prepend is the expression you wish to put in front (the indices) of your target expression (the rows).
This is what the formatted output should look like.
Finally, let us put this all together, setting up an option, BlockSize that is combined with the inherited options from Grid.
Let us exercise some of the options.
Blanagrams
A blanagram is an anagram for another word except for the substitution of one letter. Think of Scrabble with a blank square (blank + anagram = blanagram). For example, phyla is a blanagram of glyph: replace the g with an a and find anagrams. In this section we will create a function that finds all blanagrams of a given word.
We will prototype with a simple word, glyph.
Start by replacing the first letter in glyph with an a and then finding all anagrams (using Anagrams from Section 9.2). The third argument to StringReplacePart is a list of beginning and ending positions for the replacement.
Now do the same for each character position in the word.
Running Anagrams on each of these strings, only two appear as words in the dictionary.
Having done this for the letter a, we now repeat for all other single characters.
Because of the extra nesting () we need to flatten the output at a deeper level; and delete duplicates.
Finally, put all the pieces together to create the function Blanagrams.
This turns out to be fairly quick for small words, but it bogs down for larger words.
We will wait until Section 12.3 to optimize this code by profiling (identifying slow computational chunks) and taking advantage of parallel processing built into Mathematica.
Exercises
- Generalize the RandomString function to allow for a Weights option so that you can provide a weight for each character in the generated string. Include a rule to generate a message if the number of weights does not match the number of characters. For example:
- Write the function Anagrams developed in Section 9.2 without resorting to the use of Permutations. Consider using the Sort function to sort the characters. Note the difference in speed of the two approaches: one involving string functions and the other list functions that operate on lists of characters. Increase the efficiency of your search by only searching for words of the same length as your source word.
- Rewrite the function FindWordsContaining using regular expressions instead of the patterns used in this section.
- Using the text from several different sources, compute and then compare the number of punctuation characters per 1000 characters of text. gives a listing of many different texts that you can use.
- The function StringPartition was developed specifically to deal with genomic data where one often needs uniformly-sized blocks to work with. Generalize StringPartition to fully accept the same argument structure as the built-in Partition.
- Rewrite the text encoding example from Section 9.2 using StringReplace and regular expressions. First create an auxiliary function to encode a single character based on a key list of the form where is a plaintext character and is its ciphertext encoding. For example, the pair would indicate the character z in the plaintext will be encoded as an a in the ciphertext. Then create an encoding function using regular expressions to encode any string str using the key consisting of the plaintext/ciphertext character pairs.
References
[1] | P. Wellin, Programming with Mathematica®: An Introduction, Cambridge, UK: Cambridge University Press, 2013. www.cambridge.org/wellin. |
[2] | A. Sinkov, Elementary Cryptanalysis: A Mathematical Approach, Washington, DC: The Mathematical Association of America, 1966. |
[3] | C. Paar and J. Pelzl, Understanding Cryptography: A Textbook for Students and Practitioners, New York: Springer, 2010. |
[4] | J. Joyce, Finnegans Wake, New York: Viking Penguin Inc., 1939. |
[5] | Project Gutenberg. (Apr 3, 2013) www.gutenberg.org. |
[6] | C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval, New York: Cambridge University Press, 2008. |
[7] | Wolfram Research. “Working with String Patterns” from the Wolfram Mathematica Documentation Center—A Wolfram Web Resource. reference.wolfram.com/mathematica/tutorial/WorkingWithStringPatternsOverview.html. |
[8] | Wolfram Research. “Regular Expression” from the Wolfram Mathematica Documentation Center—A Wolfram Web Resource. reference.wolfram.com/mathematica/tutorial/RegularExpressions.html. |
[9] | University of Chicago Press, The Chicago Manual of Style, 16th ed., Chicago: University of Chicago Press, 2010. |
[10] | Centre for Applied Linguistics, University of Warwick. “British Academic Spoken English (BASE) and BASE Plus Collections.” (Apr 3, 2013) www2.warwick.ac.uk/fac/soc/al/research/collect/base. |
[11] | J. L. Borges, “The Library of Babel,” in Labyrinths: Selected Stories & Other Writings, New York: Modern Library, 1983. |
[12] | N. Cristianini and M. W. Hahn, Introduction to Computational Genomics: A Case Studies Approach. New York: Cambridge University Press, 2007. |
[13] | National Center for Biotechnology Information. “Nucleotide.” (Apr 3, 2013) www.ncbi.nlm.nih.gov/nuccore. |
P. Wellin, “Strings,” The Mathematica Journal, 2013. dx.doi.org/doi:10.3888/tmj.15-4. |
List of Additional Material
Additional electronic files:
- content.wolfram.com/sites/19/2013/04/638154522.tar.gz
- content.wolfram.com/sites/19/2013/04/StopWords.dat
About the Author
Paul Wellin worked for Wolfram Research from the early 1990s through 2011, directing the Mathematica training efforts with the Wolfram Education Group. Prior to that, he taught mathematics at both public school and at the university level for over 12 years. He has given talks, workshops, and seminars around the world on the integration of technical computing and education and he has served on numerous government advisory panels on these issues. He is the author or coauthor of several books on Mathematica. In addition to his writing, he is currently involved with several consulting projects, as well as book design and composition projects.
Paul Wellin
Chico, California
paulwellin@gmail.com