String manipulation is a basic operation of many algorithms and utilities such as data validation, text parsing, file conversions and others. The Java APIs contain three classes that are used to work with character data:
Character
-- A class whose instances can hold a single character
value.
String
-- An immutable class for working with multiple characters.
StringBuffer
and StringBuilder
-- Mutable classes for working with multiple characters.
The String
and StringBuffer
classes are two you will use the most
in your programming assignments. You use the String
class in situations
when you want to prohibit data modification; otherwise you use the StringBuffer
class.
In Java Strings can be created in two different ways. Either using a new
operator
or using a string literalString demo1 = new String("This is a string"); char[] demo2 = {'s','t','r','i','n','g'}; String str = new String(demo2);
String demo3 = "This is a string";
The example below demonstrates differences between these initializations
ThenString s1 = new String("Fester"); String s2 = new String("Fester"); String s3 = "Fester"; String s4 = "Fester";
s1 == s2 returns false s1 == s3 returns false s3 == s4 returns true
Because of the importance strings in real life, Java stores (at compile time) all strings in
a special internal table as long as you create your strings using a string literal
String s3 = "Fester".
This process is called canonicalization - it replaces multiple
string objects with a single object. This is why in the above example s3 and s4 refer to the
same object. Also note that creating strings like s3 and s4 is more efficient. Review the code
example
StringOptimization.java that demonstrates time comparisons between these two ways of string
creation.
Here are some important facts you must know about strings:
charAt()
method. In this code snippet we get the fourth character which is 't':
String str = "on the edge of history"; char ch = str.charAt(3);
toString()
method is used when we need a string representation of an object.
toString()
and provide your own string representation.
==
is the most common mistake beginners
do. You compare the content using either equals()
or compareTo()
methods.
The String class contains an enormous amount of useful methods for string manipulation. The following table presents the most common String methods:
str.charAt(k)
returns a char at position k in str. str.substring(k)
returns a substring from index k to the end of str s.substring(k, n)
returns a substring from index k to index n-1 of str str.indexOf(s)
returns an index of the first occurrence of String s in str str.indexOf(s, k)
returns an index of String s starting an index k in str str.startsWith(s)
returns true if str starts with s str.startsWith(s, k)
returns true if str starts with s at index k str.equals(s)
returns true if the two strings have equal values str.equalsIgnoreCase(s)
same as above ignoring case str.compareTo(s)
compares two strings s.compareToIgnoreCase(t)
same as above ignoring case
Examine the code in BasicStringDemo.java for further details.
In many cases when you deal with strings you will use methods available in the companion StringBuffer
class. This mutable class is used when you want to modify the contents of the string. It provides an efficient approach to dealing with strings, especially for large dynamic string data. StringBuffer is similar to ArrayList in a way that the memory allocated to an object is automatically expanded to take up additional data.
Here is an example of reversing a string using string concatenation
public static String reverse1(String s) { String str = ""; for(int i = s.length() - 1; i>=0; i--) str += s.charAt(i); return str; }
and using a StringBuffer's append
public static String revers2(String s) { StringBuffer sb = new StringBuffer(); for(int i = s.length() - 1; i>=0; i--) sb.append(s.charAt(i)); return sb.toString(); }
Another way to reverse a string is to convert a String object into a StringBuffer object, use the reverse
method, and then convert it back to a string:
public static String reverse3(String s) { return new StringBuffer(s).reverse().toString(); }
The performance difference between these two classes is that StringBuffer is faster than String when performing concatenations. Each time a concatenation occurs, a new string is created, causing excessive system resource consumption.
Review the code example StringOverhead.java that demonstrates time comparisons of concatenation on Strings and StringBuffer.
This class (from java.util
package) allows you to break a string into tokens
(substrings). Each token is a group of characters that are separated by delimiters, such as
an empty space, a semicolon, and so on. So, a token is a maximal sequence of consecutive
characters that are not delimiters. Here is an example of the use of the tokenizer (an empty
space is a default delimiter):
String s = "Nothing is as easy as it looks"; StringTokenizer st = new StringTokenizer(s); while (st.hasMoreTokens()) { String token = st.nextToken(); System.out.println( "Token [" + token + "]" ); }
Here, hasMoreTokens()
method checks if there are more tokens available from the
string, and nextToken()
method returns the next token from the string tokenizer.
The set of delimiters (the characters that separate tokens) may be specified in the second
argument of StringTokenizer
. In the following example, StringTokenizer
has a set of two delimiters: an empty space and an underscore:
String s = "Every_solution_breeds new problems"; StringTokenizer st = new StringTokenizer(s, " _"); while (st.hasMoreTokens()) { String token = st.nextToken(); System.out.println( "Token [" + token + "]" ); }
Regular expressions are the most common programming technique for scanning strings and extracting substrings based on common characteristics. They are an essential part of many programming languages. In the following table the left-hand column specifies the regular expression constructs, while the right-hand column describes the conditions under which each construct will match.
Character Classes [abc]
a, b, or c (simple class) [^abc]
Any character except a, b, or c (negation) [a-zA-Z]
a through z, or A through Z, inclusive (range) [a-d[m-p]]
a through d, or m through p: [a-dm-p] (union) [a-z&&[def]]
d, e, or f (intersection) [a-z&&[^bc]]
a through z, except for b and c: [ad-z] (subtraction) [a-z&&[^m-p]]
a through z, and not m through p: [a-lq-z] (subtraction) \\d
any digit from 0 to 9 \\w
any word character (a-z,A-Z,0-9 and _) \\W
any non-word character \\s
any whitespace character ?
appearing once or not at all *
appearing zero or more times +
appearing one or more times
The Java String class has several methods that allow you to perform an operation using a regular expression on that string in a minimal amount of code.
The matches("regex")
method returns true or false depending whether the string can be
matched entirely by the regular expression "regex". For example,
"abc".matches("abc")
"abc".matches("bc")
String regex = ".*"+"abc"+"_+"; "..abc___".matches(regex); "abc___".matches(regex); "abc_".matches(regex);
The method replaceAll("regex", "replacement")
replaces each substring of
the myString that matches the given regular expression "regex"
with the given "replacement". As an example, let us remove all non-letters from a given string
String str = "Nothing 2is as <> easy AS it +_=looks!"; str = str.replaceAll("[^a-zA-Z]", "");
In the next example, we replace a sequence of characters by "-"
String str = "aabfooaaaabfooabfoob"; str = str.replaceAll("a*b", "-");
The split("regex")
splits the string at each "regex" match and returns an array of
strings where each element is a part of the original string between two "regex" matches.
In the following example we break a sentence into words, using an empty space as a delimiter:
String s = "Nothing is as easy as it looks"; String[] st = s.split(" ");
String s = "Every_solution_breeds new problems"; String[]st = s.split("_| ");
String s = "Every_solution____breeds_new__problems"; String[] st = s.split("_+");
String[] st = "Tomorrow".split("r");
One of the widely use of split() is to break a given text file into words. This could be easily done by means of the metacharacter "\W" (any non-word character), which allows you to perform a "whole words only" search using a regular expression. A "word character" is either an alphabet character (a-z and A-Z) or a digit (0-9) or a underscore.
"Let's go, Steelers!!!".split("\\W");
[Let, s, go, Steelers]
Examine the code in Split.java for further details.
Pattern matching in Java is based on use of two classes
A typical invocation is the following, first we create a pattern
String seq = "CCCAA"; Pattern p = Pattern.compile("C*A*");
In this example we match all substrings that start with any number of Cs followed by any number of As. Then we create a Matcher object that can match any string against our pattern
Matcher m = p.matcher(seq);
Finally, we do actual matching
boolean res = m.matches();
The Matcher class has another widely used method, called find(), that finds next substring that matches a given pattern. In the following example we cound the number of matches "ACC"
String seq = "CGTATCCCACAGCACCACACCCAACAACCCA"; Pattern p = Pattern.compile("A{1}C{2}"); Matcher m = p.matcher(seq); int count = 0; while( m.find() ) count++; System.out.println("there are " + count + " ACC");
Examine the code example Matching.java for further details.
The DNA (the genetic blueprint) of any species is composed of about 4
billion ACGT nucleotides. DNA forms a double helix that has two strands of DNA
binding and twisting together. In pattern matching problems we ignore the fact
that DNA forms a double helix, and think
of it only as a single strand. The other strand is complimentary. Knowing one strand allows
uniquely determine the other one. Thus, DNA is essentially a linear molecule that
looks like a
string composed out of only four characters A, C, G, and T:
|
Pattern matching in computational biology arises from the need to know characteristics
of DNA sequences, such as
- find the best way to align two sequences.
- find any common subsequences
- determine how well a sequence fits into a given model.
Comparing various DNA sequencesn provide many uses. Current scientific theories suggest that very similar DNA sequences have a common ancestor. The more similar two sequences are, the more recently they evolved from a single ansestor. With such knowledge, for example, we can reconstruct a phylogenetic tree (known as a "tree of life".) that shows how long ago various organisms diverged and which species are closely related.