Return to lecture notes index
September 9, 2010 (Lecture 6)

PERL

Remember the spectrum we discussed? From scripting languages to interpreted languages to compiled lanaguges? Today, we are going to begin discussing PERL. And, in doing so, we are moving into the domain of interpreted languages.

You'll find that the building blocks in PERL are less complex than in shell -- you'll be using libraries and language features, rather than whole programs. But, as compared to shell scripting, PERL has less overhead as the libraries are part of the same program -- and don't encounter the overhead of separate processes and interprocess communication via pipes.

PERL also has more powerful operators. In shell, there were string operators -- and then some others added as after thoughts. In PERL, they are very powerful and include such things as exponents and regular expressions. These are, in fact, more powerful than they are even in compiled languages such as C and Java.

The reason for this is that in compiled languages they like the operators to closely map to features frequently available in hardware. They don't like to "hide" tremendous complexity behind a simple-seeming operator. Instead, it is left to libraries so that programmers understand that the operation is complex and also so that there can be different options in the implementation. PERL is, for example, famous for the power of its regular expression and string processing.

Of course compiled languages execute more rapidly as they are ready for the processor and require no parsing, interpretation, or translation. But, again, the trade-off is that the building blocks are smaller and, so, development time is higher -- and maintainence is more costly, making the applications less agile and dynamic, a big concern in changing enviornments or with changing needs.

PERL - The Practical Extraction and Reporting Language

Perl isn't exactly a scripting language. But, we've got a spare day in the schedule -- and Perl is certainly one of the tools of the trade.

Perl is an important language for many "quick programs". It was originally designed, as its name implies, as a tool for system administrators and others to "extract and report" -- basically to process log files, &c. As a result, it has a tremendously flexible and powerful regular expression capability -- something lacking in C, C++, and, to some extent, in Java.

And, this capability, combined with a language designed to make the common case convenient, has made Perl the language of choice for not only system administrators, but also as the "glue" used by IT developers, Web developers, &c. Basically, Perl is an interpreted lanaguage -- somewhere between a shell script and a full-fledged compiled HLL (but much closer to a compiled HLL, in many respects).

Shell scripting provides an excellent way to solve complex tasks with very little effort. But, it does this by pulling together powerful programs, usually using files and pipes as IPC tools. And, these techniques can be slow and cumbersome.

By comparison, the building blocks in Perl tend to be a bit smaller, but much more integrated. As a result, shell is often excellent for solving small but complex problems quickly. Perl is often used for medium-sized problems. And, truly large problems might be better done in a compiled language. But, especially with the current availability of tremendous processing power -- economies there are often insignificant.

Hello World!

A Perl program looks much like a shell script, except the program exec'd by the shell to process the script isn't, well, a shell -- it is the Perl interpreter. And the program that it is interpretign isn't, well, written in the language of the shell -- it is written in Perl.

The program below shows the invocation of the Perl interpreter at the top of the program -- just like the shell -- and also a quick "Hello World!"

There are a few other features to note. Just like shell, comments begin with a #. Much like C, C++, or Java, lines end with a ";". And, lastly, quote are interpreted just as they are in shell. "Double quoted strings allow for the interpretation of escapes, such as the newline\n", whereas 'single quoted strings are exactly literal -- no interpretation at all.'

  
#!/bin/perl # The usual hello world program -- an an example of a comment print `Hello world.`; # Much like C, all lines end in a ;

Scalar Variables

Variables in Perl are typeless. They can hold strings, as well as characters, integers, and decimal numbers. Much like in shell, "typeless" really means "stirngs available for interpretation". But, in some ways, this interprettion is more natural in Perl. For example, mathematical operations can be performed without need for an external program.

In Perl, scalar variables always have the prefix $. We'll soon see that scalars, lists and arrays have different prefixes.

I guess I should also note that literal values can be used just as in other languages, except that, for example, '3' and 3, are equivalent. Why? Everything is typeless and interpreted on the fly.

The Arithmetic operators

Perl basically uses the same set of arithmetic operators as C -- plus some:

  
$sum = 1 + 2; $difference = $value1 - $value2; $product = 5 * $value; $quotient = $value1 / $value2; $remainder = $value1 % $value2; $incrementafter++; $++incrementbefore; # and, here's a new and very cool one: The power operator $value = $base ** $exponent; # The following alsow work, as usual...more soon # <, >, <=, >=, ==, !=

String operators

Strings in Perl seem to have been inspired by shell scripts. As we already discussed, the "" vs '' works the same way. And, variable substitution to form strings works the same way:

  
$firstname = "Greg"; $lastname = "Kesden"; $fullname = "${firstname} ${lastname}"; # and, ge, gt, le, and lt, work as with shell scripts for comparison

In addition, the "." and "x" operators are also lots of fun. "." is concatenate and "x" literally causes a string to be repeated. Incidentally, the ".=" operator works just fine, too.

Please note: Although I don't think we discussed it in the context of shell scripts, the ${var} notation is also part of shell. It is used to offset the name of a variable name within a string. The reason is that in some cases, the string and the variable name would otherwise become impossible to distinguish, $varfollowedbysomethingelse, for example.

  
$fullname = $firstname." ".$lastname; $treepeat = "Bottle of beer" x 99;

One important note for Java programmers is that Perl treats strings as values, not objects. So, when strings are assigned, values are copied, not aliased via references.

Arrays

Perl provides traditional indexed arrays -- with some really cool operators. In Perl, array variables begin with an @ instead of a $. But, when referencing Arrays the $ is used, because the value is that of a single element, not the entire array. Array indexing begins with 0. As is the case with the Java ArrayList or C++ STL Vector, Perl arrays grow dynamically.

The following code segment declares an empty array and also demonstrates the creation of an array with several initialized elements and the access, by index, of a single item within the array:

  
@winners = (); # An empty array @contestants = ("Greg", "Mark", "Rich", "Tim", "Angie"); # initialized array print "${contestants[2]}\n"; # Prints Rich

Array operations

The push and pop operations should be pretty intuitive once you're thinking in the right context: think LIFO stack. push add an item to the end (high index) of the array and then returns the length of the array. pop removes the last item and returns it. $#arrayname returns the index of the last item in the array -- not its length.

  
@winners = (); # An empty array @contestants = ("Greg", "Mark", "Rich", "Tim", "Angie"); # initialized array print "${contestants[2]}\n"; # Prints Rich $winner = pop (@contestants); push (@winners, $winner); print "@winners\n"; print "@contestants\n"; print $#contestants

Arrays can also be used to check this one out:


  $wildcard = "Jeff";

  @winners = (); # An empty array
  @contestants = ("Greg", "Mark", "Rich", "Tim", "Angie"); # initialized array

  print "${contestants[2]}\n"; # Prints Rich

  $winner = pop (@contestants);
  push (@winners, $winner);

  print "@winners\n";
  print "@contestants\n";

  @nextcontestants = ($wildcard, @winners);

  print "@{nextcontestants}\n";
  

Here's another nice array trick: Initializing a string from an array:

  
@namesarray = ("Greg", "Mark", "Jeff", "Rich", "Tim"); $names = "@namesarray"; print "${names}\n";

But, we need to be careful. Check out the example below. Notice the absence of the quotes. This will assign the length of the array to the variable on the left:

  
@namesarray = ("Greg", "Mark", "Jeff", "Rich", "Tim"); $count = @namesarray; print "${count}\n";

Arrays can also be used in a bizzar way to make parallel assignments:


  # $item1 = $item1prime
  # $item2 = $item2prime
  ($item1, $item2) = ($item1prme, $item2prime);
  

Conditionals

Conditionals in perl work exactly like conditionals in C, except that they also offer the optional elsif construct that we saw in shell scripting:


  if ($x == $y) {
    # blah blah
  }
  else {
    # ha ha 
  }



  if ($x == $y) {
    # blah blah
  }
  elsif ($x == $a) {
    # blah blah
  }
  elsif ($x = $b) {
    # blah blah
  } 
  else {
    # blah blah
  }
  

The Traditional for and while loops

Perl has for and while loops that exactly mimic the syntax of C, C++, or Java:


  for ($count=0; $count < 10; $count++) {
    print "$count\n";
  }


  
  while ( $option ne "Quit") {
    dosomethinguseful();
  }
  

The foreach Loop

Perl also has a special for of the for loop designed to make array access more convenient. It basically allows you to walk through the array in an iterator-like fashion. Below is an example written with each of a traditional and foreach loop.

The new-fangled, but super-convenient foreach version:


  @contestants = ("Greg", "Mark", "Rich", "Tim", "Angie"); # initialized array

  foreach $contestant (@contestants) {
    print "$contestant\n";
  }
  

The venerable, familiar, good ole' fashion version:


  @contestants = ("Greg", "Mark", "Rich", "Tim", "Angie"); # initialized array

  for ( $count=0; $count <= $#contestants; $count++) {
    print "${contestants[$count]}\n";
  }
  

Files

Perl file manipulation will be very familiar to those who have worked in Java or C. Files are manipulated through file handles, which are basically special identifiers used for open files. They are not prefixed with a $ and, by convention, they are written in all CAPITALs.

Below is a pretty typical example. It opens $filename, as DATA_FILE and then uses the <> operator to construct an array representation. The file is then clsoed and subsequently printed from the array.


  # open the file named $filename and associate it with the handle DATA_FILE
  open (DATA_FILE, $filename);

  # Read each line of the file into the array @lines
  @lines = <DATA_FILE>;

  # Iterate through the lines, printing each one
  foreach $line (@lines) {
    print "$line";   # Notice: No \n. This is already at the end of the $line
  }

  # Close the file
  close (DATA_FILE);
  

Output to a file is often handled simply by specifying the handle before the formatting string of a print, as follows:


  print OUTPUT_FILE "This will land in the file!";
  

In addition to the unrestricted open show above, Perl allows files to be opened for only limited types of access. This is done by placing a <, >, <<, +<, +>, or +>> before the $file, as shown below:

Additionally, it shouldn't come as a surprise that STDIN, STDOUT, and STDERR are predefined file handles. They work as expect.

The last thing I want to mention is that redirection is really just another form of file input. $filename can be replaced with "| command" or "command |", for output and input piping, respectively:

Associative Arrays

Perl has a really cool language feature called Associative Arrays. Associative arrays are like regular arrays, except that they allow the use of strings as keys. So, when storing a value into an array, we don't need to use a numerical index -- we can use something descriptive. They are called "associative" arrays, because they they associate the key and the value, as contrasted with traditional arrays which associate a value with a somewhat arbitrary, and certainly less descriptive, numerical index. For example, consider an array of bank balances:

  $bankbalance{"Greg"} = 100.00;
  $bankbalance{"Amy"} = 150.00;

  $mybalance = $bankbalance{"Greg"};
  

We see that the keys are strings: "Greg" and "Amy". We see that the associative array, in effect, pairs them with their values, 100.00 and 150.00, respectively.

For those of you who happen by chance to be familiar with hash tables, it is easy to see that they are the natural implementation for associative arrays. The string key is hashed and, the resulting hash, is used as the numerical index into an array. Fot those of you who are not presently familiar with hash tables, no worries -- we'll talk about them soon enough.

A "Middle Spectrum" Language Feature

Associative arrays are tremendously useful to programmers. They are useful anytime we want effectively constant time, immediate, access to a value via some key. But, we tend not to see them as a language feature in scripting languages or in compiled high-level languages. Instead, they are generally available as a language feature in the languages in the "middle" of the spectrum from scripting languages to compiled languages, languages such as Perl, Python, AWK, &c.

Associative arrays are a type of complex data structure. And, we tend not to see real data structures in scripting languages -- they are hard to share with the programs that do the real work. Instead, we tend to use the file system to store information that can't be held in scalar variabels or arrays in memory. The idea is that we generally pipe data into and out of programs that are capable of building their own complex data structures.

Unlike traditional arrays, which are easily projected into memory, associative arrays really are complex data structures. So, instead of hiding this complexity behind normally simple operators, compiled languages generally leave it to libraries, such as C++'s STL or Java's Collections classes.

Unlike scripting languages, the languages in the "middle" of our language spectrum have a different model. Rather than stringing together external programs to solve problems, they make very heavy use of their own data strcutres and complex libraries. But, unliked compiled languages which have some concern about efficiency, "middle spectrum" languages are more concerned about convenience than efficiency.

As such, they are more than willing to hide the complexity of associative arrays from the programmer within a first-class language feature since it makes programming more efficient. The fact is that the average Perl programmer would rather have associative arrays and not care about the implementation than to have to make use of a library for a very commonly used feature.

Syntax Detail

Recall that, in Perl, scalar variables begin with a $-dollar sign and traditional arrays begin with an @-sign. Associative arrays begin with a %-percent sign. As with traditional arrays, when referencing an individual element of an associative array, a $-dollar sign is used. As with traditional arrays, the reason is simple -- a single element of an array is a scalar value, not an array.

Instead of using []-brackets to identify individual elements, associative arrays use {}-squiggles. Take a look a second look at our prior exmple to see the basic use cases: setting and getting the value of an element:

  $bankbalances{"Greg"} = 100.00;
  $bankbalances{"Amy"} = 150.00;

  $mybalance = $bankbalances{"Greg"};
  

Initializing an Associative Array

An associative array can be initialized as a whole. To do this, we just feed it a list of key-value pairs. The individual pairs are just run together:

  %bankbalance = { "Greg", 100.00, "Amy", 150.00};
  

It is very important to notice that the key-value pairs are grouped together by their relative position within the initializer list. As a consequence, if we want to have an empty value, we can't just leave it out completely -- we need to save the place using an empty position within the list. Notice that "Greg" has no value -- but we still preserved its space within the list. Having discussed this syntax, I'd also like to observe that it is a much better to set some initial value, any initial value, than to leave it out. At least that way, you'll know where you are starting.

  %bankbalance = { "Greg", , "Amy", 150.00};
  

Iterating Through Associative Arrays

It is possible to itertate through an associative array using a foreach loop just as it is a traditional array. Notice the addition of the keys or values keyword. This controls if the iteration traverses the keys or the values.

  for each $client in (keys %bankbalances) {
    print "Client: $client\n";
  }

  for each $amount in (values %bankbalances) {
    print "Amount: $amount\n";
  }

  # We can look up the value given the key so that we cna, in effect,
  # iterate through  tuples
  for each $client in (keys %bankbalances) {
    print "$client: $bankbalances{$client}\n";
  }
  

Some Other Fun Stuff

It is possible to assign associative arrays to traaditional arrays and vice-versa. When converting from a traditional array into an associative array, it groups the elements into key-value pair by assuming that they alternate, e.g., key1, value1, key2, value2, key3, value3, ...keyN, valueN. When an associative array is assigned into a traditional array, the traditional array will be organized the same way, as a list of alternating keys and values. If an associative array is assigned to a traditional array and then back, there is no guarantee that, if it is again converted to a traditional array that the pairs will be in the same order. As you'll learn when we talk about hash tables, because of "collision", the order of insertion can affect the organization.

  @flatbankbalances = %bankbalances;
  %bankbalances = @bankbalances;
  

Regular Expressions and String Manipulation

these days, there are many great reasons to program in perl. One of those happens to be the first among those: its natural ability to play with strings and, in particular, regular expressions.

The following two operators, =~ (match) and !~ (no match), are among the most basic. =~ returns the number of times a substring matching the regular expression is found in the supplied string. Sometimes it is interpreted as a true/false expression, where 0 matches is false (not found). The "not in" opertor !~ retunrs true if no matches are found.

The general forms are as follows:


    $nummatches = ($somestring =~ /regular expression/); 
    $notin = ($somestring !~ /regular expression/); 
  

If you group parts of a regular expression within ()-parenthesis, and the regular expression is matched, each match within ()-parenthesis will be saved into a special variable -- much as was the case with, for example, sed. These special variables are $1, $2, etc. Careful! Careful! Everyone wants to believe that these variables represent command-line arguments as they do in shell. Notice the difference! It is also worth noting that, although not preferred, Perl will accept the \1, \2, /3, etc, notation common in many other programs. Regardless, here's a quick example:

  if ( $somestring ~= /([0-9]+)[a-zA-Z]*([0-9]+)/) {
    # $1 is the number at the begining of the line
    # $2 is the number at the ending of the line
  } else {
    # $1 and $2 are unchanged
  }
  

perl also has a special variable, $_, which represents the default string. Several important operators act on this string by default. For example, perl can do sed-style searching and replacing. When this type of expression is defined, it is acting upon $_:


  $_ = "This is an example string: Hello World";

  $changes = s/World/WORLD/g;

  print "$_\n"; # "World" is now WORLD 

  print "$changes\n"; # The number of substitutions made; in this case, 1
  

The tr function is also very powerful. It acts much like the tr command. It allows the user to define a mapping of character-for-character substitutions and applies them to $_. Each character in the first field will be replaced by the corresponding character in the second filed. As with th s function above, it returns the number of substitutions:


  $changes = tr/abc/123/; # a becomes 1, b becomes 2, c becomes 3
  

Please note: In the examples above, there are no quotes around the tr and s expressions. This is important. If the expressions are quoted, they'll be interpreted as strings and assigned, instead of interpreted as regex operations and performed.

Greedy and Posessive Quantifiers

We can change the default behavior of the quantifiers (?, +, *, and {x,y}) to so-called reluctant qualifiers, by appending a ?-mark, e.g., "??", "*?", "+?", or "{x,y)?". Reluctant quantifiers match no more than is necessary to make ther expression match.

Lastly, Perl allows quantifiers to be annotated as posessive by adding a "+", e.g., "?+", "*+", "++", and "{x,y}+". These are nasty, selfish quantifiers. As before, they are processed from left-to-right, but they will eat as much as they can -- even if leaving nothing to satisfy parts of the expression to the right.