File Input
Sometimes we want to read data from files or write data to files. In practice, this can become very tangled. But, fundamentally, it is simple.Let's create a file called, "input.txt" as below:
one 1 2 two
And, given "input.txt", let's take a look at the example below:
#!/usr/bin/python inputfile = open ("input.txt") line = inputfile.readline() print line # (but, nothing to see here) line = inputfile.readline() # (or here) print line line = inputfile.readline() print line, # NOTICE THE COMMA HERE line = inputfile.readline() # AND THE COMMA HERE print line, # Done wiht this use inputfile.close() # Re-open it print "After reopen -- getting all the lines at once" inputfile = open ("input.txt") lines = inputfile.readlines() print lines[0] print lines[1] print lines[2] print lines[3]
And, pay attention to its output:
one 1 two 2 After reopen -- getting all the lines at once one 1 2 two
We can also read all of the lines of the file into a List of lines using readlines(), e.g. "lines = inputFile.readlines()"
What are the take-aways?
- open() opens the file with the name we provide, and gives us back a "file object" that represents it
- We can use the readline() "method", which is like a function associated with an "object" to read a line from the file represented by the file object
- the "dot operator" . associates a method with an object. In this way the object performing the readline() is clearly denoted.
- When we print a line we have read from a file, it double spaces. This is because print automatically adds a newline and the line we read, itself, ends in a new line. So, we end up printing both. By placing a ,-comma after the print statement, it tells print not to print a newline -- so we only get one, the one internal to the line we read. The comma syntax might seem strange to you, but it makes some sense in that it was developed to make it easy to print multiple items on the same line, and listing multiple items separated by commas, such as as "a, b, c, d", is a very human notation. common.
- readlines() reads an entire file into a List of strings, one line per string.
- The close() indicates that we are done with the file object. Once closed, it no longer keeps track of the name of out file, or our position within it, e.g. how micuh we've read.
File Output
Writing to a file is very similar to reading from a file. Notice, though, the added argument to "open", specificaly, the "w". We are now giving open the name of the file -- and asking it to allow us to write to it.Notice also the "\n". The is called the "New line escape character". It is a way of asking the system to insert a new line, like hitting enter, at that point.
#!/usr/bin/python outpfile = open ("outfile.txt", "w") outfile.write("Hello world!\n") outfile.write("Hello great, wonderful world!\n") outfile.close()
The file we created is shown below. We can view it, for example, via "more outfile.txt" from the command prompt.
Hello world! Hello great, wonderful world!
Stripping White Space
People tend to be very insensitive to white space in strings -- we just don't notice it very easily. As a result, when processing strings entered by humans, we often want to strip out the extra white space; for example spaces, tabs, etc; leaving the rest of the string. There are three methods of help to us:
- lstrip() -- strips leading spaces, e.g. those on the left side
- rstrip() -- strips trailing spaces, e.g. those on the right side
- strip() -- strips leading and trailing spaces, but leaves other spaces
Please consider the example below:
#!/usr/bin/python spacedPhrase = " Greetings and Welcome " print "phrase: ---" + spacedPhrase + "---" print "strip: ---" + spacedPhrase.strip() + "---" print "lstrip: ---" + spacedPhrase.lstrip() + "---" print "rstrip: ---" + spacedPhrase.rstrip() + "---" print "lstrip + rstrip: ---" + spacedPhrase.lstrip().rstrip() + "---"
More String Methods and Functions
Before writing any function to manipulate a string, see if the Pytyhon libraries already provide it. Strings, themselves, are very rich objects and have many functions to manipulate themselves. And, beyond that, there are a few additional string functions.Back in "The Day", there were only functions to manipulate strings, because in days gone by, strings were more like simple data types then objects. But, these days strings are rich objects. In many cases, you'll find both the old functions and the new methods that do essentially the same thing. In these cases, it is considered proper form to use the method, not the function. But, in some cases, there is no equivalent method, in which case the function is the only way to go -- and perfectly fine.
In class we perused the official documentation. You should do the same as you study -- and again any time you need a quick reference:
Inverted Index: Nested Collections Example
Today we built a data structure known as an Inverted Index. It is an index for quickly finding the locations of words in text files. We implemented it as a Dictionary that maps each word in the file to a List of the number of the lines of the file in which the word appears.Dictionarys are the logical choice for mapping words to lists. For the lists, themselves, Tuples would have been a poor choice, because the list needs to change as we are building it. We could have used Sets, but then we'd have to sort the line numbers, since they would not be maintained in order, internally, by the Set. As a result, Lists are the correct choice for this applications.
invertedIndexExample.py
#!/usr/bin/python # This function removes punctuation, etc, from words # This is done to make sure that "hello," or "hello." are seen as "hello" # This was mostly an exercise in using string library methods for practice def cleanWord(word): # Get rid of spaces and dashes, convert to upper case as a canonical form word = word.strip().upper().replace("-", "") # What about punctuation in the middle? # We'll index only the first part # Find first alphanum begin = 0 while (begin < len(word)): if word[begin:begin+1].isalnum(): break begin += 1 # Find last alphanum end = begin while (end < len(word)): if not (word[end:end+1].isalnum()): break end += 1 # Keep only that splice word = word[begin:end] return word # Gets a List containing each line of the file def getFileLines(file): inputFile = open(file, "r") lines = inputFile.readlines() inputFile.close() return lines # This actually builds the Inverted List # It returns a Dictionary mapping word-->List[line numbers] def buildIndex(lines): index = {} # Create an empty Dictionary lineNumber = 0 # Number the first line 0 # For each line, we know its number, walk through words adding to index for line in lines: # Strip leading and trailing space from line and split into list of words line = line.strip() words = line.split(" ") # For each fo those words # 1. Clean it to mkae it upper-case and w/out attached punctuation # 2. Add it to the index # 2a. Note that, if we haven't seen the word before, the dictionary # is mapping None, not a List. So, we need to map a list, # then, we can add the word to the list (or, alternately, # map a list with just the word). # Remember, we are mapping the word to a List. for word in words: word = cleanWord(word) try: index[word].append(lineNumber) except: index[word] = [] index[word].append(lineNumber) # Could also do in one line, as below: # index[word] = [lineNumber ] lineNumber +=1 # Get ready for the next line, which has the next number # Processed all words of all lines -- return the complete index (Dictionary) return index # This function uses our index # Given a word, it looks it up in the index # Then it either prints a message telling the user it isn't in the document # Or, it uses the index to gets the list of lines, and then prints each # one via the list of lines we originally created and used to build the index def printMatchingLines(lines, index, word): # Canonicalize the word, as we did to create the index word = cleanWord(word) try: lineNumbers = index[word] # This will get an IndexError, if not there except: print word + " was NOT found (sowwy!)." # So we tell the user return # Print the matching lines print word + " was found as below..." for lineNumber in lineNumbers: print str(lineNumber) + ": " + lines[lineNumber], print "" # Index the declaration of independence lines = getFileLines("declaration.txt") index = buildIndex(lines) # Look up some words, print matches (and no-match message) printMatchingLines(lines, index, "friends") printMatchingLines(lines, index, "byeByeBrits")