‹header›

‹date/time›

Click to edit Master text styles

Second level

Third level

Fourth level

Fifth level

‹footer›

‹#›

Oh … snips and snails and puppy dog tails – that’s what databases are made out of … OK, so they’re not. Databases are also referred to as files or datafiles. Sometimes a databases is split into subfiles, but not necessarily. The ERIC database is a good example of a database that is split into two subfiles – the Current Index to Journals in Educations (CIJE) and Resources in Education (RIE). The important thing to remember is that some databases will have subfiles and some won’t. When they exist, they can sometimes be helpful to craft a search strategy. All databases will be made up of database records. Each database record is made up of fields. Sometimes fields are divided into subfields. Perhaps an example will help.

Here’s an example from the online catalog of a library. The online catalog is an example of a database. The record is an example of the familiar bibliographic record. It simply describes a book. The fields should also be familiar: author, title, publisher, etc. The book happens to be one of my favorites. It was my textbook for a course in US and Canadian geography. The book suggested that North America could be split into nine different nations that would make more sense than the three that we currently have. He explains why he picks the borders that he does. One memorable one was a diagonal line from northwest Connecticut to southeast Connecticut. People to the north and east of this line have a very strong tendency to be Boston Red Sox fans and people to the south and west of the line tend to be New York Yankees fans. Sorry Mets fans … I don’t think he paid them much notice!

The record was too long to fit on one screen. Maybe we should look at the underlying structure …

I know! I know! Eeek! It’s the dreaded MARC record – well, at least some of the variable fields. The numbers on the extreme left are the tags for the fields of the records. They’re unique to the MARC record and you’ll often see them displayed as the more human friendly “Personal Author”, “Title”, etc. Between the colons are what are known as “indicators.” They are truly unique to the MARC record. From my forever dimming memory, I can only recall that the “4” means to ignore the first four character spaces or “The “ – the initial article. The “|b” or “|c” are known as “delimiters” and signify the beginnings of subfields of a field.

… and this would be the continuation of the underlying structure. Can you pick out the subfields?

Now I’ll try my very best not to confuse you. This is part of an actual database that I built so that my wife could use a title listing to know which books were already in my “presidential library.” I’m going to try and describe how database records are “parsed” or chopped up into parts that the computer can recognize as “words.” The end result would be the production of an “inverted file” – an alphabetized listing of what words appear in what position of which record. In addition to this example, see Walker and Janes pages 55 through 63 and the GEP DIALOG Lab Workbook pages 3-6 through 3-10. This screen and the next two show record numbers (I’ve designated them ‘RN’) 101, 102, and 103. My database has three fields Author (AU), Title (TI), and Subject (SU). We’re going to build the inverted file for this tiny database of three records by hand.

Familiarize yourself with each record … you should begin to wonder … just how is Matt going to chop up his database?

This is a very good book by the way …

How we chop the record up into words and phrases will lead to what the inverted file will look like.

Well, I’ve been introducing an awful lot of vocabulary through these first ten slides. Remember that Walker and Janes has a pretty nice glossary. The existence of a basic index versus additional indexes is extremely important for searchers to understand. From a searchers perspective, whenever you search a database without specifying a particular field to search, the basic index is searched. It’s kind of a default instruction for computer to follow. It’s also VERY important to realize that the basic index in one database can be different from the basic index in another database. To complicate matters even further, a database that you’ll find on the DIALOG service might also be found in EBSCOHost or FirstSearch. Even though they host some of the same databases, they may choose to index them in different ways. Sometimes a database host might choose to leave certain fields completely out of the indexing process. The important lesson to learn here would be that the savvy librarian must be ready for anything and should endeavor to find out what is defined as the basic index and which additional indexes exist for all the databases that they search. Thankfully you don’t have to memorize all that junk! You just need to know how to look up this information and how to best articulate the situation to your end users. We’ll learn about how to do this with Dialog and that will provide you with a very good example for the future.

Oh, here comes yet another bit of jargon – the “Stop Words.” This is a list of words that computer is instructed to ignore as it builds the inverted file. It’s as if the words don’t even exist. Note that DIALOG’s list is very short – shorter than many hosts of databases. They even index the word “a” – remember that phrases like “type A” or “grade A” can often be found. DIALOG and some other database hosts feel that it’s important for their users to easily find them.

Now, this is where we start to chop up the database. What’s up with the author field? Both the last name and first name on one line – why isn’t the name chopped into a first and last name? Well, our author index … an additional index … will, by my design, be a phrase index. That’s the only way the author will appear in our index: Lastname, Firstname.

“101” is the record number (much smaller than the record numbers that you’ll see in DIALOG). “AU” is the field. “1” is the first phrase.

For another example “psychohistorical” is also in the 101st record, it’s in the title field of record 101 and it’s the 6th word.

And the saga continues …

Wait a minute! This name is chopped up into words! Yup! I determined that design too. We’re in the middle of chopping up a subject field into words. Even “1913” is a word according to a computer – aaw! They never could spell.

Whoa Milhouse! What’s this? Well, this madcap designer has decided that the subject field of his database is both word and phrase indexed. We can do that you know! Even within DIALOG you’ll see variations in whether a particular field is word or phrase indexed. You simply have to memorize all this stuff for 500 plus databases. Nah! I’m just kidding. Never memorize something that can easily be looked up. DIALOG has a set of tools called “Bluesheets.” There’s a Bluesheet for each database that goes through each and every field and indicates how the field is indexed … or if it’s indexed.

And so it goes …

Thankfully we’ll only take this until we get the idea. There’s a point where one begins to really appreciate what we can assign to computers to do. Let’s continue by alphabetizing our list.

And, numbers would kick off our alphabetized list of words …

Each ‘word’ has its record number, field, and position within the field in our list.

Finally we get to the “a’s” in the list.

Hopefully I haven’t screwed up the numbers of the records. This is hard enough to follow as it is.

We also can’t forget the separate additional index that we’ve built. There are only three entries – they’re the authors of the three books.

Oooh! How am I going to do this? I’ll tell you what. I’ll do my best to go through the ERIC database Bluesheet using the Camtasia Studio software – you’ll be able to listen to my explanation and I’ll try to point out what is important. The link in the slide is to the DIALOG Bluesheet for the ERIC database. One of the things that I’ll show is the pair of example records – one is a journal record from the CIJE subfile of ERIC and the other is a document record from the RIE subfile of ERIC.

The link above is to my work web page at Carnegie Mellon. You can actually view the structure that exists within any web page. Use the “VIEW” pull-down menu on your browser and then choose “Source.” The first code recognized by a computer would be the “<HTML>” tag that you’ll see about five lines down (after a bunch of remarks that the computer ignores). This would be equivalent to a document type – if you’ve ever done an advanced Google or Yahoo search, you’ll notice that you can restrict your search to a type of document. Right after the “<HTML>” tag you’ll see the beginning of the “<HEAD>” of the document. The chief portion of the head of an html document is the “<TITLE>.” Note that I put a bunch of different variants of my name in the title of my page. I did that so that if someone looks for me as “Matt Marsteller” or “Matthew Marsteller” or “Matthew R. Marsteller,” then when this page is indexed by most search engines, there’s a good chance that most search engines give greater weight to the title of a web page. The title of a web page is what shows up in the blue bar at the top of your browser window. The web page also has “Web Page of Matthew R. Marsteller” in big bold letters. If you look in the source code (about a third of the way through the code), you’ll see that this phrase is enclosed in “<H1>” and </H1>” tags. This is a heading within a document and can also be given more prominence by a search engine. The key operative phrase is “can be given more prominence” – nothing is guaranteed. When search engines rank their search results, a lot of things come into play. Ranking algorithms (rules the computer will follow) are often redesigned. We’ll explore searching of the Internet at the end of our study together. For now, I hope that you’ll appreciate that the typical web page on the Internet can, and usually does, have structure that search engines can make use of. Perhaps it’s a bit more crude than we’re used to … but it’s there! There are other concerns such as … is it standard HTML code? The answer is, unfortunately, no. One thing that is popular for people to do is to use a word processor like Microsoft Word to produce their HTML code (using the “SAVE As” function). It’s easy to do, but the result is NOT standard HTML. Most browsers will interpret it correctly (kind of) – hopefully most search engines will as well, eh?

This should reinforce the “First Contact” tutorial a bit. DIALOG uses the “?” (question mark) as the command prompt. The command prompt is the symbol that the computer system uses when the searcher needs to supply input. This slide shows the straightforward example of how to start in … or switch to … a database while you are connected to DIALOG. This would be similar to clicking a checkbox to choose the database or databases that you want to search within a collection of databases like EBSCOHost or pointing your browser to the proper URL for a library’s online catalog.

This simply shows the system’s response to the BEGIN command. We’re in the ERIC database and another command prompt is showing.

The SELECT command is the next tool to learn. In this example, I’ve tasked the computer to find all records with the word “mathematics” in the basic index of the ERIC database. It will look in all the fields of each record that are designated as part of the basic index. Again, you’ll find a description of the basic index in each and every Bluesheet.

If you’ve viewed the tutorial, the search should look a little familiar. The number of records retrieved is probably a bit lower – some time has passed since I did the example search for the PowerPoint slides. In this slide, I searched for the word “fear” after completing the search for the word “mathematics.” Note that DIALOG has assigned a set number for each set of records. I’m able to use these set numbers (“S1” and “S2”) to find records where both words show up (again, simply in the basic index). A third set, “S3,” is created and the system responds with another prompt. The word “AND” that I put between “S1” and “S2” is significant. The DIALOG system recognizes that as a special word known as a Boolean operator …

These are the three main Boolean Operators. Note that the description of the AND Operator matches what I was trying to accomplish with finding records on the fear of mathematics. Let’s explore each separately …

Back in third grade, I first learned about set theory and Venn Diagrams. I have to admit that I never foresaw a career where I tend to draw Venn diagrams several times a day! Let’s let the left circle represent all the records in a database that contain the word “mathematics.” The circle on the right represents the set of all records that contain the word “fear.” Thanks to people like me that struggled mightily with multiplying and dividing fractions (at first) in fifth grade, we can expect that educators have looked into the problem of kids being a bit afraid of math. The green shaded area represents records that contain both words. The circles aren’t drawn to scale of course. The set of records with the word “mathematics” was much larger … more than ten times larger. Hopefully this makes the search more understandable. One thing that you should realize is that the result of using an AND Operator should be a set of records that is SMALLER than either of the two sets that you started with (it could be the same size as the smaller of the two sets … but that is not likely to happen). It’s important to note that using too many AND Operators may yield sets that are too small to cover the topic. Start with the most significant concepts for your search and combine them until you have a reasonably sized set to work with. Don’t overuse the AND Operator. Use it, but do so wisely. Think about the topic and the impact that your strategy might have on your results.

Typically, a computer will only do what it is instructed. Searching for “mathematics” is interpreted exactly. If any of the records contained the more casual “math,” well, our earlier search would have missed the ones that didn’t also have the word “mathematics.” For the purposes of a database search, the words in the Venn Diagram above are synonyms. The OR Operator looks for either word to show up in a record and gathers them into a set that we can use later. Note that the results for truly synonymous terms will have overlap. The results of an OR Operator is a set that is larger than the set of records retrieved for each individual word (unless there’s a case of complete overlap … theoretically possible, but again, highly unlikely). Database searching often calls for the searcher to think out of the box a little. Perhaps, for my purposes, “fractions” would or could be considered synonymous with mathematics. “Fractions” would be a narrower term, but what if any of the three would be acceptable to the information seeker. Have I got you thinking?

These three slides with the Boolean Operators illustrated were borrowed from a training tool from DIALOG. Don’t blame their example on me! The NOT operator must always be used with great caution. The guidance given in this slide is most wise. Remember it!

In this slide, I’ve used the OR Operator. It helped me to retrieve more than 4,000 additional records that should cover the concept of mathematics.

Followed by that, I may be losing track of the sets that I’ve created. “DS” is the abbreviation for “DISPLAY SETS.” I use this command a lot.

Note set “S4.” It shows a mistake. The “S” before the number for sets is VERY important. Set “S3” is quite different from set “S4!” The results of set S4 are a Boolean AND of records containing the “word” “1” and the “word” “2” – the computer will treat a number as if it were a word unless given specific instructions not to do so. We’ll see some of those examples later in the class.

So, my new set S6 is a little larger than set S3. It is most likely a superior set of search results from a completeness standpoint.

I told you I use that DS command a lot!

If you think over our search strategy, set S3 is indeed a subset of set S6. Sets S6 and S7 would be considered equivalent sets.

The next thing that I tried was to limit my search to only English records. This command works in a lot of DIALOG databases, but when it doesn’t work, the system indicates that the command is ignored. If you look at the ERIC Bluesheet (link provided), and hone in on the section of the Bluesheet that discusses Limits, you’ll find no mention of a limit for English language materials. Thankfully there’s another way to handle the problem. Here’s a mind bender! DIALOG will let you search multiple files together. What happens when one file accepts the “/ENG” limit and others don’t? Well, sometimes the search gets a little messy. Preplanning your search is the only good remedy.

In a technique similar to LIMIT suffixes in DIALOG, it’s important to note that you can restrict your search to a part of the basic index. You would often want to do this to make sure … for example … mathematics is a very important concept in a particular set of search results. It is important to be cognizant of what fields are in the basic index. I know I’m probably beginning to sound like a broken record, but it’s a very important concept. The second example shows a search for the word “mathematics” restricted to the title (ti) field. There are times when this extreme narrowing of a search is beneficial. One example of this is when you try to verify a citation (patrons often struggle trying to find a citation that is slightly incorrect – maybe they misheard a detail or two in a discussion with a colleague at a conference. There’s always the hopeless case where even the best of databases won’t help … one famous item of library lore would be the kid that visits his local library looking for a book called (according to the kid) “Oranges and Peaches” – a good reference interview eventually revealed to the librarian that the kids teacher had mentioned a book on evolution by some guy named Darwin. Aha! The kid wanted “Origin of the Species!” As a person that has struggled with hearing impairment, I’m on the kids side! Oh … how many times have I misheard things!

Sometimes I’ll restrict a search to the titles or descriptors … like you see in the third example above. If the information need is a few good articles on a topic, this can be a reasonable choice. For multiple fields, simply separate the field abbreviations by a comma. What fields are available for suffix searching? Again, it depends on the database – the Bluesheet is where to look!

Okay! I get to introduce a new command and bail us out of our problem that we had in the ERIC database – that one without the Limit command that we needed! If you scour the ERIC Bluesheet, you’ll notice the Language Field as one of the additional indexes. At this point, I’ve always been more comfortable with using the EXPAND command or “E” to browse an additional index. Some folks would just use:

?S LA=ENGLISH

And be done with the problem. I’ll often start with an EXPAND command and …

… follow it up with a SELECT command of the “E Number.” Thus,

?S E3

… puts a set of English language records in a (huge) set …

Then I can return to use of the AND Operator to finish the job.

Gosh, I’ve been doing a lot of searching. What kind of a bill am I running up? The “COST” command allows me to check. Sometimes I’ll use the COST command after a TYPE command to see how much I just spent. Always make sure that you’re TYPE-ing the correct set number or the subsequent COST command might be a bit unsettling. Thankfully, if you ever make an expensive mistake and generate useless output, you can call DIALOG and explain the situation. Be prepared to tell them your User number and the Session number. In this case, the User number is 556323 and the Session number was D1.2. Oh, what’s the TYPE command? Well, I got a little ahead of myself. We have this great (we hope) set S10 and we’d like to look at some of the results. The TYPE command is what we need to use to retrieve the results of our work.

This slide and the next one show the beginning of the results of our TYPE command example …

Note that I asked for “Format 8.” This is one of the formats of database output that is free in the ERIC database. This is often not even enough information to find the document, although in this particular case I would be trying “ED452367” in a library’s huge set of ERIC microfiche. Many libraries would put the ERIC documents in “ED” number order.

Hopefully you’ll find this to be a good example. The Bluesheet will indicate which formats are available to the searcher. If you ever wanted all of the results of a particular set, the word “ALL” could be substituted for “/1-5” in our example.

Now, the ERIC database is one of the databases that uses something called Controlled Vocabulary. Specifically, the database uses the Thesaurus of ERIC Database Descriptors. If you EXPAND a word or a phrase in a database and you see a column headed with an “RT” then you’ll know you’re in a database that has an online thesaurus (but, of course you scoured the DIALOG Bluesheet ahead of time and KNEW the database had a thesaurus. Right!!??!!!)

In this example, I’m thinking of trying to focus my search to fear of mathematics in fifth grade. I had spotted the “GRADE 1, 2, 3, …” descriptors in the earlier results in Format 8 and thought I’d expand on the phrase “GRADE 5” to see if I could use it and perhaps other terms that would be or could be considered synonymous for my purposes.

In this example, I’ve used the “SELECT” command on E number “E3,” but it would have been more interesting to use the EXPAND command:

?E E3

This would have listed the three related terms for me. Bummer! I must have had a bad day! I think I’ll do it and stick the slide in!

Here’s an example of the TYPE command that requests Format 9 for record 1 of the set S12.

It’s a 16 page document … again probably in a huge bank of ERIC microfiche documents that one will find in many libraries with good education collections. ED219235 is what we’d be looking for.

When you’re connected to the DIALOG service, the meter is running. So, we’ll also want to stop spending money! The LOGOFF command will end your search session. By the way, with these student accounts, don’t worry about costs. DIALOG provides this service to us so that students have the chance to learn how to use their system. Note that I put “$0.00 2 Type(s) in Format 9” in bold face font. I wanted to point out that in a real situation you would be charged per record for the output.

One of the things that we’re able to do in Dialog is to enter a rather complex strategy in a single command line. We do this with something called “nesting.” The first command above is just a simple search for two words using an OR Operator between them. We could have put each word in a separate select statement and then combined them with an OR, but this is equivalent. The second command is simply a search for the word “leukemia.” The third command should look familiar – it would be a way to combine the concepts of S1 and S2. We could for all three words in the same select command, but here we have to be careful. For the fourth command in the slide, Dialog will always process an AND Operator before an OR Operator. This would mean that all the records of a database containing both feline AND leukemia would be found, then the result of that would be combined with the set of all records that contain the word “cat” using an OR operator. That is not likely what would be intended. We would really want either “cat” or “feline” to be present and then the word “leukemia” to also show up in the record. In Dialog (and most other systems), we can force the system to interpret our logic correctly with the use parentheses. That’s what I show in the fifth and final command on the slide.

Being able to combine concepts in a single select statement can be very helpful. For example, we’ll soon learn about a tool that helps us scope out the number of records that match a particular query without actually searching every file. The only problem is that our command mud