‹header›
‹date/time›
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
‹footer›
‹#›
Oh … snips and snails and puppy dog tails
– that’s what databases are made out of … OK, so they’re not. Databases are also referred to as files or datafiles. Sometimes a databases is split into
subfiles, but not necessarily. The ERIC
database is a good example of a database that is split into two subfiles – the
Current Index to Journals in Educations (CIJE) and Resources in Education
(RIE). The important thing to remember
is that some databases will have subfiles and some won’t. When they exist, they can sometimes be helpful
to craft a search strategy. All
databases will be made up of database records.
Each database record is made up of fields. Sometimes fields are divided into
subfields. Perhaps an example will
help.
Here’s an example from the online catalog
of a library. The online catalog is an
example of a database. The record is an
example of the familiar bibliographic record.
It simply describes a book. The
fields should also be familiar: author, title, publisher, etc. The book happens to be one of my favorites. It was my textbook for a course in US and
Canadian geography. The book suggested
that North America could be split into nine different nations that would make
more sense than the three that we currently have. He explains why he picks the borders that he
does. One memorable one was a diagonal
line from northwest Connecticut to southeast Connecticut. People to the north and east of this line
have a very strong tendency to be Boston Red Sox fans and people to the south
and west of the line tend to be New York Yankees fans. Sorry Mets fans … I don’t think he paid them
much notice!
The record was too long to fit on one
screen. Maybe we should look at the underlying
structure …
I know!
I know! Eeek! It’s the dreaded MARC record – well, at
least some of the variable fields. The numbers on the extreme left are the tags
for the fields of the records. They’re
unique to the MARC record and you’ll often see them displayed as the more
human friendly “Personal Author”, “Title”, etc. Between the colons are what are known as
“indicators.” They are truly unique to
the MARC record. From my forever
dimming memory, I can only recall that the “4” means to ignore the first four
character spaces or “The “ – the initial article. The “|b” or “|c” are known as “delimiters” and signify the
beginnings of subfields of a field.
… and this would be the continuation of
the underlying structure. Can you pick out
the subfields?
Now I’ll try my very best not to confuse
you.
This is part of an actual database
that I built so that my wife could use a title listing to know which books
were already in my “presidential library.”
I’m going to try and describe how database records are “parsed” or
chopped up into parts that the computer can recognize as “words.”
The end result would be the production of an
“inverted file” – an alphabetized listing of what words appear in what
position of which record.
In addition
to this example, see Walker and Janes pages 55 through 63 and the GEP DIALOG
Lab Workbook pages 3-6 through 3-10.
This screen and the next two show record numbers (I’ve designated them
‘RN’) 101, 102, and 103.
My database
has three fields Author (AU), Title (TI), and Subject (SU).
We’re going to build the inverted file for
this tiny database of three records by hand.
Familiarize yourself with each record …
you should begin to wonder … just how is Matt going to chop up his
database?
This is a very good book by the way …
How we chop the record up into words and
phrases will lead to what the inverted file will look like.
Well, I’ve been introducing an awful lot
of vocabulary through these first ten slides.
Remember that Walker and Janes has a pretty nice glossary. The existence of a basic index versus
additional indexes is extremely important for searchers to understand. From a searchers perspective, whenever you search
a database without specifying a particular field to search, the basic index is
searched. It’s kind of a default
instruction for computer to follow.
It’s also VERY important to realize that the basic index in one
database can be different from the basic index in another database. To complicate matters even further, a
database that you’ll find on the DIALOG service might also be found in
EBSCOHost or FirstSearch. Even though
they host some of the same databases, they may choose to index them in
different ways. Sometimes a database
host might choose to leave certain fields completely out of the indexing
process. The important lesson to learn
here would be that the savvy librarian must be ready for anything and should
endeavor to find out what is defined as the basic index and which additional
indexes exist for all the databases that they search. Thankfully you don’t have to memorize all
that junk! You just need to know how to
look up this information and how to best articulate the situation to your end
users. We’ll learn about how to do this
with Dialog and that will provide you with a very good example for the future.
Oh, here comes yet another bit of jargon
– the “Stop Words.” This is a list of words
that computer is instructed to ignore as it builds the inverted file. It’s as if the words don’t even exist. Note that DIALOG’s list is very short –
shorter than many hosts of databases.
They even index the word “a” – remember that phrases like “type A” or
“grade A” can often be found. DIALOG
and some other database hosts feel that it’s important for their users to
easily find them.
Now, this is where we start to chop up
the database.
What’s up with the author
field?
Both the last name and first
name on one line – why isn’t the name chopped into a first and last name?
Well, our author index … an additional index
… will, by my design, be a phrase index.
That’s the only way the author will appear in our index:
Lastname, Firstname.
“101” is the record number (much smaller than the record numbers that you’ll see
in DIALOG).
“AU” is the field.
“1” is the first phrase.
For another example “psychohistorical” is also in the 101
st record, it’s in the title field of record 101
and it’s the 6
th word.
And the saga continues …
Wait a minute! This name is chopped up into words! Yup!
I determined that design too.
We’re in the middle of chopping up a subject field into words. Even “1913” is a word according to a
computer – aaw! They never could spell.
Whoa Milhouse! What’s this?
Well, this madcap designer has decided that the subject field of his
database is both word and phrase indexed.
We can do that you know! Even
within DIALOG you’ll see variations in whether a particular field is word or
phrase indexed. You simply have to
memorize all this stuff for 500 plus databases. Nah!
I’m just kidding. Never memorize
something that can easily be looked up.
DIALOG has a set of tools called “Bluesheets.” There’s a Bluesheet for each database that
goes through each and every field and indicates how the field is indexed … or
if it’s indexed.
And so it goes …
Thankfully we’ll only take this until we
get the idea. There’s a point where one
begins to really appreciate what we can assign to computers to do. Let’s continue by alphabetizing our list.
And, numbers would kick off our alphabetized list of words …
Each ‘word’ has its record number, field, and position within the field
in our list.
Finally we get to the “a’s” in the list.
Hopefully I haven’t screwed up the
numbers of the records. This is hard enough
to follow as it is.
We also can’t forget the separate
additional index that we’ve built.
There are only three entries – they’re the authors of the three books.
Oooh!
How am I going to do this? I’ll
tell you what. I’ll do my best to go through
the ERIC database Bluesheet using the Camtasia Studio software – you’ll be
able to listen to my explanation and I’ll try to point out what is important. The link in the slide is to the DIALOG
Bluesheet for the ERIC database. One of
the things that I’ll show is the pair of example records – one is a journal
record from the CIJE subfile of ERIC and the other is a document record from
the RIE subfile of ERIC.
The link above is to my work web page at
Carnegie Mellon. You can actually view
the structure that exists within any web page.
Use the “VIEW” pull-down menu on your browser and then choose
“Source.” The first code recognized by
a computer would be the “<HTML>” tag that you’ll see about five lines
down (after a bunch of remarks that the computer ignores). This would be equivalent to a document type
– if you’ve ever done an advanced Google or Yahoo search, you’ll notice that
you can restrict your search to a type of document. Right after the “<HTML>” tag you’ll
see the beginning of the “<HEAD>” of the document. The chief portion of the head of an html document
is the “<TITLE>.” Note that I put
a bunch of different variants of my name in the title of my page. I did that so that if someone looks for me
as “Matt Marsteller” or “Matthew Marsteller” or “Matthew R. Marsteller,” then
when this page is indexed by most search engines, there’s a good chance that
most search engines give greater weight to the title of a web page. The title of a web page is what shows up in
the blue bar at the top of your browser window. The web page also has “Web Page of Matthew
R. Marsteller” in big bold letters. If
you look in the source code (about a third of the way through the code),
you’ll see that this phrase is enclosed in “<H1>” and </H1>”
tags. This is a heading within a
document and can also be given more prominence by a search engine. The key operative phrase is “can be given
more prominence” – nothing is guaranteed.
When search engines rank their search results, a lot of things come
into play. Ranking algorithms (rules
the computer will follow) are often redesigned. We’ll explore searching of the Internet at
the end of our study together. For now,
I hope that you’ll appreciate that the typical web page on the Internet can,
and usually does, have structure that search engines can make use of. Perhaps it’s a bit more crude than we’re
used to … but it’s there! There are
other concerns such as … is it standard HTML code? The answer is, unfortunately, no. One thing that is popular for people to do
is to use a word processor like Microsoft Word to produce their HTML code
(using the “SAVE As” function). It’s
easy to do, but the result is NOT standard HTML. Most browsers will interpret it correctly
(kind of) – hopefully most search engines will as well, eh?
This should reinforce the “First Contact”
tutorial a bit. DIALOG uses the “?” (question
mark) as the command prompt. The
command prompt is the symbol that the computer system uses when the searcher
needs to supply input. This slide shows
the straightforward example of how to start in … or switch to … a database
while you are connected to DIALOG. This
would be similar to clicking a checkbox to choose the database or databases
that you want to search within a collection of databases like EBSCOHost or
pointing your browser to the proper URL for a library’s online catalog.
This simply shows the system’s response
to the BEGIN command. We’re in the ERIC
database and another command prompt is showing.
The SELECT command is the next tool to
learn. In this example, I’ve tasked the
computer to find all records with the word “mathematics” in the basic index of
the ERIC database. It will look in all
the fields of each record that are designated as part of the basic index. Again, you’ll find a description of the basic
index in each and every Bluesheet.
If you’ve viewed the tutorial, the search
should look a little familiar. The number
of records retrieved is probably a bit lower – some time has passed since I
did the example search for the PowerPoint slides. In this slide, I searched for the word
“fear” after completing the search for the word “mathematics.” Note that DIALOG has assigned a set number
for each set of records. I’m able to
use these set numbers (“S1” and “S2”) to find records where both words show up
(again, simply in the basic index). A
third set, “S3,” is created and the system responds with another prompt. The word “AND” that I put between “S1” and
“S2” is significant. The DIALOG system recognizes
that as a special word known as a Boolean operator …
These are the three main Boolean
Operators. Note that the description of
the AND Operator matches what I was trying to accomplish with finding records
on the fear of mathematics. Let’s
explore each separately …
Back in third grade, I first learned
about set theory and Venn Diagrams. I
have to admit that I never foresaw a career where I tend to draw Venn diagrams
several times a day! Let’s let the left
circle represent all the records in a database that contain the word
“mathematics.” The circle on the right represents
the set of all records that contain the word “fear.” Thanks to people like me that struggled
mightily with multiplying and dividing fractions (at first) in fifth grade, we
can expect that educators have looked into the problem of kids being a bit
afraid of math. The green shaded area
represents records that contain both words.
The circles aren’t drawn to scale of course. The set of records with the word
“mathematics” was much larger … more than ten times larger. Hopefully this makes the search more
understandable. One thing that you
should realize is that the result of using an AND Operator should be a set of
records that is SMALLER than either of the two sets that you started with (it could
be the same size as the smaller of the two sets … but that is not likely to happen). It’s important to note that using too many
AND Operators may yield sets that are too small to cover the topic. Start with the most significant concepts for
your search and combine them until you have a reasonably sized set to work
with. Don’t overuse the AND
Operator. Use it, but do so
wisely. Think about the topic and the
impact that your strategy might have on your results.
Typically, a computer will only do what
it is instructed. Searching for “mathematics”
is interpreted exactly. If any of the
records contained the more casual “math,” well, our earlier search would have
missed the ones that didn’t also have the word “mathematics.” For the purposes of a database search, the
words in the Venn Diagram above are synonyms.
The OR Operator looks for either word to show up in a record and
gathers them into a set that we can use later.
Note that the results for truly synonymous terms will have
overlap. The results of an OR Operator
is a set that is larger than the set of records retrieved for each individual
word (unless there’s a case of complete overlap … theoretically possible, but
again, highly unlikely). Database
searching often calls for the searcher to think out of the box a little. Perhaps, for my purposes, “fractions” would
or could be considered synonymous with mathematics. “Fractions” would be a narrower term, but
what if any of the three would be acceptable to the information seeker. Have I got you thinking?
These three slides with the Boolean
Operators illustrated were borrowed from a training tool from DIALOG. Don’t blame their example on me! The NOT operator must always be used with
great caution. The guidance given in
this slide is most wise. Remember
it!
In this slide, I’ve used the OR
Operator.
It helped me to retrieve more
than 4,000 additional records that should cover the concept of
mathematics.
Followed by that, I may be losing track of the sets that I’ve created.
“DS” is the abbreviation for “DISPLAY
SETS.”
I use this command a lot.
Note set “S4.”
It shows a mistake.
The “S” before the number for sets is VERY
important.
Set “S3” is quite different
from set “S4!”
The results of set S4
are a Boolean AND of records containing the “word” “1” and the “word” “2” –
the computer will treat a number as if it were a word unless given specific instructions
not to do so.
We’ll see some of those
examples later in the class.
So, my new set S6 is a little larger than
set S3.
It is most likely a superior
set of search results from a completeness standpoint.
I told you I use that DS command a lot!
If you think over our search strategy,
set S3 is indeed a subset of set S6.
Sets S6 and S7 would be considered equivalent sets.
The next thing that I tried was to limit my search to only English
records.
This command works in a lot of
DIALOG databases, but when it doesn’t work, the system indicates that the
command is ignored.
If you look at the
ERIC Bluesheet (link provided), and hone in on the section of the Bluesheet
that discusses Limits, you’ll find no mention of a limit for English language materials.
Thankfully there’s another way to handle the
problem.
Here’s a mind bender!
DIALOG will let you search multiple files
together.
What happens when one file
accepts the “/ENG” limit and others don’t?
Well, sometimes the search gets a little messy.
Preplanning your search is the only good
remedy.
In a technique similar to LIMIT
suffixes in DIALOG, it’s important to note that you can restrict your search
to a part of the basic index. You would
often want to do this to make sure … for example … mathematics is a very
important concept in a particular set of search results. It is important to be cognizant of what
fields are in the basic index. I know
I’m probably beginning to sound like a broken record, but it’s a very
important concept. The second example shows
a search for the word “mathematics” restricted to the title (ti) field. There are times when this extreme narrowing
of a search is beneficial. One example
of this is when you try to verify a citation (patrons often struggle trying to
find a citation that is slightly incorrect – maybe they misheard a detail or
two in a discussion with a colleague at a conference. There’s always the hopeless case where even
the best of databases won’t help … one famous item of library lore would be
the kid that visits his local library looking for a book called (according to
the kid) “Oranges and Peaches” – a good reference interview eventually
revealed to the librarian that the kids teacher had mentioned a book on
evolution by some guy named Darwin.
Aha! The kid wanted “Origin of
the Species!” As a person that has
struggled with hearing impairment, I’m on the kids side! Oh … how many times have I misheard
things!
Sometimes I’ll restrict a search to
the titles or descriptors … like you see in the third example above. If the information need is a few good
articles on a topic, this can be a reasonable choice. For multiple fields, simply separate the
field abbreviations by a comma. What
fields are available for suffix searching?
Again, it depends on the database – the Bluesheet is where to look!
Okay!
I get to introduce a new command and bail us out of our problem that we
had in the ERIC database – that one without the Limit command that we needed!
If you scour the ERIC Bluesheet, you’ll
notice the Language Field as one of the additional indexes.
At this point, I’ve always been more
comfortable with using the EXPAND command or “E” to browse an additional
index.
Some folks would just use:
?S LA=ENGLISH
And be done with the problem.
I’ll
often start with an EXPAND command and …
… follow it up with a SELECT command of the “E Number.” Thus,
?S E3
… puts a set of English language records in a (huge) set …
Then I can return to use of the AND Operator to finish the job.
Gosh, I’ve been doing a lot of searching.
What kind of a bill am I running up?
The “COST” command allows me to check.
Sometimes I’ll use the COST command after a TYPE command to see how
much I just spent.
Always make sure
that you’re TYPE-ing the correct set number or the subsequent COST command
might be a bit unsettling.
Thankfully,
if you ever make an expensive mistake and generate useless output, you can
call DIALOG and explain the situation.
Be prepared to tell them your User number and the Session number.
In this case, the User number is 556323 and
the Session number was D1.2.
Oh, what’s
the TYPE command?
Well, I got a little
ahead of myself.
We have this great (we
hope) set S10 and we’d like to look at some of the results.
The TYPE command is what we need to use to
retrieve the results of our work.
This slide and the next one show the
beginning of the results of our TYPE command example …
Note that I asked for “Format 8.” This is one of the formats of database
output that is free in the ERIC database.
This is often not even enough information to find the document,
although in this particular case I would be trying “ED452367” in a library’s
huge set of ERIC microfiche. Many
libraries would put the ERIC documents in “ED” number order.
Hopefully you’ll find this to be a good
example. The Bluesheet will indicate which
formats are available to the searcher.
If you ever wanted all of the results of a particular set, the word
“ALL” could be substituted for “/1-5” in our example.
Now, the ERIC database is one of the
databases that uses something called Controlled Vocabulary.
Specifically, the database uses the
Thesaurus
of ERIC Database Descriptors.
If you EXPAND a word or a phrase in a database and you see a column
headed with an “RT” then you’ll know you’re in a database that has an online
thesaurus (but, of course you scoured the DIALOG Bluesheet ahead of time and
KNEW the database had a thesaurus.
Right!!??!!!)
In this example, I’m thinking of trying to focus my search to fear of mathematics
in fifth grade.
I had spotted the
“GRADE 1, 2, 3, …” descriptors in the earlier results in Format 8 and thought
I’d expand on the phrase “GRADE 5” to see if I could use it and perhaps other
terms that would be or could be considered synonymous for my purposes.
In this example, I’ve used the “SELECT”
command on E number “E3,” but it would have been more interesting to use the
EXPAND command:
?E E3
This would have listed the three related terms for me.
Bummer!
I must have had a bad day!
I
think I’ll do it and stick the slide in!
Here’s an example of the TYPE command
that requests Format 9 for record 1 of the set S12.
It’s a 16 page document … again probably in a huge bank of ERIC microfiche documents
that one will find in many libraries with good education collections.
ED219235 is what we’d be looking for.
When you’re connected to the DIALOG
service, the meter is running. So,
we’ll also want to stop spending money!
The LOGOFF command will end your search session. By the way, with these student accounts,
don’t worry about costs. DIALOG
provides this service to us so that students have the chance to learn how to
use their system. Note that I put
“$0.00 2 Type(s) in Format 9” in bold face font. I wanted to point out that in a real
situation you would be charged per record for the output.
One of the things that we’re able to do in
Dialog is to enter a rather complex strategy in a single command line. We do this with something called
“nesting.” The first command above is
just a simple search for two words using an OR Operator between them. We could have put each word in a separate
select statement and then combined them with an OR, but this is
equivalent. The second command is
simply a search for the word “leukemia.”
The third command should look familiar – it would be a way to combine
the concepts of S1 and S2. We could for
all three words in the same select command, but here we have to be
careful. For the fourth command in the
slide, Dialog will always process an AND Operator before an OR Operator. This would mean that all the records of a
database containing both feline AND leukemia would be found, then the result
of that would be combined with the set of all records that contain the word
“cat” using an OR operator. That is not
likely what would be intended. We would
really want either “cat” or “feline” to be present and then the word
“leukemia” to also show up in the record.
In Dialog (and most other systems), we can force the system to
interpret our logic correctly with the use parentheses. That’s what I show in the fifth and final
command on the slide.
Being able to combine concepts in a single
select statement can be very helpful.
For example, we’ll soon learn about a tool that helps us scope out the number
of records that match a particular query without actually searching every
file. The only problem is that our
command mud