San José State University

SCHOOL OF LIBRARY & INFORMATION SCIENCE

Overview of Subject Representation

Metadata created for any document or collection of documents needs to give various kinds of information. A very short list of what metadata does includes allowing resources to be found by relevant criteria; identifying resources; bringing similar resources together; distinguishing dissimilar resources; and giving location information (NISO, 2004). Providing some kind of information about the subjects of documents in a collection allows searchers to find all the documents which are about a particular topic.

This section gives an overview of the ways that the subject of a document can be represented. Excellent longer treatments can be found in Rowley’s Organizing Knowledge (1992) and Taylor’s The Organization of Information (2004).

There are three basic ways in which the subjects of documents can be represented. One, familiar to us from Internet searching, is to take the actual text of the documents and place the words into an index which can be searched. The second way, familiar to us from library catalogs, is to assign various subject headings to each document, and allow a searcher to retrieve all the documents to which a particular subject heading or combination of subject headings has been assigned. The third way, familiar to us from browsing around in a physical library, is to assign a document to a particular place in a classification scheme, with all documents about the same subject assigned to the same place. These three of these forms of subject representation will be discussed here.

This page has five sections:

 

Representing subject through full text – the natural language of the document

This technique (also known as “natural language”) takes the actual words from a document and sorts them into an alphabetical index which can then be searched by a computer. It’s based on the assumption that the words in the document do a good job of indicating the subject of a document. This assumption is correct in a general sense, but there are serious problems with it.

The first problem is that the words of a document reflect the context of the document as well as its actual information – in a journal written for insurance sales people, for instance, an article may not ever actually use the word “insurance,” but rather refer more specifically to “whole life,” “term,” and so on. Thus someone searching for the word “insurance” would miss this article entirely.

A second problem is that words will occur in a document that do not reflect what the document is actually about. This section of the Online Resource provides an excellent example: if an index containing the words from this document (and other documents) were being searched by someone interested in insurance, this document would be retrieved – and it’s useless for anyone who actually wants information about insurance.

A third problem with depending on the language of documents for subject representation is that languages are very rich and complex; I may use the term “subject representation” but another author writing about the same subject may use words like “topic” or “knowledge representation.” There is no mechanism for retrieving an article about the subject if it doesn’t contain the word the searcher enters as a search term.

Thus the subject of a document is imperfectly related to the words that are present in that document. “What a document is about” is tied to meaning, and meaning arises not from individual words but from the way those words are put together. Each word is affected by the context of a phrase or sentence, and gains further meaning in the context of a paragraph, then of the entire document. The document itself has a context: the journal it appears in, the other writings of the same author, a particular time, place, and cultural frame of reference. And meaning, of course, ultimately arises in our consciousness, in the context of our own experience and prior knowledge.

In a database that contains several thousand documents, natural language techniques can work fairly well despite the problems. But when the web is searched with its vast number of documents, the problems are magnified to the point where they are not tolerable in an information retrieval system.

Designers of search engines for the web, therefore, have developed sophisticated computer algorithms for searching the indexes composed of words taken from documents. These algorithms identify words which are used often within a document but not too often – the assumption is that if the word “subject,” for instance, is used several times, it probably is closely related to what the document is about, but if a word is used very frequently, such as “way,” it probably is not. (Note the use of the word “probably” – this is why Internet search engines retrieve irrelevant materials – the assumptions underlying the various algorithms hold true most of the time but not all of the time.) Another algorithm might be to give words that occur in section headings of a document more weight for retrieval than words occurring in the middle of the text. The algorithms used become much more complex than this, but these two examples give you a sense of the way in which words in the full text of a document that are closely related to its subject can be identified.

These techniques for making the full text of the documents searchable are also called natural language techniques, since the document’s own language is used rather than a pre-specified set of terms as with controlled vocabularies and classification systems.

It should be clear that this way of providing subject access depends first on human beings to write sophisticated processing algorithms, and then after that on computers to do all the processing.

Representing subject through controlled vocabularies

“Controlled vocabularies” is a generic term which usually refers to subject headings, descriptors, and index terms. (These four terms are roughly equivalent, though there are differences in when we tend to use them.) In this method of indicating the subject of a document, a list of subjects is first created. The list may be very short (one art library uses a list of 150) or very long (Library of Congress’s list contains almost 300,000 terms which can be expanded to create millions of terms). This list of subjects is called “controlled” because only these terms can be used. If “eye-hand coordination” is the term in the vocabulary, then there can’t be a subject heading for “hand-eye coordination.” If “human resources administration” is the authorized term in the vocabulary, then there can’t be another term for “personnel administration.” The “control” is the control over what terms are provided to search on.

Most documents are fairly complex, and they need more than one descriptor to represent their subjects adequately.

For instance, a book entitled Evaluation and privatization: Cases in waste management has the following three subject headings assigned in the Pollock Library: Evaluation research, Sewage disposal—United States—Management, and Privatization—United States. This has two results. The first is that a person interested in the management of sewage disposal in the U.S. can find all the books about that topic by searching for “Sewage disposal—United States—Management.” Some of these may involve privatization or evaluation, but many won’t. Similarly, all the books on evaluation research can be grouped together and retrieved – so the searcher will retrieve books on evaluation research related to evaluation research in Scandinavia, history of evaluation research, etc.

If “Privatization—United States” is searched on, the searcher will retrieve books on privatization of Amtrak, Air Force depot maintenance, and prisons, as well as sewage disposal. The second result is that a searcher can be very precise in her search. If she is interested specifically in the evaluation of privatized sewage disposal in the United States, she can search on all three of the terms and group together and retrieve all the books that are on that very specific topic.

Note that no one of these three terms adequately describes the work – but a person interested in just one aspect such as evaluation research is able to find all the books about that topic. So assigning several terms does accomplishes three different things: it allows the subject of the book to be represented thoroughly, it allows searchers to group documents about a particular topic, and it allows searchers to group documents for a very complex topic which takes more than one subject heading to describe.

It works just the same way in journal indexes – a journal article entitled “Secondary Traumatic Stress in the Classroom: Ameliorating Stress in Graduate Students” might be assigned descriptors Educational Strategies, Graduate Students, Mental Health Workers, Stress Management, and Violence from the list of controlled vocabulary. This combination of subject headings gives a reasonably thorough indication of the document’s subject; no one term alone can do so. It also allows the article to be retrieved by someone interested in violence and graduate students, and by someone with the very different interest of stress management and appropriate educational strategies, and by someone trying to find everything in the database that deals with graduate students.

In summary, then, controlled vocabularies:

  1. are lists of vocabulary terms (subject headings, descriptors, indexing terms) developed by human beings
  2. require a human indexer or cataloger to choose the terms from that list which best represent the subject of each document in the collection
  3. allow very specific description of content through the assignment of several terms, each of which represents a particular aspect of the topic covered by the document
  4. allow searchers to group documents in several different ways – they can use only one descriptor, which will retrieve many documents that are related to that subject but may be very different in other ways, or they can use a combination of descriptors, which will retrieve documents that share complex topics.

Note the differences from searching on the full text of a document:

  1. the searcher can be assured that if she searches on the term “violence,” all the documents related to that topic will be retrieved, whether or not the documents themselves use the word violence (they may use the words “attack” or “unacceptable behavior”)
  2. the searcher can be assured that if she searches on the term “violence” she will not retrieve documents describe “the violence of color” in a painting of a fire
  3. a human being (an indexer or cataloger) has to understand the document and decide both what it is about and which terms from the controlled vocabulary list best express that.
  4. the searcher does not have to guess what terms might have been used by an author to discuss the subject, but does have to know, or have some means to learn, what the authorized terms are that may be searched on

Representing subject through classification

Both the use of the full text of documents and of controlled vocabularies allow multiple ways for a document to be grouped with other documents about the same subject, depending on how the searcher defines ”same subject.” Any combination of the meaningful words in a full text system can be searched on, thus allowing a great deal of flexibility but creating some uncertainty about the appropriateness of the documents which will be retrieved. Any combination of the authorized subject terms in a controlled vocabulary can be searched on, with some certainty that the documents retrieved will be appropriate.

Classification systems work quite differently. First a person or task force creates a set of categories, and then each document in the collection is placed into a category. Thus if the class number for dogs in the Dewey Decimal System is 636.7, books about dogs will all be assigned that number. The fact that only one category is assigned per document means that classification systems may be used to put documents in order physically as well as intellectually. Thus all the books about dogs will be on the same shelf or shelves, followed immediately by the books on cats, which have the number 636.8.

Electronic classification systems work the same way. Even though the electronic document may be physically stored anywhere in the computer’s memory, it will be assigned to one category, and thus the link to the document will always go from that category to the document. (With electronic documents, it’s possible to assign more than one category, but when you search the effect is the same as if it were in only one place – when you browse the Yahoo! directory, for instance, you can only look in one category at a time; you can’t combine categories to express complex subjects such as “animals and photography”.)

Another example of a classification system is a file cabinet, where file folders are created for particular subjects, and all the documents (for instance, photocopies of journal articles) about a particular subject are placed in the same file folder.

This creates some rigidity in how the subject of a document can be represented in a classification system. If you have a photocopy of an article about the differences between indexing images and indexing text, do you put it in the file folder for “image indexing” or the file folder for “text indexing”? Should a book on lighthouses and their role in New England’s whaling industry go in SH381 (LC’s classification number for whaling), or VK1020 (for lighthouses) or F4 (for New England)? These are not easy questions to answer, and while cataloging rules provide some guidance, different librarians may make different decisions. And notice that a book which covers all three of these topics but emphasizes the geographic location may well be classified in F4, while another book that covers all three topics but emphasizes the architecture and history of lighthouses may be classed in VK1020. Thus a searcher would need to check both classes (or all three) to be sure of finding everything on the topic.

One further issue: classes (categories) and notation. The classification systems we’re most familiar with use a notation (a sort of code) to express the subject of each category – in Dewey, the subject of dogs is expressed by the number 636.7 and the subject of cats is expressed by 636.8; in the Library of Congress Classification System, the subject of whaling is expressed by the alphanumeric SH381. These notation schemes are very useful for indicating the subject in a compressed form and for creating an order for the subjects in the classification. But they’re not an essential part of a classification system; file folders in a four-drawer cabinet, for instance, probably just have words written on their tabs rather than any notation or coding.

Classification systems, then

  1. begin with a set of categories created by human beings
  2. require a human being to decide which category a given document should be in
  3. allow similar documents to be placed in proximity to one another, if the classification system is used as the basis for how documents are arranged physically
  4. allow groupings of documents in only one way, predetermined by the categories in the classification system
  5. may have a notation that is a code for the subject and indicates the sequence of classes and subclasses.

Conclusion

This section of the Online Resource has discussed three ways which have been developed for representing the subjects of documents – and for searching for documents by subject.

The first, natural language or full text techniques, use computer programs to create an index of all the significant words in the documents in the collection, and by searching on any of those words, a person can find documents which are about the subjects those words represent. In databases, it is usually the case that if the word(s) searched on appears in the document even once, the document will be retrieved. On the web, because of the vast number of documents present, algorithms have been developed for search engines which are reasonably successful in identifying which words in a document are actually related to the subject of the document in a substantial way. Because any combination of words can be searched on, a searcher may search for any complex subject in which he or she is interested – though there is no guarantee that documents will be retrieved.

The second, controlled vocabularies, use lists of authorized terms to ensure that all documents about a subject can be retrieved with the same term, and that only documents about that subject will be retrieved by the term. While there is very interesting experimental work with having computers process the text of documents and automatically assign terms from a controlled vocabulary, in most cases human beings are still doing this work. Because from one to a dozen (or more) terms from the list can be assigned to each document, the document can be described very precisely, and searchers can search on any combination of authorized terms that they feel describe the subject they’re interested in – though again there is no guarantee that documents will be found that have been assigned all of the terms.

The third, classification systems, use categories established by the designers of the system, and each document is placed into one of the categories. This allows for useful grouping of documents by subject – but is limited in the ability to represent complex subjects, since a document is placed into only one category. (There is no reason a document couldn’t be placed into more than one category – but then the system couldn’t be used for the physical organization of like documents. As noted above, with electronic documents this isn’t a problem, but for the searcher the effect is still the same.)

Each of these systems has its strengths and weaknesses, a few of which have been discussed here. Many collections therefore use a combination of systems. A database system, for instance, may have several descriptors from a controlled vocabulary assigned, and it may also have an index of all the words from the documents which is searchable. A searcher can search on the controlled vocabulary terms or in the fulltext index, or both simultaneously. (A discussion of when each strategy is appropriate is beyond the scope of this section.) Electronic documents may have controlled vocabularies assigned, as well as being searchable through full text. Libraries use a classification system to place books on the shelf with similar subjects grouped together, and they assign several controlled vocabulary terms for searching in the catalog. (This greatly benefits the person interested in books on lighthouses and the whaling industry in New England – by searching on “lighthouses and whaling and New England,” all the books related to that subject can be retrieved, whatever classification number they may have been assigned.)

There are probably some exceptions to every statement on this web page, but this is an overview of the way these three techniques work in general.

It should also be mentioned briefly that there are at least two kinds of controlled vocabularies (precoordinate and postcoordinate) and two kinds of classification schemes (hierarchical and faceted). Other sections of the Online Resource will address these in more detail.

Which system, or combination of systems, should be used to represent the subjects of documents in a collection depends upon the size of the collection, the breadth of subject material it includes, the financial and technical resources available, and the kinds of information needs which the collection was created to meet.

References

NISO (2004). Understanding metadata. Bethesda, MD: NISO Press. Retrieved January 5, 2005, from http://www.niso.org/publications/press/UnderstandingMetadata.pdf

Rowley, Jennifer E. (1992). Organizing knowledge: An introduction to information retrieval, 2nd ed. Brookfield, VT: Gower.

See especially chapters 12-18.

Taylor, Arlene G. (2004). The organization of information, 2nd ed. Westport, CT: Libraries Unlimited.

See especially chapters 9-11.

 Back