FREE TEXT MEDICAL DOCUMENT RETRIEVAL VIA PHRASE-BASED VECTOR SPACE MODELING
UCLA Technology Available For Licensing

Researchers in the UCLA Department of Computer Science have developed and reduced to practice algorithms and methods for obtaining information, primarily medical information, from free text sources, such as patient medical records. The techniques involve 3 sets of innovations: (1) keyword extraction and indexing (UCLA Case 2003-358); (2) query expansion (UCLA Case 2003-357); and (3) phrase based vector space models (VSM) of document retrieval, described herein.

BACKGROUND:  Information retrieval are based VSM is based on a model whereby the document is a vector of index terms. Concepts have been proposed to replace word stems as the index terms to improve retrieval accuracy. However, past research revealed that such systems did not outperform stem-based systems. Knowledge sources should improve retrieval accuracy. But knowledge sources are based on word stems which precludes significant improvement. To remedy this problem, we propose to represent documents using phrases.

INNOVATION:  The innovation herein is the used of a knowledge source, the use of phrases, as opposed to word stems, the application of a vector space and metrics for document similarity.

A phrase consists of multiple concepts and word stems. The similarity between 2 phrases is jointed by their conceptual similarity and their common word stems. Document similarity can in turn be derived from phrase similarity. Using VSM, a document is represented by a vector of terms. The basis of the vector space consists of distinct concepts. Components of the document vector are the weights applied to corresponding terms that represent the relative importance of the distinct concepts in the document. The weight of a concept is the count of times the phrase is used in the document. Furthermore, higher weights are assigned to longer phrases that correspond to more specific concepts. On the other hand, the more documents the phase belongs to, the less disambiguating power it has, and thus the less important it is. The cosine of the angle between 2 document vectors measures the similarity between the documents. Retrieval of documents is achieved by finding the document vectors closest to the query vector.

DEVELOPMENT TO DATE:  Using OHSUMED as a test collection and UMLS as the knowledge source, our experiments show that phrase-based VSM yield a 16% increase of retrieval accuracy compated to the stem-based model. The test composed 105 queries, 14,000 judged documents and 1.3 million phrases in UMLS.

This work was performed by the CoBase Database Group at UCLA (http://www.cobase.cs.ucla.edu/)

Reference: UCLA Case No. 2003-510

For additional technical details and current licensing
availability, please contact the following UCLA office:

UCLA Office of Intellectual Property
11000 Kinross Avenue, Suite #200
Los Angeles, CA 90095-7231
Tel: 310-794-0558 Fax: 310-794-0638
email: ncd@research.ucla.edu
NCD URL:   http://www.research.ucla.edu/tech/ucla03-510.htm

Lead Inventor: Wesley Chu

UCLA Technologies Available for Licensing
http://www.research.ucla.edu/tech

Copyright © 2003 The Regents of the University of California.

keywords: bioinformatics datamining medical devices uclancd ucla technologies intellectual property patents technology transfer invention business card