Type of Document Dissertation Author Sornil, Ohm Author's Email Address firstname.lastname@example.org URN etd-02062001-114915 Title Parallel Inverted Indices for Large-Scale, Dynamic Digital Libraries Degree PhD Department Computer Science Advisory Committee
Advisor Name Title Fox, Edward Alan Committee Chair Edwards, Stephen H. Committee Member Koelling, Charles Patrick Committee Member Ramakrishnan, Naren Committee Member Varadarajan, Srinidhi Committee Member Keywords
- incremental update
- information retrieval
- parallel inverted index
- hybrid partitioning
- digital library
- terabyte text collection
Date of Defense 2001-01-25 Availability unrestricted AbstractThe dramatic increase in the amount of content available in digital forms gives rise to large-scale digital libraries, targeted to support millions of users and terabytes of data. Retrieving information from a system of this scale in an efficient manner is a challenging task due to the size of the collection as well as the index. This research deals with the design and implementation of an inverted index that supports searching for information in a large-scale digital library, implemented atop a massively parallel storage system. Inverted index partitioning is studied in a simulation environment, aiming at a terabyte of text. As a result, a high performance partitioning scheme is proposed. It combines the best qualities of the term and document partitioning approaches in a new Hybrid Partitioning Scheme. Simulation experiments show that this organization provides good performance over a wide range of conditions. Further, the issues of creation and incremental updates of the index are considered. A disk-based inversion algorithm and an extensible inverted index architecture are described, and experimental results with actual collections are presented. Finally, distributed algorithms to create a parallel inverted index partitioned according to the hybrid scheme are proposed, and performance is measured on a portion of the equipment that normally makes up the 100 node Virginia Tech PetaPlex™ system.
NOTE: (02/2007) An updated copy of this ETD was added after there were patron reports of problems with the file.
Filename Size Approximate Download Time (Hours:Minutes:Seconds)
28.8 Modem 56K Modem ISDN (64 Kb) ISDN (128 Kb) Higher-speed Access dissertation.pdf 1.07 Mb 00:04:56 00:02:32 00:02:13 00:01:06 00:00:05 dissertation_printTo7.pdf 1.13 Mb 00:05:13 00:02:41 00:02:21 00:01:10 00:00:06
If you have questions or technical problems, please Contact DLA.