CS 419/519
Information Filtering and Retrieval
Class Meeting Times/Location:
TR 9:30-10:50 1/9/06-3/17/06
Exam Dates:
Tuesday, Feb 14 th (9:30-10:50AM),
Thursday, Mar 9 th (9:30-10:50AM)
Final Presentation Times:
Thursday, March 16 th (9:30-10:50AM),
(may change) Friday, March 24 th
(7:30-9:20AM)
Instructor:
Jon Herlocker
Electrical Engineering and Computer Science
1048 Kelley Engineering Center (Office: Kelley
2053)
(541) 737-8894
Fax: (541) 737-3014
herlock@eecs.oregonstate.edu
Teaching Assistant: Mr. Dana Benson - bensond@eecs.oregonstate.edu
Office Hours:
Dr. Herlocker: TThu 11-Noon (this time may
change, check the web site) in Kelley 2053. To make an appointment outside of
office hours, send an email request to the above listed email address.
Teaching Assistant: MW 11-Noon KEC Computer Lab
Text: Modern Information Retrieval , Ricardo
Baeza-Yates, Berthier Ribeiro-Neto. ACM Press: New York. 1999.
Syllabus: IR
Syllabus winter 2006.doc
Slides
- Lectures Slides
- Guest Speaker Presentation
Slides:
- Week 2
- Week 3
- Week 4
- Tuesday, Jan 31:
Matt McLaughlin, MusicStrands Slides 1, Slides 2
- Thursday, Feb 2: No
Class
Course Objectives: At the end of this
course, you should be able to...
- Explain what information
filtering and retrieval are and recognize core terminology specific to
information retrieval and filtering.
- Explain the basic
capabilities of current text information retrieval technology.
- Describe in detail the
Boolean and vector-space retrieval algorithms for text.
- Design an empirical
evaluation for an information retrieval or filtering system.
- Describe what collaborative
filtering is and explain how it works, its strengths and its weaknesses.
- Describe the purpose of
popular text encoding standards and information retrieval protocols.
- Find relevant articles using
library journal indexes, library catalogs, bibliographic citation
databases, and article indexes.
- Describe one topic area of
information retrieval in depth that you covered in your class project.
Overall Grading Distribution:
- 1 course-long project –
50%
- Two exams – 40%
- Other, including
contributions to the class in general – 10%
Tools
- Collaborative
Filtering Engines
- Precision/Recall
Graphs
- Morphology/Stemming
Tools
- Porter
- TreeTagger
- A free POS tagger
that works for many languages including English.
- http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html
- Porter or Lovins
stemmers do not lead to 'real' words (computing => comput),
TreeTagger does (computing => compute). But on the other hand,
TreeTagger does not group together nouns and verbs for example
(computing => compute ; computation => computation) while Porter
does (computing => comput ; computation => comput).
- Snowball
- Used to specifiy your
own stemmer, and stemmers for other languages other than English.
- Snowball is a
language for specifying lexical stemmers. Snowball supports a bunch of
languages at this point.
- http://snowball.tartarus.org/
- Search Engines
- Lucene
- http://lucene.apache.org/java/docs/
- Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
Links
- Books
- HCI
- Managing Gigabytes
- http://www.cs.mu.oz.au/mg/
- "Managing Gigabytes" is a great book that will teach you all the gory
details of how you actually build a high performance search engine. This
link will give you information about the book and give you a link to
their free open source software - the mg indexer and the seft search
engine.
- Alternative Search Engines