Following objects are used in accomplishign indexing
IndexWriter
Analyzer
Document
Field
You prime an index writer with a directory path and an analyzer. Index writer will write the indices in this directory path. Index write will utilize the analyzer to process the documents to look for significant items to index.
A document (unlike an html or word document) is a collection of fields. Where each field can contain any amount of text. You drop this document in the index writer to index it. You can add any number of documents to this index writer
A field depending on its type (keyword, unindexed, unstored, text) may or may not be kept in the document.
Satya - Wednesday, June 01, 2005 6:10:51 PM
General idea of searching
The counter part of the indexwriter is an indexsearcher taking the directory path as its input. It is now ready to search for a specific query. A query would have been obtained from your input string and a QueryParser. A query parser will take an analyzer as its input
In summary the relevant objects are
IndexSearcher
Query
QueryParser
Analyzer
Hits
Where Hits is a collection of lazy loaded documents obtained by the search
Satya - Wednesday, June 01, 2005 6:16:05 PM
What is the role of fields in searching
While searching the queryparser takes a field name as one of its arguments as a default field. What is the general role of fields in searching? What happens if you don't specify a field name for a search query?
"Google could index the articles but we wouldn't be able to show results based on questions such as, "show me all the articles by Professor Henry that deal with relativity and have superstring in their title."
A lucene index is a data store that is similar to a table. You can search that index like you search a table. Documents are inserted into the table as rows ("added" to be precise). The documents may or may not have all the same fields (columns) in them. Each row (or the document) has fields that are indexed and those that are not (like in a database).
"The first step is to find out how to "crawl the web". That is: request a page using the HTTP protocol, receive the page, extract the text in the page, and harvest the links in the page. Then repeat this process for every link found."
The interesting conclusion then is, if your page is not available as a link on an already public site, then it is hidden from the crawlers.
Teaching assitants - Wang Lam, Mahati Mahabhashyam
Satya - Thursday, June 02, 2005 9:02:44 AM
Look for some articles on "relevance"
So far, the search has been for a certain amount of key words that are known to the user. Look for strategies where given a document worth of information, look for similar documents that are in the database already.
This is probably being done by such players as Google already. wonder if their desktop toolkit has this built in already.
What about lucene? Look for some literature or their news group for this subject.See what Pramod came up from the book.
See some of the researchers at Stanford has any information on this.
Satya - Thursday, June 02, 2005 10:21:09 AM
Some ideas/notes on information retrieval from Mahathi Mahabhshyam
This is very useful as the most ideas in IR are here and what they are called in literature. This will allow us to search for those ideas in google.
For example see the following extract
Content-Based Filtering: The process of filtering by extracting features from the text of documents to determine the documents' relevance. Also called "cognitive filtering".