Manage this page
Resources
Basics
Readings in IR
Nutch
Satya - Sunday, May 29, 2005 3:17:03 PM
A brief history
1997 - Doug cutting started 2000 - Goes to source forge 2004 - widely accepted
Satya - Sunday, May 29, 2005 3:21:01 PM
What is nutch?
Nutch is an open source search engine. Doug Cutting is the primary developer here as well
Satya - Sunday, May 29, 2005 3:35:33 PM
What is simpy?
This is social bookmarking website or service created by Otis, co-author of Lucene in Action.
Satya - Wednesday, June 01, 2005 5:11:30 PM
General idea of indexing
Following objects are used in accomplishign indexing
IndexWriter Analyzer Document Field
You prime an index writer with a directory path and an analyzer. Index writer will write the indices in this directory path. Index write will utilize the analyzer to process the documents to look for significant items to index.
A document (unlike an html or word document) is a collection of fields. Where each field can contain any amount of text. You drop this document in the index writer to index it. You can add any number of documents to this index writer
A field depending on its type (keyword, unindexed, unstored, text) may or may not be kept in the document.
Satya - Wednesday, June 01, 2005 6:10:51 PM
General idea of searching
The counter part of the indexwriter is an indexsearcher taking the directory path as its input. It is now ready to search for a specific query. A query would have been obtained from your input string and a QueryParser. A query parser will take an analyzer as its input
In summary the relevant objects are
IndexSearcher Query QueryParser Analyzer Hits
Where Hits is a collection of lazy loaded documents obtained by the search
Satya - Wednesday, June 01, 2005 6:16:05 PM
What is the role of fields in searching
While searching the queryparser takes a field name as one of its arguments as a default field. What is the general role of fields in searching? What happens if you don't specify a field name for a search query?
Satya - Thursday, June 02, 2005 7:40:39 AM
You can do the following with Lucene
A quote -
"Google could index the articles but we wouldn't be able to show results based on questions such as, "show me all the articles by Professor Henry that deal with relativity and have superstring in their title."
- by Thomas Paul at Java Ranch
Satya - Thursday, June 02, 2005 7:50:10 AM
Elaborating on the "fields"
Satya - Thursday, June 02, 2005 7:59:03 AM
Parallels to database indexing
A lucene index is a data store that is similar to a table. You can search that index like you search a table. Documents are inserted into the table as rows ("added" to be precise). The documents may or may not have all the same fields (columns) in them. Each row (or the document) has fields that are indexed and those that are not (like in a database).
Satya - Thursday, June 02, 2005 8:06:11 AM
How is the web crawled
He says
"The first step is to find out how to "crawl the web". That is: request a page using the HTTP protocol, receive the page, extract the text in the page, and harvest the links in the page. Then repeat this process for every link found."
The interesting conclusion then is, if your page is not available as a link on an already public site, then it is hidden from the crawlers.
Satya - Thursday, June 02, 2005 8:28:28 AM
Information retrieval and web mining
Information retrieval and web mining: A stanford lecture
Professors - Prabhakar Raghavan, Hinrich Schutze
Teaching assitants - Wang Lam, Mahati Mahabhashyam
Satya - Thursday, June 02, 2005 9:02:44 AM
Look for some articles on "relevance"
So far, the search has been for a certain amount of key words that are known to the user. Look for strategies where given a document worth of information, look for similar documents that are in the database already.
This is probably being done by such players as Google already. wonder if their desktop toolkit has this built in already.
What about lucene? Look for some literature or their news group for this subject.See what Pramod came up from the book.
See some of the researchers at Stanford has any information on this.
Satya - Thursday, June 02, 2005 10:21:09 AM
Some ideas/notes on information retrieval from Mahathi Mahabhshyam
Satya - Friday, June 03, 2005 8:29:51 AM
A strategy for indexing: A case study - Dion Almer
Satya - Friday, June 03, 2005 9:25:10 AM
Glossary of IR terms
This is very useful as the most ideas in IR are here and what they are called in literature. This will allow us to search for those ideas in google.
For example see the following extract
Content-Based Filtering: The process of filtering by extracting features from the text of documents to determine the documents' relevance. Also called "cognitive filtering".
Satya - Friday, June 03, 2005 9:30:46 AM
Another interesting read from IR
Satya - Friday, June 03, 2005 9:36:32 AM
Content based filtering - Oard and Marchionini
See if the articles presents strategies
Satya - Friday, June 03, 2005 9:38:58 AM
Finally a close enough query for google
Satya - Friday, June 03, 2005 9:55:30 AM
Go through lucene faq page on jguru
See if I can find out about the content based filtering here
Satya - Saturday, June 04, 2005 11:45:53 AM
Follow up on lucene content filtering at JGuru
Satya - Monday, June 06, 2005 4:07:53 PM
Some discussion on finding similar documents
Satya - Tuesday, June 07, 2005 9:13:00 AM
A weblog: Bayesian Nets, Latent Semantics, Despamming and other speculations
Satya - Monday, June 27, 2005 7:49:07 PM
Search the news group
Search the lucene mailing list at open subscriber
Somehow I couldn't do this from the lucene homepage at apache
Satya - Monday, June 27, 2005 8:11:57 PM
Example of a term frequency vector
{content: 0/1, 02/1, 03/1, 04/1, 05/1, 1/4, 10/4, 12/1, 14/1, 2/1, 2.0/1, 2005/8, 22/1, 24/5, 26/1, 27/5, 28/1, 33/2, 34/2, 36/1, 5/1, access/1, accessed/1, agent/5, agentdao/1, akc/1, already/1, am/4, append/1, architectural/1, architecture/3, author/3, b/1, back/1, between/1, blogs/1, blue/2, ccp/9, central/1, channel/1, class/1, classic/1, clearcase/1, column/1, content/1, create/1, cross/1, current/2, cvs/1, cvsroot/1, data/4, default/1, delivery/1, develop/1, directory/1, display/1, doc/2, docs/3, embedded/1, enquiry/1, essentially/1, excel/1, feedback/1, fileupload/1, florida/1, folder/1, format/1, framework/1, friday/4, from/1, functionality/2, general/1, generic/2, get/1, go/1, google/1, have/1, home/2, host/1, how/3, i/1, idea/1, information/1, initial/1, interface/5, june/8, knowledge/2, library/2, links/1, look/1, main/1, manage/3, managers/1, manipulate/1, mapping/1, masterpage/1, model/1, monday/4, mq/1, much/1, my/2, needs/1, new/4, next/1, object/1, obtained/1, other/1, page/1, paging/3, parent/1, password/2, path/1, piece/1, plans/1, pm/4, pmfweb/1, port/1, portal/5, print/1, products/1, project/1, prototype/1, pserver/1, public/1, purpose/1, put/1, r2/1, r3/1, r3/saa/1, r4/2, rating/2, read/4, records/2, release/3, releases/1, repository/1, request/1, requirements/2, requires/1, response/1, returning/1, review/1, sales/1, satya/8, schedules/1, search/1, see/4, senior/1, service/1, set/1, shield/1, siebel/5, site/1, sorting/1, specs/1, staff/1, standard/1, strategic/1, sufficient/2, summary/1, support/1, sync/1, test/1, text/1, through/3, together/1, ui/1, urls/3, validate/1, via/1, vision/1, web/2, welcome/1, what/4, windows/1, work/5, xml/8}
Satya - Monday, June 27, 2005 8:13:28 PM
What on earth is a docnum?
in lucene the indexreader can give you this termfrequency vector if you know the document number. To get this document number you need to do
int docnumber = hits.id(n);
Looks like the id is the docnumber
Satya - Monday, June 27, 2005 8:20:10 PM
Lucene sample code
Take a look at some sample code that helped in generating the above
annonymous - Saturday, July 02, 2005 2:02:51 PM
Here is its term frequency vector
{content: 1/1, 1356/8, 2/1, 216.187.231.34/3, 216.187.231.34/akc/2, 3/1, 8080/1, 8080/akc/1, about/1, above/5, absolute/2, access/1, account/1, additional/1, address/3, adress/1, advantage/2, advantageous/1, akc/9, aliases/1, all/1, also/2, any/3, application/2, application1/1, application2/1, applications/1, approached/1, argument/4, arguments/5, article/1, aspect/1, aspire/1, associate/1, assumes/1, available/2, background/1, based/2, because/2, belongs/1, both/1, browser/6, called/5, came/1, can/9, care/1, case/2, change/3, class/1, client/4, clients/1, comma/1, completely/1, consider/1, create/1, creating/2, deal/1, decide/1, declare/1, definition/1, deliver/2, delivered/2, dependent/1, devlivering/1, different/1, discussed/1, display/3, displayed/1, displaynotempurl/7, displayservlet/10, divided/1, do/2, document/3, doesn't/2, don't/2, done/1, downloaded/1, dyanmic/1, earlier/1, either/1, equivalent/1, especially/1, etc/2, ever/1, every/1, example/2, existing/1, explanation/1, explicitly/1, far/1, file/2, filename/2, first/1, focuses/1, follow/2, following/3, follows/1, from/5, ftp/1, fully/1, further/2, gets/1, given/1, google/1, guess/1, handed/1, hari/7, has/2, have/5, hiding/1, host/4, host/application1/servlet/1, host/application2/servlet/1, house/1, how/3, html/1, http/8, i/2, id/1, identified/1, identifier/6, identifies/3, identifying/1, inside/1, instruction/1, internal/1, invocation/1, ip/2, its/1, java/6, just/1, keep/1, key/1, know/2, known/2, knows/1, komatineni/7, let/2, lik/1, like/2, limit/1, linking/1, links/4, list/1, located/2, logic/2, logical/1, long/2, look/1, lookup/1, machine/7, mail/1, maintain/1, maintains/1, make/1, mappings/1, master/2, may/1, me/2, meaningful/1, means/2, methods/1, much/1, myservlet/2, name/5, names/3, need/1, needs/3, new/3, next/1, notebook/1, notice/2, now/2, nuances/1, number/8, one/1, only/1, ordinary/1, other/1, over/1, owner/1, owneruserid/1, page/12, pages/4, paint/1, pairs/1, parent/1, part/8, particular/1, parts/1, path/2, people/1, points/1, port/10, ports/2, possible/1, practical/1, prefix/2, primarily/1, process/1, properties/1, protcol/1, protocol/8, protocols/2, purpose/1, really/2, refinement/1, relative/11, removed/1, report/1, reportid/1, request/1, reside/1, resource/2, responsible/1, rest/2, returns/1, revisit/1, rewrite/1, rewritten/3, same/4, scheme/1, second/1, see/2, sense/1, separate/1, separated/1, separator/1, server/11, servers/4, service/1, servlet/19, servlets/2, several/1, short/1, shortening/1, side/1, similar/1, single/1, so/6, some/2, something/1, specific/1, specified/2, specify/2, start/2, starts/1, static/1, stays/1, string/3, structure/1, sub/1, summary/2, table/1, taking/1, tell/1, tells/1, them/1, think/1, through/2, two/1, type/1, understanding/2, universal/1, up/1, uri/3, url/36, urls/7, use/1, user/2, uses/1, using/4, usually/3, value/1, very/1, waiting/1, way/2, web/31, webapp/1, webpage/2, webserver/10, webservers/2, well/1, what/8, when/5, where/2, which/1, while/1, won't/1, you/12, your/3}
Satya - Monday, July 18, 2005 6:34:54 PM
Working with boolean queries: sample code
public static Query getRelevanceQuerySimple(List wordList)
{
//Constructing a boolean query
BooleanQuery bq = new BooleanQuery();
//Setup reused query parameters
boolean bNotRequired=false;
boolean bNotProhibited = false;
Iterator wordItr = wordList.iterator();
while(wordItr.hasNext())
{
String word = (String)wordItr.next();
//Setup a term query
TermQuery tq = new TermQuery(new Term("content",word));
//Add it with proper search criteria
bq.add(tq,bNotRequired,bNotProhibited);
}
return bq;
}
Satya - Tuesday, July 19, 2005 9:22:50 AM
A brief overview of Lucenes querying capabilities
A brief overview of Lucenes querying capabilities
You will see here a short introduction to many of the features of lucene. A good read before creating your own hand crafted queries.
Satya - Tuesday, July 19, 2005 9:26:38 AM
See what a multiterm query and fuzzy query can do
Can these be used for relevancy search? Check the mailing list. Check the book.
Satya - Tuesday, July 19, 2005 9:44:38 AM
Finding similar documents
Contents
MoreLikeThis.java SimilarityQueries.java
These seem to have been written by Doug.
Satya - Saturday, August 13, 2005 12:51:28 PM
How to enable lucene for storing term frequency vectors
When the index is built, if you want to keep the term frequency vectors for a document, you need to do something special.
When you add a text field that is indexed to the document, there is a boolean variable that you need to set it to true. Example
Field.Text(x,y,true);
See the API for the Field.Text method