Your first time on this page? Allow me to give some explanations.
Awesome Information Retrieval
A curated list of awesome information retrieval resources
Here you can see meta information about this topic like the time we last updated this page, the original creator of the awesome list and a link to the original GitHub repository.
Thank you harpribot & contributors
View Topic on GitHub:
Search for resources by name or description.
Simply type in what you are looking for and the results will be filtered on the fly.
Further filter the resources on this page by type (repository/other resource), number of stars on GitHub and time of last commit in months.
C.D. Manning, P. Raghavan, H. Schütze. Cambridge UP, 2008. (First book for getting started with Information Retrieval).
Bruce Croft, Don Metzler, and Trevor Strohman. 2009. (Great book for readers interested in knowing how Search Engines work. The book is very detailed).
R. Baeza-Yates, B. Ribeiro-Neto. Addison-Wesley, 1999.
B. Croft, D. Metzler, T. Strohman. Pearson Education, 2009.
S. Chakrabarti. Morgan Kaufmann, 2002.
W.B. Croft, J. Lafferty. Springer, 2003. (Handles Language Modeling aspect of Information Retrieval. It also extensively details probabilistic perspective in this domain, which is interesting).
Ed Greengrass, 2000. (Comprehensive survey of Conventional Information Retrieval, before Deep Learning era).
Matthew Lease (University of Texas at Austin).
Chris Manning and Pandu Nayak (Stanford University).
Raymond J. Mooney (University of Texas at Austin).
Vagelis Hristidis (University of California - Riverside).
Ray R. Larson (UC berkeley).
David Yarowsky (John Hopkins University).
Andrea LaPaugh (Princeton University).
Dr. Jilles Vreeken , Prof. Dr. Gerhard Weikum (MPI).
Prof. Wei Wang (University of New South Wales).
Open Source Search Engine that can be used to test Information Retrieval Algorithm. Twitter uses this core for its real-time search.
The Lemur Project develops search engines, browser toolbars, text analysis tools, and data resources that support research and development of information retrieval and text mining software.
Another Open Source Search Engine competitor of Apache Lucene.
Standard IR Collections
This is one of the first collections in IR domain, however the dataset is too small for any statistical significance analysis, but is nevertheless suitable for pilot runs.
TREC is the benchmark dataset used by most IR and Web search algorithms. It has several tracks, each of which consists of dataset to test for a specific task. The tracks along with suggested use-case are:
Address challenges in building large chemical testbeds for chemical IR.
Investigate techniques to link medical cases to information relevant for patient care.
Investigate search techniques for complex information needs (context and user interests based).
Explore crowdsourcing methods for performing and evaluating search.
Perform entity-related search (find entities and their properties) on Web data.
Binarily decide retrieval of new incoming documents given a stable information need.
Study merge performance for results from various search services.
Study retrieval efficiency of genomics data and corresponding documentation.
Obtain High Accuracy Retrieval from Documents by leveraging searcher's context.
Study user interaction with text retrieval systems.
Study algorithms that improve efficiency of human Knowledge Base.
Study retrieval systems that have high recall for legal documents use case.
Explore unstructured search performance over patients record data.
Examine satisfaction of real-time information need for microblogging sites.
Explore ad-hoc retrieval over large set of queries.
Investigate systems' abilities to locate new (non-redundant) information.
Test systems that scale beyond document retrieval, to retrieve answers to factoid, list and definition type questions.
For deep evaluation of relevance feedback processes.
Develop methods for measuring multiple-query sessions where information needs drift.
Test if systems can induce possible tasks, users might be trying to accomplish for the query.
Develop systems that allow users to efficiently monitor the information associated with an event over time.
Test scalability of IR systems to large scale collection.
Explore information seeking behaviors common in general web search.
This is one of the largest Web collection of documents obtained from crawl of government websites by Charlie Clarke and Ian Soboroff, using NIST hardware and network, then formatted by Nick Craswel.
This is collection of wide variety of dataset ranging from Ad-hoc collection, Chinese IR collection, mobile clickthrough collections to medical collections. The focus of this collection is mostly on east asian languages and cross language information retrieval.
This dataset can be used for cross lingual IR between CJKE (Chinese-Japanese-Korean-English) languages. It is suitable for the following tasks:
It supports following bi-lingua and mono-lingua:
The dataset is used for the task of cross-lingual question answering but the complexity of the task is higher than CLQA dataset.
It contains a multi-lingual document collection. The test suite includes:
This data set consists of 20000 newsgroup messages.posts taken from 20 newsgroup topics.
This data set is a comprehensive archive of English newswire text data including headlines, datelines and articles.
External Curation Links
Manik Verma (Microsoft Research)
Tim Berners-Lee (Ted Talk) [Tim Berners-Lee invented the World Wide Web. He leads the World Wide Web Consortium (W3C), overseeing the Web's standards and development].
Gary Flake, Technical Fellow at Microsoft (TED Talks).
Jeff Dean (WSDM Conference, 2009).
David Wilne (The University of Waikato, 2008).
Steve Tjoa (RackSpace Developers) [This talk shows that IR is not just text and images].
Liron Shapira (Box Tech Talk).
Doug Imbruce (Techcrunch Disrupt)[Doug Imbruce is the Founder of Qwiki, Inc, a technology startup in New York, NY, acquired by Yahoo! in 2013].
Andreas Ekström (Swedish Author & Journalist, TED Talk).
Eli Pariser (Author of the Filter Bubble, TED Talk).
Andy Yen (CERN, TED Talk) [This talk talks about privacy, which Search Engines intrude into, and how can people protect it].
Michael Douglas [TEDx SouthBank].
Information Retrieval from Lip Reading.
Bias in Relevance.